Page 6

Semester 2: Big Data Framework

  • Introduction to Big Data: Characteristics, management architecture, types, and analytics

    Introduction to Big Data
    • Characteristics of Big Data

      Big Data is characterized by its volume, velocity, variety, veracity, and value. Volume refers to the vast amounts of data generated every second. Velocity is the speed at which this data is created and processed. Variety indicates the different types of data (structured, semi-structured, and unstructured). Veracity deals with the reliability and accuracy of the data. Finally, value refers to the insights and benefits derived from analyzing the data.

    • Big Data Management Architecture

      The management architecture for Big Data typically includes layers for data storage, processing, and analysis. The architecture often consists of a data ingestion layer that captures data from various sources, a storage layer that organizes and stores the data, a processing layer that transforms and analyzes the data, and a presentation layer that enables end-users to visualize and interact with the data.

    • Types of Big Data

      Big Data can be classified into different types based on its source and structure. The main types are: 1. Structured Data - Highly organized and easily searchable (e.g., databases). 2. Unstructured Data - Unorganized data that doesn't fit into traditional data models (e.g., emails, videos). 3. Semi-structured Data - Contains elements of both structured and unstructured data (e.g., XML, JSON). Each type of data poses unique challenges and requires different approaches for storage and analysis.

    • Big Data Analytics

      Big Data analytics involves leveraging advanced analytical techniques to process and analyze large datasets. Common techniques include data mining, machine learning, and predictive analytics. The goal is to uncover trends and insights that can inform business decisions. Analytics can be descriptive (analyzing past events), diagnostic (understanding why something happened), predictive (forecasting future events), or prescriptive (suggesting actions to achieve desired outcomes).

  • Hadoop Ecosystem: History, architecture, HDFS, cluster setup, YARN

    Hadoop Ecosystem
    • History of Hadoop

      Hadoop originated at Google with the publication of the MapReduce paper in 2004. Doug Cutting and Mike Cafarella created Hadoop in 2006 as an open-source framework. It was inspired by Google's technologies for handling large data sets.

    • Architecture of Hadoop

      Hadoop has a master-slave architecture. The master node is responsible for managing the cluster and coordinating the activities. Slave nodes store data and process the tasks assigned by the master.

    • HDFS (Hadoop Distributed File System)

      HDFS is designed to store a large amount of data across multiple machines. It divides data into blocks, which are distributed across the cluster for fault tolerance and easy access.

    • Cluster Setup

      Setting up a Hadoop cluster requires installing the Hadoop framework on all nodes. Configuration files need to be edited to define master and slave nodes, and services must be started.

    • YARN (Yet Another Resource Negotiator)

      YARN is responsible for resource management and job scheduling in the Hadoop ecosystem. It allows multiple data processing engines to handle data stored in HDFS.

  • MapReduce: Programming model, job run and failures, shuffle and sort

    MapReduce: Programming model, job run and failures, shuffle and sort
    • Overview of MapReduce

      MapReduce is a programming model designed for processing large datasets in a distributed computing environment. It consists of two main functions: Map and Reduce. The Map function takes input data and transforms it into a set of key-value pairs, while the Reduce function aggregates these pairs to produce the final output.

    • Map Function

      The Map function processes input data in parallel across multiple nodes. Each mapper reads a segment of the input dataset, processes it, and emits intermediate key-value pairs. This parallel processing allows for efficient handling of large-scale data.

    • Reduce Function

      The Reduce function takes the intermediate key-value pairs generated by the Map function. Reducers aggregate the data based on keys, performing operations such as summation or counting. The result is a reduced set of key-value pairs that represents the final output.

    • Job Execution

      A MapReduce job goes through several stages: it is submitted to a resource manager, the input data is split across mappers, the mappers execute the Map function, followed by the shuffle and sort phase, and finally, the Reduce function is executed. The completed job outputs the results to a storage system.

    • Failures and Resilience

      MapReduce is resilient to failures. If a mapper or reducer fails during execution, the framework can restart the failed task on another node, ensuring that the overall job continues processing without significant downtime.

    • Shuffle and Sort Phase

      After the Map stage, the shuffle and sort phase reorganizes the intermediate key-value pairs produced by the mappers based on their keys. This phase is crucial for ensuring that all values associated with a particular key are sent to the same reducer, enabling accurate aggregation in the Reduce function.

  • Pig Latin scripting: Data models, relational operators, developing and testing scripts

    Pig Latin scripting
    • Introduction to Pig Latin

      Pig Latin is a high-level platform for creating programs that run on Apache Hadoop. It uses a language called Pig Latin to execute data operations.

    • Data Models in Pig Latin

      Pig Latin supports various data types, including atomic types such as int, long, float, double, and bytearray. Complex data types include tuples, bags, and maps.

    • Relational Operators in Pig Latin

      Pig Latin provides several relational operators such as LOAD, STORE, FILTER, FOREACH, GROUP, JOIN, and COGROUP to process and manipulate data.

    • Developing Scripts in Pig Latin

      Developing a Pig Latin script involves writing a series of commands in the Pig Latin language to perform specific data processing tasks.

    • Testing Scripts in Pig Latin

      Testing involves validating the script output against expected results, typically by using local or remote execution modes available in Pig.

    • Best Practices in Pig Latin Scripting

      Best practices include using descriptive names for variables, breaking down complex scripts into smaller, manageable parts, and thorough testing of scripts before deployment.

  • NoSQL Databases: MongoDB, HBase concepts, data manipulation and use cases

    NoSQL Databases: MongoDB, HBase Concepts, Data Manipulation and Use Cases
    • Introduction to NoSQL Databases

      NoSQL databases are designed to handle large volumes of data that do not fit well into traditional relational databases. They offer flexible schema design, horizontal scaling, and varied data models including document, key-value, column-family, and graph.

    • MongoDB Overview

      MongoDB is a popular document-oriented NoSQL database. It stores data in BSON format, which is similar to JSON. Key features include schema flexibility, indexing, and powerful querying capabilities.

    • Data Manipulation in MongoDB

      Data manipulation in MongoDB can be done using CRUD operations. Create (insert documents), Read (query data), Update (modify existing documents), and Delete (remove documents) are all supported through a rich set of query languages and APIs.

    • HBase Overview

      HBase is a distributed, scalable, and column-oriented NoSQL database built on top of Hadoop. It is designed to handle large tables with millions of rows and supports structured data. HBase provides real-time read/write access to large datasets.

    • Data Manipulation in HBase

      Data manipulation in HBase primarily involves using the HBase shell or APIs, including operations for creating tables, inserting and retrieving data, updating records, and deleting rows. HBase follows a key-value approach for handling data.

    • Use Cases of MongoDB

      MongoDB is widely used in various applications such as content management systems, real-time analytics, mobile applications, and IoT applications. Its flexible schema allows developers to iterate quickly.

    • Use Cases of HBase

      HBase is ideal for use cases that involve large-scale data, such as time-series data, data warehousing solutions, and applications requiring real-time analytics on massive datasets. It is particularly useful for big data applications.

Big Data Framework

M.Sc. Data Analytics

Big Data Framework

2

Periyar University

23PDA06 Core 6

free web counter

GKPAD.COM by SK Yadav | Disclaimer