Page 6

Semester 2: Big Data Framework

Introduction to Big Data: Characteristics, management architecture, types, and analytics
Introduction to Big Data
- Characteristics of Big Data
  Big Data is characterized by its volume, velocity, variety, veracity, and value. Volume refers to the vast amounts of data generated every second. Velocity is the speed at which this data is created and processed. Variety indicates the different types of data (structured, semi-structured, and unstructured). Veracity deals with the reliability and accuracy of the data. Finally, value refers to the insights and benefits derived from analyzing the data.
- Big Data Management Architecture
  The management architecture for Big Data typically includes layers for data storage, processing, and analysis. The architecture often consists of a data ingestion layer that captures data from various sources, a storage layer that organizes and stores the data, a processing layer that transforms and analyzes the data, and a presentation layer that enables end-users to visualize and interact with the data.
- Types of Big Data
  Big Data can be classified into different types based on its source and structure. The main types are: 1. Structured Data - Highly organized and easily searchable (e.g., databases). 2. Unstructured Data - Unorganized data that doesn't fit into traditional data models (e.g., emails, videos). 3. Semi-structured Data - Contains elements of both structured and unstructured data (e.g., XML, JSON). Each type of data poses unique challenges and requires different approaches for storage and analysis.
- Big Data Analytics
  Big Data analytics involves leveraging advanced analytical techniques to process and analyze large datasets. Common techniques include data mining, machine learning, and predictive analytics. The goal is to uncover trends and insights that can inform business decisions. Analytics can be descriptive (analyzing past events), diagnostic (understanding why something happened), predictive (forecasting future events), or prescriptive (suggesting actions to achieve desired outcomes).
Hadoop Ecosystem: History, architecture, HDFS, cluster setup, YARN
Hadoop Ecosystem
- History of Hadoop
  Hadoop originated at Google with the publication of the MapReduce paper in 2004. Doug Cutting and Mike Cafarella created Hadoop in 2006 as an open-source framework. It was inspired by Google's technologies for handling large data sets.
- Architecture of Hadoop
  Hadoop has a master-slave architecture. The master node is responsible for managing the cluster and coordinating the activities. Slave nodes store data and process the tasks assigned by the master.
- HDFS (Hadoop Distributed File System)
  HDFS is designed to store a large amount of data across multiple machines. It divides data into blocks, which are distributed across the cluster for fault tolerance and easy access.
- Cluster Setup
  Setting up a Hadoop cluster requires installing the Hadoop framework on all nodes. Configuration files need to be edited to define master and slave nodes, and services must be started.
- YARN (Yet Another Resource Negotiator)
  YARN is responsible for resource management and job scheduling in the Hadoop ecosystem. It allows multiple data processing engines to handle data stored in HDFS.
MapReduce: Programming model, job run and failures, shuffle and sort
MapReduce: Programming model, job run and failures, shuffle and sort
- Overview of MapReduce
  MapReduce is a programming model designed for processing large datasets in a distributed computing environment. It consists of two main functions: Map and Reduce. The Map function takes input data and transforms it into a set of key-value pairs, while the Reduce function aggregates these pairs to produce the final output.
- Map Function
  The Map function processes input data in parallel across multiple nodes. Each mapper reads a segment of the input dataset, processes it, and emits intermediate key-value pairs. This parallel processing allows for efficient handling of large-scale data.
- Reduce Function
  The Reduce function takes the intermediate key-value pairs generated by the Map function. Reducers aggregate the data based on keys, performing operations such as summation or counting. The result is a reduced set of key-value pairs that represents the final output.
- Job Execution
  A MapReduce job goes through several stages: it is submitted to a resource manager, the input data is split across mappers, the mappers execute the Map function, followed by the shuffle and sort phase, and finally, the Reduce function is executed. The completed job outputs the results to a storage system.
- Failures and Resilience
  MapReduce is resilient to failures. If a mapper or reducer fails during execution, the framework can restart the failed task on another node, ensuring that the overall job continues processing without significant downtime.
- Shuffle and Sort Phase
  After the Map stage, the shuffle and sort phase reorganizes the intermediate key-value pairs produced by the mappers based on their keys. This phase is crucial for ensuring that all values associated with a particular key are sent to the same reducer, enabling accurate aggregation in the Reduce function.
Pig Latin scripting: Data models, relational operators, developing and testing scripts
Pig Latin scripting
- Introduction to Pig Latin
  Pig Latin is a high-level platform for creating programs that run on Apache Hadoop. It uses a language called Pig Latin to execute data operations.
- Data Models in Pig Latin
  Pig Latin supports various data types, including atomic types such as int, long, float, double, and bytearray. Complex data types include tuples, bags, and maps.
- Relational Operators in Pig Latin
  Pig Latin provides several relational operators such as LOAD, STORE, FILTER, FOREACH, GROUP, JOIN, and COGROUP to process and manipulate data.
- Developing Scripts in Pig Latin
  Developing a Pig Latin script involves writing a series of commands in the Pig Latin language to perform specific data processing tasks.
- Testing Scripts in Pig Latin
  Testing involves validating the script output against expected results, typically by using local or remote execution modes available in Pig.
- Best Practices in Pig Latin Scripting
  Best practices include using descriptive names for variables, breaking down complex scripts into smaller, manageable parts, and thorough testing of scripts before deployment.
NoSQL Databases: MongoDB, HBase concepts, data manipulation and use cases
NoSQL Databases: MongoDB, HBase Concepts, Data Manipulation and Use Cases
- Introduction to NoSQL Databases
  NoSQL databases are designed to handle large volumes of data that do not fit well into traditional relational databases. They offer flexible schema design, horizontal scaling, and varied data models including document, key-value, column-family, and graph.
- MongoDB Overview
  MongoDB is a popular document-oriented NoSQL database. It stores data in BSON format, which is similar to JSON. Key features include schema flexibility, indexing, and powerful querying capabilities.
- Data Manipulation in MongoDB
  Data manipulation in MongoDB can be done using CRUD operations. Create (insert documents), Read (query data), Update (modify existing documents), and Delete (remove documents) are all supported through a rich set of query languages and APIs.
- HBase Overview
  HBase is a distributed, scalable, and column-oriented NoSQL database built on top of Hadoop. It is designed to handle large tables with millions of rows and supports structured data. HBase provides real-time read/write access to large datasets.
- Data Manipulation in HBase
  Data manipulation in HBase primarily involves using the HBase shell or APIs, including operations for creating tables, inserting and retrieving data, updating records, and deleting rows. HBase follows a key-value approach for handling data.
- Use Cases of MongoDB
  MongoDB is widely used in various applications such as content management systems, real-time analytics, mobile applications, and IoT applications. Its flexible schema allows developers to iterate quickly.
- Use Cases of HBase
  HBase is ideal for use cases that involve large-scale data, such as time-series data, data warehousing solutions, and applications requiring real-time analytics on massive datasets. It is particularly useful for big data applications.

Page 6

Semester 2: Big Data Framework

Introduction to Big Data: Characteristics, management architecture, types, and analytics

Introduction to Big Data

Characteristics of Big Data

Big Data Management Architecture

Types of Big Data

Big Data Analytics

Hadoop Ecosystem: History, architecture, HDFS, cluster setup, YARN

Hadoop Ecosystem

History of Hadoop

Architecture of Hadoop

HDFS (Hadoop Distributed File System)

Cluster Setup

YARN (Yet Another Resource Negotiator)

MapReduce: Programming model, job run and failures, shuffle and sort

MapReduce: Programming model, job run and failures, shuffle and sort

Overview of MapReduce

Map Function

Reduce Function

Job Execution

Failures and Resilience

Shuffle and Sort Phase

Pig Latin scripting: Data models, relational operators, developing and testing scripts

Pig Latin scripting

Introduction to Pig Latin

Data Models in Pig Latin

Relational Operators in Pig Latin

Developing Scripts in Pig Latin

Testing Scripts in Pig Latin

Best Practices in Pig Latin Scripting

NoSQL Databases: MongoDB, HBase concepts, data manipulation and use cases

NoSQL Databases: MongoDB, HBase Concepts, Data Manipulation and Use Cases

Introduction to NoSQL Databases

MongoDB Overview

Data Manipulation in MongoDB

HBase Overview

Data Manipulation in HBase

Use Cases of MongoDB

Use Cases of HBase

Big Data Framework

M.Sc. Data Analytics

Big Data Framework

2

Periyar University

23PDA06 Core 6