Page 9

Semester 4: Big Data Analytics

Big data - classification, structured vs unstructured data, characteristics, challenges
Big Data Analytics
- Classification of Big Data
  Big data can be classified based on various dimensions. Primarily, it can be categorized into structured, unstructured, and semi-structured data. Structured data is highly organized, typically found in databases and easily subjected to query. Unstructured data lacks a predefined format and often includes text, images, and videos, making it more complex to analyze. Semi-structured data is a mix between the two, containing some hierarchical structure but not as rigid as structured data.
- Structured vs Unstructured Data
  Structured data has a fixed schema, is easy to enter, and can be analyzed using traditional databases. Examples include spreadsheets and SQL databases. Unstructured data does not have a specific structure, making it more challenging to analyze. Examples include social media posts, emails, and multimedia content. The management of both types of data requires different approaches and technologies, such as data lakes for unstructured data.
- Characteristics of Big Data
  The main characteristics of big data are often referred to as the 5 Vs: Volume (huge amounts of data), Velocity (high speed of data generation), Variety (different types of data), Veracity (uncertainty of data), and Value (utility in producing meaningful insights). These characteristics pose both opportunities and challenges in processing and extracting value.
- Challenges in Big Data Analytics
  Big data analytics faces several challenges, including data privacy and security issues, the complexity of data integration, the need for advanced analytical skills, and the costs associated with big data technologies. Additionally, the raw data often needs to be cleaned and pre-processed, which can be labor-intensive and time-consuming. Ensuring data quality and obtaining actionable insights remains a significant challenge in the field.
Technology landscape - NoSQL, comparison with SQL, Hadoop ecosystem
Technology landscape - NoSQL, comparison with SQL, Hadoop ecosystem
- Introduction to NoSQL
  NoSQL stands for Not Only SQL. It represents a diverse category of databases that are designed to handle large volumes of data, offering flexibility in data models. Key types include document stores, key-value stores, column-family stores, and graph databases. They are designed to scale out by distributing data across multiple servers.
- Comparison between NoSQL and SQL
  SQL databases are structured and use predefined schemas, making them suitable for complex queries and transactions. In contrast, NoSQL databases offer dynamic schemas and scale horizontally, which makes them ideal for unstructured or semi-structured data. SQL uses ACID properties for consistency, while many NoSQL systems adopt BASE properties for flexibility and performance in big data applications.
- Hadoop Ecosystem Overview
  Hadoop is an open-source framework for storing and processing big data in a distributed environment. It consists of several components, including Hadoop Distributed File System (HDFS), MapReduce for processing, and various ecosystem tools such as Hive for data warehousing, Pig for data flow scripting, and HBase for real-time read/write access to large datasets.
- Integration of NoSQL with Hadoop
  NoSQL databases can integrate with Hadoop to enhance big data processing capabilities. Technologies like HBase and Cassandra can store data that Hadoop can process using MapReduce or through SQL-like languages like HiveQL. This integration allows analysts to leverage both structured and unstructured data effectively.
- Use Cases and Applications
  NoSQL databases are prevalent in real-time web applications, big data analytics, social networks, and Internet of Things (IoT) platforms. They support high-velocity data ingestion and can handle diverse data types uniquely. In contrast, SQL databases are commonly used in applications requiring transaction management, such as banking and airline reservation systems.
MongoDB and mapreduce programming - terms, data types, query language, mapreduce phases
MongoDB and MapReduce Programming
MongoDB is a NoSQL database that uses a document-oriented data model. It stores data in JSON-like format called BSON. It is designed for scalability and flexibility, making it suitable for handling large volumes of unstructured data.
MongoDB uses various data types such as String, Integer, Boolean, Double, Array, and ObjectID. Key terms include Collections (equivalent to tables in SQL), Documents (equivalent to rows), and Indexes for efficient data retrieval.
MongoDB uses a rich query language that allows for data retrieval, filtering, and sorting. Queries are expressed in a JSON-like format, allowing for flexibility in specifying search criteria.
MapReduce is a programming model used for processing large datasets in parallel across a distributed cluster. It involves two main functions - Map and Reduce.
In the Map phase, the input data is divided into smaller sub-units, which can be processed simultaneously. Each map function processes input data and produces key-value pairs.
In the Reduce phase, the output from the Map phase is aggregated. The reduce function takes groups of key-value pairs, aggregates them, and outputs a reduced result.
MapReduce can be used in MongoDB for tasks such as data analysis, real-time data processing, and large-scale data aggregation, making it suitable for big data applications.
Hive - architecture, data types, file formats, query language, partitioning, UDFs
Hive
Hive architecture involves several key components including the Hive Metastore, which stores metadata, the Hive Driver that manages query execution, and the execution engine that translates queries into jobs executed on Hadoop. Hive provides a SQL-like interface for querying large datasets stored in Hadoop, allowing users to write queries without dealing with low-level programming directly.
Hive supports a variety of data types including scalar types such as INT, FLOAT, STRING, and BOOLEAN, collection types like ARRAY, MAP, and STRUCT, which help in modeling complex data. Understanding these data types is crucial for efficiently working with Hive tables and optimizing query performance.
Hive supports multiple file formats, including Text, Sequence, ORC (Optimized Row Columnar), and Parquet. Each format has its advantages; for example, ORC and Parquet are optimized for performance and are highly compressible. Choosing the right file format can significantly impact query performance and storage efficiency.
Hive uses HiveQL, a SQL-like query language, which allows users to perform data manipulation and retrieval. The language supports various operations such as SELECT, JOIN, GROUP BY, and ORDER BY. Users familiar with SQL can quickly adapt to HiveQL, simplifying data analysis processes.
Partitioning in Hive divides tables into smaller, more manageable pieces based on key values, such as dates or regions. This structure enhances query performance by limiting the amount of data scanned during query execution. Proper partitioning strategy can lead to significant performance improvements in Hive queries.
User Defined Functions (UDFs) in Hive allow users to extend Hive capabilities by creating custom functions. This is especially useful for complex data processing tasks that are not covered by built-in functions. UDFs can be written in Java, Python, or other languages supported by the Hive framework.
Pig - introduction, features, data types, execution modes, commands
Pig
- Item
  Pig is a high-level platform for creating programs that run on Apache Hadoop. It offers a scripting language called Pig Latin, which simplifies the process of data processing and analysis.
- Item
  Pig simplifies the process of working with large datasets and offers features such as extensibility, rich data types, and optimization opportunities. It can handle both structured and unstructured data.
- Item
  Pig supports various data types, including scalar types (int, long, float, double, chararray, bytearray) and complex types (tuple, bag, and map) that allow for more intricate data manipulations.
- Item
  Pig has two execution modes: Local Mode and MapReduce Mode. Local Mode runs on a single machine and is useful for debugging, while MapReduce Mode leverages the Hadoop cluster for processing large datasets.
- Item
  Common commands in Pig include LOAD (to read data), STORE (to write data), FILTER (to filter data), FOREACH (to iterate over data), GROUP (to group data), and CUBE (to perform multi-dimensional analysis). Each command plays a role in effectively processing data.

Page 9

Semester 4: Big Data Analytics

Big data - classification, structured vs unstructured data, characteristics, challenges

Big Data Analytics

Classification of Big Data

Structured vs Unstructured Data

Characteristics of Big Data

Challenges in Big Data Analytics

Technology landscape - NoSQL, comparison with SQL, Hadoop ecosystem

Technology landscape - NoSQL, comparison with SQL, Hadoop ecosystem

Introduction to NoSQL

Comparison between NoSQL and SQL

Hadoop Ecosystem Overview

Integration of NoSQL with Hadoop

Use Cases and Applications

MongoDB and mapreduce programming - terms, data types, query language, mapreduce phases

MongoDB and MapReduce Programming

MongoDB is a NoSQL database that uses a document-oriented data model. It stores data in JSON-like format called BSON. It is designed for scalability and flexibility, making it suitable for handling large volumes of unstructured data.

MongoDB uses various data types such as String, Integer, Boolean, Double, Array, and ObjectID. Key terms include Collections (equivalent to tables in SQL), Documents (equivalent to rows), and Indexes for efficient data retrieval.

MongoDB uses a rich query language that allows for data retrieval, filtering, and sorting. Queries are expressed in a JSON-like format, allowing for flexibility in specifying search criteria.

MapReduce is a programming model used for processing large datasets in parallel across a distributed cluster. It involves two main functions - Map and Reduce.

In the Map phase, the input data is divided into smaller sub-units, which can be processed simultaneously. Each map function processes input data and produces key-value pairs.

In the Reduce phase, the output from the Map phase is aggregated. The reduce function takes groups of key-value pairs, aggregates them, and outputs a reduced result.

MapReduce can be used in MongoDB for tasks such as data analysis, real-time data processing, and large-scale data aggregation, making it suitable for big data applications.

Hive - architecture, data types, file formats, query language, partitioning, UDFs

Hive

Hive uses HiveQL, a SQL-like query language, which allows users to perform data manipulation and retrieval. The language supports various operations such as SELECT, JOIN, GROUP BY, and ORDER BY. Users familiar with SQL can quickly adapt to HiveQL, simplifying data analysis processes.

Pig - introduction, features, data types, execution modes, commands

Pig

Item

Pig is a high-level platform for creating programs that run on Apache Hadoop. It offers a scripting language called Pig Latin, which simplifies the process of data processing and analysis.

Item

Pig simplifies the process of working with large datasets and offers features such as extensibility, rich data types, and optimization opportunities. It can handle both structured and unstructured data.

Item

Pig supports various data types, including scalar types (int, long, float, double, chararray, bytearray) and complex types (tuple, bag, and map) that allow for more intricate data manipulations.

Item

Pig has two execution modes: Local Mode and MapReduce Mode. Local Mode runs on a single machine and is useful for debugging, while MapReduce Mode leverages the Hadoop cluster for processing large datasets.

Item

Common commands in Pig include LOAD (to read data), STORE (to write data), FILTER (to filter data), FOREACH (to iterate over data), GROUP (to group data), and CUBE (to perform multi-dimensional analysis). Each command plays a role in effectively processing data.

Big Data Analytics

M.C.A

Big Data Analytics

4

Periyar University

23PCA09