Page 7

Semester 3: Big Data with Spark and Hive

Overview of Spark and Big Data: Architecture, philosophy, running and deploying applications
Overview of Spark and Big Data
- Introduction to Big Data
  Big Data refers to datasets that are so large or complex that traditional data processing applications are inadequate. It encompasses the three Vs: Volume, Variety, and Velocity.
- Apache Spark Overview
  Apache Spark is an open-source distributed computing system designed for fast and flexible data processing. It supports in-memory data processing, which increases speed and efficiency.
- Spark Architecture
  Spark's architecture is based on a master-slave model. The Driver Program runs the main function of Spark and creates the Spark Context. Executors run on worker nodes and are responsible for executing tasks.
- Main Components of Spark
  Key components of Spark include the Spark Core (for basic functionality), Spark SQL (for working with structured data), Spark Streaming (for real-time data processing), MLlib (for machine learning), and GraphX (for graph processing).
- Running Spark Applications
  Spark applications can be run in various modes: local mode, standalone mode, on Apache Mesos, or using Hadoop YARN. Each mode has its own configuration and resource management approach.
- Deploying Spark Applications
  Deploying Spark applications involves packaging the application, managing dependencies, and selecting the cluster manager. The application can be submitted using the Spark-submit command.
- Comparison with Hadoop
  While Hadoop is designed for batch processing with its MapReduce framework, Spark is optimized for speed with its in-memory processing capabilities. Spark can work alongside Hadoop and is often seen as a complement to it.
- Use Cases of Spark in Big Data
  Spark is used in various domains like real-time data analysis, machine learning pipelines, data integration, and large-scale data processing, making it a versatile tool in the big data ecosystem.
Spark Structured API: Data types, operations, JSON handling, aggregations, grouping and joins
Spark Structured API
- Data Types
  Spark Structured API supports various data types such as IntegerType, StringType, DoubleType, BooleanType, and complex types including ArrayType, MapType, and StructType. These types help in defining the schema for DataFrames, which allow for efficient processing of structured data.
- Operations
  Spark provides several core operations such as transformations (like map, filter, and select) and actions (like count, collect, and show). Transformations are lazy and return a new DataFrame, while actions trigger computation and return results.
- JSON Handling
  The Spark Structured API has built-in support for JSON data through the read.json method, which allows users to read JSON files into DataFrames. The schema can be inferred automatically or defined explicitly, facilitating seamless integration with semi-structured data.
- Aggregations
  The Structured API supports aggregation operations using methods like groupBy, agg, and count. Users can perform operations such as sum, avg, min, and max on grouped data, providing insights into large datasets.
- Grouping
  Grouping data can be accomplished using the groupBy method, which allows users to group rows based on one or more columns. After grouping, aggregations can be applied to summarize the data effectively.
- Joins
  Spark Structured API supports various join operations, including inner, outer, left, and right joins. DataFrames can be joined on one or more keys, enabling the combination of different datasets for comprehensive analysis.
Spark SQL and RDDs: Queries, tables, transformations, actions, accumulators
Spark SQL and RDDs
- Introduction to Spark SQL
  Spark SQL provides a programming interface for working with structured and semi-structured data. It allows for querying data using SQL as well as the DataFrame API.
- RDDs (Resilient Distributed Datasets)
  RDDs are the fundamental data structure in Spark, enabling distributed data processing. They are immutable, fault-tolerant, and can be created from existing datasets or by transforming other RDDs.
- Queries in Spark SQL
  Queries in Spark SQL can be executed using the SQL interface or the DataFrame API. Spark optimizes these queries using the Catalyst query optimizer.
- Tables in Spark SQL
  Tables in Spark SQL can be created from DataFrames and can be queried using SQL commands. They can be temporary or permanent and can be registered in the Hive metastore.
- Transformations in RDDs
  Transformations are operations that create a new RDD from an existing one. They are lazy, meaning they do not compute their results immediately, but rather define a lineage of operations.
- Actions in RDDs
  Actions are operations that trigger the execution of transformations and return a result to the driver program. Examples include collect, count, and save.
- Accumulators in Spark
  Accumulators are variables that help aggregate information across the cluster. They can be used to implement counters and sums, but they must be used only in actions, not in transformations.
Streaming Fundamentals: Structure streaming basics and design principles
Streaming Fundamentals
- Introduction to Streaming
  Streaming refers to the continuous flow of data and processing it in real-time. Unlike batch processing, streaming allows for immediate insights and actions on data as it arrives.
- Key Concepts of Streaming
  Important concepts include data ingestion, stream processing, and real-time analytics. Data ingestion involves capturing data from various sources, while stream processing manipulates the data in transit.
- Streaming Architectures
  Common architectures include lambda and kappa. Lambda architecture combines batch and stream processing, while kappa focuses solely on stream processing, simplifying data handling.
- Design Principles for Streaming Applications
  Designing streaming applications requires considerations such as scalability, fault tolerance, and data consistency. Applications should be able to scale with increasing data load and recover from failures.
- Streaming Technologies
  Popular streaming technologies include Apache Kafka, Apache Spark Streaming, and Apache Flink. Each has unique features and use cases catering to different streaming requirements.
- Use Cases of Streaming
  Streaming is widely used in various industries for real-time analytics, monitoring systems, fraud detection, and enhancing user experiences through personalization.
Hive Services: Data model, definitions, manipulation, queries, views, indexes, tuning
Hive Services: Data model, definitions, manipulation, queries, views, indexes, tuning
- Data Model
  Hive uses a schema-on-read approach where data is stored in HDFS and structured through a schema at the time of querying. The primary components of the data model in Hive include tables, columns, and partitions. Tables can store various data types and can be internal (managed by Hive) or external (not managed by Hive).
- Definitions
  In Hive, definitions encompass the specifications of tables, including HiveQL Data Definition Language (DDL) commands, which are used to create, alter, or drop tables. Key terms include databases, tables, partitions, and buckets.
- Manipulation
  Data manipulation in Hive involves the use of HiveQL Data Manipulation Language (DML) commands, such as INSERT, UPDATE, DELETE, and LOAD data into tables. Hive does not support full ACID properties but provides some functionalities for data modifications.
- Queries
  Hive allows users to write queries in HiveQL, which is similar to SQL. Queries can involve SELECT statements to retrieve data, JOINs to combine data from multiple tables, and GROUP BY clauses for aggregating data.
- Views
  Views in Hive are virtual tables that allow users to simplify complex queries. A view is defined by a SELECT statement and provides a way to present data without needing to create new tables physically.
- Indexes
  Indexes in Hive help speed up query performance by allowing the query optimizer to access data faster. Hive supports bitmap indexes and compact indexes, which can be created on tables to facilitate quick access.
- Tuning
  Tuning in Hive involves optimizing performance through various means, including adjusting execution settings, partitioning data to reduce scan times, and using efficient file formats like ORC or Parquet. Properly indexing data and using caching strategies can also enhance performance.

Page 7

Semester 3: Big Data with Spark and Hive

Overview of Spark and Big Data: Architecture, philosophy, running and deploying applications

Overview of Spark and Big Data

Introduction to Big Data

Apache Spark Overview

Spark Architecture

Main Components of Spark

Running Spark Applications

Deploying Spark Applications

Comparison with Hadoop

Use Cases of Spark in Big Data

Spark Structured API: Data types, operations, JSON handling, aggregations, grouping and joins

Spark Structured API

Data Types

Operations

JSON Handling

Aggregations

Grouping

Joins

Spark SQL and RDDs: Queries, tables, transformations, actions, accumulators

Spark SQL and RDDs

Introduction to Spark SQL

RDDs (Resilient Distributed Datasets)

Queries in Spark SQL

Tables in Spark SQL

Transformations in RDDs

Actions in RDDs

Accumulators in Spark

Streaming Fundamentals: Structure streaming basics and design principles

Streaming Fundamentals

Introduction to Streaming

Key Concepts of Streaming

Streaming Architectures

Design Principles for Streaming Applications

Streaming Technologies

Use Cases of Streaming

Hive Services: Data model, definitions, manipulation, queries, views, indexes, tuning

Hive Services: Data model, definitions, manipulation, queries, views, indexes, tuning

Data Model

Definitions

Manipulation

Queries

Views

Indexes

Tuning

Big Data with Spark and Hive

M.Sc. Data Analytics

Big Data with Spark and Hive

3

Periyar University

23PDA07 Core 7