Page 7

Semester 3: Machine Learning

Data Analytics with pandas and NumPy: Basic statistics, working with data, null values, statistical graphs
Data Analytics with pandas and NumPy
- Basic Statistics
  Basic statistics includes measures of central tendency like mean, median, and mode, and measures of dispersion like variance and standard deviation. Pandas provides methods like .mean(), .median(), .mode(), .var(), and .std() to compute these statistics easily.
- Working with Data
  Pandas is an essential library for data manipulation. It allows for importing data from various formats such as CSV, Excel, and SQL databases. Key functionalities include DataFrame creation, indexing, filtering, and aggregating data using groupby and pivot_table methods.
- Handling Null Values
  In data analysis, null values can affect the results. Pandas provides methods like .isnull(), .dropna(), and .fillna() to identify and handle missing data effectively, allowing for options to drop or fill null values.
- Statistical Graphs
  Data visualization is crucial in data analytics. Libraries like Matplotlib and Seaborn can be used alongside pandas to create various statistical graphs such as histograms, box plots, and scatter plots. Visualization helps in understanding data distributions and relationships.
Concepts and Types of Machine Learning: Supervised, unsupervised, pattern classification, ML algorithms
Concepts and Types of Machine Learning
- Introduction to Machine Learning
  Machine Learning (ML) is a field of artificial intelligence that uses statistical techniques to enable computers to improve their performance on a specific task through experience. It involves designing algorithms that can learn from and make predictions or decisions based on data.
- Types of Machine Learning
  Machine Learning can be broadly categorized into three types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
- Supervised Learning
  In Supervised Learning, the algorithm is trained on a labeled dataset. This means that each training example is paired with an output label. The objective is to learn a model that can predict the output for new, unseen data. Common algorithms include linear regression, decision trees, and support vector machines.
- Unsupervised Learning
  Unsupervised Learning deals with unlabeled data. The algorithm tries to learn the underlying patterns or distributions in the data without any explicit guidance on what the output should be. Techniques such as clustering and dimensionality reduction are commonly used, with algorithms like K-means and principal component analysis.
- Pattern Classification
  Pattern classification is a type of supervised learning where the goal is to assign a label to input data based on the patterns identified during training. This can be applied in numerous applications, including image recognition and speech analysis.
- Machine Learning Algorithms
  There are various algorithms used in Machine Learning, including but not limited to: 1. Linear Regression 2. Logistic Regression 3. Decision Trees 4. Random Forests 5. Support Vector Machines 6. Neural Networks 7. K-Nearest Neighbors.
- Conclusion
  Understanding the different types of Machine Learning and their respective algorithms is crucial for effectively applying ML techniques in various data science problems. Each type serves different purposes and can be used in combination to solve complex tasks.
ML Classifiers: Perceptron, logistic regression, support vector machines, decision trees, k-nearest neighbors
ML Classifiers
- Perceptron
  Perceptron is one of the simplest types of artificial neurons. It is a supervised learning algorithm used for binary classifiers. The model takes multiple inputs, applies weights to them, sums the weighted inputs, and applies a step function to produce a binary output. Despite its simplicity, it can only solve linearly separable problems.
- Logistic Regression
  Logistic regression is a statistical method for predicting binary classes. The outcome is modeled using a logistic function, which is a type of sigmoid curve. This approach allows for probabilities to be mapped to classifications, making it useful for binary classification tasks. It is widely used and provides interpretable results.
- Support Vector Machines (SVM)
  Support Vector Machines are powerful classifiers that work well on both linear and non-linear data. SVM creates a hyperplane in a high-dimensional space that separates different classes. The goal is to maximize the margin between the closest points of the classes, known as support vectors. SVM can also use kernel functions to handle non-linearity.
- Decision Trees
  Decision trees are a non-parametric supervised learning method used for classification and regression. They model decisions based on feature values and lead to a tree structure where each node represents a feature, each branch represents a decision rule, and each leaf represents the outcome. They are easy to interpret but can overfit on complicated datasets.
- K-Nearest Neighbors (KNN)
  KNN is a simple, instance-based learning algorithm used for classification and regression. It identifies the 'k' nearest neighbors to a data point and makes predictions based on the majority label of those neighbors. KNN is easy to implement and can be very effective, but it can also be computationally expensive for large datasets.
Data Preprocessing: Missing data, categorical data, training/test splits, feature scaling, feature selection, random forests
Data Preprocessing
- Missing Data
  Missing data can significantly impact the performance of machine learning models. Approaches to handle missing data include: 1. Deletion: Removing records with missing values. Suitable when the missing data is negligible. 2. Imputation: Filling in missing values using strategies like mean, median, mode, or using predictive models. 3. Indicators: Adding binary columns to indicate whether data was missing.
- Categorical Data
  Categorical data refers to variables that represent categories or groups. To preprocess categorical data: 1. Encoding: Use techniques like one-hot encoding or label encoding to convert categorical data into numerical form. 2. Handling Uniqueness: Ensure categories are correctly represented to avoid misinterpretation.
- Training/Test Splits
  It is essential to split your dataset into training and testing sets to assess model performance. Common strategies include: 1. Holdout Method: Randomly splitting the dataset into a training set (usually 70-80%) and a test set (20-30%). 2. Cross-Validation: Using techniques like k-fold cross-validation to evaluate the model on multiple train-test splits for more reliable performance metrics.
- Feature Scaling
  Feature scaling ensures that numerical features contribute equally to the distance computations in models. Methods include: 1. Min-Max Scaling: Rescaling features to a range of [0, 1]. 2. Standardization: Rescaling features to have a mean of 0 and a standard deviation of 1.
- Feature Selection
  Feature selection involves choosing the most relevant features for model training. Techniques include: 1. Filter Methods: Using statistical tests to select features based on their relationship with the target variable. 2. Wrapper Methods: Using model performance to evaluate feature subsets. 3. Embedded Methods: Performing feature selection as part of the model training process.
- Random Forests
  Random forests are ensemble learning methods that combine multiple decision trees to improve model accuracy and control overfitting. Features include: 1. Bagging: Building multiple decision trees on bootstrapped samples of the data. 2. Feature Randomness: Randomly selecting a subset of features for splitting nodes, enhancing diversity among trees.
Dimensionality Reduction: Principal component analysis, linear discriminant analysis, kernel PCA
Dimensionality Reduction
- Principal Component Analysis (PCA)
  PCA is a statistical technique used to emphasize variation and bring out strong patterns in a dataset. It works by identifying the directions (principal components) along which the variance of the data is maximized. The first principal component captures the most variance, and each subsequent component captures the highest variance possible under the constraint of being orthogonal to the preceding components. PCA is commonly used for feature reduction, noise reduction, and visualization.
- Linear Discriminant Analysis (LDA)
  LDA is a supervised learning technique used for classification and dimensionality reduction. It aims to find the linear combinations of features that best separate two or more classes. LDA maximizes the ratio of between-class variance to within-class variance, ensuring that the classes are well-separated in the lower-dimensional space. It can also be used for feature extraction, helping to improve the performance of classifiers.
- Kernel PCA
  Kernel PCA is an extension of PCA that allows for nonlinear dimensionality reduction by using kernel methods. It projects the original data into a higher-dimensional space where it is possible to perform PCA. By applying a kernel function, it computes the principal components in this higher-dimensional space without explicitly carrying out the transformation. This technique is useful for capturing the structure of data that is not linearly separable in the original feature space.
Model Evaluation and Hyperparameter Tuning: Cross-validation, learning curves, grid search, performance metrics
Model Evaluation and Hyperparameter Tuning
- Cross-Validation
  Cross-validation is a technique used to assess how a model will generalize to an independent dataset. The main idea is to partition the data into several subsets or folds. The model is trained on some folds and tested on the remaining folds. This process is repeated multiple times to ensure a robust evaluation. K-fold cross-validation is one of the most common methods, where the dataset is divided into K complementary subsets.
- Learning Curves
  Learning curves are graphical representations that show the relationship between the size of the training dataset and the model's performance. They can help diagnose whether a model suffers from high bias or high variance. A learning curve plots training and validation scores as the size of the training set increases. If the training score is high but the validation score converges to a lower value, the model likely overfits.
- Grid Search
  Grid search is an exhaustive search method used for hyperparameter tuning of models. It involves defining a set of hyperparameter values and systematically evaluating the model performance for every combination of those values. The combination that yields the best performance on a validation set is selected. This method can be computationally expensive but is effective for small hyperparameter spaces.
- Performance Metrics
  Performance metrics are critical in assessing the effectiveness of a model. Common metrics include accuracy, precision, recall, F1 score, and area under the ROC curve (AUC-ROC). Each metric serves a different purpose depending on the specific problem and the data distribution. For instance, accuracy may not be the best metric for imbalanced datasets, while F1 score provides a balance between precision and recall.
Ensemble Learning: Bagging, boosting, combining classifiers
Ensemble Learning: Bagging, Boosting, Combining Classifiers
- Introduction to Ensemble Learning
  Ensemble learning is a technique that combines multiple models to improve performance. The main idea is to leverage the strengths of various algorithms to achieve higher accuracy.
- Bagging
  Bagging stands for Bootstrap Aggregating. It involves training multiple models independently on different subsets of the data, which are obtained by random sampling with replacement. The final prediction is made by averaging the predictions or voting among the models. One of the main algorithms that use bagging is Random Forest.
- Boosting
  Boosting is an iterative technique where models are trained sequentially. Each new model focuses on correcting the errors made by the previous models. This method combines weak learners to create a strong learner. Common boosting algorithms include AdaBoost and Gradient Boosting.
- Comparative Analysis of Bagging and Boosting
  While both bagging and boosting aim to improve model performance, they differ in their approach. Bagging reduces variance and is suited for high-variance models, while boosting reduces bias and can improve weak learners. Bagging is parallelizable, whereas boosting is sequential.
- Combining Classifiers
  Combining classifiers can be done in various ways, such as voting, averaging, or stacking. It involves using the strengths of different classifiers to enhance the overall predictive performance. Stacking involves training a meta-learner on the predictions from base models to improve outcomes.
Regression Analysis: Linear regression, robust regression, polynomial regression, random forests
Regression Analysis
- Linear Regression
  Linear regression is the simplest form of regression analysis. It assumes a linear relationship between the independent variable X and the dependent variable Y. The method fits a linear equation to observed data, minimizing the sum of the squares of the vertical distances of the points from the line of best fit. It is widely used due to its simplicity and interpretability.
- Robust Regression
  Robust regression is used when data contains outliers or is not homoscedastic. Unlike linear regression, which can be influenced significantly by outliers, robust regression techniques reduce the influence of these outliers. Methods such as Huber or Tukey's biweight are applied to achieve a more accurate model in the presence of anomalies.
- Polynomial Regression
  Polynomial regression extends linear regression by considering polynomial equations for the relationship between variables. It allows for the modeling of non-linear relationships by introducing polynomial terms. The degree of the polynomial must be chosen carefully to avoid overfitting, especially in cases with limited data points.
- Random Forests
  Random forests is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions for classification tasks or the mean prediction for regression tasks. It improves predictive accuracy and controls overfitting. The technique is particularly useful when dealing with large datasets or datasets with complex interactions.
Clustering: k-means, hierarchical clustering, DBSCAN
Clustering
- Introduction to Clustering
  Clustering is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups. It is an unsupervised learning technique used in various fields such as data mining, image analysis, and pattern recognition.
- K-Means Clustering
  K-means is a popular partitioning method that divides data into K distinct clusters. The algorithm starts with K initial centroids, assigns each data point to the nearest centroid, and recalculates the centroids until convergence. Its advantages include simplicity and efficiency, while limitations include sensitivity to initial centroid placement and the requirement to predefine the number of clusters.
- Hierarchical Clustering
  Hierarchical clustering creates a tree-like structure of nested clusters. It can be agglomerative (bottom-up) or divisive (top-down). The process involves calculating the distance between data points and merging or splitting clusters based on a linkage criterion. This method does not require a predefined number of clusters and provides a dendrogram for visual representation, though it can be computationally intensive.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
  DBSCAN is a density-based clustering method that groups together points that are closely packed and marks points in low-density regions as outliers. By defining parameters like minimum points and epsilon (radius), DBSCAN can identify clusters of varying shapes and sizes, making it effective for spatial data. Its advantages include the ability to discover arbitrary-shaped clusters and handle noise, while its challenges involve setting appropriate parameters.
Embedding ML Models into Web Applications: Serialization, SQLite, Flask, deployment
Embedding ML Models into Web Applications
- Serialization of ML Models
  Serialization is the process of converting a model into a format that can be saved to disk or transmitted over a network. Common libraries for serialization in Python include Pickle and Joblib. Using these libraries allows developers to save trained models so they can be loaded later without needing to retrain.
- Using SQLite for Data Storage
  SQLite is a lightweight, disk-based database that does not require a separate server process. It is useful for storing application data and can be easily integrated into Flask applications. SQLite allows developers to manage user data and model inputs and outputs effectively.
- Building Web Applications with Flask
  Flask is a micro web framework for Python that allows developers to build web applications quickly. It provides routing, templating, and session management. Flask is a suitable choice for integrating machine learning models due to its simplicity and flexibility, allowing the serving of model predictions via web endpoints.
- Deployment of ML Models
  Deployment refers to the process of making the ML model available for use in a production environment. Common deployment options include cloud services (like AWS, Google Cloud), containerization with Docker, and serverless architectures. It is crucial to ensure that the model performs well in production and can handle user requests efficiently.

Page 7

Semester 3: Machine Learning

Data Analytics with pandas and NumPy: Basic statistics, working with data, null values, statistical graphs

Data Analytics with pandas and NumPy

Basic Statistics

Working with Data

Handling Null Values

Statistical Graphs

Concepts and Types of Machine Learning: Supervised, unsupervised, pattern classification, ML algorithms

Concepts and Types of Machine Learning

Introduction to Machine Learning

Types of Machine Learning

Supervised Learning

Unsupervised Learning

Pattern Classification

Machine Learning Algorithms

Conclusion

ML Classifiers: Perceptron, logistic regression, support vector machines, decision trees, k-nearest neighbors

ML Classifiers

Perceptron

Logistic Regression

Support Vector Machines (SVM)

Decision Trees

K-Nearest Neighbors (KNN)

Data Preprocessing: Missing data, categorical data, training/test splits, feature scaling, feature selection, random forests

Data Preprocessing

Missing Data

Categorical Data

Training/Test Splits

Feature Scaling

Feature Selection

Random Forests

Dimensionality Reduction: Principal component analysis, linear discriminant analysis, kernel PCA

Dimensionality Reduction

Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)

Kernel PCA

Model Evaluation and Hyperparameter Tuning: Cross-validation, learning curves, grid search, performance metrics

Model Evaluation and Hyperparameter Tuning

Cross-Validation

Learning Curves

Grid Search

Performance Metrics

Ensemble Learning: Bagging, boosting, combining classifiers

Ensemble Learning: Bagging, Boosting, Combining Classifiers

Introduction to Ensemble Learning

Bagging

Boosting

Comparative Analysis of Bagging and Boosting

Combining Classifiers

Regression Analysis: Linear regression, robust regression, polynomial regression, random forests

Regression Analysis

Linear Regression

Robust Regression

Polynomial Regression

Random Forests

Clustering: k-means, hierarchical clustering, DBSCAN

Clustering

Introduction to Clustering

K-Means Clustering

Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Embedding ML Models into Web Applications: Serialization, SQLite, Flask, deployment

Embedding ML Models into Web Applications

Serialization of ML Models

Using SQLite for Data Storage

Building Web Applications with Flask

Deployment of ML Models

Machine Learning

M.Sc. Data Science

III

Periyar University

Core VI