Page 3

Semester 1: Statistics I

  • Introduction to Statistics: Data collection and descriptive statistics, populations and samples, history

    Introduction to Statistics
    • Data Collection

      Data collection is the process of gathering information to answer research questions. It can be completed through various methods including surveys, experiments, and observational studies. The data must be relevant, accurate, and collected systematically to ensure its reliability.

    • Descriptive Statistics

      Descriptive statistics summarize and describe the main features of a dataset. This includes measures such as mean, median, mode, variance, and standard deviation. Graphical representations like histograms, bar charts, and box plots are also essential for visualizing data distribution.

    • Populations and Samples

      A population refers to the entire group of individuals or instances about whom we hope to learn. A sample is a subset of the population used for analysis. Proper sampling methods are crucial to ensure that the sample accurately represents the population, influencing the validity of statistical conclusions.

    • History of Statistics

      Statistical methods have developed over centuries. Early applications are traced back to ancient civilizations for census data collection. The field gained formal recognition in the 18th century with advancements in probability theory by mathematicians like Pascal and Fermat. The 20th century saw the growth of inferential statistics, allowing researchers to make predictions about populations based on sample data.

  • Organization and Presentation of Data: Types of data, measurements, graphical representation

    Organization and Presentation of Data
    • Types of Data

      Data can be categorized into different types including qualitative and quantitative data. Qualitative data is descriptive and non-numerical, often involving categories or labels. Quantitative data is numerical and can be further divided into discrete data, which can take specific values, and continuous data, which can take any value within a range.

    • Measurements

      Measurements in statistics typically involve capturing quantitative data through various scales such as nominal, ordinal, interval, and ratio scales. Nominal scales categorize data without a specific order, while ordinal scales involve ordered categories. Interval scales have measurable distances between values but lack a true zero point, and ratio scales possess both a measurable distance and a true zero point, allowing for meaningful comparison.

    • Graphical Representation

      Graphical representation of data is crucial for visual understanding. Common forms include bar graphs for categorical data, histograms for frequency distribution of quantitative data, line graphs for trends over time, and pie charts for proportional data distribution. The choice of graph depends on the data type and the information one wants to convey.

    • Importance of Data Organization

      Organizing data systematically is crucial for effective analysis. This can include the use of tables, databases, and spreadsheets to ensure data is easily accessible and interpretable. Proper organization allows for better data management and facilitates the process of data analysis, enhancing the clarity of insights derived from the data.

  • Descriptive Statistics: Frequency tables, histograms, mean, median, mode, variance, percentiles, Chebyshev's inequality, normal data sets, correlation

    Statistics I
    M.Sc. Data Science
    I
    Periyar University
    Core III
    Descriptive Statistics
    • Frequency Tables

      Frequency tables organize data into categories showing the number of occurrences for each category. They help in identifying data distribution and patterns.

    • Histograms

      Histograms are graphical representations of frequency distributions. They display the data by grouping values into intervals, providing insights into the shape and spread of the data.

    • Mean

      The mean is the average value calculated by summing all data points and dividing by the number of points. It is sensitive to outliers, affecting its representation of the dataset.

    • Median

      The median is the middle value in a sorted list of numbers. It is less affected by outliers, making it a better measure of central tendency for skewed data.

    • Mode

      The mode is the value that appears most frequently in a dataset. A dataset may have no mode, one mode, or multiple modes (bimodal or multimodal).

    • Variance

      Variance measures the degree of spread in the data set by calculating the average of the squared differences from the mean. A higher variance indicates more variability.

    • Percentiles

      Percentiles divide a dataset into 100 equal parts. The nth percentile is the value below which n percent of the data falls. This is useful for understanding the relative standing of a value.

    • Chebyshev's Inequality

      Chebyshev's inequality states that for any dataset, the proportion of values that lie within k standard deviations from the mean is at least 1 - (1/k^2) for k > 1. It applies to all distributions.

    • Normal Data Sets

      Normal distributions are bell-shaped curves where most observations cluster around the mean. Properties include symmetry and specific percentages of data within standard deviations from the mean.

    • Correlation

      Correlation measures the strength and direction of a linear relationship between two variables. A value close to 1 or -1 indicates a strong correlation, whereas a value near 0 indicates no correlation.

  • Random Variables and Expectation: Types, distributions, expectation and variance, moment generating functions

    Random Variables and Expectation
    • Types of Random Variables

      Random variables can be classified into two main types: discrete and continuous random variables. A discrete random variable takes on a countable number of distinct values, such as the roll of a die. A continuous random variable, on the other hand, can take on any value within a given range, such as the height of a person.

    • Probability Distributions

      Probability distributions describe how probabilities are distributed over the values of a random variable. For discrete random variables, the probability mass function (PMF) provides the probability for each possible value. For continuous random variables, the probability density function (PDF) serves a similar purpose, indicating the likelihood of different outcomes within continuous intervals.

    • Expectation

      The expectation, or expected value, of a random variable is a measure of the central tendency. For a discrete random variable, it is calculated as the sum of the products of each value and its corresponding probability. For continuous random variables, it involves integrating the product of the value and its PDF across the possible range.

    • Variance

      Variance quantifies the spread of a random variable's values around its mean. It measures the average squared deviation from the expected value. For discrete random variables, variance is calculated using the probabilities and values, while for continuous variables, it involves integration with respect to the PDF.

    • Moment Generating Functions

      Moment generating functions (MGFs) provide a way to encapsulate all moments of a random variable. The MGF is defined as the expected value of e^(tx), where t is a parameter and x is the random variable. MGFs can be useful for finding moments, as well as for studying the convergence of random variables.

  • Sampling Distributions and Parameter Estimation: Central limit theorem, maximum likelihood estimators, confidence intervals

    Sampling Distributions and Parameter Estimation
    • Central Limit Theorem

      The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's distribution, given that the samples are independent and identically distributed (i.i.d). It is fundamental in inferential statistics, as it provides the foundation for making inferences about population parameters based on sample statistics.

    • Maximum Likelihood Estimators

      Maximum likelihood estimation (MLE) is a method of estimating parameters of a statistical model. It finds the parameter values that maximize the likelihood function, which measures how well the model explains the observed data. MLE is widely used due to its desirable properties, such as consistency and asymptotic normality under certain conditions.

    • Confidence Intervals

      A confidence interval is a range of values that is likely to contain the population parameter with a specified level of confidence, usually expressed as a percentage (e.g., 95% confidence interval). It is constructed using the sample statistic and its standard error, reflecting the uncertainty regarding the estimation of the population parameter.

  • Basics of Probability: Definition, classical and axiomatic approaches, laws of probability, conditional probability, Bayes theorem

    Basics of Probability
    • Definition

      Probability is a measure of the likelihood that an event will occur. It quantifies uncertainty and ranges from 0 (impossible event) to 1 (certain event). The probability of an event A is denoted as P(A).

    • Classical Approach

      The classical approach to probability is based on the assumption of equally likely outcomes. If an experiment has n equally likely outcomes, and m of those outcomes favor event A, then the probability of event A is P(A) = m/n. This is commonly used in scenarios such as rolling dice or flipping coins.

    • Axiomatic Approach

      The axiomatic approach, developed by Andrey Kolmogorov, defines probability through three axioms: 1. Non-negativity: P(A) >= 0 for any event A. 2. Normalization: P(S) = 1, where S is the sample space. 3. Additivity: For any mutually exclusive events A1, A2,..., P(A1 ∪ A2 ∪ ...) = P(A1) + P(A2) + ... .

    • Laws of Probability

      The laws of probability include the addition law and multiplication law. The addition law states that for any two events A and B, P(A ∪ B) = P(A) + P(B) - P(A ∩ B). The multiplication law states that for independent events A and B, P(A ∩ B) = P(A) * P(B).

    • Conditional Probability

      Conditional probability is the probability of an event given that another event has occurred. It is denoted as P(A | B) and calculated using the formula P(A | B) = P(A ∩ B) / P(B), provided P(B) > 0.

    • Bayes Theorem

      Bayes theorem relates conditional probabilities and can be expressed as P(A | B) = [P(B | A) * P(A)] / P(B). It is particularly useful in updating probabilities based on new evidence.

Statistics I

M.Sc. Data Science

I

Periyar University

Core III

free web counter

GKPAD.COM by SK Yadav | Disclaimer