Page 5
Semester 2: Statistics II
Introduction to Hypothesis Testing: Sampling distributions, test of significance, types of hypothesis, errors, p-value
Statistics II
M.Sc. Data Science
II
Periyar University
Core V
Introduction to Hypothesis Testing
Sampling Distributions
Sampling distributions refer to the probability distribution of a statistic obtained from a large number of samples drawn from a specific population. It describes how the sample mean, sample proportion, or other sample statistics behave and allows researchers to make inferences about the population from which the samples were taken.
Test of Significance
A test of significance is a statistical test that determines if the observed data deviate significantly from a null hypothesis. It helps in deciding whether to reject or fail to reject the null hypothesis based on the evidence provided by the sample data.
Types of Hypothesis
There are primarily two types of hypotheses in hypothesis testing: the null hypothesis (H0), which represents a statement of no effect or no difference, and the alternative hypothesis (H1 or Ha), which represents a statement that contradicts the null hypothesis.
Errors in Hypothesis Testing
Errors in hypothesis testing refer to incorrect conclusions based on the analysis of sample data. There are two types of errors: Type I error, which occurs when the null hypothesis is rejected when it is actually true, and Type II error, which occurs when the null hypothesis is not rejected when it is false.
P-value
The p-value is a statistical measure that helps determine the strength of the evidence against the null hypothesis. A low p-value indicates strong evidence against the null hypothesis, leading to its rejection, while a high p-value suggests weak evidence against it.
Hypothesis Testing Methods: t-test, F-test, Chi-square tests for independence and goodness of fit
Hypothesis Testing Methods
t-test
The t-test is used to determine if there is a statistically significant difference between the means of two groups. It is applicable primarily when the sample size is small and the population standard deviation is unknown. There are different types of t-tests, including the independent t-test, which compares two different groups, and the paired t-test, which compares two measurements taken from the same group.
F-test
The F-test is used to compare two variances and can determine if they are significantly different from each other. It is commonly used in the context of ANOVA (Analysis of Variance), which extends the t-test for more than two groups. The F-test assesses whether the variability between group means is greater than the variability within the groups.
Chi-square tests
Chi-square tests are non-parametric tests used to determine if there is a significant association between categorical variables. Two main types of chi-square tests are employed: the chi-square test for independence, which assesses if two categorical variables are independent, and the chi-square goodness of fit test, which checks if the observed distribution of data fits a specific theoretical distribution.
Regression Analysis: Least squares estimators, inferences, prediction intervals, polynomial and multiple regression, logistic regression models
Regression Analysis
Least Squares Estimators
Least squares estimators are techniques used in regression analysis to estimate the parameters of a model by minimizing the sum of the squares of the residuals. This method results in the best-fitting line through the data points in the least squares sense.
Inferences
Inferences in regression involve drawing conclusions about the population parameters based on sample data. This includes hypothesis testing for coefficients and calculating confidence intervals for estimates.
Prediction Intervals
Prediction intervals provide a range in which future observations are expected to fall, given a certain level of confidence. These intervals account for both the variability of the data and the inherent uncertainty in the regression model.
Polynomial Regression
Polynomial regression is a form of regression analysis in which the relationship between the independent variable and dependent variable is modeled as an nth degree polynomial. It is useful for modeling nonlinear relationships.
Multiple Regression
Multiple regression involves using two or more independent variables to predict the value of a dependent variable. It allows for the examination of the relationship between multiple predictors and the outcome.
Logistic Regression
Logistic regression is used when the dependent variable is categorical. It models the probability of a certain class or event occurring and is commonly used in binary classification problems.
Analysis of Variance (ANOVA): One-way, two-way ANOVA, multiple comparisons, interaction effects
Analysis of Variance (ANOVA)
One-way ANOVA
One-way ANOVA is used to compare means between three or more independent groups. The null hypothesis states that all group means are equal. The test calculates the F-statistic, which compares the variance between the groups to the variance within the groups. If the F-statistic is significantly high, we reject the null hypothesis, indicating at least one group mean is different.
Two-way ANOVA
Two-way ANOVA assesses the effect of two factors on a dependent variable, along with their interaction effects. It helps to understand if there is a significant effect of each factor, as well as whether the effect of one factor depends on the level of the other factor. It expands on one-way ANOVA by allowing interaction terms.
Multiple Comparisons
After finding significant results in ANOVA, multiple comparisons tests such as Tukey's HSD or Bonferroni correction are conducted to identify which specific groups differ. These tests adjust for Type I error rates that can be inflated when multiple pairwise comparisons are made.
Interaction Effects
Interaction effects occur when the effect of one independent variable on the dependent variable differs depending on the level of another independent variable. In a two-way ANOVA, identifying interaction effects is crucial as it influences the interpretation of main effects.
Goodness of Fit and Categorical Data Analysis: Kolmogorov-Smirnov test, contingency tables analysis
Goodness of Fit
Goodness of fit refers to how well a statistical model fits a set of observations. It is often used to determine if a set of observed values significantly differs from expected values. Common methods to assess goodness of fit include Chi-square tests, Kolmogorov-Smirnov tests, and more.
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is a non-parametric test that compares the distribution of a sample with a reference probability distribution, or compares two sample distributions. It assesses the largest difference between the empirical distribution functions of the samples. This test is applicable to continuous and categorical data.
Contingency Tables Analysis
Contingency tables are used to analyze the relationship between two categorical variables. They enable the computation of probabilities and the examination of associations between variables. The Chi-square test is often employed to assess independence in contingency tables.
Application in Categorical Data Analysis
Categorical data analysis involves techniques to analyze data that can be grouped into categories. It involves using methods like contingency tables and the Kolmogorov-Smirnov test to verify if there are significant differences or relationships present in the data.
Nonparametric Tests: Sign test, Wilcoxon signed rank test, runs test, median test, Mann-Whitney-Wilcoxon tests
Nonparametric Tests
Sign Test
A nonparametric test used to evaluate the median of a single population or to compare two related samples. It focuses on the sign of the differences between pairs rather than the actual values.
Wilcoxon Signed Rank Test
A nonparametric test that compares two related samples, matched samples, or repeated measurements to assess whether their population mean ranks differ. It considers both the direction and the magnitude of the differences.
Runs Test
A nonparametric test that checks for randomness in data sequences. It analyzes the occurrence of runs, or sequences of similar values, to test if the order is consistent with randomness.
Median Test
A nonparametric test that compares the medians of two or more groups. It is especially useful when the data distribution is not normal.
Mann-Whitney-Wilcoxon Test
A nonparametric test for assessing whether two independent samples come from the same distribution. It evaluates the ranks of the combined data to determine differences between groups.
