Page 5

Semester 5: Biostatistics and Bioinformatics

History and introduction to Bioinformatics: Applications and data generation from molecular biology, genome sequencing, protein sequencing, NMR spectroscopy, microarray
- History of Bioinformatics
  Bioinformatics emerged in the 1960s alongside the development of molecular biology. Early efforts focused on genetic data analysis and the comparison of protein sequences. With the completion of the Human Genome Project in 2003, the field expanded significantly, integrating computational biology, statistics, and data analysis.
- Introduction to Bioinformatics
  Bioinformatics is the application of computer technology to manage biological information. It encompasses the storage, retrieval, and analysis of biological data, primarily focusing on genomic and proteomic information.
- Applications of Bioinformatics
  Bioinformatics is used in various applications, including gene identification, evolutionary studies, drug discovery, and personalized medicine. Its role is crucial in understanding biological processes and developing new therapies.
- Data Generation from Molecular Biology
  Molecular biology techniques, such as PCR and cloning, generate vast amounts of data. Sequence data from DNA, RNA, and proteins must be analyzed, stored, and interpreted using bioinformatics tools.
- Genome Sequencing
  Genome sequencing involves determining the complete nucleotide sequence of an organism's DNA. Next-generation sequencing technologies have revolutionized this field, enabling rapid and cost-effective sequencing.
- Protein Sequencing
  Protein sequencing determines the amino acid sequence of proteins. Techniques such as Edman degradation and mass spectrometry are used to analyze protein structure and function.
- NMR Spectroscopy
  Nuclear Magnetic Resonance (NMR) spectroscopy is a technique used to determine the structure of proteins and nucleic acids in solution. It provides insights into molecular dynamics and interactions.
- Microarray Technology
  Microarrays are used to study gene expression and variations within the genome. This high-throughput technology allows simultaneous analysis of thousands of genes, aiding in comparative genomics and personalized medicine.
Databases, data generation, storage and retrieval: Biological databases including NCBI, DDBJ, EMBL, protein databases, specialized genome and structure databases, file formats, metadata and search techniques
Databases, Data Generation, Storage and Retrieval in Biological Contexts
- Introduction to Biological Databases
  Biological databases are structured collections of biological data, serving as essential resources for researchers in genomics, proteomics, and other biological fields. Key databases include NCBI, DDBJ, and EMBL.
- NCBI, DDBJ, and EMBL
  NCBI (National Center for Biotechnology Information) provides access to biomedical and genomic information. DDBJ (DNA Data Bank of Japan) focuses on DNA sequence data, while EMBL (European Molecular Biology Laboratory) offers extensive sequence and protein data.
- Protein Databases
  Protein databases contain information about protein sequences, structures, functions, and interactions. Examples include UniProt and PDB (Protein Data Bank), which are crucial for protein research.
- Specialized Genome and Structure Databases
  These databases focus on specific organisms or data types. Examples include Ensembl for genome annotation and RCSB PDB for protein structure data.
- File Formats in Biological Databases
  Common file formats for biological data include FASTA for sequences, GFF for genome annotations, and PDB for protein structures. Understanding these formats is essential for data manipulation.
- Metadata in Biological Databases
  Metadata provides contextual information about the data, including its origin, quality, and structure. Proper metadata is essential for data interoperability and reproducibility.
- Search Techniques in Biological Databases
  Effective search techniques are vital for retrieving relevant information from biological databases. Techniques include keyword searches, Boolean operators, and metadata-based searches.
Sequence and Phylogeny analysis: Sequences and alignments, dynamic programming, local and global alignment, pairwise alignment (BLAST and FASTA), multiple sequence alignment, phylogenetic analysis, PCR primer designing
Sequence and Phylogeny analysis
- Sequences and Alignments
  Sequences represent the order of nucleotides or amino acids in a DNA, RNA, or protein molecule. Alignments are used to compare these sequences to identify similarities and differences. They can help determine functional and evolutionary relationships.
- Dynamic Programming
  Dynamic programming is an algorithmic technique used for solving complex problems by breaking them down into simpler subproblems. In bioinformatics, it is often used in sequence alignment, allowing for efficient computation of optimal alignments.
- Local and Global Alignment
  Global alignment aims to align every residue in every sequence, while local alignment seeks to identify the most similar regions within sequences. This differentiation is essential depending on the biological question.
- Pairwise Alignment (BLAST and FASTA)
  BLAST (Basic Local Alignment Search Tool) and FASTA are algorithms used for pairwise sequence alignment. They facilitate the identification of similarities between sequences, which can help uncover evolutionary relationships and functional insights.
- Multiple Sequence Alignment
  Multiple sequence alignment involves aligning three or more sequences simultaneously. This approach provides a way to identify conserved regions, which may indicate important functional sites.
- Phylogenetic Analysis
  Phylogenetic analysis involves the construction of evolutionary trees or phylogenies based on genetic information. Techniques in this area include distance-based methods, maximum likelihood, and Bayesian inference. These analyses help understand the evolutionary relationships among organisms.
- PCR Primer Designing
  PCR (Polymerase Chain Reaction) primer designing is critical for amplifying specific DNA sequences. Designing effective primers requires understanding of the target sequence, melting temperature, and specificity to ensure successful amplification in PCR experiments.
Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools
Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools
- Introduction to Database Searching
  Database searching is a key component in bioinformatics, allowing researchers to efficiently find relevant biological data from vast resources. It involves systematic querying of databases to retrieve biological sequences, annotations, and information.
- SRS (Sequence Retrieval System)
  SRS is a powerful tool for accessing various biological databases. It provides an easy-to-use interface for querying sequence data and allows users to perform complex searches across multiple databases.
- Entrez
  Entrez is a search and retrieval system developed by NCBI. It integrates a wide range of databases including nucleotide and protein sequences, literature references, and taxonomy data. Users can conduct searches using keywords, limits, and Boolean operators.
- Sequence Similarity Searches
  Sequence similarity searches are essential for identifying homologous sequences across different organisms. Tools such as BLAST and FASTA compare query sequences against databases to find matches, providing insights into evolutionary relationships and functional predictions.
- Genome Annotation Tools
  Genome annotation tools assist in predicting the locations of genes and other features within a genome. Software such as GeneMark, MAKER, and Augustus are used to analyze genomic data, assigning biological meaning to sequences based on functional predictions and comparative genomics.
- Integration of Tools and Databases
  Combining these tools enhances the ability to analyze biological data comprehensively. For example, using sequence similarity results from BLAST in conjunction with genome annotation tools can help in understanding gene function and regulation in different species.
Types and collection of data: Primary and secondary data, graphical representation, measures of central tendency and dispersion, skewness and kurtosis
Types and collection of data
- Primary Data
  Primary data refers to information gathered directly from original sources through methods such as surveys, experiments, interviews, and observations. It is characterized by its relevance and reliability for specific research purposes.
- Secondary Data
  Secondary data is information that has already been collected and published by others. Sources of secondary data include academic journals, books, online databases, and government reports. It is less time-consuming to gather but may not be as specific as primary data.
- Graphical Representation
  Graphical representation involves using charts, graphs, and plots to visualize data. Common types include bar graphs, histograms, pie charts, and scatter plots. Visual representation helps in understanding trends, patterns, and distributions within the data.
- Measures of Central Tendency
  Measures of central tendency describe the center of a dataset. The three main measures are mean, median, and mode. The mean is the average, the median is the middle value when data is sorted, and the mode is the most frequently occurring value.
- Measures of Dispersion
  Measures of dispersion indicate the spread or variability of a dataset. Key measures include range (difference between highest and lowest values), variance (average of the squared differences from the mean), and standard deviation (square root of variance).
- Skewness
  Skewness assesses the asymmetry of the distribution of values in a dataset. Positive skew indicates a long tail on the right side, while negative skew indicates a long tail on the left. Skewness is important for understanding the nature of data distributions.
- Kurtosis
  Kurtosis measures the tailedness of the probability distribution of a real-valued random variable. High kurtosis indicates heavy tails, meaning more data is in the extremes, while low kurtosis indicates lighter tails and a peak closer to the mean.
Probability: Definition and theorems, elementary ideas of binomial, Poisson and normal distributions
- Definition of Probability
  Probability is the measure of the likelihood that an event will occur. It quantifies uncertainty, allowing predictions about future events based on past data. The probability of an event A is denoted as P(A) and is defined as the ratio of the number of favorable outcomes to the total number of possible outcomes.
- Basic Theorems of Probability
  Key theorems include: 1. Addition Theorem: P(A or B) = P(A) + P(B) - P(A and B) 2. Multiplication Theorem: P(A and B) = P(A) * P(B|A) 3. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B)
- Binomial Distribution
  The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It is characterized by two parameters: n (number of trials) and p (probability of success). The probability mass function is given by: P(X = k) = (n choose k) * p^k * (1-p)^(n-k) where k = number of successes.
- Poisson Distribution
  The Poisson distribution describes the number of events occurring in a fixed interval of time or space. It is characterized by the rate parameter lambda (λ) which is the average number of occurrences in the interval. The probability function is: P(X = k) = (e^(-λ) * λ^k) / k! where k = number of occurrences.
- Normal Distribution
  The normal distribution is a continuous probability distribution that is symmetric about the mean. It is characterized by its mean (µ) and standard deviation (σ). The probability density function is: f(x) = (1/(σ√(2π))) * e^(-0.5 * ((x-µ)/σ)^2). The total area under the curve equals 1 and is used to model many natural phenomena.
Sampling: Sampling methods, confidence level, hypothesis testing, large and small sample tests, t-test, chi-square, ANOVA
Sampling in Biostatistics
- Item
  Sampling methods refer to the techniques used to select individuals or items from a population to gather data for analysis.
  - Random sampling
  - Stratified sampling
  - Systematic sampling
  - Cluster sampling
  - Convenience sampling
- Item
  The confidence level is a statistical term that quantifies the level of certainty in the results of a sample and is often expressed as a percentage.
  - 90% confidence level
  - 95% confidence level
  - 99% confidence level
- Item
  Hypothesis testing is a statistical method used to make inferences or draw conclusions about population parameters based on sample data.
  - State the null hypothesis (H0) and alternative hypothesis (H1)
  - Select a significance level (alpha)
  - Choose the appropriate statistical test
  - Calculate the test statistic
  - Make a decision based on the p-value or confidence interval
- Item
  Tests that are applicable when the sample size is large (usually n > 30).
  - Z-test
  - Large sample t-test
  Tests that are applicable when the sample size is small (usually n ≤ 30).
  - T-test
  - Small sample proportion test
- Item
  The t-test is a statistical test used to compare the means of two groups and determine if they are significantly different from each other.
  - Independent t-test
  - Paired t-test
- Item
  The chi-square test is a statistical test used to determine if there is a significant association between categorical variables.
  - Chi-square goodness of fit
  - Chi-square test for independence
- Item
  ANOVA, or Analysis of Variance, is a statistical method used to compare means among three or more groups to find if at least one group mean is different from the others.
  - One-way ANOVA
  - Two-way ANOVA
Correlation and Regression: Types, Karl-Pearson and Spearman correlations, regression analysis, differences between correlation and regression
Correlation and Regression
- Introduction to Correlation
  Definition of correlation as a statistical measure that expresses the extent to which two variables are linearly related. Correlation values range from -1 to +1.
- Types of Correlation
  Distinction between positive, negative, and zero correlation. Explanation of how positive correlation indicates that as one variable increases, the other also increases, while negative correlation shows that as one variable increases, the other decreases.
- Karl-Pearson Correlation Coefficient
  Overview of the Karl-Pearson correlation coefficient as a method for measuring the degree of linear relationship between two continuous variables. Calculation formula and interpretation of values.
- Spearman Rank Correlation Coefficient
  Description of Spearman's correlation as a non-parametric measure that assesses how well the relationship between two variables can be described by a monotonic function. Comparison to Pearson's method, especially in rank data.
- Regression Analysis
  Definition and significance of regression analysis in predicting the value of a dependent variable based on the value of one or more independent variables. Brief discussion on linear regression and multiple regression.
- Differences between Correlation and Regression
  Clarification of how correlation quantifies the strength of a relationship between two variables, whereas regression provides a formula for predicting one variable based on the other. Explanation of the differences in their purposes, methodologies, and the types of conclusions that can be drawn from each.

Page 5

Semester 5: Biostatistics and Bioinformatics

History and introduction to Bioinformatics: Applications and data generation from molecular biology, genome sequencing, protein sequencing, NMR spectroscopy, microarray

History of Bioinformatics

Introduction to Bioinformatics

Applications of Bioinformatics

Data Generation from Molecular Biology

Genome Sequencing

Protein Sequencing

NMR Spectroscopy

Microarray Technology

Databases, data generation, storage and retrieval: Biological databases including NCBI, DDBJ, EMBL, protein databases, specialized genome and structure databases, file formats, metadata and search techniques

Databases, Data Generation, Storage and Retrieval in Biological Contexts

Introduction to Biological Databases

NCBI, DDBJ, and EMBL

Protein Databases

Specialized Genome and Structure Databases

File Formats in Biological Databases

Metadata in Biological Databases

Search Techniques in Biological Databases

Sequence and Phylogeny analysis: Sequences and alignments, dynamic programming, local and global alignment, pairwise alignment (BLAST and FASTA), multiple sequence alignment, phylogenetic analysis, PCR primer designing

Sequence and Phylogeny analysis

Sequences and Alignments

Dynamic Programming

Local and Global Alignment

Pairwise Alignment (BLAST and FASTA)

Multiple Sequence Alignment

Phylogenetic Analysis

PCR Primer Designing

Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools

Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools

Introduction to Database Searching

SRS (Sequence Retrieval System)

Entrez

Sequence Similarity Searches

Genome Annotation Tools

Integration of Tools and Databases

Types and collection of data: Primary and secondary data, graphical representation, measures of central tendency and dispersion, skewness and kurtosis

Types and collection of data

Primary Data

Secondary Data

Graphical Representation

Measures of Central Tendency

Measures of Dispersion

Skewness

Kurtosis

Probability: Definition and theorems, elementary ideas of binomial, Poisson and normal distributions

Definition of Probability

Basic Theorems of Probability

Binomial Distribution

Poisson Distribution

Normal Distribution

Sampling: Sampling methods, confidence level, hypothesis testing, large and small sample tests, t-test, chi-square, ANOVA

Sampling in Biostatistics

Item

Sampling methods refer to the techniques used to select individuals or items from a population to gather data for analysis.

Item

The confidence level is a statistical term that quantifies the level of certainty in the results of a sample and is often expressed as a percentage.

Item

Hypothesis testing is a statistical method used to make inferences or draw conclusions about population parameters based on sample data.

Item

Tests that are applicable when the sample size is large (usually n > 30).

Tests that are applicable when the sample size is small (usually n ≤ 30).

Item

The t-test is a statistical test used to compare the means of two groups and determine if they are significantly different from each other.

Item

The chi-square test is a statistical test used to determine if there is a significant association between categorical variables.

Item

ANOVA, or Analysis of Variance, is a statistical method used to compare means among three or more groups to find if at least one group mean is different from the others.

Correlation and Regression: Types, Karl-Pearson and Spearman correlations, regression analysis, differences between correlation and regression

Correlation and Regression

Introduction to Correlation

Types of Correlation

Karl-Pearson Correlation Coefficient

Spearman Rank Correlation Coefficient

Regression Analysis

Differences between Correlation and Regression

Biostatistics and Bioinformatics

B100501T

Biotechnology