Page 5
Semester 5: Biostatistics and Bioinformatics
History and introduction to Bioinformatics: Applications and data generation from molecular biology, genome sequencing, protein sequencing, NMR spectroscopy, microarray
History of Bioinformatics
Bioinformatics emerged in the 1960s alongside the development of molecular biology. Early efforts focused on genetic data analysis and the comparison of protein sequences. With the completion of the Human Genome Project in 2003, the field expanded significantly, integrating computational biology, statistics, and data analysis.
Introduction to Bioinformatics
Bioinformatics is the application of computer technology to manage biological information. It encompasses the storage, retrieval, and analysis of biological data, primarily focusing on genomic and proteomic information.
Applications of Bioinformatics
Bioinformatics is used in various applications, including gene identification, evolutionary studies, drug discovery, and personalized medicine. Its role is crucial in understanding biological processes and developing new therapies.
Data Generation from Molecular Biology
Molecular biology techniques, such as PCR and cloning, generate vast amounts of data. Sequence data from DNA, RNA, and proteins must be analyzed, stored, and interpreted using bioinformatics tools.
Genome Sequencing
Genome sequencing involves determining the complete nucleotide sequence of an organism's DNA. Next-generation sequencing technologies have revolutionized this field, enabling rapid and cost-effective sequencing.
Protein Sequencing
Protein sequencing determines the amino acid sequence of proteins. Techniques such as Edman degradation and mass spectrometry are used to analyze protein structure and function.
NMR Spectroscopy
Nuclear Magnetic Resonance (NMR) spectroscopy is a technique used to determine the structure of proteins and nucleic acids in solution. It provides insights into molecular dynamics and interactions.
Microarray Technology
Microarrays are used to study gene expression and variations within the genome. This high-throughput technology allows simultaneous analysis of thousands of genes, aiding in comparative genomics and personalized medicine.
Databases, data generation, storage and retrieval: Biological databases including NCBI, DDBJ, EMBL, protein databases, specialized genome and structure databases, file formats, metadata and search techniques
Databases, Data Generation, Storage and Retrieval in Biological Contexts
Introduction to Biological Databases
Biological databases are structured collections of biological data, serving as essential resources for researchers in genomics, proteomics, and other biological fields. Key databases include NCBI, DDBJ, and EMBL.
NCBI, DDBJ, and EMBL
NCBI (National Center for Biotechnology Information) provides access to biomedical and genomic information. DDBJ (DNA Data Bank of Japan) focuses on DNA sequence data, while EMBL (European Molecular Biology Laboratory) offers extensive sequence and protein data.
Protein Databases
Protein databases contain information about protein sequences, structures, functions, and interactions. Examples include UniProt and PDB (Protein Data Bank), which are crucial for protein research.
Specialized Genome and Structure Databases
These databases focus on specific organisms or data types. Examples include Ensembl for genome annotation and RCSB PDB for protein structure data.
File Formats in Biological Databases
Common file formats for biological data include FASTA for sequences, GFF for genome annotations, and PDB for protein structures. Understanding these formats is essential for data manipulation.
Metadata in Biological Databases
Metadata provides contextual information about the data, including its origin, quality, and structure. Proper metadata is essential for data interoperability and reproducibility.
Search Techniques in Biological Databases
Effective search techniques are vital for retrieving relevant information from biological databases. Techniques include keyword searches, Boolean operators, and metadata-based searches.
Sequence and Phylogeny analysis: Sequences and alignments, dynamic programming, local and global alignment, pairwise alignment (BLAST and FASTA), multiple sequence alignment, phylogenetic analysis, PCR primer designing
Sequence and Phylogeny analysis
Sequences and Alignments
Sequences represent the order of nucleotides or amino acids in a DNA, RNA, or protein molecule. Alignments are used to compare these sequences to identify similarities and differences. They can help determine functional and evolutionary relationships.
Dynamic Programming
Dynamic programming is an algorithmic technique used for solving complex problems by breaking them down into simpler subproblems. In bioinformatics, it is often used in sequence alignment, allowing for efficient computation of optimal alignments.
Local and Global Alignment
Global alignment aims to align every residue in every sequence, while local alignment seeks to identify the most similar regions within sequences. This differentiation is essential depending on the biological question.
Pairwise Alignment (BLAST and FASTA)
BLAST (Basic Local Alignment Search Tool) and FASTA are algorithms used for pairwise sequence alignment. They facilitate the identification of similarities between sequences, which can help uncover evolutionary relationships and functional insights.
Multiple Sequence Alignment
Multiple sequence alignment involves aligning three or more sequences simultaneously. This approach provides a way to identify conserved regions, which may indicate important functional sites.
Phylogenetic Analysis
Phylogenetic analysis involves the construction of evolutionary trees or phylogenies based on genetic information. Techniques in this area include distance-based methods, maximum likelihood, and Bayesian inference. These analyses help understand the evolutionary relationships among organisms.
PCR Primer Designing
PCR (Polymerase Chain Reaction) primer designing is critical for amplifying specific DNA sequences. Designing effective primers requires understanding of the target sequence, melting temperature, and specificity to ensure successful amplification in PCR experiments.
Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools
Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools
Introduction to Database Searching
Database searching is a key component in bioinformatics, allowing researchers to efficiently find relevant biological data from vast resources. It involves systematic querying of databases to retrieve biological sequences, annotations, and information.
SRS (Sequence Retrieval System)
SRS is a powerful tool for accessing various biological databases. It provides an easy-to-use interface for querying sequence data and allows users to perform complex searches across multiple databases.
Entrez
Entrez is a search and retrieval system developed by NCBI. It integrates a wide range of databases including nucleotide and protein sequences, literature references, and taxonomy data. Users can conduct searches using keywords, limits, and Boolean operators.
Sequence Similarity Searches
Sequence similarity searches are essential for identifying homologous sequences across different organisms. Tools such as BLAST and FASTA compare query sequences against databases to find matches, providing insights into evolutionary relationships and functional predictions.
Genome Annotation Tools
Genome annotation tools assist in predicting the locations of genes and other features within a genome. Software such as GeneMark, MAKER, and Augustus are used to analyze genomic data, assigning biological meaning to sequences based on functional predictions and comparative genomics.
Integration of Tools and Databases
Combining these tools enhances the ability to analyze biological data comprehensively. For example, using sequence similarity results from BLAST in conjunction with genome annotation tools can help in understanding gene function and regulation in different species.
Types and collection of data: Primary and secondary data, graphical representation, measures of central tendency and dispersion, skewness and kurtosis
Types and collection of data
Primary Data
Primary data refers to information gathered directly from original sources through methods such as surveys, experiments, interviews, and observations. It is characterized by its relevance and reliability for specific research purposes.
Secondary Data
Secondary data is information that has already been collected and published by others. Sources of secondary data include academic journals, books, online databases, and government reports. It is less time-consuming to gather but may not be as specific as primary data.
Graphical Representation
Graphical representation involves using charts, graphs, and plots to visualize data. Common types include bar graphs, histograms, pie charts, and scatter plots. Visual representation helps in understanding trends, patterns, and distributions within the data.
Measures of Central Tendency
Measures of central tendency describe the center of a dataset. The three main measures are mean, median, and mode. The mean is the average, the median is the middle value when data is sorted, and the mode is the most frequently occurring value.
Measures of Dispersion
Measures of dispersion indicate the spread or variability of a dataset. Key measures include range (difference between highest and lowest values), variance (average of the squared differences from the mean), and standard deviation (square root of variance).
Skewness
Skewness assesses the asymmetry of the distribution of values in a dataset. Positive skew indicates a long tail on the right side, while negative skew indicates a long tail on the left. Skewness is important for understanding the nature of data distributions.
Kurtosis
Kurtosis measures the tailedness of the probability distribution of a real-valued random variable. High kurtosis indicates heavy tails, meaning more data is in the extremes, while low kurtosis indicates lighter tails and a peak closer to the mean.
Probability: Definition and theorems, elementary ideas of binomial, Poisson and normal distributions
Definition of Probability
Probability is the measure of the likelihood that an event will occur. It quantifies uncertainty, allowing predictions about future events based on past data. The probability of an event A is denoted as P(A) and is defined as the ratio of the number of favorable outcomes to the total number of possible outcomes.
Basic Theorems of Probability
Key theorems include: 1. Addition Theorem: P(A or B) = P(A) + P(B) - P(A and B) 2. Multiplication Theorem: P(A and B) = P(A) * P(B|A) 3. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B)
Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It is characterized by two parameters: n (number of trials) and p (probability of success). The probability mass function is given by: P(X = k) = (n choose k) * p^k * (1-p)^(n-k) where k = number of successes.
Poisson Distribution
The Poisson distribution describes the number of events occurring in a fixed interval of time or space. It is characterized by the rate parameter lambda (λ) which is the average number of occurrences in the interval. The probability function is: P(X = k) = (e^(-λ) * λ^k) / k! where k = number of occurrences.
Normal Distribution
The normal distribution is a continuous probability distribution that is symmetric about the mean. It is characterized by its mean (µ) and standard deviation (σ). The probability density function is: f(x) = (1/(σ√(2π))) * e^(-0.5 * ((x-µ)/σ)^2). The total area under the curve equals 1 and is used to model many natural phenomena.
Sampling: Sampling methods, confidence level, hypothesis testing, large and small sample tests, t-test, chi-square, ANOVA
Sampling in Biostatistics
Item
Sampling methods refer to the techniques used to select individuals or items from a population to gather data for analysis.
Random sampling
Stratified sampling
Systematic sampling
Cluster sampling
Convenience sampling
Item
The confidence level is a statistical term that quantifies the level of certainty in the results of a sample and is often expressed as a percentage.
90% confidence level
95% confidence level
99% confidence level
Item
Hypothesis testing is a statistical method used to make inferences or draw conclusions about population parameters based on sample data.
State the null hypothesis (H0) and alternative hypothesis (H1)
Select a significance level (alpha)
Choose the appropriate statistical test
Calculate the test statistic
Make a decision based on the p-value or confidence interval
Item
Tests that are applicable when the sample size is large (usually n > 30).
Z-test
Large sample t-test
Tests that are applicable when the sample size is small (usually n ≤ 30).
T-test
Small sample proportion test
Item
The t-test is a statistical test used to compare the means of two groups and determine if they are significantly different from each other.
Independent t-test
Paired t-test
Item
The chi-square test is a statistical test used to determine if there is a significant association between categorical variables.
Chi-square goodness of fit
Chi-square test for independence
Item
ANOVA, or Analysis of Variance, is a statistical method used to compare means among three or more groups to find if at least one group mean is different from the others.
One-way ANOVA
Two-way ANOVA
Correlation and Regression: Types, Karl-Pearson and Spearman correlations, regression analysis, differences between correlation and regression
Correlation and Regression
Introduction to Correlation
Definition of correlation as a statistical measure that expresses the extent to which two variables are linearly related. Correlation values range from -1 to +1.
Types of Correlation
Distinction between positive, negative, and zero correlation. Explanation of how positive correlation indicates that as one variable increases, the other also increases, while negative correlation shows that as one variable increases, the other decreases.
Karl-Pearson Correlation Coefficient
Overview of the Karl-Pearson correlation coefficient as a method for measuring the degree of linear relationship between two continuous variables. Calculation formula and interpretation of values.
Spearman Rank Correlation Coefficient
Description of Spearman's correlation as a non-parametric measure that assesses how well the relationship between two variables can be described by a monotonic function. Comparison to Pearson's method, especially in rank data.
Regression Analysis
Definition and significance of regression analysis in predicting the value of a dependent variable based on the value of one or more independent variables. Brief discussion on linear regression and multiple regression.
Differences between Correlation and Regression
Clarification of how correlation quantifies the strength of a relationship between two variables, whereas regression provides a formula for predicting one variable based on the other. Explanation of the differences in their purposes, methodologies, and the types of conclusions that can be drawn from each.
