Page 5

Semester 5: Biostatistics and Bioinformatics

  • History and introduction to Bioinformatics: Applications and data generation from molecular biology, genome sequencing, protein sequencing, NMR spectroscopy, microarray

    • History of Bioinformatics

      Bioinformatics emerged in the 1960s alongside the development of molecular biology. Early efforts focused on genetic data analysis and the comparison of protein sequences. With the completion of the Human Genome Project in 2003, the field expanded significantly, integrating computational biology, statistics, and data analysis.

    • Introduction to Bioinformatics

      Bioinformatics is the application of computer technology to manage biological information. It encompasses the storage, retrieval, and analysis of biological data, primarily focusing on genomic and proteomic information.

    • Applications of Bioinformatics

      Bioinformatics is used in various applications, including gene identification, evolutionary studies, drug discovery, and personalized medicine. Its role is crucial in understanding biological processes and developing new therapies.

    • Data Generation from Molecular Biology

      Molecular biology techniques, such as PCR and cloning, generate vast amounts of data. Sequence data from DNA, RNA, and proteins must be analyzed, stored, and interpreted using bioinformatics tools.

    • Genome Sequencing

      Genome sequencing involves determining the complete nucleotide sequence of an organism's DNA. Next-generation sequencing technologies have revolutionized this field, enabling rapid and cost-effective sequencing.

    • Protein Sequencing

      Protein sequencing determines the amino acid sequence of proteins. Techniques such as Edman degradation and mass spectrometry are used to analyze protein structure and function.

    • NMR Spectroscopy

      Nuclear Magnetic Resonance (NMR) spectroscopy is a technique used to determine the structure of proteins and nucleic acids in solution. It provides insights into molecular dynamics and interactions.

    • Microarray Technology

      Microarrays are used to study gene expression and variations within the genome. This high-throughput technology allows simultaneous analysis of thousands of genes, aiding in comparative genomics and personalized medicine.

  • Databases, data generation, storage and retrieval: Biological databases including NCBI, DDBJ, EMBL, protein databases, specialized genome and structure databases, file formats, metadata and search techniques

    Databases, Data Generation, Storage and Retrieval in Biological Contexts
    • Introduction to Biological Databases

      Biological databases are structured collections of biological data, serving as essential resources for researchers in genomics, proteomics, and other biological fields. Key databases include NCBI, DDBJ, and EMBL.

    • NCBI, DDBJ, and EMBL

      NCBI (National Center for Biotechnology Information) provides access to biomedical and genomic information. DDBJ (DNA Data Bank of Japan) focuses on DNA sequence data, while EMBL (European Molecular Biology Laboratory) offers extensive sequence and protein data.

    • Protein Databases

      Protein databases contain information about protein sequences, structures, functions, and interactions. Examples include UniProt and PDB (Protein Data Bank), which are crucial for protein research.

    • Specialized Genome and Structure Databases

      These databases focus on specific organisms or data types. Examples include Ensembl for genome annotation and RCSB PDB for protein structure data.

    • File Formats in Biological Databases

      Common file formats for biological data include FASTA for sequences, GFF for genome annotations, and PDB for protein structures. Understanding these formats is essential for data manipulation.

    • Metadata in Biological Databases

      Metadata provides contextual information about the data, including its origin, quality, and structure. Proper metadata is essential for data interoperability and reproducibility.

    • Search Techniques in Biological Databases

      Effective search techniques are vital for retrieving relevant information from biological databases. Techniques include keyword searches, Boolean operators, and metadata-based searches.

  • Sequence and Phylogeny analysis: Sequences and alignments, dynamic programming, local and global alignment, pairwise alignment (BLAST and FASTA), multiple sequence alignment, phylogenetic analysis, PCR primer designing

    Sequence and Phylogeny analysis
    • Sequences and Alignments

      Sequences represent the order of nucleotides or amino acids in a DNA, RNA, or protein molecule. Alignments are used to compare these sequences to identify similarities and differences. They can help determine functional and evolutionary relationships.

    • Dynamic Programming

      Dynamic programming is an algorithmic technique used for solving complex problems by breaking them down into simpler subproblems. In bioinformatics, it is often used in sequence alignment, allowing for efficient computation of optimal alignments.

    • Local and Global Alignment

      Global alignment aims to align every residue in every sequence, while local alignment seeks to identify the most similar regions within sequences. This differentiation is essential depending on the biological question.

    • Pairwise Alignment (BLAST and FASTA)

      BLAST (Basic Local Alignment Search Tool) and FASTA are algorithms used for pairwise sequence alignment. They facilitate the identification of similarities between sequences, which can help uncover evolutionary relationships and functional insights.

    • Multiple Sequence Alignment

      Multiple sequence alignment involves aligning three or more sequences simultaneously. This approach provides a way to identify conserved regions, which may indicate important functional sites.

    • Phylogenetic Analysis

      Phylogenetic analysis involves the construction of evolutionary trees or phylogenies based on genetic information. Techniques in this area include distance-based methods, maximum likelihood, and Bayesian inference. These analyses help understand the evolutionary relationships among organisms.

    • PCR Primer Designing

      PCR (Polymerase Chain Reaction) primer designing is critical for amplifying specific DNA sequences. Designing effective primers requires understanding of the target sequence, melting temperature, and specificity to ensure successful amplification in PCR experiments.

  • Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools

    Searching databases: SRS, Entrez, sequence similarity searches, genome annotation tools
    • Introduction to Database Searching

      Database searching is a key component in bioinformatics, allowing researchers to efficiently find relevant biological data from vast resources. It involves systematic querying of databases to retrieve biological sequences, annotations, and information.

    • SRS (Sequence Retrieval System)

      SRS is a powerful tool for accessing various biological databases. It provides an easy-to-use interface for querying sequence data and allows users to perform complex searches across multiple databases.

    • Entrez

      Entrez is a search and retrieval system developed by NCBI. It integrates a wide range of databases including nucleotide and protein sequences, literature references, and taxonomy data. Users can conduct searches using keywords, limits, and Boolean operators.

    • Sequence Similarity Searches

      Sequence similarity searches are essential for identifying homologous sequences across different organisms. Tools such as BLAST and FASTA compare query sequences against databases to find matches, providing insights into evolutionary relationships and functional predictions.

    • Genome Annotation Tools

      Genome annotation tools assist in predicting the locations of genes and other features within a genome. Software such as GeneMark, MAKER, and Augustus are used to analyze genomic data, assigning biological meaning to sequences based on functional predictions and comparative genomics.

    • Integration of Tools and Databases

      Combining these tools enhances the ability to analyze biological data comprehensively. For example, using sequence similarity results from BLAST in conjunction with genome annotation tools can help in understanding gene function and regulation in different species.

  • Types and collection of data: Primary and secondary data, graphical representation, measures of central tendency and dispersion, skewness and kurtosis

    Types and collection of data
    • Primary Data

      Primary data refers to information gathered directly from original sources through methods such as surveys, experiments, interviews, and observations. It is characterized by its relevance and reliability for specific research purposes.

    • Secondary Data

      Secondary data is information that has already been collected and published by others. Sources of secondary data include academic journals, books, online databases, and government reports. It is less time-consuming to gather but may not be as specific as primary data.

    • Graphical Representation

      Graphical representation involves using charts, graphs, and plots to visualize data. Common types include bar graphs, histograms, pie charts, and scatter plots. Visual representation helps in understanding trends, patterns, and distributions within the data.

    • Measures of Central Tendency

      Measures of central tendency describe the center of a dataset. The three main measures are mean, median, and mode. The mean is the average, the median is the middle value when data is sorted, and the mode is the most frequently occurring value.

    • Measures of Dispersion

      Measures of dispersion indicate the spread or variability of a dataset. Key measures include range (difference between highest and lowest values), variance (average of the squared differences from the mean), and standard deviation (square root of variance).

    • Skewness

      Skewness assesses the asymmetry of the distribution of values in a dataset. Positive skew indicates a long tail on the right side, while negative skew indicates a long tail on the left. Skewness is important for understanding the nature of data distributions.

    • Kurtosis

      Kurtosis measures the tailedness of the probability distribution of a real-valued random variable. High kurtosis indicates heavy tails, meaning more data is in the extremes, while low kurtosis indicates lighter tails and a peak closer to the mean.

  • Probability: Definition and theorems, elementary ideas of binomial, Poisson and normal distributions

    • Definition of Probability

      Probability is the measure of the likelihood that an event will occur. It quantifies uncertainty, allowing predictions about future events based on past data. The probability of an event A is denoted as P(A) and is defined as the ratio of the number of favorable outcomes to the total number of possible outcomes.

    • Basic Theorems of Probability

      Key theorems include: 1. Addition Theorem: P(A or B) = P(A) + P(B) - P(A and B) 2. Multiplication Theorem: P(A and B) = P(A) * P(B|A) 3. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B)

    • Binomial Distribution

      The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It is characterized by two parameters: n (number of trials) and p (probability of success). The probability mass function is given by: P(X = k) = (n choose k) * p^k * (1-p)^(n-k) where k = number of successes.

    • Poisson Distribution

      The Poisson distribution describes the number of events occurring in a fixed interval of time or space. It is characterized by the rate parameter lambda (λ) which is the average number of occurrences in the interval. The probability function is: P(X = k) = (e^(-λ) * λ^k) / k! where k = number of occurrences.

    • Normal Distribution

      The normal distribution is a continuous probability distribution that is symmetric about the mean. It is characterized by its mean (µ) and standard deviation (σ). The probability density function is: f(x) = (1/(σ√(2π))) * e^(-0.5 * ((x-µ)/σ)^2). The total area under the curve equals 1 and is used to model many natural phenomena.

  • Sampling: Sampling methods, confidence level, hypothesis testing, large and small sample tests, t-test, chi-square, ANOVA

    Sampling in Biostatistics
    • Item

      Sampling methods refer to the techniques used to select individuals or items from a population to gather data for analysis.
      • Random sampling

      • Stratified sampling

      • Systematic sampling

      • Cluster sampling

      • Convenience sampling

    • Item

      The confidence level is a statistical term that quantifies the level of certainty in the results of a sample and is often expressed as a percentage.
      • 90% confidence level

      • 95% confidence level

      • 99% confidence level

    • Item

      Hypothesis testing is a statistical method used to make inferences or draw conclusions about population parameters based on sample data.
      • State the null hypothesis (H0) and alternative hypothesis (H1)

      • Select a significance level (alpha)

      • Choose the appropriate statistical test

      • Calculate the test statistic

      • Make a decision based on the p-value or confidence interval

    • Item

      Tests that are applicable when the sample size is large (usually n > 30).
      • Z-test

      • Large sample t-test

      Tests that are applicable when the sample size is small (usually n ≤ 30).
      • T-test

      • Small sample proportion test

    • Item

      The t-test is a statistical test used to compare the means of two groups and determine if they are significantly different from each other.
      • Independent t-test

      • Paired t-test

    • Item

      The chi-square test is a statistical test used to determine if there is a significant association between categorical variables.
      • Chi-square goodness of fit

      • Chi-square test for independence

    • Item

      ANOVA, or Analysis of Variance, is a statistical method used to compare means among three or more groups to find if at least one group mean is different from the others.
      • One-way ANOVA

      • Two-way ANOVA

  • Correlation and Regression: Types, Karl-Pearson and Spearman correlations, regression analysis, differences between correlation and regression

    Correlation and Regression
    • Introduction to Correlation

      Definition of correlation as a statistical measure that expresses the extent to which two variables are linearly related. Correlation values range from -1 to +1.

    • Types of Correlation

      Distinction between positive, negative, and zero correlation. Explanation of how positive correlation indicates that as one variable increases, the other also increases, while negative correlation shows that as one variable increases, the other decreases.

    • Karl-Pearson Correlation Coefficient

      Overview of the Karl-Pearson correlation coefficient as a method for measuring the degree of linear relationship between two continuous variables. Calculation formula and interpretation of values.

    • Spearman Rank Correlation Coefficient

      Description of Spearman's correlation as a non-parametric measure that assesses how well the relationship between two variables can be described by a monotonic function. Comparison to Pearson's method, especially in rank data.

    • Regression Analysis

      Definition and significance of regression analysis in predicting the value of a dependent variable based on the value of one or more independent variables. Brief discussion on linear regression and multiple regression.

    • Differences between Correlation and Regression

      Clarification of how correlation quantifies the strength of a relationship between two variables, whereas regression provides a formula for predicting one variable based on the other. Explanation of the differences in their purposes, methodologies, and the types of conclusions that can be drawn from each.

Biostatistics and Bioinformatics

B100501T

Biotechnology

V

Mahatma Gandhi Kashi Vidyapith

free web counter

GKPAD.COM by SK Yadav | Disclaimer