2. Factors that led for the
development
• The past decade has seen an explosive growth in:
1.Genomics
2.Proteomics
3.Functional genomics
4.Biomedical research
• Identification and comparative analysis of genomes of humans
and other species for investigation of genetic networks.
• Development of new Pharmaceuticals and advances in cancer
therapies.
3. • DNA sequences form the foundation of genetic codes of all
living organisms.
• DNA sequences are comprised of four basic building blocks
called nucleotides:
1.adenine (A)
2.cytosine (C)
3.guanine (G)
4.thymine (T)
• These four nucleotides (or bases) are combined to form long
chains that resemble a twisted ladder.
4.
5. • DNA sequence … CTA CAC ACG TGT AAC …
• A gene usually comprises hundreds of individual nucleotides
arranged in particular order.
• A genome is the complete set of genes of an organism.
• Genomics is the analysis of genome sequences.
• A proteome is the complete set of protein molecules present
in a cell, tissue, or organism.
• Proteomics is the study of proteome sequences.
6. Data mining may contribute to
the biological data analysis in
the following aspects.
7. Biological data mining has
become an essential part of
new research field called
bioinformatics.
8. 1)Semantic integration of
heterogeneous, distributed genomic and
proteomic data bases.
• Genomic and proteomic data sets are often generated at
different labs and by different methods.
• They are distributed, heterogeneous, and of wide variety.
• Integration of such data is essential to cross-site analysis of
biological data .
• Such integration and linkage analysis would facilitate the
systematic and coordinated analysis of genome and biological
data.
9. • This has promoted the development of integrated data
warehouses to store and manage derived biological data.
• Data cleaning, data integration, reference
reconciliation, classification, and clustering methods will
facilitate the integration of biological data and the
construction of data warehouses for biological data analysis.
10. 2)Alignment, indexing, similarity search, and
comparative analysis of multiple nucleotide/protein
sequences.
• BLAST and FASTA, in particular, are the tools for the systematic
analysis of genomic and proteomic data.
• Biological sequence analysis methods differ from many
sequential pattern analysis algorithms proposed in data
mining.
• For protein sequences, two amino acids should also be
considered a “match” if one can be derived from the other by
substitutions that are likely to occur in nature.
11. • There is a combinatorial number of ways to approximately
align multiple sequences:
1)reducing a multiple alignment to a series of pair wise
alignments and then combining the result.
2)using Hidden Markow Models or HMMs.
• Multiple alignment can be used to identify highly conserved
residues among genomes and they can be used to build
phylogenetic trees to infer evolutionary relationships among
species.
• Genomic and proteomic sequences isolated from diseased
and healthy tissues can be compared to identify critical
differences between them.
• Sequences occurring in the diseased samples may indicate the
genetic factor of the disease.
12. 3)Discovery of structural patterns and analysis of
genetic networks and protein pathways.
• Protein sequences are folded into 3D structures, and such
structures interact with each other based on the relative
position and distances between them.
• Such complex interactions lead to the formation of genetic
networks and protein pathways.
• It is important to develop powerful and scalable data mining
to discover patterns and to study about regularities and
irregularities among complex biological network.
13. 4)Association and path analysis: identifying co-
occurring gene sequences and linking genes to
different stages of disease development .
• Many studies have been focused on comparison of one gene
to another.
• Most diseases are not triggered by a single gene but by a
combination of genes acting together.
• Association analysis methods can be used to determine the
kinds of genes that are likely to co-occur in target samples.
• A group of genes may contribute to a disease process, here
path analysis is expected to play an important role.
14. 5)Visualization tools in genetic data analysis.
• Alignments among genomic or proteomic sequences and
interactions between them can be expressed in
1)Graphic forms.
2)Transformed into various kinds of easy-to-understand
visual displays.
• They facilitate pattern understanding, knowledge
discovery, and interactive data exploration.