SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Sequence Matrix
 Gene concatenation made easy
  Gaurav Vaidya1, David Lohman2, Rudolf Meier2

                           1: NeatCo Asia, Singapore.
                           2: Department of Biological Sciences,
                              National University of Singapore, Singapore.
Our goals


 ✤   Many powerful tools exist for concatenating sequences.

 ✤   Adding new sequences to an existing dataset is tedious and time consuming.

 ✤   Our initial goal: simple, user-friendly program for concatenating sequences.

 ✤   We also added a few tools to help you look for lab contamination in your dataset.
Sequence Matrix


✤   Written in Java.

    ✤   Graphical user interface libraries.

    ✤   Works on different operating systems.

    ✤   Easy to install: download and run the batch file.
Importing sequences



✤   You can use the sequence names as
    entered in the input file.

✤   Or you can ask Sequence Matrix to try
    to identify the species names.
Importing sequences

✤   Sequences mode:                                      ✤   Species name
    ✤   gi|237510679|gb|AY556753.2|Daubentonia               ✤   Daubentonia madagascariensis
        madagascariensis voucher WE94001 5.8S
        ribosomal RNA gene, partial sequence; internal
        transcribed spacer 2, complete sequence; and
        28S ribosomal RNA gene, partial sequence

    ✤   gi|237510678|gb|AY556735.2|Macaca                    ✤   Macaca sylvanus
        sylvanus voucher OK96022 5.8S ribosomal
        RNA gene, partial sequence; internal
        transcribed spacer 2, complete sequence; and
        28S ribosomal RNA gene, partial sequence
Importing sequences



✤   A common source of error is forgetting
    to recode leading and trailing gaps as
    missing information.

✤   Sequence Matrix can automatically
    replace such gaps with question marks.
Importing sequences: Naming



✤   Sequences from one dataset are matched up to another dataset by sequence name.

    ✤   Errors in sequence naming need to be fixed.

✤   We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
Export: Taxonsets


✤   By default, we generate taxonsets on the
    basis of:

    ✤   Combined length.

    ✤   Number of character sets

    ✤   Information for a particular gene.
Gene trees



✤   Two ways to do them:

    ✤   Use the taxonset of taxa having information for a particular gene to exclude other
        taxa.

    ✤   Export the entire dataset with one file per column.
Export features



✤   You can also export the Sequence Matrix table as an Excel-readable text file.

    ✤   Supervisory mode.

    ✤   Keep track of a project as it grows.
Character sets


✤   We can read character sets defined in
    Nexus CHARSET and TNT xgroup
    commands.

✤   These can be “split” into individual
    columns, or imported as a single
    column representing the entire file.
Excision


✤   Individual sequences can be excised
    from the dataset.

✤   Excised sequences will not be exported.

    ✤   Sequence Matrix will warn you about
        that.
Contamination


✤   You thought you were sequencing Gorilla gorilla

    ✤   but you were really sequencing Homo sapiens.

✤   We have two tools you can use:

    ✤   If Homo sapiens is in your dataset.

    ✤   If Homo sapiens is not in your dataset (experimental!).
H. sapiens in dataset

✤   Looks for pairs of sequences whose
    pairwise distance is very low.

✤   Expected difference depends on gene:

    ✤   28S doesn’t change very much, but

    ✤   COI changes very quickly.

✤   Some interpretation is required.
H. sapiens not present

✤   Use “Pairwise Distance Mode” to look
    for unusual pairwise distances.

✤   Ignore one charset, then sort taxa based
    on their pairwise distance to a
    “reference taxon”.

    ✤   Colour sequences by their individual
        pairwise distances to the reference
        taxon.
H. sapiens not present

✤   Colour pairwise distances on the gene
    in question by their pairwise distance to
    the reference taxon.

✤   Look for colour variation which is
    unusual or out of place.

✤   We would expect sequences from
    different species to be correlated
    together.
Pairwise distance
mode

✤   You need to vary:

    ✤   The gene you are studying.

    ✤   The reference taxon being compared
        against.

✤   Possibly helpful as an alert mechanism.
Summary

✤   Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤   Taxonsets allow you to analyse subsets of your data in downstream programs.

✤   Excising sequences gives you greater control over which sequences to analyse.

✤   You can look for contamination in two ways:

    ✤   Looking for very low pairwise distances across your entire dataset.

    ✤   Looking for unusual pairwise distances in Pairwise Distance Mode.
Acknowledgements

✤   Rudolf Meier

✤   Zhang Guanyang

✤   Farhan Ali

✤   David Lohman

✤   Everybody at the NUS DBS
    Evolutionary Biology lab.
Question time!

Weitere ähnliche Inhalte

Was ist angesagt?

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Sanaym
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
mlong24
 

Was ist angesagt? (20)

Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future Perspectives
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Tomato leaves diseases detection approach based on support vector machines
Tomato leaves diseases detection approach based on support vector machinesTomato leaves diseases detection approach based on support vector machines
Tomato leaves diseases detection approach based on support vector machines
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Vertex cover Problem
Vertex cover ProblemVertex cover Problem
Vertex cover Problem
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
5. Global and Local Alignment Algorithms.pptx
5. Global and Local Alignment Algorithms.pptx5. Global and Local Alignment Algorithms.pptx
5. Global and Local Alignment Algorithms.pptx
 
BINARY SEARCH TREE
BINARY SEARCH TREEBINARY SEARCH TREE
BINARY SEARCH TREE
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Biomed central
Biomed centralBiomed central
Biomed central
 
Unit II - LINEAR DATA STRUCTURES
Unit II -  LINEAR DATA STRUCTURESUnit II -  LINEAR DATA STRUCTURES
Unit II - LINEAR DATA STRUCTURES
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
local and global allignment
local and global allignmentlocal and global allignment
local and global allignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Bioinformatics and Artificial Intelligence (AI) the interrelation between the...
Bioinformatics and Artificial Intelligence (AI) the interrelation between the...Bioinformatics and Artificial Intelligence (AI) the interrelation between the...
Bioinformatics and Artificial Intelligence (AI) the interrelation between the...
 

Ähnlich wie Sequence Matrix: Gene concatenation made easy

презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
Valeriya Simeonova
 

Ähnlich wie Sequence Matrix: Gene concatenation made easy (20)

31931 31941
31931 3194131931 31941
31931 31941
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Seq 301116
Seq 301116Seq 301116
Seq 301116
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
EST Clustering.ppt
EST Clustering.pptEST Clustering.ppt
EST Clustering.ppt
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Sequence Matrix: Gene concatenation made easy

  • 1. Sequence Matrix Gene concatenation made easy Gaurav Vaidya1, David Lohman2, Rudolf Meier2 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.
  • 2. Our goals ✤ Many powerful tools exist for concatenating sequences. ✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.
  • 3. Sequence Matrix ✤ Written in Java. ✤ Graphical user interface libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch file.
  • 4. Importing sequences ✤ You can use the sequence names as entered in the input file. ✤ Or you can ask Sequence Matrix to try to identify the species names.
  • 5. Importing sequences ✤ Sequences mode: ✤ Species name ✤ gi|237510679|gb|AY556753.2|Daubentonia ✤ Daubentonia madagascariensis madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca ✤ Macaca sylvanus sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
  • 6. Importing sequences ✤ A common source of error is forgetting to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.
  • 7. Importing sequences: Naming ✤ Sequences from one dataset are matched up to another dataset by sequence name. ✤ Errors in sequence naming need to be fixed. ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
  • 8. Export: Taxonsets ✤ By default, we generate taxonsets on the basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.
  • 9. Gene trees ✤ Two ways to do them: ✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one file per column.
  • 10. Export features ✤ You can also export the Sequence Matrix table as an Excel-readable text file. ✤ Supervisory mode. ✤ Keep track of a project as it grows.
  • 11. Character sets ✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire file.
  • 12. Excision ✤ Individual sequences can be excised from the dataset. ✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.
  • 13. Contamination ✤ You thought you were sequencing Gorilla gorilla ✤ but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).
  • 14. H. sapiens in dataset ✤ Looks for pairs of sequences whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.
  • 15. H. sapiens not present ✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.
  • 16. H. sapiens not present ✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.
  • 17. Pairwise distance mode ✤ You need to vary: ✤ The gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.
  • 18. Summary ✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
  • 19. Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali ✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.