SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Pau Corral Montañés

         RUGBCN
        03/05/2012
VIDEO:
“what we used to do in Bioinformatics“


note: due to memory space limitations, the video has not been embedded into
this presentation.

Content of the video:
A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done
interactively, leading to a single entry in the database related to Dengue Virus,
complete genome. Thanks to the facilities that the site provides, a FASTA file for
this complete genome is downloaded and thereafter opened with a text editor.

The purpose of the video is to show how tedious is accessing the site and
downleading an entry, with no less than 5 mouse clicks.
Unified records with in-house coding:
        The 3 databases share all sequences, but they use different                   A "mirror" of the content of other databases
        accession numbers (IDs) to refer to each entry.
        They update every night.

                                                                              ACNUC
                                                                                           A series of commands and their arguments were
        NCBI – ex.: #000134                                                                defined that allow
                                           EMBL – ex.: #000012
                                                                                           (1) database opening,
                                                                                           (2) query execution,
                                                                                           (3) annotation and sequence display,
                                                                                           (4) annotation, species and keywords browsing, and
                                                                                           (5) sequence extraction




                                              DDBJ – ex.:
                                              #002221




                                                                                  Other Databases
– NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov)
– EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl)
– DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp)
– ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
ACNUC retrieval programs:

     i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/)

     ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html)

     iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh)


     a.I) Query_win – GUI client for remote ACNUC retrieval operations
                   (http://pbil.univ-lyon1.fr/software/query_win.html)
     a.II) raa_query – same functionality as Query_win in command line interface
                  (http://pbil.univ-lyon1.fr/software/query.html)
Why use seqinR:
the Wheel          - available packages
Hotline            - discussion forum
Automation         - same source code for different purposes
Reproducibility    - tutorials in vignette format
Fine tuning        - function arguments

Usage of the R environment!!



#Install the package seqinr_3.0-6 (Apirl 2012)

>install.packages(“seqinr“)
package ‘seqinr’ was built under R version 2.14.2




                                                               A vingette:
                                                               http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
#Choose a mirror
>chooseCRANmirror(mirror_name)
>install.packages("seqinr")

>library(seqinr)

#The command lseqinr() lists all what is defined in the package:
>lseqinr()[1:9]

  [1]   "a"              "aaa"
  [3]   "AAstat"         "acnucclose"
  [5]   "acnucopen"      "al2bp"
  [7]   "alllistranks"   "alr"
  [9]   "amb"

>length(lseqinr())
[1] 209
How many different ways are there to work with biological sequences using SeqinR?

1)Sequences you have locally:

     i) read.fasta() and s2c() and c2s() and GC() and count() and translate()
     ii) write.fasta()
     iii) read.alignment() and consensus()

2) Sequences you download from a Database:

     i) browse Databases
     ii) query() and getSequence()
FASTA files example:
 Example with DNA data: (4 different characters, normally)

                                                                   The FASTA format is very
                                                                   simple and widely used for
                                                                   simple import of biological
                                                                   sequences.
                                                                   It begins with a single-line
                                                                   description starting with a
                                                                   character '>', followed by
                                                                   lines of sequence data of
                                                                   maximum 80 character each.
                                                                   Lines starting with a semi-
                                                                   colon character ';' are
                                                                   comment lines.

  Example with Protein data: (20 different characters, normally)
                                                                   Check Wikipedia for:
                                                                   i)Sequence representation
                                                                   ii)Sequence identifiers
                                                                   iii)File extensions
Read a file with read.fasta()

#Read the sequence from a local directory
> setwd("H:/Documents and Settings/Pau/Mis documentos/R_test")
> dir()
[1] "dengue_whole_sequence.fasta"

#Use the read.fasta (see next slide) function to load the sequence
> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA")
$`gi|9626685|ref|NC_001477.1|`
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g"
[19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c"
[37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a"
[55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t"
  [............................................................]
[10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c"
[10729] "a" "g" "g" "t" "t" "c" "t"
attr(,"name")
[1] "gi|9626685|ref|NC_001477.1|"
attr(,"Annot")
[1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome"
attr(,"class")
[1] "SeqFastadna"



> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
$`gi|9626685|ref|NC_001477.1|`
[1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"

> seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
> str(seq)
List of 1
 $ gi|9626685|ref|NC_001477.1|: chr
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
Usage:
rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower =
TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE,                   strip.desc
= FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong,
    endian = .Platform$endian, apply.mask = TRUE)

Arguments:
file - path (relative [getwd] is used if absoulte is not given) to FASTA file
seqtype - the nature of the sequence: DNA or AA
as.string - if TRUE sequences are returned as a string instead of a vector characters
forceDNAtolower - lower- or upper-case
set.attributes - whether sequence attributes should be set
legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored
seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3)
strip.desc - if TRUE, removes the '>' at the beginning
bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong
endian - relative to MAQ files
apply.mask - relative to MAQ files

Value:
a list of vector of chars
Basic manipulations:
#Turn seqeunce into characters and count how many are there
> length(s2c(seq[[1]]))
[1] 10735

#Count how many different accurrences are there
> table(s2c(seq[[1]]))
   a    c    g    t
3426 2240 2770 2299

#Count the fraction of G and C bases in the sequence
> GC(s2c(seq[[1]]))
[1] 0.4666977

#Count all possible words in a sequence with a sliding window of size = wordsize
> seq_2 <- "actg"
> count(s2c(seq_2), wordsize=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

> count(s2c(seq_2), wordsize=2, by=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

#Translate into aminoacids the genetic sequence.
> c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1))
[1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-...
RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD
QRSCCLYSIIPGTERQKMEWCC*INRF"
#Note:
#    1)frame=0 means that first nuc. is taken as the first position in the codon
#    2)sense='F' means that the sequence is tranlated in the forward sense
#    3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
Write a FASTA file (and how to save time)
> dir()
[1] "dengue_whole_sequence.fasta" "ER.fasta"
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33

> length(seq)
[1] 9984
> seq[[1]]
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata"
> names(seq)[1:5]
[1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F"

#Clean the sequences keeping only the ones > 25 nucleotides
> char_seq = rapply(seq, s2c, how="list")       # create a list turning strings to characters
> len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list
> bigger_25_list = c()
> for (val in 1:length(len_seq)){
+     if (len_seq[[val]] >= 25){
+         bigger_25_list = c(bigger_25_list, val)
+     }
+ }
> seq_25 = seq[bigger_25_list]            #indexing to get the desired list
> length(seq_25)
[1] 8928
> seq_25[1]
$ER0101A01F
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..."
#Write a FASTA file (see next slide)
> write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w")
> dir()
[1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
Usage:
w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80)

Arguments:
sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences.
names - The name(s) of the sequences.
file.out - The name of the output file.
open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file.
nbchar - The number of characters per line (default: 60)


Value:
none in the R space. A FASTA formatted file is created
Write a FASTA file (and how to save time)
# Remember what we did before
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33


# After cleanig seq, we produced a file "clean_seq.fasta" that we read now
> system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   1.30    0.01    1.31

#Time can be saved with the save function:
> save(seq_25, file = "ER_CLEAN_seqs.RData")

> system.time(load("ER_CLEAN_seqs.RData"))
   user system elapsed
   0.11    0.02    0.12
Read an alignment (or create an alignment object to be aligned):
#Create an alignment object through reading a FASTA file with two sequences (see next slide)
> fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta")
> fasta
$nb
[1] 2
$nam
[1] "LmjF01.0030" "LinJ01.0030"
$seq
$seq[[1]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$seq[[2]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$com
[1] NA
attr(,"class")
[1] "alignment"

# The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used.
> fixed_align = consensus(fasta, method="IUPAC")

> table(fixed_align)
fixed_align
  a   c   g   k   m  r    s   t   w    y
411 636 595   3   5 20   13 293   2   20
Usage:
rea d.a lig nm ent(file, format, forceToLower = TRUE)

Arguments:
file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative
path, the file name is relative to the current working directory, getwd.
format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f
forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in
lower case



Value:
An object is created of class alignment which is a list with the following components:
nb ->the number of aligned sequences
nam ->a vector of strings containing the names of the aligned sequences
seq ->a vector of strings containing the aligned sequences
com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
Access a remote server:
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"         "swissprot"   "ensembl"       "hogenom"
 [7] "hogenomdna"    "hovergendna"   "hovergen"        "hogenom5"    "hogenom5dna"   "hogenom4"
[13] "hogenom4dna"   "homolens"      "homolensdna"     "hobacnucl"   "hobacprot"     "phever2"
[19] "phever2dna"    "refseq"        "greviews"        "bacterial"   "protozoan"     "ensbacteria"
[25] "ensprotists"   "ensfungi"      "ensmetazoa"      "ensplants"   "mito"          "polymorphix"
[31] "emglib"        "taxobacgen"    "refseqViruses"

#Access a bank and see complementary information:
> choosebank(bank="genbank", infobank=T)
> ls()
[1] "banknameSocket"
> str(banknameSocket)
List of 9
 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3
  .. ..- attr(*, "conn_id")=<externalptr>
 $ bankname: chr "genbank"
 $ banktype: chr "GENBANK"
 $ totseqs : num 1.65e+08
 $ totspecs: num 968157
 $ totkeys : num 3.1e+07
 $ release : chr "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
 $ status :Class 'AsIs' chr "on"
 $ details : chr [1:4] "              ****     ACNUC Data Base Content      ****                        " "
      GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170
sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive,
Universite Lyon I "
> banknameSocket$details
[1] "              ****     ACNUC Data Base Content      ****                        "
[2] "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
[3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers."
[4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
Make a query:
# query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and
# that are not partial sequences
> system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial""))
   user system elapsed
   0.78    0.00    6.72

> All_Dengue_viruses_NOTpartial[1:4]
$call
query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial"")
$name
[1] "All_Dengue_viruses_NOTpartial"
$nelem
[1] 7741
$typelist
[1] "SQ"

> All_Dengue_viruses_NOTpartial[[5]][[1]]
    name   length    frame   ncbicg
"A13666"    "456"      "0"      "1"

> myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]])
> myseq[1:20]
 [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a"

> closebank()
Usage:
query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F)

Arguments:
listname - The name of the list as a quoted string of chars
query - A quoted string of chars containing the request with the syntax given in the details section
socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened
database).
invisible - if FALSE, the result is returned visibly.
verbose - if TRUE, verbose mode is on
virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req
component of the list is set to NA.

Value:
The result is a list with the following 6 components:

call - the original call
name - the ACNUC list name
nelem - the number of elements (for instance sequences) in the ACNUC list
typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for
a list of species names.
req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE
socket - the socket connection that was used
Pau Corral Montañés

         RUGBCN
        03/05/2012

Weitere ähnliche Inhalte

Was ist angesagt?

Genomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to VariantsGenomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to VariantsAureliano Bombarely
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Yaoyu Wang
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Unique, dual-matched adapters mitigate index hopping between NGS samples
Unique, dual-matched adapters mitigate index hopping between NGS samplesUnique, dual-matched adapters mitigate index hopping between NGS samples
Unique, dual-matched adapters mitigate index hopping between NGS samplesIntegrated DNA Technologies
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withHafiz Muhammad Zeeshan Raza
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014LutzFr
 
Nested JSON data processing with Apache Spark
Nested JSON data processing with Apache SparkNested JSON data processing with Apache Spark
Nested JSON data processing with Apache SparkAegis Software Canada
 
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...Laura Berry
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assemblyZhuyi Xue
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisEfi Athieniti
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB
 

Was ist angesagt? (20)

Genomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to VariantsGenomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to Variants
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
Sanger sequencing
Sanger sequencingSanger sequencing
Sanger sequencing
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Rt pcr
Rt pcrRt pcr
Rt pcr
 
DNA Notes
DNA NotesDNA Notes
DNA Notes
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Unique, dual-matched adapters mitigate index hopping between NGS samples
Unique, dual-matched adapters mitigate index hopping between NGS samplesUnique, dual-matched adapters mitigate index hopping between NGS samples
Unique, dual-matched adapters mitigate index hopping between NGS samples
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
Nested JSON data processing with Apache Spark
Nested JSON data processing with Apache SparkNested JSON data processing with Apache Spark
Nested JSON data processing with Apache Spark
 
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...
CAP Trapper Technologies and Applications, CAP Analysis of Gene Expression (C...
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assembly
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation Performance
 

Andere mochten auch

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Paul Richards
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in RKlaus Schliep
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Kate Hertweck
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Rakesh Kumar
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Jesse Vincent
 
Exotic Orient
Exotic OrientExotic Orient
Exotic OrientRenny
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Paul Pivec
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDPhillip CFD
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portalSebastien Goiffon
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Sayogyo Rahman Doko
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationStanford University
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study ShareThis
 

Andere mochten auch (20)

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)
 
Exotic Orient
Exotic OrientExotic Orient
Exotic Orient
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Bailey capítulo-6
Bailey capítulo-6Bailey capítulo-6
Bailey capítulo-6
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFD
 
Gscm1
Gscm1Gscm1
Gscm1
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portal
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
 
Geohash
GeohashGeohash
Geohash
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentation
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 

Ähnlich wie SeqinR - biological data handling

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Agora Group
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10wremes
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Marc Tritschler
 

Ähnlich wie SeqinR - biological data handling (20)

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Biopython
BiopythonBiopython
Biopython
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011
 
Java
JavaJava
Java
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 

Kürzlich hochgeladen

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Kürzlich hochgeladen (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

SeqinR - biological data handling

  • 1. Pau Corral Montañés RUGBCN 03/05/2012
  • 2. VIDEO: “what we used to do in Bioinformatics“ note: due to memory space limitations, the video has not been embedded into this presentation. Content of the video: A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done interactively, leading to a single entry in the database related to Dengue Virus, complete genome. Thanks to the facilities that the site provides, a FASTA file for this complete genome is downloaded and thereafter opened with a text editor. The purpose of the video is to show how tedious is accessing the site and downleading an entry, with no less than 5 mouse clicks.
  • 3. Unified records with in-house coding: The 3 databases share all sequences, but they use different A "mirror" of the content of other databases accession numbers (IDs) to refer to each entry. They update every night. ACNUC A series of commands and their arguments were NCBI – ex.: #000134 defined that allow EMBL – ex.: #000012 (1) database opening, (2) query execution, (3) annotation and sequence display, (4) annotation, species and keywords browsing, and (5) sequence extraction DDBJ – ex.: #002221 Other Databases – NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov) – EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl) – DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp) – ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
  • 4.
  • 5. ACNUC retrieval programs: i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/) ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html) iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh) a.I) Query_win – GUI client for remote ACNUC retrieval operations (http://pbil.univ-lyon1.fr/software/query_win.html) a.II) raa_query – same functionality as Query_win in command line interface (http://pbil.univ-lyon1.fr/software/query.html)
  • 6. Why use seqinR: the Wheel - available packages Hotline - discussion forum Automation - same source code for different purposes Reproducibility - tutorials in vignette format Fine tuning - function arguments Usage of the R environment!! #Install the package seqinr_3.0-6 (Apirl 2012) >install.packages(“seqinr“) package ‘seqinr’ was built under R version 2.14.2 A vingette: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
  • 7. #Choose a mirror >chooseCRANmirror(mirror_name) >install.packages("seqinr") >library(seqinr) #The command lseqinr() lists all what is defined in the package: >lseqinr()[1:9] [1] "a" "aaa" [3] "AAstat" "acnucclose" [5] "acnucopen" "al2bp" [7] "alllistranks" "alr" [9] "amb" >length(lseqinr()) [1] 209
  • 8. How many different ways are there to work with biological sequences using SeqinR? 1)Sequences you have locally: i) read.fasta() and s2c() and c2s() and GC() and count() and translate() ii) write.fasta() iii) read.alignment() and consensus() 2) Sequences you download from a Database: i) browse Databases ii) query() and getSequence()
  • 9. FASTA files example: Example with DNA data: (4 different characters, normally) The FASTA format is very simple and widely used for simple import of biological sequences. It begins with a single-line description starting with a character '>', followed by lines of sequence data of maximum 80 character each. Lines starting with a semi- colon character ';' are comment lines. Example with Protein data: (20 different characters, normally) Check Wikipedia for: i)Sequence representation ii)Sequence identifiers iii)File extensions
  • 10. Read a file with read.fasta() #Read the sequence from a local directory > setwd("H:/Documents and Settings/Pau/Mis documentos/R_test") > dir() [1] "dengue_whole_sequence.fasta" #Use the read.fasta (see next slide) function to load the sequence > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA") $`gi|9626685|ref|NC_001477.1|` [1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" [19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" [37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" [55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" [............................................................] [10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c" [10729] "a" "g" "g" "t" "t" "c" "t" attr(,"name") [1] "gi|9626685|ref|NC_001477.1|" attr(,"Annot") [1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome" attr(,"class") [1] "SeqFastadna" > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) $`gi|9626685|ref|NC_001477.1|` [1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......" > seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) > str(seq) List of 1 $ gi|9626685|ref|NC_001477.1|: chr "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
  • 11. Usage: rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower = TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE, strip.desc = FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong, endian = .Platform$endian, apply.mask = TRUE) Arguments: file - path (relative [getwd] is used if absoulte is not given) to FASTA file seqtype - the nature of the sequence: DNA or AA as.string - if TRUE sequences are returned as a string instead of a vector characters forceDNAtolower - lower- or upper-case set.attributes - whether sequence attributes should be set legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3) strip.desc - if TRUE, removes the '>' at the beginning bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong endian - relative to MAQ files apply.mask - relative to MAQ files Value: a list of vector of chars
  • 12. Basic manipulations: #Turn seqeunce into characters and count how many are there > length(s2c(seq[[1]])) [1] 10735 #Count how many different accurrences are there > table(s2c(seq[[1]])) a c g t 3426 2240 2770 2299 #Count the fraction of G and C bases in the sequence > GC(s2c(seq[[1]])) [1] 0.4666977 #Count all possible words in a sequence with a sliding window of size = wordsize > seq_2 <- "actg" > count(s2c(seq_2), wordsize=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 > count(s2c(seq_2), wordsize=2, by=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 #Translate into aminoacids the genetic sequence. > c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1)) [1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-... RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD QRSCCLYSIIPGTERQKMEWCC*INRF" #Note: # 1)frame=0 means that first nuc. is taken as the first position in the codon # 2)sense='F' means that the sequence is tranlated in the forward sense # 3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
  • 13. Write a FASTA file (and how to save time) > dir() [1] "dengue_whole_sequence.fasta" "ER.fasta" > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 > length(seq) [1] 9984 > seq[[1]] [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata" > names(seq)[1:5] [1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F" #Clean the sequences keeping only the ones > 25 nucleotides > char_seq = rapply(seq, s2c, how="list") # create a list turning strings to characters > len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list > bigger_25_list = c() > for (val in 1:length(len_seq)){ + if (len_seq[[val]] >= 25){ + bigger_25_list = c(bigger_25_list, val) + } + } > seq_25 = seq[bigger_25_list] #indexing to get the desired list > length(seq_25) [1] 8928 > seq_25[1] $ER0101A01F [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..." #Write a FASTA file (see next slide) > write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w") > dir() [1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
  • 14. Usage: w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80) Arguments: sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences. names - The name(s) of the sequences. file.out - The name of the output file. open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file. nbchar - The number of characters per line (default: 60) Value: none in the R space. A FASTA formatted file is created
  • 15. Write a FASTA file (and how to save time) # Remember what we did before > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 # After cleanig seq, we produced a file "clean_seq.fasta" that we read now > system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 1.30 0.01 1.31 #Time can be saved with the save function: > save(seq_25, file = "ER_CLEAN_seqs.RData") > system.time(load("ER_CLEAN_seqs.RData")) user system elapsed 0.11 0.02 0.12
  • 16. Read an alignment (or create an alignment object to be aligned): #Create an alignment object through reading a FASTA file with two sequences (see next slide) > fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta") > fasta $nb [1] 2 $nam [1] "LmjF01.0030" "LinJ01.0030" $seq $seq[[1]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $seq[[2]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $com [1] NA attr(,"class") [1] "alignment" # The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used. > fixed_align = consensus(fasta, method="IUPAC") > table(fixed_align) fixed_align a c g k m r s t w y 411 636 595 3 5 20 13 293 2 20
  • 17. Usage: rea d.a lig nm ent(file, format, forceToLower = TRUE) Arguments: file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case Value: An object is created of class alignment which is a list with the following components: nb ->the number of aligned sequences nam ->a vector of strings containing the names of the aligned sequences seq ->a vector of strings containing the aligned sequences com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
  • 18. Access a remote server: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "hogenom" [7] "hogenomdna" "hovergendna" "hovergen" "hogenom5" "hogenom5dna" "hogenom4" [13] "hogenom4dna" "homolens" "homolensdna" "hobacnucl" "hobacprot" "phever2" [19] "phever2dna" "refseq" "greviews" "bacterial" "protozoan" "ensbacteria" [25] "ensprotists" "ensfungi" "ensmetazoa" "ensplants" "mito" "polymorphix" [31] "emglib" "taxobacgen" "refseqViruses" #Access a bank and see complementary information: > choosebank(bank="genbank", infobank=T) > ls() [1] "banknameSocket" > str(banknameSocket) List of 9 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3 .. ..- attr(*, "conn_id")=<externalptr> $ bankname: chr "genbank" $ banktype: chr "GENBANK" $ totseqs : num 1.65e+08 $ totspecs: num 968157 $ totkeys : num 3.1e+07 $ release : chr " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" $ status :Class 'AsIs' chr "on" $ details : chr [1:4] " **** ACNUC Data Base Content **** " " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I " > banknameSocket$details [1] " **** ACNUC Data Base Content **** " [2] " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" [3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." [4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
  • 19. Make a query: # query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and # that are not partial sequences > system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"")) user system elapsed 0.78 0.00 6.72 > All_Dengue_viruses_NOTpartial[1:4] $call query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"") $name [1] "All_Dengue_viruses_NOTpartial" $nelem [1] 7741 $typelist [1] "SQ" > All_Dengue_viruses_NOTpartial[[5]][[1]] name length frame ncbicg "A13666" "456" "0" "1" > myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]]) > myseq[1:20] [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a" > closebank()
  • 20. Usage: query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F) Arguments: listname - The name of the list as a quoted string of chars query - A quoted string of chars containing the request with the syntax given in the details section socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened database). invisible - if FALSE, the result is returned visibly. verbose - if TRUE, verbose mode is on virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA. Value: The result is a list with the following 6 components: call - the original call name - the ACNUC list name nelem - the number of elements (for instance sequences) in the ACNUC list typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names. req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE socket - the socket connection that was used
  • 21. Pau Corral Montañés RUGBCN 03/05/2012