SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)
Volume of data
Variety of data
Integration of data
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
The key,
The whole key, and
Nothing but the key.
Implementation
bottlenecks
vs.
Data
Modeler
Developer
Scaling-up vs.
scaling-out
Frequent need for
denormalization
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results
Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study
“… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song
Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Scalability
Cost
Availability
Consistency
Flexibility
 Key Value Databases
 Document Databases
 Wide Column Stores
 Graph Databases
 Search Engines
Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
>>> Import redis
>>> r_server = redis.Redis(“localhost”)
>>> r_server.set(“sample:123:type”,”Aorta”)
>>> r_server.get(“sample:123:type”)
>>> “Aorta”
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}
Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers
Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
 Wide column data stores:
 Extremely large volumes
of data
 High availability
 Graph Databases:
 Connected data
 Need path finding and
recursive queries
Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality
* Slide 1:
* http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re
117_genome.png
* http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net
work_of_Treponema_pallidum.png
* http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg
* http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium
* http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/
* Slide 2:
* http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/
* http://en.wikipedia.org/wiki/File:MySQL.svg
* http://commons.wikimedia.org/wiki/File:Database-postgres.svg
* http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png
* http://commons.wikimedia.org/wiki/File:Oracle_logo.svg
* http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png
* Slide 3
* http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy
* http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
* Sllide 4
* http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database-
152091_640.png
* http://www.clker.com/clipart-desk-work.html
* Slide 6
* http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent
er.jpg
* Slide 7
* http://en.wikipedia.org/wiki/Chase_(bank)
* http://en.wikipedia.org/wiki/Computer-
aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg
* http://olioshealth.com/services/electronic-medical-record-implementation/
* Slide 9
* http://tran-bio3u-
fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg
* Slide 11
* http://arteriosclerotic.org/arteriosclerotic-cardiovascular/
Slide 12
http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png
http://en.wikipedia.org/wiki/File:Riak_product_logo.png
http://download.oracle.com/berkeley-
db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp
http://www.yegor256.com/images/2014/04/dynamodb-logo.png
https://foundationdb.com/
http://www.aerospike.com/
Slide 13
http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train-
wrecks/
Slide 15
http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png
http://tomphilip.me/couchdb-its-too-easy/
http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou
chbase/
http://ravendb.net/
https://cloudant.com/
Slide 17
http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan
dra_logo.svg
https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s
rc/site/resources/images
https://accumulo.apache.org/
http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a-
graph-database.html
Slide 18
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg19&position=chr10%3A90973326-
90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5
Slide 19
https://github.com/thinkaurelius/titan
http://www.neotechnology.com/logos/
http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p
ng
http://franz.com/
Slide 21
http://blogs.teradata.com/international/why-the-reports-of-the-death-
of-the-relational-database-are-an-exaggeration/
*Dr. Rebecca Wattam,
Advisor
*Becky Will, GPAA VT PI
*Chengdong Zhang, DBA & SE
*Cyberinfrastructure Division
*GPAA Collaborators
Limits of RDBMS and Need for NoSQL in Bioinformatics

Weitere ähnliche Inhalte

Andere mochten auch

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairslittledata
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to chooseLars Thorup
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December Ruru Chowdhury
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Cloudera, Inc.
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisrobertstevens65
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
Genomics in Public Health
Genomics in Public HealthGenomics in Public Health
Genomics in Public HealthJennifer Gardy
 
An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"Asar Khan
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSpark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQLMike Crabb
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 

Andere mochten auch (19)

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairs
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to choose
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December
 
SQL & NoSQL
SQL & NoSQLSQL & NoSQL
SQL & NoSQL
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
Genomics in Public Health
Genomics in Public HealthGenomics in Public Health
Genomics in Public Health
 
An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized Genomics
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQL
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 

Ähnlich wie Limits of RDBMS and Need for NoSQL in Bioinformatics

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptxRushikeshChikane2
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL DatabasesAbiral Gautam
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lernetarunprajapati0t
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication networkAyoubSohiabMohammad
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...ijdms
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationGuru Ji
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clustersresponseteam
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptxRushikeshChikane2
 

Ähnlich wie Limits of RDBMS and Need for NoSQL in Bioinformatics (20)

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL Databases
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lerne
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication network
 
Nosql
NosqlNosql
Nosql
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Nosql
NosqlNosql
Nosql
 
Unit-10.pptx
Unit-10.pptxUnit-10.pptx
Unit-10.pptx
 
Unit01 dbms
Unit01 dbmsUnit01 dbms
Unit01 dbms
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL Presentation
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clusters
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
 
No sq lv2
No sq lv2No sq lv2
No sq lv2
 

Mehr von Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured dataDan Sullivan, Ph.D.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 

Mehr von Dan Sullivan, Ph.D. (13)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 

Kürzlich hochgeladen

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Kürzlich hochgeladen (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Limits of RDBMS and Need for NoSQL in Bioinformatics

  • 1. Alternative Approaches to Managing and Integrating Bioinformatics Data GBCB Seminar October 9, 2014 Dan Sullivan Cyberinfrastructure Division
  • 2.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 3. Relational Database – a database that [explicitly] stores information about both the data and how it is related.” (Source: http://en.wikipedia.org/wiki/Relational_database) NoSQL Database – “[a] database [that] provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.” (Source: http://en.wikipedia.org/wiki/NoSQL)
  • 4. Volume of data Variety of data Integration of data
  • 5.
  • 6.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 7. The key, The whole key, and Nothing but the key.
  • 9.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 10.
  • 11.
  • 12. Text Mining Storing Text Caching Word Vectors Extracted Features Experiment Results Atherosclerosis Research Demographics Sample Tracking Genomic data Sequence Variants Mass Spec Results
  • 13.
  • 14. Early 1950s Korean War autopsies 2012-2016 Genomic and Proteomic Architecture of Atherosclerosis (GPAA) 1985-1998 Pathodeterminants of Atherosclerosis in Youth (PDAY) study
  • 15. “… tell your children not to do what I have done …” House of the Rising Sun American Folk Song
  • 16. Started with MySQL Could have stayed with relational model, but: Requirements change New data sets Unknown data structures Increasingly complex normalized model
  • 17.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 19.  Key Value Databases  Document Databases  Wide Column Stores  Graph Databases  Search Engines
  • 20. Features Simple primitive data structure No predefined schema Limited query capabilities Dictionary-like functionality at large scale key3 key2 key1 value1 value2 value2 Bioinformatics Use Case Word vectors in text mining Caching Limitations Key lookup only, no generalized query Small number of attributes per entity
  • 21. >>> Import redis >>> r_server = redis.Redis(“localhost”) >>> r_server.set(“sample:123:type”,”Aorta”) >>> r_server.get(“sample:123:type”) >>> “Aorta”
  • 22.
  • 23. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Bioinformatics Use Case Text mining Atherosclerosis Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 24. { subject_id: "F8273", age : "26", sex : "M" date_of_death : "12-Jan-1995”, glycohemoglobin: 10%, BMI : 22, samples : [ {type:"Thoracic Aorta", AHA_score: 1}, {type:"Abdominal Aorta", AHA_score: 2}, {type:"LAD", AHA_Score:5} ], sequence: {seq_file: "F8273_08152014.bam", variant_file: "F8273_08152014.vcf”} }
  • 25.
  • 26. Features Groups attributes into column families Column families store key- value pairs Implemented as sparse multi-dimensional arrays Denormalized 104-106 columns; 109 rows  Bioinformatics Use Case  Large studies  Many experiments & data types  Simulations  Limitations  Operationally challenging  Suitable for large number of servers
  • 27.
  • 28. Limitations Less suited for tabular data Features Highly normalized Graph-based query language (Gremlin) SQL-inspired query language (Cypher) Support for path finding and recursion Bioinformatics Use Case Epidemiology simulations Interaction networks
  • 29.
  • 30.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 31. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 32. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 33.  Wide column data stores:  Extremely large volumes of data  High availability  Graph Databases:  Connected data  Need path finding and recursive queries
  • 34.
  • 35. Multiple types of databases NoSQL complements relational models Research question drives selection Balance benefits and limitations May use multiple types of databases in a single project NoSQL databases are improving rapidly, gaining additional functionality
  • 36. * Slide 1: * http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re 117_genome.png * http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net work_of_Treponema_pallidum.png * http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg * http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium * http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/ * Slide 2: * http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/ * http://en.wikipedia.org/wiki/File:MySQL.svg * http://commons.wikimedia.org/wiki/File:Database-postgres.svg * http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png * http://commons.wikimedia.org/wiki/File:Oracle_logo.svg * http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png * Slide 3 * http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy * http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf * Sllide 4 * http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database- 152091_640.png * http://www.clker.com/clipart-desk-work.html * Slide 6 * http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent er.jpg * Slide 7 * http://en.wikipedia.org/wiki/Chase_(bank) * http://en.wikipedia.org/wiki/Computer- aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg * http://olioshealth.com/services/electronic-medical-record-implementation/ * Slide 9 * http://tran-bio3u- fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg * Slide 11 * http://arteriosclerotic.org/arteriosclerotic-cardiovascular/ Slide 12 http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png http://en.wikipedia.org/wiki/File:Riak_product_logo.png http://download.oracle.com/berkeley- db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp http://www.yegor256.com/images/2014/04/dynamodb-logo.png https://foundationdb.com/ http://www.aerospike.com/ Slide 13 http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train- wrecks/ Slide 15 http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png http://tomphilip.me/couchdb-its-too-easy/ http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou chbase/ http://ravendb.net/ https://cloudant.com/ Slide 17 http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan dra_logo.svg https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s rc/site/resources/images https://accumulo.apache.org/ http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a- graph-database.html Slide 18 http://genome.ucsc.edu/cgi- bin/hgTracks?db=hg19&position=chr10%3A90973326- 90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5 Slide 19 https://github.com/thinkaurelius/titan http://www.neotechnology.com/logos/ http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p ng http://franz.com/ Slide 21 http://blogs.teradata.com/international/why-the-reports-of-the-death- of-the-relational-database-are-an-exaggeration/
  • 37. *Dr. Rebecca Wattam, Advisor *Becky Will, GPAA VT PI *Chengdong Zhang, DBA & SE *Cyberinfrastructure Division *GPAA Collaborators

Hinweis der Redaktion

  1. Relational databases take advantage of relationships between entities (things, nouns) to minimize the amount of data stored NoSQL model entities but relationships are often implicit in structure. Less emphasis on minimizing storage, preserving data integrity, or avoiding data anomalies.
  2. Projects with any two of these can probably be well handled by RDBMS. When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
  3. Simple data sets can be managed in spreadsheets. Not ideal but works in some cases. Larger and more complicated data sets require a database. Relational is a natural next step from spreadsheets because of the tabular nature of data.
  4. Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  5. Normalization is a process of reducing redundancy and risk of data anomalies. Several rules of normalization most important are Codd’s first three. Much of the code in RDBMS is designed to support querying normalized data: how to bring related data together, how to do it with an optimal set of steps (query optimizer)
  6. RDMBSs run well on single server. Can implement failover solutions, load balance read-only, difficult to have distributed RDBMSs with write operations and immediate consistency. Network and database latency causes delay in the time a row is updated in one instance and when it is updated in all others. Can require locking all replicas of rows until all replicas updated. Distributed RDBMS requires: Two phase commit for writes in Master-master configuration Master-slave replication helps with reads but not writes Sharding – helps if querying by shard key, otherwise need to query all servers Vertical partitioning – tables placed on different servers; hard to join tables on different servers Watch out for software license costs if scaling out with COTS. NoSQL database relax consistency constraint. Some implement eventual consistency. Implementation bottlenecks – need data modeler to change model schema and DBA to implement those changes. NoSQL allows developers to add columns, collections and other structures on the fly. Lose some benefits of RDBMS, such as referential integrity. Joins are time and resource consuming. Developers often deformalize to improve performance. Makes one question the use of RDBMSs if core functionality is not used.
  7. Relational good when - audit and compliance important - referential integrity - Immediate consistency - relational integrity - durability satisfied by backups Use cases: financial services, health care, manufacturing, even our own beloved Hokie Spa. Our use cases are different. Is relational really the best data model? Not necessary when - tolerant of some errors - availability primary concern - durability important
  8. Most important point of this talk Don’t be driven to choose a database model based on - what you are familiar with - what others say is the “best” data model - what has been used before just because it has been used before Let research requirements subject to constraints (time, funding, etc). Drive decision. Some of use learn this lesson the hard way.
  9. I’ll discuss how NoSQL databases can be used in two different bioinformatics areas: text mining and atherosclerosis I described text mining project in detail in seminar last semester so I won’t go into much detail in that area but I will spend a few minutes to provide background on atherosclerosis And I’ll use atherosclerosis examples when describing NoSQL data models.
  10. Build up of plaque inside arteries Plaque consists of fat, cholesterol, calcium and other substances Limits flow of oxygen Leads to: Heart attack Stroke From http://www.nhlbi.nih.gov/health/health-topics/topics/atherosclerosis/causes.html: The exact cause of atherosclerosis isn't known. However, studies show that atherosclerosis is a slow, complex disease that may start in childhood. It develops faster as you age. Atherosclerosis may start when certain factors damage the inner layers of the arteries. These factors include: Smoking High amounts of certain fats and cholesterol in the blood High blood pressure High amounts of sugar in the blood due to insulin resistanceexternal link icon or diabetesexternal link icon Plaque may begin to build up where the arteries are damaged. Over time, plaque hardens and narrows the arteries. Eventually, an area of plaque can rupture (break open). When this happens, blood cell fragments called platelets (PLATE-lets) stick to the site of the injury. They may clump together to form blood clots. Clots narrow the arteries even more, limiting the flow of oxygen-rich blood to your body.
  11. Autopsies performed during Korean War found evidence of early on set athero. Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero. PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes. 3,000 autopsies 15-34 year olds Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks. Liver samples also collected. GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.
  12. Time for confession. I ignored earlier advice about letting requirements and constraints drive database selection in GPAA project. I’ve worked with relational databases extensively, developed models for demographic, phenotypic, genomic and proteomic data before. I did not pay enough attention to the “unknown unknowns” – collaborators had additional ideas of how to leverage other data about GWAS, eQTL, histones, chromatins, etc. Did not appreciate how much would change.
  13. Could have stayed with relational model, but: Requirements were changing New data sets: GWAS, eQTL, Chromatin Segmentation, Histones Unknown data structures for Multiple Reaction Monitoring (MRM) Mass Spec and SWATH Normalized model was beginning to be more trouble than it was worth. Flexibility was a primary concern.
  14. First 4 especially important to organizations with big data and need for constant access to data and applications – e.g. Facebook, Amazon, Google Flexibility is primary driver for us to consider and eventually adopt a NoSQL database.
  15. 4 most commonly referenced database types in NoSQL community and press. Will not discuss Search databases here. PATRIC is using hybrid Relational-Search database strategy which is significantly improving performance over relational-only approach. Integration key for bioinformaticians and biologist; Don’t make them integrate data.
  16. So simple, it is almost trivial. Can store non-atomic values as well, e.g. JSON documents, but can only access entire document, cannot select a single value in the document or search for values of a particular field.
  17. Example KV databases. Redis – popular, easy to use, commonly used for caching; master-slave replication; multiple servers respond to read request; one server handles writes Riak – scalable, masterless BerkeleyDB – first widely used KV data store Areospike and FoundationDB – supports ACID transactions Amazon DynamoDB available in cloud (just announced on 10/9/2014 DynamoDB will support documents as well as KVs)
  18. JSON/BSON or XML storage
  19. Cassandra developed by Facebook Hbase part of Hadoop ecosystem Accumulo designed to support cell level access control; originally created by NSA Hypertable – used commercially
  20. Neo4j is probably most widely used of graph dbs OrientDB incorportes document db features as well as graphdb Titan runs on cluster, used Cassandra or HDFS (I think) for distributed storgae GraphChi-DB – project to run large graphs on small machines, e.g. Mac Mini’s AllegroGraph – commercial product from Franz, a long established Lisp vendor