Nowomics.com is a website to help biology researchers stay up to date with the latest data and papers relevant to their research. This is a talk given at the Cambridge UK MongoDB User Group about how Nowomics is built on MongoDB.
3. Biomedical data are being generated
and published at an unprecedented rate
model organisms
1500
literature
biological databases
proteins
pathways
genome annotation
gene expression
interactions
~20,000 journal articles a week
mutations
diseases
5. HOW NOWOMICS WORKS
literature
& databases
Fetch data
every day
nowomics
Work out what’s
changed
link to original
data source
Personalised
News Feed
& email alerts
Follow
Users follow what
they work on
Organise by
gene, disease,
process, author, etc
6. Alpha - summer 2013
Beta - November 2013
CEDAR Enterprise Fellowship - December 2013
13. 1.
2.
3.
4.
1+4 - precalculate with aggregation framework
3 - wasn’t using correct index, needed hint
2 - uses aggregation framework, doesn’t support hint
14. AGGREGATION FRAMEWORK I
•
New in 2.2 - alternative to map reduce
•
map reduce was slow and complex
•
Analogous to SQL group by
•
Run a pipeline of commands
db.links.aggregate([
{$match: {t2: 'pub', t1: 'gene'}},
{$group: {_id : '$n2', count: {$sum: 1}}} ])
17. AGGREGATION EXAMPLE
Count of genes linked to each publication
db.links.aggregate([
{$match: {t2: 'pub', t1: 'gene'}},
{$group: {_id : '$n2', count: {$sum: 1}}} ])
(actually precalculate for all and store results in collection)
18. AGGREGATION EXAMPLE
Count of updates per month
db.links.aggregate([
{$match: {date: {$gte: new ISODate('2013-02-01')},
't1': 'gene', 'n1': 530}},
{$project: {_id: 0, month: {$month: ‘$date'},
year: {$year: ‘$date'}}},
{$group: {'_id': {m: '$month', y: ‘$year'},
count: {$sum: 1}}} ] )
(actually precalculate for all and store results in collection)
19. AGGREGATION ISSUES
•
No explain()(coming in 2.6)
•
Can’t use index hints
•
16MB result limit - run in batches
(coming in 2.6)
•
Can’t output results to collection
(coming in 2.6)
22. PERFORMANCE
Indexes & data (20GB) bigger than RAM (8GB)
• main indexes in RAM would be OK
• Loading data uses different indexes
• Slow page loads
•
23. PERFORMANCE
Indexes & data (20GB) bigger than RAM (8GB)
• main indexes in RAM would be OK
• Loading data uses different indexes
• Slow page loads
•
!
•
ReadPreference.SECONDARY_PREFERRED
send links queries to secondary
• indexes stay in RAM
•