Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.
9. ...organise
their research
...collaborate with
one another
...discover new
research
Mendeley provides tools to help users...
è Explore crowdsourced
research catalogue
è Document statistics
è Personalised article
recommendations
è Related research
è Research contact
suggestions
12. Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(>425M articles)
Our community from a data perspective
Logging massive
set of usage data
14. Lots of features to build & support
è Reference
management
è Cite-as-you-
write
è Full-text
article search
è Digitalised
annotations
è Professional
research groups
è Social network
è Annotation
sharing
è Explore crowdsourced
research catalogue
è Document statistics
è Personalised article
recommendations
è Related research
è Research contact
suggestions
15. Lots of features to build & support
è Reference
management
è Cite-as-you-
write
è Full-text
article search
è Digitalised
annotations
è Professional
research groups
è Social network
è Annotation
sharing
è Explore crowdsourced
research catalogue
è Document statistics
è Personalised article
recommendations
è Related research
è Research contact
suggestions
16. Lots of features to build & support
è Reference
management
è Cite-as-you-
write
è Full-text
article search
è Digitalised
annotations
è Professional
research groups
è Social network
è Annotation
sharing
è Explore crowdsourced
research catalogue
è Document statistics
è Personalised article
recommendations
è Related research
è Research contact
suggestions
18. Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
19. Lots of features to build & support
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
20.
21.
22.
23.
24. The curse of success
• More articles came
• More users came
• Keeping catalogue data fresh was a burden
• Algorithms relied on global counts
• Iterating over MySQL tables was slow
• Needed to shard tables to grow catalogue
• In short, our backend system didn’t scale
26. ~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
27. ~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?
28. ~0.5 million users; the 20 largest user bases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
~30m research articles
The system started to become
slow.
How long did it take to
generate our daily readership
statistics?
23 hours!
29. We had serious needs
• Build a catalogue based on billions of articles
• Support many features that rely on the catalogue
• Statistics
• Search
• Recommendations
• Sharing
• Data
• Freshness
• Consistency
• Business context
• Agile development (rapid prototyping)
• Cost effective
• Going viral
• Technical debt stacking up
30. Enter Hadoop
What is Hadoop?
The Apache Hadoop Project develops
open-source software for reliable,
scalable, distributed computing
www.hadoop.apache.org
31. Hadoop
• Designed to operate on a cluster of
computers
• 1…thousands
• Commodity hardware (low cost units)
• Each node offers local computation and
storage
• Provides framework for working with big
data (beyond petabytes)
32. New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
33. New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
23 hr
computations
now took 15
minutes
34. New tech stack for backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
recommended
reading
37. Generating recommendations
through matrix multiplication
This is item-based
recommendations as
similarity is based on
items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
48. Disclaimer: these advantages have costs
• Migrating to a new system (data consistency)
• Setup costs
• Learn black magic to configure
• Hardware for cluster
• Administrative costs
• High learning curve to administrate Hadoop
• Still an immature technology
• You may need to debug the source code
• Developing against Mahout
• Still needs lots of love
49. Big data backend
features
Research catalogue
(~30M unique articles)
Personal libraries
(>100M articles)
Crowdsourcing
(deduplication,
metadata
aggregation,
statistics)
51. Social network
(>2.4M users)
Research catalogue
(~85M unique articles)
Research groups
(~240K groups)
Personal libraries
(>425M articles)
Our community from a data perspective
Logging massive
set of usage data
52.
53.
54.
55. Challenge: Build an application with our data,
make science more open.
PloS/Mendeley's Binary Battle
More details at http://dev.mendeley.com/api-binary-battle/
56.
57. Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
58. Challenge: Build off-line system for scientific
recommendations with our API
and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
61. We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS
65. We have a history of academic
collaborations
Duration Project
2009-2011 MAKIN’IT
2010-2014 TEAM
2010-2011 DURA
2012-2012 CSL Editor
2012-2014 CODE
2012-2014 ERASM
2013-2015 EEXCESS
Want to collaborate?
67. Conclusions
è Mendeley is far more than a reference manager – it‘s
a platform that connects researchers, data and apps
è Starting small is good, but be prepared for the cost of
scaling up
è We‘re opening up our data for you to build apps on
our platform
è We‘re always keen to collaborate with academic
groups
Kris Jack, PhD
Chief Data Scientist, @_krisjack