Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Charleston Conference 2016

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 18 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (11)

Anzeige

Ähnlich wie Charleston Conference 2016 (20)

Weitere von Anita de Waard (20)

Anzeige

Aktuellste (20)

Charleston Conference 2016

  1. 1. | From Maslow’s Hierarchy to Knowledgegraphs: Experiments in Big and Small Data at Elsevier Anita de Waard, a.dewaard@elsevier.com VP Research Data Management, Elsevier Charleston Conference, November 4, 2016
  2. 2. | 2 Big Data vs. Small Data: What Will I Be Talking About? Data Type Small Big User UX User analytics Performance Pure Scival Research Research Data Management (RDM) HPC systems (HEP, astronomy, etc) Text Text mining KnowledgeGraphs Health Medical systems Precision Medicine Elsevier does I will talk about
  3. 3. | Bauer, B. (Bruno) et al,(2015) ‘Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung (eBook)‘ (in German) E-infrastructures Austria, https://phaidra.univie.ac.at/detail_object/o:407736 Stays at institution Take it with me Don’t know Data is lost Other When You Leave Your Institution, What Happens To Your Data?
  4. 4. | When we talk about data, we really talk about the following: Machine & environment settings Raw data Processed data Scripts & analyses Protocols, methods, algorithms Accessibility Reproducibility Reusability Discoverability Note: images for illustrative purpose only 4
  5. 5. | https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data A Maslow Hierarchy for Research Data:
  6. 6. | Preserve Process: Hivebench (http://www.hivebench.com)
  7. 7. | Linked to published papers – or not Linked to Github – or not Versioning and provenance Preserve Data: Mendeley Data (https://data.mendeley.com/)
  8. 8. | http://www.journals.elsevier.com/softwarex/ Share and Comprehend: SoftwareX (http://www.journals.elsevier.com/softwarex/)
  9. 9. | Access: Linking papers to data: www.Scholix.org • ICSU/WDS/RDA Publishing Data Service Working group • Creating linked-data model for exposing DOI to DOI links outside publisher’s firewall • Merged with National Data Service pilot with the same goal • Collaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, Elsevier, OpenAire Objective: move from a plethora of (mostly) bilateral arrangements between the different players… .. a one-for-all cross-referencing service for articles and data .. to ..
  10. 10. | Discover: Data Search (http://datasearch.elsevier.com) DataSearch.Elsevier.com 1. Across repositories 2. (Deep) indexing of data, so not just metadata 3. Data preview 1 3 2
  11. 11. | https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data A Maslow Hierarchy for Research Data: Data at Risk Reproducibility Papers
  12. 12. | Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction GOAL: IDENTIFY ENTITIES AND RELATIONSHIP ACROSS THE ENTIRE ELSEVIER CORPUS IN SCIENCE DIRECT TEXT MINING + ENTITY IDENTIFICATION, USING OUR TAXONOMIES (EMMET, COMPENDEX, AND OTHER) UNSUPERVISED, SCALABLE AND BUILT WITH OFF-THE-SHELF TECHNOLOGIES COLLABORATION WITH UNIVERSITY COLLEGE LONDON AND UM AMHERST [1] TOWARDS AN ELSEVIER KNOWLEDGE GRAPH 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT [1] Riedel, S., L. Yao, A. McCallum, and B. M. Marlin. (2013). "Relation extraction with matrix factorization and universal schemas”, http://www.aclweb.org/anthology/N13-1008
  13. 13. | SAMPLE OUTPUT: glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss Deduplication/normalization: downsampled from 49M entity-resolved triples:
  14. 14. | Knowledge Graphs for the Life Sciences: Bradley Allen, DC Conference, Oct 2016, http://www.slideshare.net/bpa777/dc2016-keynote-20161013-67164305/15
  15. 15. | 15 Trends driving Digital Health & Precision Medicine: need for health data with consent 4500 tests for gene disorders available (2013: 3200 +20% CAGR) $1245 cost to sequence full genome (10/2014: $5730) $199 cost of 23andME test 25 million biomed articles referenced on PubMed 30 days → 1 hour manual to machine learning time needed to develop one prediction model at Elsevier 1.2 million new biomed articles p.a. 76% of US hospitals use at least a basic EMR 130 million patient data sets at large insurer 21 m complete for last 2 years 7 m with clinical and lab data NB: 6 m (no clin, lab) in Germany 6.5 million in Catalonia 105 mm ECG high ecg quality, heart rate, respiratory, body temp, activity, body position, water tight, induction charged, bluetooth, continuous data feed patientslikeme has 400,000+ members 31 million data points covering 2,500+ conditions, donating data 1. genetic testing 2. information explosion 3. patient data 4. biosensors - IoT in health 5. machine learning 6. patient empowerment
  16. 16. | 16 The Elsevier Medical Graph is a deep predictive model that relates attributes of over 2000 medical conditions to phenotypes of patients at potential risk of re-admission. Probability of occurrance within next five years. 2,083 ICD10 conditions. Based on 6 year longitudinal history of 6 million German patients.
  17. 17. | 17 Big Data vs. Small Data: What Did I Talk About? Data Type Small Big User UX User analytics Performance Pure Scival Research Research Data Management (RDM) HPC systems (HEP, astronomy, etc) Text Text mining KnowledgeGraphs Health Medical systems Precision Medicine Elsevier does I discussed!
  18. 18. | Thank you! 18 Anita de Waard, VP Research Data Collaborations, Elsevier RDM Services Jericho, VT 05465 a.dewaard@elsevier.com

×