Más contenido relacionado

The power of graphs to analyze biological data

  1. the power of graphs for analyzing biological datasets Davy Suvee Janssen Pharmaceutica
  2. about me who am i ... ➡ working as an it lead / software architect @ janssen pharmaceutica • dealing with big scientific data sets • hands-on expertise in big data and NoSQL technologies ➡ founder of datablend • provide big data and NoSQL consultancy Davy Suvee • share practical knowledge and big data use cases via blog @DSUVEE
  3. outline ➡ getting visual insights into big data sets ★ gene expression clustering (mongodb, Neo4j, Gephi) ★ Mutation prevalence (cassandra, Neo4j, Gephi) ➡ fluxgraph, a time machine for you graphs ...
  4. insights in big data ➡ typical approach through warehousing ★ star schema with fact tables and dimension tables
  5. insights in big data ➡ typical approach through warehousing ★ star schema with fact tables and dimension tables
  6. insights in big data ★ real-time visualization ★ filtering ★ metrics ★ layouting 1, 2 ★ modular 1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin
  7. gene expression clustering ➡ oncology data set: ★ 4.800 samples ★ 27.000 genes ➡ Question: ★ for a particular subset of samples, which genes are co-expressed?
  8. mongodb for storing gene expressions { "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} ,   "sample_name" : "122551hp133a21.cel" ,   "genomics_id" : 122551 ,   "sample_id" : 343981 ,   "donor_id" : 143981 ,   "sample_type" : "Tissue" ,   "sample_site" : "Ascending colon" ,   "pathology_category" : "MALIGNANT" ,   "pathology_morphology" : "Adenocarcinoma" ,   "pathology_type" : "Primary malignant neoplasm of colon" ,   "primary_site" : "Colon" ,   "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} ,                     { "gene" : "X10_at" , "expression" : 3.92335121981739} ,                     { "gene" : "X100_at" , "expression" : 7.81638155662255} ,                     { "gene" : "X1000_at" , "expression" : 5.44318512260619} ,                      … ] }
  9. pearson correlation through map-reduce x y pearson correlation 43 99 21 65 25 79 0,52 42 75 57 87 59 81
  10. co-expression graph ➡ create a node for each gene ➡ if correlation between two genes >= 0.8, draw an edge between both nodes
  11. co-expression graph
  12. graphs and time ... ➡ reproducible graph state ➡ towards a time-aware graph ... ➡ fluxgraph: a blueprints-compatible graph on top of Datomic ➡ make FluxGraph fully time-aware ★ travel your graph through time ★ time-scoped iteration of vertices and edges ★ temporal graph comparison
  13. travel through time FluxGraph fg = new FluxGraph();
  14. travel through time FluxGraph fg = new FluxGraph(); Davy Vertex davy = fg.addVertex(); davy.setProperty(“name”,”Davy”);
  15. travel through time FluxGraph fg = new FluxGraph(); Davy Vertex davy = fg.addVertex(); davy.setProperty(“name”,”Davy”); Peter Vertex peter = ...
  16. travel through time FluxGraph fg = new FluxGraph(); Davy Vertex davy = fg.addVertex(); davy.setProperty(“name”,”Davy”); Peter Vertex peter = ... Vertex michael = ... Michael
  17. travel through time FluxGraph fg = new FluxGraph(); Davy kn ow Vertex davy = fg.addVertex(); s davy.setProperty(“name”,”Davy”); Peter Vertex peter = ... Vertex michael = ... Edge e1 = Michael fg.addEdge(davy, peter,“knows”);
  18. travel through time Davy Date checkpoint = new Date(); kn ow s Peter Michael
  19. travel through time Davy Date checkpoint = new Date(); kn ow s davy.setProperty(“name”,”David”); Peter Michael
  20. travel through time David Date checkpoint = new Date(); kn ow s davy.setProperty(“name”,”David”); Peter Michael
  21. travel through time David Date checkpoint = new Date(); kn ow s davy.setProperty(“name”,”David”); Peter kn Edge e2 = ow fg.addEdge(davy, michael,“knows”); s Michael
  22. travel through time by default time kn Davy ow David Davy s kn ow checkpoint s current Peter Peter kn ow s Michael Michael
  23. travel through time time kn Davy ow David Davy s kn ow checkpoint s current Peter Peter kn ow s Michael Michael fg.setCheckpointTime(checkpoint);
  24. time-scoped iteration t1 t2 t3 tcurrrent change change change Davy Davy’ Davy’’ Davy’’’ ➡ how to find the version of the vertex you are interested in?
  25. time-scoped iteration t1 t2 t3 tcurrrent next next next Davy Davy’ Davy’’ Davy’’’ previous previous previous
  26. time-scoped iteration t1 t2 t3 tcurrrent next next next Davy Davy’ Davy’’ Davy’’’ previous previous previous Vertex previousDavy = davy.getPreviousVersion();
  27. time-scoped iteration t1 t2 t3 tcurrrent next next next Davy Davy’ Davy’’ Davy’’’ previous previous previous Vertex previousDavy = davy.getPreviousVersion(); Iterable<Vertex> allDavy = davy.getNextVersions();
  28. time-scoped iteration t1 t2 t3 tcurrrent next next next Davy Davy’ Davy’’ Davy’’’ previous previous previous Vertex previousDavy = davy.getPreviousVersion(); Iterable<Vertex> allDavy = davy.getNextVersions(); Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);
  29. time-scoped iteration t1 t2 t3 tcurrrent next next next Davy Davy’ Davy’’ Davy’’’ previous previous previous Vertex previousDavy = davy.getPreviousVersion(); Iterable<Vertex> allDavy = davy.getNextVersions(); Iterable<Vertex> selDavy = davy.getPreviousVersions(filter); Interval valid = davy.getTimerInterval();
  30. time-scoped iteration ➡ When does an element change? ➡ vertex: ★ setting or removing a property ★ add or remove it from an edge ★ being removed
  31. time-scoped iteration ➡ When does an element change? ➡ vertex: ➡ edge: ★ setting or removing a property ★ setting or removing a property ★ add or remove it from an edge ★ being removed ★ being removed
  32. time-scoped iteration ➡ When does an element change? ➡ vertex: ➡ edge: ★ setting or removing a property ★ setting or removing a property ★ add or remove it from an edge ★ being removed ★ being removed ➡ ... and each element is time-scoped!
  33. temporal graph comparison David Davy Davy kn kn ow ow s s Peter what changed? Peter kn ow s Michael Michael current checkpoint
  34. temporal graph comparison ➡ difference (A , B) = union (A , B) - B ➡ ... as a (immutable) graph!
  35. temporal graph comparison ➡ difference (A , B) = union (A , B) - B ➡ ... as a (immutable) graph! David difference ( , )= kn ow s
  36. use case: longitudinal patient data t1 t2 t3 t4 t5 smoking smoking death patient patient patient patient patient cancer cancer
  37. use case: longitudinal patient data ➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)
  38. use case: longitudinal patient data ➡ historical data for 15.000 patients over a period of 10 years (2001- 2010) ➡ example analysis: ★ if a male patient is no longer smoking in 2005 ★ what are the chances of getting lung cancer in 2010, comparing patients that smoked before 2005 patients that never smoked
  39. use case: longitudinal patient data ➡ get all male non-smokers in 2005 fg.setCheckpointTime(new DateTime(2005,12,31).toDate());
  40. use case: longitudinal patient data ➡ get all male non-smokers in 2005 fg.setCheckpointTime(new DateTime(2005,12,31).toDate()); Iterator<Vertex> males = fg.getVertices("gender", "male").iterator()
  41. use case: longitudinal patient data ➡ get all male non-smokers in 2005 fg.setCheckpointTime(new DateTime(2005,12,31).toDate()); Iterator<Vertex> males = fg.getVertices("gender", "male").iterator() while (males.hasNext()) { Vertex p2005 = males.next(); boolean smoking2005 = p2005.getEdges(OUT,"smokingStatus").iterator().hasNext(); }
  42. use case: longitudinal patient data ➡ which patients were smoking before 2005? boolean smokingBefore2005 = ((FluxVertex)p2005).getPreviousVersions(new TimeAwareFilter() { public TimeAwareElement filter(TimeAwareVertex element) { return element.getEdges(OUT, "smokingStatus").iterator().hasNext() ? element : null; } }).iterator().hasNext();
  43. use case: longitudinal patient data ➡ which patients have cancer in 2010 working set of smokers Graph g = fg.difference(smokerws, time2010.toDate(), time2005.toDate());
  44. use case: longitudinal patient data ➡ which patients have cancer in 2010 working set of smokers Graph g = fg.difference(smokerws, time2010.toDate(), time2005.toDate()); ➡ extract the patients that have an edge to the cancer node
  45. Questions?