Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
@StratioBD
Multiplatform Spark solution for Graph datasourcess, Stratio Stratio
Javier Domínguez
Javier Dominguez Montes
Studied computer engineering at the
ULPGC. He is passionate about Scala,
Python and all Big Data t...
Graph use cases Results
What's next?
Dataset
Main process explanation
Notebooks show off
DataStores
Machine learning
Busin...
@StratioBD
500 GB - 2 TB
4 TB - 8 TB
20 GB - 100 GB
80’S 2000 2010 2015 2020
100 TB
> 10 PB
VALUE IS THE DATA
VALUE IS UNDERSTANDING THE DATA
DO NOT STAY ON THE
SURFACE OF KNOWLEDGE
• Graph use cases
• DataStores
• Machine learning
@StratioBD
Example of how to exploit a massive database from different stages and
through several graph technologies
MACHINE LEARNING...
Machine Learning life cycle
Show how a data sciencist is able to take advantage of a Graph
Database through different data...
USE CASES
Making use of a masive graph datasource implies make batch queries over it.
We will need to maken them with our distribute...
Most of our clients or teammates will need to have fast and easy access to the information.
We would need a way to make ea...
DATASTORES
Spark
Apache Spark is a fast and generic engine for large-scale data processing.
GraphX
Spark API for the management and d...
Neo4j
Neo4j is a highly scalable native graph database that leverages data relationships as first-class entities.
Big data...
MACHINE LEARNING
It's possible to quickly and automatically produce models that can analyze bigger, more complex data
and deliver faster, m...
• Dataset
• Main process explanation
• Notebooks show off
@StratioBD
STRATIO INTELLIGENCE
Integration of different Open Source libraries of distributed machine learning algorithms.
Developmen...
DATASET
Freebase aimed to create a global resource that allowed people
(and machines) to access common information more effectivel...
PROCESS
EXPLANATION
Transforms
Cast
RDF
Dataset
GraphFrames
Batch
query
Neo4jGraphX
Extracts sample & transforms Online
query
SVD
K-core
Decomposition Strongly
connected graph
Apply
algorithms
Behavior
Inference
Graph
Subject
equality
A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k.
Equivalently, it ...
NOTEBOOKS SHOW
OFF
BUSINESS EXAMPLE
Jaccard Graph Clustering
Node Clusterization based on concrete relations optimized for Big
Data environments.
We've develo...
• Results
• What's next?
@StratioBD
Semantic search engine
Include ElasticSearch for making text searchs as a search engine.
Apply more Machine Learning algor...
THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
contact@stratio.com
www.stratio.com
@StratioBD
people@stratio.com
WE ARE HIRING
@StratioBD
Multiplaform Solution for Graph Datasources
Nächste SlideShare
Wird geladen in …5
×

Multiplaform Solution for Graph Datasources

504 Aufrufe

Veröffentlicht am

One of the top banks in Europe, needed a system to provide better performance, scaling almost linearly with the increase in information to be analyzed, and allowing to move the processes that were currently being executed in the Host to a Big Data infrastructure. During a year we've worked on a system which is able to provide greater agility, flexibility and simplicity for the user to view information when profiling and is now able to analyze the structure of profile data. It's a powerful way to make online queries to a graph database, which is integrated with Apache Spark and different graph libraries. Basically, we get all the necessary information through Cypher queries which are sent to a Neo4j database.

Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata, we have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Multiplaform Solution for Graph Datasources

  1. 1. @StratioBD Multiplatform Spark solution for Graph datasourcess, Stratio Stratio Javier Domínguez
  2. 2. Javier Dominguez Montes Studied computer engineering at the ULPGC. He is passionate about Scala, Python and all Big Data technologies and is currently part of the Data Science team at Stratio Big Data, working with ML algorithms, profiling analysis based around Spark.
  3. 3. Graph use cases Results What's next? Dataset Main process explanation Notebooks show off DataStores Machine learning Business example
  4. 4. @StratioBD
  5. 5. 500 GB - 2 TB 4 TB - 8 TB 20 GB - 100 GB 80’S 2000 2010 2015 2020 100 TB > 10 PB
  6. 6. VALUE IS THE DATA VALUE IS UNDERSTANDING THE DATA
  7. 7. DO NOT STAY ON THE SURFACE OF KNOWLEDGE
  8. 8. • Graph use cases • DataStores • Machine learning @StratioBD
  9. 9. Example of how to exploit a massive database from different stages and through several graph technologies MACHINE LEARNING LIFE CYCLE WITH BIG DATA
  10. 10. Machine Learning life cycle Show how a data sciencist is able to take advantage of a Graph Database through different datasources and technologies thanks to our solution. Use as a example a masive dataset. Query the datasource from different technologies like: • GraphX • GraphFrames • Neo4j And finally apply Machine Learning over our information!
  11. 11. USE CASES
  12. 12. Making use of a masive graph datasource implies make batch queries over it. We will need to maken them with our distributed technologies... The easier the better Batch Queries Motifs filter example import org.graphframes._ val g: GraphFrame = Graph(usersRdd,relationshipsRdd0) // Search for pairs of vertices with edges in both directions between them val motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)") motifs.show() // More complex queries can be expressed by applying filters. motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")
  13. 13. Most of our clients or teammates will need to have fast and easy access to the information. We would need a way to make easy queries and of course a graphic representation of our data! We would need of course microservices like REST operations over our datastore. Online queries
  14. 14. DATASTORES
  15. 15. Spark Apache Spark is a fast and generic engine for large-scale data processing. GraphX Spark API for the management and distributed calculation of graphs. It comes with a great variety of graph algorithms:  Connected componentes  PageRank  Triangle count  SVD++ GraphFrames It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding and highly expressive graph queries.
  16. 16. Neo4j Neo4j is a highly scalable native graph database that leverages data relationships as first-class entities. Big data alone used to be enough, but enterprise leaders need more than just volumes of information to make bottom-line decisions. You need real-time insights into how data is related.
  17. 17. MACHINE LEARNING
  18. 18. It's possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results – even on a very large scale. The result? High-value predictions that can guide better decisions and smart actions in real time without human intervention. Machine learning SVD Will relate all the existing object in our dataset and infer possible new behaviors.
  19. 19. • Dataset • Main process explanation • Notebooks show off @StratioBD
  20. 20. STRATIO INTELLIGENCE Integration of different Open Source libraries of distributed machine learning algorithms. Development environment adapted to each data scientist. Real-time decision based on models based on machine learning algorithms Integrated with all components of the Stratio Big Data Platform Comprehensive knowledge lifecycle management
  21. 21. DATASET
  22. 22. Freebase aimed to create a global resource that allowed people (and machines) to access common information more effectively. This model is based on the idea of converting the declarations of the resources in expressions with the subject-predicate-object which are called triplets. Subject: It's the resource, what we are describing. Predicate: Could be a property or a relationship with the object value. Object value: Propertie's value or the related subject. <'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 . <'Cristiano Ronaldo'> <'Born in'> 'Portugal' . Freebase Google Total triplets: 1.9 Billion
  23. 23. PROCESS EXPLANATION
  24. 24. Transforms Cast RDF Dataset GraphFrames Batch query Neo4jGraphX Extracts sample & transforms Online query
  25. 25. SVD K-core Decomposition Strongly connected graph Apply algorithms Behavior Inference Graph Subject equality
  26. 26. A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k. Equivalently, it is one of the connected components of the subgraph of G formed by repeatedly deleting all vertices of degree less than k. Objective Remove all nodes with fewer connections. At the end, we want only the most representative and connected elements in our grah. In our use case we used K = 5. K-Core process
  27. 27. NOTEBOOKS SHOW OFF
  28. 28. BUSINESS EXAMPLE
  29. 29. Jaccard Graph Clustering Node Clusterization based on concrete relations optimized for Big Data environments. We've developed an straightforward functionality which is able to detect patterns and clusterize data in a graph database thanks to daily machine learning processes. Neo4j Scala Graph functionalities Jaccard Indexation Connected Componentes Java HDFS / Parquet Spark / GraphX 40B Jaccard distance calculation in everyday process 400K nodes graph clustering
  30. 30. • Results • What's next? @StratioBD
  31. 31. Semantic search engine Include ElasticSearch for making text searchs as a search engine. Apply more Machine Learning algorithms • Connected components: As we've already done, try to cluster information thanks to their relationships. • PageRank: Measure the importance of a subject. • Triangle counting: Check posible triangle relationships inside our dataset to avoid redundancy. New Graph use cases • Fraud detection • Recommendation System • Profiling
  32. 32. THANK YOU UNITED STATES Tel: (+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com @StratioBD
  33. 33. people@stratio.com WE ARE HIRING @StratioBD

×