Weitere ähnliche Inhalte Ähnlich wie Graph Analytics on Data from Meetup.com (20) Mehr von Karin Patenge (16) Kürzlich hochgeladen (20) Graph Analytics on Data from Meetup.com1. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Improve Your Experience by Using
Graph Analytics
Slides from my session at
“Women Who Code” Meetup | 2018-05-23 | Berlin
Karin Patenge | @kpatenge | karin.patenge@oracle.com
Business Development Manager Technology (Europe North)
Oracle Deutschland B.V. & Co. KG
1
2. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Oracle Code Berlin
June 12th 2018
Free full-day event @ Funkhaus Berlin
https://developer.oracle.com/code/berlin-june-2018
Including panel discussion:
Go for IT! Make Diversity Matter: Digital Transformation
as a Chance for Women in Coding
3. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Agenda
• Data of Interest
• Questions of Interest
• Data Processing Workflow
• Key Takeaways
• Q&A
@kpatenge
4. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Just briefly about myself
Since Nov. 2016: Business Development Manager focusing on new(er)/emerging
technologies & modern data management platforms for Europe North
Joined Oracle in 2007: As Sales Consultant for Core Tech Products. Special topics:
Spatial Technologies, Graph & Semantic Technologies, NoSQL, …
Before Oracle: Since 1989 worked as Computer Scientist in several IT roles | depts for
Radio Technology Manufacturer | Public Sector | Pharma (Schering, Bayer Health Care)
@kpatenge
5. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Setting the Scene
@kpatenge
6. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Data of Interest
Direct relations not (yet) analyzed
• Data retrieval via REST API
https://www.meetup.com/meetup_api
• Different API methods & versions
• API Key required
• Sample request
• Data returned as JSON
@kpatenge
is_interested_in
is_member_of
is_assigned_to
has_registered_for
takes_place_in
is_located_in
7. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Questions of Interest
• Which Meetup groups are most active in terms of:
– # members
– # events
– # event attendees
• Who and where are influencers in the Meetup community?
• Where are connections between the Meetup groups in different locations?
• Which topics are “hot”?
• How close/similar are groups?
• …
@kpatenge
8. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Data Processing Workflow: Overview
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB as
Graph data
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
@kpatenge
9. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Code and result (data) files can be
downloaded from:
– https://github.com/karinpatenge/AnalyticsandD
ataSummit2018
Important Note
@kpatenge
10. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Working Environment
• Available for free:
Oracle Big Data Lite VM 4.11 running in Oracle VirtualBox
– Big Data Spatial and Graph (BDSG) 2.4 including Property Graph Query Language
(PGQL) 1.0
http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html
• Gremlin, Apache Groovy Shell
• Zeppelin Notebook with PGX Interpreter
– Oracle NoSQL Database (Minimal instance with 1 node, no replication, aka kvlite)
– RStudio
• Additional R packages loaded
– Cytoscape 3.6.0
• Big Data Spatial and Graph 2.4 support installed
@kpatenge
11. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Modeling Data as Graphs
11
The more connected the data is, the better a Graph fits
Oracle NoSQL DB with Big Data Spatial and GraphGraphic source: http://www.ateam-oracle.com/intro-to-graphs-at-oracle/
@kpatenge
12. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• A set of nodes (aka vertices)
– each vertex has a unique identifier
– each vertex has a set of in/out edges
– each vertex has a collection of key-value
properties
• A set of edges
– each edge has a unique identifier
– each edge has a head/tail vertex
– each edge has a label denoting type of
relationship between two vertices
– each edge has a collection of key-value properties
• Blueprints Java APIs
• Implementations
– Oracle (Spatial and Graph, Big Data Spatial and
Graph), Neo4j, DataStax (Titan), InfiniteGraph,
Dex, Sail, MongoDB, …
12
What is a Property Graph?
https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
@kpatenge
2
13. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 13
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB to
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
1
@kpatenge
14. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Request URL (Example)
• https://api.meetup.com/Women-Who-
Code-Berlin-
Germany/events?&key=506c1916524f6d
3a6c782432645f5eb&status=past,upcomi
ng&omit=description
• Important note:
– For most requests data are only returned for
the city that matches with the location that
is assigned to the user profile posessing the
API key
Response (JSON)
Requesting Data via Meetup REST API
1
@kpatenge
15. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Run R Code via RStudio
1
@kpatenge
16. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Transform JSON into a flat structure:
One record per instance of
information type
– Cities
– Categories
– Groups
– Members
– Events
– Topics
• Store data in .csv
– Not required but convenient to have as
intermediate format
Intermediate Results: CSV text files
1
@kpatenge
17. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Extract attribute values from flat
structure
• Append each as single record into nodes
and edges files
Final Results: Flat File Structure for Property Graph
1
Nodes
(aka Vertices)
(in flat file format)
Edges
(in flat file format)
@kpatenge
18. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Useful Tips
• When creating nodes and edges files (.opv, .ope)
– Assign the right data type to attributes
– Check for NULL values
– Replace special characters
– Remove duplicates
– Check pattern of IDs used in source(s). Generate surrogate IDs if necessary.
• Keep original ID by storing it as property if necessary
@kpatenge
1
19. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 19
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB to
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
2
@kpatenge
20. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Architecture of Property Graph Support
20@kpatenge
3
Graph Data Access Layer (DAL)
Graph Analytics
Blueprints & Lucene/SolrCloud RDF (RDF/XML, N-
Triples, N-Quads,
TriG,N3,JSON)
REST/Web
Service/Notebooks
Java,Groovy,Python,…
Java APIs
Java APIs/JDBC/SQL/PLSQL
Property Graph
formats
GraphML
GML
GraphSON
Flat FilesScalable and Persistent Storage Management
Parallel In-Memory Graph
Analytics (PGX) /
Graph Querying (PGQL)
Oracle NoSQL
Database
Oracle RDBMS Apache HBase
Apache
Spark
21. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Import Nodes and Edges into a Property Graph
// Start Groovy Shell connecting to Oracle NoSQL DB
cd /opt/oracle/oracle-spatial-graph/property_graph/dal/groovy
./gremlin-opg-nosql.sh
server = new ArrayList();
server.add("bigdatalite.localdomain:5000");
// Create a graph config that contains the graph name "meetup"
// Name of KV store is "kvstore"
// Make sure to add all vertex/edge properties used in PGQL queries
cfg = GraphConfigBuilder.forPropertyGraphNosql()
.setName("meetup")
.setStoreName("kvstore")
.setHosts(server)
.addVertexProperty("type", PropertyType.STRING, "NA")
.addVertexProperty("city_name", PropertyType.STRING, "NA")
.addVertexProperty("city_country", PropertyType.STRING, "NA")
.addVertexProperty("city_member_count", PropertyType.INTEGER, 0)
.addVertexProperty("group_country", PropertyType.STRING, "NA")
.addVertexProperty("group_visibility", PropertyType.STRING, "NA")
.addVertexProperty("group_members", PropertyType.INTEGER, 0)
.addVertexProperty("group_name", PropertyType.STRING, "NA")
.addVertexProperty("member_name", PropertyType.STRING, "NA")
.addVertexProperty("topic_name", PropertyType.STRING, "NA")
.addVertexProperty("topic_urlkey", PropertyType.STRING, "NA")
.addVertexProperty("event_yes_rsvp_count", PropertyType.INTEGER, 0)
.addVertexProperty("event_rating_count", PropertyType.INTEGER, 0)
.addVertexProperty("event_rating_average", PropertyType.INTEGER, 0)
.addVertexProperty("event_waitlist_count", PropertyType.INTEGER, 0)
.hasEdgeLabel(true)
.setLoadEdgeLabel(true)
.setMaxNumConnections(2).build();
// Create an instance of the graph
opg = OraclePropertyGraph.getInstance(cfg);
opg.setClearTableDOP(2);
opg.clearRepository();
opg.getKVStoreConfig();
// Create an instance for the graph loader
opgdl=OraclePropertyGraphDataLoader.getInstance();
vfile="/home/oracle/Documents/Meetup/data/meetup.opv
efile="/home/oracle/Documents/Meetup/data/meetup.ope
// Load data into the graph
opgdl.loadData(opg, vfile, efile, 2);
// Do some checks
// Count vertices and edges
opg.countVertices();
opg.countEdges();
// Get vertices and edges
opg.getVertices();
opg.getEdges();
...
// Shut down instance and close shell
opg.shutdown();
:q
2
@kpatenge
22. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 22
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB to
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
3
@kpatenge
23. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
PGX – Graph Analytics Engine
• Toolkit for In-Memory, Parallel Graph
Analysis containing
– PGX shell
– Analyst API with a large collection of built-in
algorithms
– and more
• Developed by Oracle Labs
• https://docs.oracle.com/cd/E56133_01/latest/i
ndex.html
• https://event.cwi.nl/grades/2018/07-
VanRest.pdf
PGQL – Property Graph Query Language
• http://pgql-lang.org/
• Graph Pattern Matching combined with
SQL
– WHERE clause set of comma-separated
constraints
• Developed by Oracle Labs
• Proposed for standardization
23
How to Analyze Property Graph Data
@kpatenge
3
24. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Analyze Property Graph Data using PGX
3
• Start PGX server
/opt/oracle/oracle-spatial-
graph/property_graph/pgx/bin/start-server
• Start / Return to Groovy Shell
// Create in-memory session and analyst for analytics
session=Pgx.createSession("session_ID_1");
analyst=session.createAnalyst();
// Read the graph from Oracle NoSQL DB into memory
pgxGraph =
session.readGraphWithProperties(opg.getConfig());
// Working with In-Memory Analyst
// Execute Page Rank
rank=analyst.pagerank(pgxGraph, 0.0001, 0.85, 100);
// Get top 10 vertices
rank.getTopKValues(10);
// BetweenNess Centrality
bc=analyst.vertexBetweennessCentrality(pgxGraph)
// Get top 10 vertices
bc.getTopKValues(10);
...
@kpatenge
25. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Topology constraints
(n)–[e]–>(m)
(n)–[e1]–>(m1), (n)–[e2]–>(m2)
(n1)-[e1]->(n2)-[e2]->(n3)-[e3]->(n4)
(n1)-[e1]->(n2)<-[e2]-(n3)
• Label matching
(x:Person) -[e:likes]-> (y:Person)
(:Person) -[:likes]-> (:Person)
(x:Student|Professor) -[e:likes|knows]->
(y:Student|Professor)
• Value constraints
(x) -> (y), x.name = 'John’, y.age > 25
• In-Line constraints
(n WITH name = 'John' OR name = 'James', type =
'Person') -[e WITH type = 'workAt', workHours <
40]-> ()
• …
Syntax form Examples
Basic form (n)-[e]->(m)
Omit variable name of the source
vertex
()-[e]->(m)
Omit variable name of the destination
vertex
(n)-[e]->()
Omit variable names in both vertices ()-[e]->()
Omit variable name in edge (n)-->(m)
Omit variable name in edge
(alternative, one dash)
(n)->(m)
25
Analyzing Property Graph Data using PGQL
3
@kpatenge
26. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Analyzing Property Graph Data using PGQL
3
• Start / Return to Groovy Shell
// Some PGQL queries
pgxResultSet = pgxGraph.queryPgql("SELECT * WHERE (x) -
[e1:is_organizer_of]-> (y) -[e2:is_located_in]-> (z)")
pgxResultSet.print(5);
pgxResultSet.getNumResults();
pgxResultSet = pgxGraph.queryPgql("SELECT x.type,
y.type, y.group_name, y.group_members WHERE (x) -
[e1:is_organizer_of]-> (y WITH group_members > 1000) -
[e2:is_located_in]-> (z) order by y.group_members
desc");
pgxResultSet.print(5);
pgxResultSet = pgxGraph.queryPgql("SELECT
x.member_name, y.group_name, y.group_members WHERE (x)
-[e1:is_organizer_of]-> (y WITH group_members > 1000) -
[e2:is_located_in]-> (z)");
pgxResultSet.print(5);
pgxResultSet = pgxGraph.queryPgql("SELECT * WHERE (x
WITH event_yes_rsvp_count > 250) -[e1:is_organized_by]-
> (y) -[e2:is_located_in]-> (z)")
pgxResultSet.print(5);
...
@kpatenge
https://blogs.oracle.com/bigdataspatialgraph/how-many-ways-to-run-property-graph-query-language-pgql-in-bdsg-i
27. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 27
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB to
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
4
@kpatenge
28. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Visualization Data using Cytoscape connected to
Big Data Spatial and Graph
4
@kpatenge
29. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SELECT *
WHERE (x WITH type='Event') -[e1]-> (y WITH
type='Group' and group_name = 'Women Who Code
Berlin') <-[e2:is_assigned_to]- (z WITH
type='Topic')
29
PGQL – Examples (visualized using Cytoscape)
4
@kpatenge
https://blogs.oracle.com/bigdataspatialgraph/how-many-ways-to-run-property-graph-query-language-pgql-in-bdsg-ii
30. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SELECT *
WHERE (x) -[e1:is_organized_by]-> (y WITH
type='Group' and group_name = 'Women Who Code
Berlin') <-[e2:is_assigned_to]- (z WITH
type='Topic'), (y) -[e3:is_located_in]-> (w)
30
PGQL – Examples (visualized using Cytoscape)
4
@kpatenge
31. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SELECT *
WHERE (x WITH type='Topic' and topic_name = 'Women
in Technology') -[e1]-> (y WITH type='Group') -
[e2]-> (z WITH type = 'City' and (city_name =
'Berlin' or city_name = 'Hamburg' or city_name =
'München'))
31
PGQL – Examples (visualized using Cytoscape)
4
@kpatenge
32. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SELECT *
WHERE (x WITH type='Event' and
event_yes_rsvp_count >= 250) -[e1]- (y WITH
type='Group') -[e2]- (z WITH type='City')
32
PGQL – Examples (visualized using Cytoscape)
4
@kpatenge
33. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SELECT *
WHERE (x WITH type='Group' and group_name = 'Women
Who Code Berlin') <-[e1:is_assigned_to]- (y WITH
type='Topic') -[e2]-> (z WITH group_members >=
2000) -[e3:is_located_in]-> (w)
33
PGQL – Examples (visualized using Cytoscape)
4
@kpatenge
34. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Copenhagen
Berlin
Hamburg
Munich
4
Meetup Groups in relation to organizers
@kpatenge
More Visualization Examples using Cytoscape
35. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
4
@kpatenge
More Visualization Examples using Cytoscape
36. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 37
Retrieve&Prepare
Prepare
source data
• Using R data
retrieval via
REST API and
conversion
JSON CSV
OPV/OPE
Load&Build
Load
nodes and
edges data
into a graph
• Use Oracle
NoSQL DB to
store
Analyze
Analyze
graph data
• Using Graph
Analytics Engine
(PGX) and
Property Graph
Query Language
(PGQL)
Visualize
Visualize
graph data
• Using
Cytoscape
Results
Summarize
results
5
@kpatenge
37. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Summarize (Preliminary) Results
Who are important people in the Meetup landscape?
Which Meetup groups should we talk to for certain topics?
Which Meetup groups are relevant in terms of
#Members, #Participants of events, #Events
Which Meetup groups are related and how?
...
5
@kpatenge
38. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Key Takeaways – So far
• Graph data model perfect to focus on connectivity
• Code written once, re-useable many times to retrieve data from every
desired location (city)
• Visual analysis helps a great deal to understand how data are connected
• Big variety of analytic tools and frameworks to answer all kind of questions
– Integrated distributed, in-memory Graph analytics engine
• Use case of how to combine Open Source with Oracle Technologies
• Please also check latest Graph talks during
Analytics and Data Summit in March 2018
– https://analyticsanddatasummit.org/schedule/
5
@kpatenge
39. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 40@kpatenge
40. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Oracle Code Berlin
June 12th 2018
See you there