Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
1. BAE SYSTEMS PROPRIETARY1 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
BAE Systems Apache Spark GraphX and
GraphFrames
April 11th 2016
Eddie Baggott
2. BAE SYSTEMS PROPRIETARY2 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Functional and Data Architect
• BAE Systems, Norkom
• Anti Fraud, AML, Compliance, Watch lists, Cyber Security
• Disclaimer
• All my own opinion
Introduction
3. BAE SYSTEMS PROPRIETARY3 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Graph databases are databases that use graph structures for semantic
queries with nodes, edges and properties to represent and store data.
• Storing and showing Networks
What are graph databases
4. BAE SYSTEMS PROPRIETARY4 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Finding networks
• Analyse Relationships
• What to see how customers and accounts are connected
• See the transactions between them
• Credit Card
• Comprised Devices
• AML Rings
• Insurance
• Unauthorized Trading
• Social Networks
• Uber – Lyft Cancel Wars
• Panama Papers
What are they used for
5. BAE SYSTEMS PROPRIETARY5 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
Customer behaviour
Relationships
Showing direction of payments , co-ownerships
Use different type of lines and shapes to give extra meanings
Width of lines can show bigger amounts
6. BAE SYSTEMS PROPRIETARY6 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
offshoreleaks.icij.org/nodes/262484
Start search with “mossack fonseca”
Panama Papers
7. BAE SYSTEMS PROPRIETARY7 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
Spider out one level
Panama Papers
8. BAE SYSTEMS PROPRIETARY8 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
Show more connections
Panama Papers
9. BAE SYSTEMS PROPRIETARY9 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Graph Databases
• Neo4j, Titan ,OrientDB
• Can Store and manage data
• Transversal queries
• Processing Engine
• Spark , Giraph
• GraphX
• GraphFrames
• Can be complementary and used together e.g. MazeRunner
• Elastic Search Graph
• New , uses search and term relevancy
Graph Databases : different approaches
10. BAE SYSTEMS PROPRIETARY10 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
Apache Spark
DataFrames
GraphFrames
11. BAE SYSTEMS PROPRIETARY11 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• GraphX is a graph computation engine built on top of Spark that enables
users to interactively build, transform and reason about graph structured
data at scale. It comes complete with a library of common algorithms.
• Spark , based on RDDs
• Num Vertices, Num Edges ,Degrees
• Algorithms
• PageRank
• Connected Components
• Triangle Counting
GraphX
12. BAE SYSTEMS PROPRIETARY12 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• In Big Data “Hello World” is usually a “Word Count”, of Wikipedia
• So lets graph wiki
• Clean the Data
• Making a Vertex RDD
val vertices = articles.map(a => (pageHash(a.title), a.title))
• Making the Edge RDD
val edges: RDD[Edge[Double]] = articles.flatMap { a =>
Edge(srcVid, dstVid, 1.0) }
• Making the Graph
val graph = Graph(vertices, edges, "")
• Run PageRank on Wikipedia
val dublinGraph = graph.subgraph(vpred = (v, t) =>
t.toLowerCase contains “dublin")
val prDublin = dublinGraph.staticPageRank(5)
titleAndPrGraph.vertices.top(10).print
GraphX Example
13. BAE SYSTEMS PROPRIETARY13 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• GraphFrames support general graph processing, similar to Apache Spark’s
GraphX library. However, GraphFrames are built on top of Spark
DataFrames, resulting in some key advantages:
• Python, Java & Scala APIs: GraphFrames provide uniform APIs for all 3
languages. For the first time, all algorithms in GraphX are available from
Python & Java.
• Powerful queries: GraphFrames allow users to phrase queries in the
familiar, powerful APIs of Spark SQL and DataFrames.
• Saving & loading graphs: GraphFrames fully support DataFrame data
sources , allowing writing and reading graphs using many formats like
Parquet, JSON, and CSV.
• In GraphFrames, vertices and edges are represented as DataFrames,
allowing us to store arbitrary data with each vertex and edge
• http://spark-packages.org/package/graphframes/graphframes
Spark Graph Frames
14. BAE SYSTEMS PROPRIETARY14 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
Spark Graph Frames Example
Customer ID
Eddie 1
Alan 2
Matt 3
Deirdre 4
Bob 5
Sue 6
John 7
// Create Vertices ( customer ) and Edges payments )
Vertices = customers.select("Customer", "id").distinct()
Edges = payments.select("Sender","Receiver","Amount", "Country")
Graph = GraphFrame(Vertices, Edges)
Sender Receiver Amount Country
Eddie Matt 10,000 UK
Eddie Deirdre 15,000 Irl
Eddie Bob 25,000 USA
Alan Sue 32,000 USA
Alan John 43,000 USA
Matt Alan 50,000 Irl
Matt Deirdre 60,000 Irl
Matt Bob 120,000 USA
15. BAE SYSTEMS PROPRIETARY15 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Who sent more than 100k?
graph.vertices.filter(“amount> 100000").show
Matt
• Who sent to more than 2 people?
graph.inDegrees.filter("inDegree > 2").show
Eddie,Matt
• Who sent to most to Ireland?
graph.edges.filter(“country =‘Irl’” "). groupBy(”sender”).sum
• Who are most connected?
results = graph.pageRank(resetProbability=0.15, maxIter=10)
display(results.vertices)
Spark Graph Frames Example
16. BAE SYSTEMS PROPRIETARY16 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Another way to see who is sending money to who
Chord Diagram
17. BAE SYSTEMS PROPRIETARY17 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
www.elastic.co/products/graph
• Find connections based on relevance
•
Elastic Search : Graph
18. BAE SYSTEMS PROPRIETARY18 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY
• Graph good for Finding networks and Analysing Relationships
• Different approaches
• Lots of visualization options
• Get the benefits of using Spark
• We’re hiring!
• http://www.baesystems.com/en/cybersecurity/careers
• Any Questions?
Recap
19. FREEDOM OF INFORMATION ACT
This document (<projectreference><documentnumber>) contains confidential and commercially sensitive material
which is provided for the Authority’s internal use only and is not intended for general dissemination.
The information contained herein pertains to bodies dealing with security, national security and/or defence matters
that would be exempt under Sections 23, 24 and 26 of the Freedom of Information Act 2000 (FOIA). It also consists of
information which describes our methodologies, processes and commercial arrangements all of which would be exempt
from disclosure under Sections 41 and 43 of the Act.
Should the Authority receive any request for disclosure of the information provided in this document, the Authority is
requested to notify BAE Systems Applied Intelligence. BAE Systems Applied Intelligence shall provide every assistance
to the Authority in complying with its obligations under the Act.
BAE Systems Applied Intelligence’s point of contact for FOIA requests is:
Chief Counsel
Legal Department
BAE Systems Applied Intelligence
Surrey Research Park
Guildford Gu2 7YP
Telephone 01483 816082
BAE SYSTEMS PROPRIETARY19 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved.
(See final slide for restrictions on use.)
|
BAE SYSTEMS PROPRIETARY