SlideShare ist ein Scribd-Unternehmen logo
1 von 31
June, 2013
Jay Tang
GRAPH MINING WITH APACHE
GIRAPH
Confidential and Proprietary2
• Introduction
• Big Data problem
• Graph mining platform
• Use case
• Lessons
• Future work
AGENDA
Confidential and Proprietary3
• Director of Big Data Platform & Analytics, PayPal
− Hadoop, Graph mining, Real-time analytics, ML, text mining
• 20 years of software experience in the valley focused on data
• Member of original Hadoop team @Yahoo
• Built data warehouse, relational database, OLAP product
@Yahoo, Oracle/Hyperion, IBM Informix, DB2
ABOUT ME
Confidential and Proprietary4
BIG DATA PROBLEM
Confidential and Proprietary5
• Enable Online, Offline, and Mobile payment
• 128M customers worldwide
• $160B payment volume processed annually
• Major retail locations accepting PayPal
20K today  2M end of 2013
• PayPal Here launching in US and international markets
Petabye Data Problem & Growing
BIG DATA PROBLEM @ PAYPAL
Confidential and Proprietary6
• Detect and prevent fraud
• Assess credit risk
• Relevant offer to our customers
• Improve user experience
• Provide better insights to our merchants
BIG DATA POWERS PAYPAL ANALYTICS
Confidential and Proprietary7
GRAPH MINING PLATFORM
Confidential and Proprietary8
BIG DATA STACK
Data
Cloud
Confidential and Proprietary9
Traditional data processing abstraction -- TABLE
• Rows
• Columns
• Data Types
DATA ABSTRACTION
Confidential and Proprietary10
• Internet & WWW
• Social network
• PayPal payment network – accounts & transactions
GRAPH IS EVERYWHERE
Confidential and Proprietary11
• Think like a vertex
• Two basic operations
− Fusion: aggregate information from neighbors to a set of entities
− Diffusion: propagate information from a vertex to neighbors
GRAPH COMPUTING
Confidential and Proprietary12
THING LIKE A VERTEX - FUSION
Confidential and Proprietary13
THINK LIKE A VERTEX - DIFFUSION
Confidential and Proprietary14
• Which graph mining engine to use?
− GraphLab
− Apache Giraph
− Apache Hamas
• Hadoop compatible
− Data is on Hadoop
− Leverage existing cluster infrastructure
− Integration with Hadoop
• Easy of deployment and update
• Community
GRAPH MINING ENGINE
Confidential and Proprietary15
• Apache open src implementation of Google Pregel on Hadoop
• Send msg from a vertex to any other vertex
• In-memory scalable system
− Map-only jobs, Zookeeper, Netty
BSP & GIRAPH
Confidential and Proprietary16
GRAPH MINING USE CASE
Confidential and Proprietary17
• Stop fraudsters from stealing money from PayPal payment
network
• Sophisticate risk models running in real-time based on
− Online data
− Offline data
• Risk profile traditionally based on a variety of data
− Account
− Transaction -- frequency, amount, history
− IP
− Email domain
RISK DETECTION & MITIGATION
Confidential and Proprietary18
RISK COMPUTATION
Current TX Details
Risk Models
Approve
DeclineHistory Data
Confidential and Proprietary19
• PayPal data are connected
• Form multiple communities that have hidden inferences
• Discover the inferences via a graph approach
• Build a system to extract the inferences
GRAPH MINING CONNECTED DATA
Confidential and Proprietary20
GRAPH VIEW OF DATA
User1
User2
Merchant
BUY
BUY
P2P Money
Transfer
Confidential and Proprietary21
GRAPH VIEW OF DATA
Account 1
IP1 IP2
Account 2
IP3
Confidential and Proprietary22
GRAPH MINING DATA PIPELINE
Pre
Processing
Graph Processing
Post
Processing
Confidential and Proprietary23
• Input data is raw transaction data
• Custom MapReduce jobs to pre-process data into graph
model
• Output is JSON format of adjacent node list
− Easy to consume in Java and by humans
− Use gson library
• Post processing – output format conversion
GRAPH DATA PIPELINE
Confidential and Proprietary24
• Customers/Accounts linked via transactions
• Compute risk = intrinsic risk + risk propagated from peers
• Send risk message to peers
• Iterate till converge
GRAPH PROCESSING
Cus1
Cus2
Transaction T1
Transaction T0
Transaction T2
Transaction T3
Confidential and Proprietary25
IP3
IP2
GRAPH PROCESSING
Account 1
IP1 IP2
Account 2
IP3IP1
Confidential and Proprietary26
LESSONS LEARNED
Confidential and Proprietary27
• Giraph is an emerging technology
− Incubation in 2012
− Rapidly evolving
− 0.1 and 0.2 are not compatible
− Lack of knowledge & doc
• Build internal git repo
• Read code and join mailing list
• Port code from 0.1 to 0.2
• Use Giraph 1.0 released on May 6 2013
GIRAPH
Confidential and Proprietary28
• Must guarantee minimum number of Mappers
• Capacity scheduler
− set MIN mapper of queue > Giraph job needs
• Fair scheduler
− set MIN mapper of queue > Giraph job needs
− Turn on pre-emption
− Set pre-emption wait time to a small interval – 20 sec
HADOOP ENVIRONMENT INTEGRATION
Confidential and Proprietary29
• Memory constraint in a shared Hadoop environment
− 1.2B edges and 300M nodes
− Single purpose POC cluster mapper memory = 10 GB
− Shared R&D cluster mapper memory = 3 GB
• Reduce memory consumption is key
− Convert String to long for graph processing
− Convert back to String in post-processing for downstream application
− Cap the number of messages passed
− distance from current vertex
− message payload data values
MEMORY SCALABILITY
Confidential and Proprietary30
• Giraph-based data engine to produce enriched data set
• Leverage Giraph on YARN
• Number of worker scalability
FUTURE WORK
Q&A
WE ARE HIRING

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summitSujee Maniyam
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopJason Plurad
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanLuke Han
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...Chris Huang
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Thingselephantscale
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 

Was ist angesagt? (20)

Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Andere mochten auch

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 

Andere mochten auch (14)

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Ähnlich wie Hadoop Graph Processing with Apache Giraph

Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineDataWorks Summit
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosCloudera, Inc.
 
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyondRakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyondBaiji He
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018Romit Mehta
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software OverviewKNIMESlides
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
Accumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit
 
Big Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast DataBig Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast DataMatt Stubbs
 
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database Redis Labs
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to FinishAdam Doyle
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Romit Mehta
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDeepak Chandramouli
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsDataWorks Summit
 

Ähnlich wie Hadoop Graph Processing with Apache Giraph (20)

Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data Pipeline
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for Telcos
 
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyondRakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software Overview
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Accumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queries
 
Big Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast DataBig Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast Data
 
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to Finish
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended Benefits
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Hadoop Graph Processing with Apache Giraph

  • 1. June, 2013 Jay Tang GRAPH MINING WITH APACHE GIRAPH
  • 2. Confidential and Proprietary2 • Introduction • Big Data problem • Graph mining platform • Use case • Lessons • Future work AGENDA
  • 3. Confidential and Proprietary3 • Director of Big Data Platform & Analytics, PayPal − Hadoop, Graph mining, Real-time analytics, ML, text mining • 20 years of software experience in the valley focused on data • Member of original Hadoop team @Yahoo • Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2 ABOUT ME
  • 5. Confidential and Proprietary5 • Enable Online, Offline, and Mobile payment • 128M customers worldwide • $160B payment volume processed annually • Major retail locations accepting PayPal 20K today  2M end of 2013 • PayPal Here launching in US and international markets Petabye Data Problem & Growing BIG DATA PROBLEM @ PAYPAL
  • 6. Confidential and Proprietary6 • Detect and prevent fraud • Assess credit risk • Relevant offer to our customers • Improve user experience • Provide better insights to our merchants BIG DATA POWERS PAYPAL ANALYTICS
  • 8. Confidential and Proprietary8 BIG DATA STACK Data Cloud
  • 9. Confidential and Proprietary9 Traditional data processing abstraction -- TABLE • Rows • Columns • Data Types DATA ABSTRACTION
  • 10. Confidential and Proprietary10 • Internet & WWW • Social network • PayPal payment network – accounts & transactions GRAPH IS EVERYWHERE
  • 11. Confidential and Proprietary11 • Think like a vertex • Two basic operations − Fusion: aggregate information from neighbors to a set of entities − Diffusion: propagate information from a vertex to neighbors GRAPH COMPUTING
  • 12. Confidential and Proprietary12 THING LIKE A VERTEX - FUSION
  • 13. Confidential and Proprietary13 THINK LIKE A VERTEX - DIFFUSION
  • 14. Confidential and Proprietary14 • Which graph mining engine to use? − GraphLab − Apache Giraph − Apache Hamas • Hadoop compatible − Data is on Hadoop − Leverage existing cluster infrastructure − Integration with Hadoop • Easy of deployment and update • Community GRAPH MINING ENGINE
  • 15. Confidential and Proprietary15 • Apache open src implementation of Google Pregel on Hadoop • Send msg from a vertex to any other vertex • In-memory scalable system − Map-only jobs, Zookeeper, Netty BSP & GIRAPH
  • 17. Confidential and Proprietary17 • Stop fraudsters from stealing money from PayPal payment network • Sophisticate risk models running in real-time based on − Online data − Offline data • Risk profile traditionally based on a variety of data − Account − Transaction -- frequency, amount, history − IP − Email domain RISK DETECTION & MITIGATION
  • 18. Confidential and Proprietary18 RISK COMPUTATION Current TX Details Risk Models Approve DeclineHistory Data
  • 19. Confidential and Proprietary19 • PayPal data are connected • Form multiple communities that have hidden inferences • Discover the inferences via a graph approach • Build a system to extract the inferences GRAPH MINING CONNECTED DATA
  • 20. Confidential and Proprietary20 GRAPH VIEW OF DATA User1 User2 Merchant BUY BUY P2P Money Transfer
  • 21. Confidential and Proprietary21 GRAPH VIEW OF DATA Account 1 IP1 IP2 Account 2 IP3
  • 22. Confidential and Proprietary22 GRAPH MINING DATA PIPELINE Pre Processing Graph Processing Post Processing
  • 23. Confidential and Proprietary23 • Input data is raw transaction data • Custom MapReduce jobs to pre-process data into graph model • Output is JSON format of adjacent node list − Easy to consume in Java and by humans − Use gson library • Post processing – output format conversion GRAPH DATA PIPELINE
  • 24. Confidential and Proprietary24 • Customers/Accounts linked via transactions • Compute risk = intrinsic risk + risk propagated from peers • Send risk message to peers • Iterate till converge GRAPH PROCESSING Cus1 Cus2 Transaction T1 Transaction T0 Transaction T2 Transaction T3
  • 25. Confidential and Proprietary25 IP3 IP2 GRAPH PROCESSING Account 1 IP1 IP2 Account 2 IP3IP1
  • 27. Confidential and Proprietary27 • Giraph is an emerging technology − Incubation in 2012 − Rapidly evolving − 0.1 and 0.2 are not compatible − Lack of knowledge & doc • Build internal git repo • Read code and join mailing list • Port code from 0.1 to 0.2 • Use Giraph 1.0 released on May 6 2013 GIRAPH
  • 28. Confidential and Proprietary28 • Must guarantee minimum number of Mappers • Capacity scheduler − set MIN mapper of queue > Giraph job needs • Fair scheduler − set MIN mapper of queue > Giraph job needs − Turn on pre-emption − Set pre-emption wait time to a small interval – 20 sec HADOOP ENVIRONMENT INTEGRATION
  • 29. Confidential and Proprietary29 • Memory constraint in a shared Hadoop environment − 1.2B edges and 300M nodes − Single purpose POC cluster mapper memory = 10 GB − Shared R&D cluster mapper memory = 3 GB • Reduce memory consumption is key − Convert String to long for graph processing − Convert back to String in post-processing for downstream application − Cap the number of messages passed − distance from current vertex − message payload data values MEMORY SCALABILITY
  • 30. Confidential and Proprietary30 • Giraph-based data engine to produce enriched data set • Leverage Giraph on YARN • Number of worker scalability FUTURE WORK

Hinweis der Redaktion

  1. Input data size, giraph data size