SlideShare ist ein Scribd-Unternehmen logo
1 von 16
BIG DATA ANALYTICS
Can it deliver speed and accuracy?
Risk & Compliance Engineering, Paypal
Gurinder S. Grewal
This deck contains generic architecture information, and does not
reflect the exact details of current or planned systems.
June 2013
ABOUT PAYPAL
• 123MM active users
• 190 markets, 25 currencies
• $300,000 payments processed/minute
• 2B+ events/day
• 12 TB new data added per day
• 500K+ real time queries per second
• < 100ms average response time
we are talking, a lot of data …big data!
WHAT IS BIG DATA?
transactions
interactions
observations
petabytes of data
diverse analytics
variety of data structures
hadoop
large number of characteristics
large map/reduce cluster
terradata
GROWING COMPLEXITY AND EXPECTATIONS
Emerging technologies in the modern world are opening up possibilities
for sophisticated analytics.
Data infrastructure is growing, so are the expectations - make decisions
fast and with higher accuracy!
FraudSophistication
DataComplexity
time
Simple rules, black/white lists
Linear Modes, aggregated variables
Location, Time Analysis
lowhigh
lowhigh
Inline Histories Analysis
Consistency
Networks..
Time taken to make a decision
DECISIONS MUST BE QUICK
• A gang of cyber-criminals stole $45 million in a matter of hours
• More than 36,000 transactions were made worldwide and about $40
million was stolen in 6 hours
Source: http://www.huffingtonpost.com/2013/05/09/atm-fraud_n_3248331.html
BusinessValue
80
low
high
Prevention
Fast
Detection
High Fraud
Loss
Fraudloss
low
high
DECISIONS MUST BE ACCURATE
11:01AM
11:05AM
11:06AM
• Credit card used from three distance locations in short time
Result based on realtime analysis: Block the card, not decided?
• According to past purchasing behavior
• Card holder lives in US - wife paid bill online from home PC
• Card holder’s kid studies in Europe - used card to purchase books
• Card holder travels to Japan - paid for lunch
Result based on historical analysis: It’s a legit usage
DO WE HAVE CONFLICTING REQUIREMENTS?
speed
• analyze data incoming at high velocity in split second
• consume data in timely manner to make decisions
accuracy
• utilize powerful analytics techniques (text mining, predictive analysis)
• processing large variety and volume of data (details)
cost
• can’t spend a dollar to save a penny – pick a right tool for right job
TIERED BIG DATA STRATEGY
real time
e.g. filters
near real time
e.g. correlations
offline
e.g. behavioral analysis
cost, speed
data volume, accuracy
effective decision = fn(accuracy, speed, cost)
data age
secondshoursyears
Data in-motion
Data in-use
BIG DATA - COMPUTATION STRATEGY
Offline
(map-reduce, batch)
Offline variablesOnline variables
Near Real-time
(complex event
processing)
Realtime
(in-flow processing)
• fast, very stringent availability and performance SLA’s
• computations are simple and eventually accurate
• computations are transient, short lived (user sessions)
• event-driven, incremental processing
• high efficiency and scalability
• data for short time windows (hours)
• optimized for throughput
• computations are slow and accurate
• data captured as events for historical analysis
Hadoop Technology Stack
BIG DATA IN USE - OFFLINE ECOSYSTEM
HDFS HBase
Map Reduce Framework
Data Storage
Data Processing Data Integration
ETL
Flume, Sqoop
Programming Languages
Pig Hive QL
Scheduling, Coordination
Zookeeper
Oozie
UI Framework/SDK
Hue Hue SDK
Structured Data
Unstructured Data
MPP DW RDBMS
BIG DATA IN MOTION – ONLINE ECOSYSTEM
Complex Event Processing
correlations
filtering
aggregations
pattern matching In-memory data store
Message Bus Offline
Decision Service
Events stream
CEP enables continuous analytics on data in motion
• Solution for velocity of big data
• Well suited for detection, decisioning, alerting and taking actions
• Relies on in-memory data grid for ability to provide low latency
Monitoring
BIG DATA MOVEMENT
Offline
Data movement between offline and online is the key and biggest challenge
• ETL jobs require custom coding, biggest bottleneck
• Data transfer very expensive, slow across networks, multiple data centers
• Online data stores are not optimized for parallel or bulk loads
Slows down data store during ETL operation
Negatively impacts online applications availability
Data
Cloud Offline
BIG DATA MOVEMENT EVOLUTION
Offline
In-memory data store
Offline
NoSQL
(persistent backing store)
In-memory data store
Two-tier architecture
Data Cloud
Data Cloud
Initial state
• 500GB in 16 hours
Optimization – Phase 1
• 2 TB in 16 hours
• Split data files prepared offline
• Maximize data load parallelism
• Maximum data compression
• Optimize data format
• Validation before data movement
Scale – Phase 2
• 10 TB in 6 hours
• Add persistent NoSQL behind in-memory store
• Blast bulk load into NoSQL store
• Batch process will warm the cache
• Lazy warm-up as needed, while serving r/w
• Refresh cache contents via time based evictions
Batch
Multi-tier architecture
Confidential and Proprietary14
USE CASE: GRAPH BASED DECISIONING
Map/Reduce Graph
builder
In-memory graph store
Online Graph Server
Daily
incremental
updates
Continuous
graph
updates and
rollup
• Generate graph and associated complex variables on Hadoop on daily basis
• Move the incremental changes to online in-memory graph store
• Based on event stream, keep graph, offline variables up-to-date
• In-memory store provides fast read only access to Decision services
Decision Service
Avg. read time: 2ms
95th percentile: 6ms
Events stream
offline online
Confidential and Proprietary15
• Hadoop is best for offline processing of variety and volume data – not for real time
• CEP is a solution for online, big data in motion (velocity), complements Hadoop
• Harness true power of big data by combining offline and online data
• Data integration is a key – careful planning and optimization is needed
• Online data stores are not optimized for highly parallel writes, bulk loads
• Big data can solve complex problems while delivering speed and accuracy
CONCLUSION
Big data – can it deliver speed and accuracy v1

Weitere ähnliche Inhalte

Was ist angesagt?

Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
WSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformWSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformSamisa Abeysinghe
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeHybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeAli Hodroj
 
xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream ProcessingJorge Hirtz
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Big Data Spain
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowRob Winters
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLMatt Stubbs
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDHostedbyConfluent
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
 

Was ist angesagt? (20)

Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
WSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformWSO2 Big Data Analytics Platform
WSO2 Big Data Analytics Platform
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeHybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
 
xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream Processing
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Loan Decisioning Transformation
Loan Decisioning TransformationLoan Decisioning Transformation
Loan Decisioning Transformation
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 

Andere mochten auch

PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalInnovation Enterprise
 
Big- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalBig- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalCodemotion Tel Aviv
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopDataWorks Summit
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...Data Insight
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundShannon Sofield
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case studyKimberly Teo
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (11)

PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, Paypal
 
Big- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalBig- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPal
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed Round
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case study
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Big data – can it deliver speed and accuracy v1

Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceMercedes Coyle
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalVMware Tanzu Korea
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Cloud Native Day Tel Aviv
 
Data Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingData Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingJames Arnold Faeldon
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Dataconomy Media
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkSingleStore
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsExtraHop Networks
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
 
Accelerating analytics in a new era of data
Accelerating analytics in a new era of dataAccelerating analytics in a new era of data
Accelerating analytics in a new era of dataArnon Shimoni
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Future Grid Overview 2018
Future Grid Overview 2018Future Grid Overview 2018
Future Grid Overview 2018Chris J Law
 

Ähnlich wie Big data – can it deliver speed and accuracy v1 (20)

Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from Pivotal
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
 
Data Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather ForecastingData Centric HPC for Numerical Weather Forecasting
Data Centric HPC for Numerical Weather Forecasting
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Real-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and SparkReal-Time Analytics with MemSQL and Spark
Real-Time Analytics with MemSQL and Spark
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT Operations
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
Accelerating analytics in a new era of data
Accelerating analytics in a new era of dataAccelerating analytics in a new era of data
Accelerating analytics in a new era of data
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Future Grid Overview 2018
Future Grid Overview 2018Future Grid Overview 2018
Future Grid Overview 2018
 

Kürzlich hochgeladen

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Big data – can it deliver speed and accuracy v1

  • 1. BIG DATA ANALYTICS Can it deliver speed and accuracy? Risk & Compliance Engineering, Paypal Gurinder S. Grewal This deck contains generic architecture information, and does not reflect the exact details of current or planned systems. June 2013
  • 2. ABOUT PAYPAL • 123MM active users • 190 markets, 25 currencies • $300,000 payments processed/minute • 2B+ events/day • 12 TB new data added per day • 500K+ real time queries per second • < 100ms average response time we are talking, a lot of data …big data!
  • 3. WHAT IS BIG DATA? transactions interactions observations petabytes of data diverse analytics variety of data structures hadoop large number of characteristics large map/reduce cluster terradata
  • 4. GROWING COMPLEXITY AND EXPECTATIONS Emerging technologies in the modern world are opening up possibilities for sophisticated analytics. Data infrastructure is growing, so are the expectations - make decisions fast and with higher accuracy! FraudSophistication DataComplexity time Simple rules, black/white lists Linear Modes, aggregated variables Location, Time Analysis lowhigh lowhigh Inline Histories Analysis Consistency Networks..
  • 5. Time taken to make a decision DECISIONS MUST BE QUICK • A gang of cyber-criminals stole $45 million in a matter of hours • More than 36,000 transactions were made worldwide and about $40 million was stolen in 6 hours Source: http://www.huffingtonpost.com/2013/05/09/atm-fraud_n_3248331.html BusinessValue 80 low high Prevention Fast Detection High Fraud Loss Fraudloss low high
  • 6. DECISIONS MUST BE ACCURATE 11:01AM 11:05AM 11:06AM • Credit card used from three distance locations in short time Result based on realtime analysis: Block the card, not decided? • According to past purchasing behavior • Card holder lives in US - wife paid bill online from home PC • Card holder’s kid studies in Europe - used card to purchase books • Card holder travels to Japan - paid for lunch Result based on historical analysis: It’s a legit usage
  • 7. DO WE HAVE CONFLICTING REQUIREMENTS? speed • analyze data incoming at high velocity in split second • consume data in timely manner to make decisions accuracy • utilize powerful analytics techniques (text mining, predictive analysis) • processing large variety and volume of data (details) cost • can’t spend a dollar to save a penny – pick a right tool for right job
  • 8. TIERED BIG DATA STRATEGY real time e.g. filters near real time e.g. correlations offline e.g. behavioral analysis cost, speed data volume, accuracy effective decision = fn(accuracy, speed, cost) data age secondshoursyears Data in-motion Data in-use
  • 9. BIG DATA - COMPUTATION STRATEGY Offline (map-reduce, batch) Offline variablesOnline variables Near Real-time (complex event processing) Realtime (in-flow processing) • fast, very stringent availability and performance SLA’s • computations are simple and eventually accurate • computations are transient, short lived (user sessions) • event-driven, incremental processing • high efficiency and scalability • data for short time windows (hours) • optimized for throughput • computations are slow and accurate • data captured as events for historical analysis
  • 10. Hadoop Technology Stack BIG DATA IN USE - OFFLINE ECOSYSTEM HDFS HBase Map Reduce Framework Data Storage Data Processing Data Integration ETL Flume, Sqoop Programming Languages Pig Hive QL Scheduling, Coordination Zookeeper Oozie UI Framework/SDK Hue Hue SDK Structured Data Unstructured Data MPP DW RDBMS
  • 11. BIG DATA IN MOTION – ONLINE ECOSYSTEM Complex Event Processing correlations filtering aggregations pattern matching In-memory data store Message Bus Offline Decision Service Events stream CEP enables continuous analytics on data in motion • Solution for velocity of big data • Well suited for detection, decisioning, alerting and taking actions • Relies on in-memory data grid for ability to provide low latency Monitoring
  • 12. BIG DATA MOVEMENT Offline Data movement between offline and online is the key and biggest challenge • ETL jobs require custom coding, biggest bottleneck • Data transfer very expensive, slow across networks, multiple data centers • Online data stores are not optimized for parallel or bulk loads Slows down data store during ETL operation Negatively impacts online applications availability Data Cloud Offline
  • 13. BIG DATA MOVEMENT EVOLUTION Offline In-memory data store Offline NoSQL (persistent backing store) In-memory data store Two-tier architecture Data Cloud Data Cloud Initial state • 500GB in 16 hours Optimization – Phase 1 • 2 TB in 16 hours • Split data files prepared offline • Maximize data load parallelism • Maximum data compression • Optimize data format • Validation before data movement Scale – Phase 2 • 10 TB in 6 hours • Add persistent NoSQL behind in-memory store • Blast bulk load into NoSQL store • Batch process will warm the cache • Lazy warm-up as needed, while serving r/w • Refresh cache contents via time based evictions Batch Multi-tier architecture
  • 14. Confidential and Proprietary14 USE CASE: GRAPH BASED DECISIONING Map/Reduce Graph builder In-memory graph store Online Graph Server Daily incremental updates Continuous graph updates and rollup • Generate graph and associated complex variables on Hadoop on daily basis • Move the incremental changes to online in-memory graph store • Based on event stream, keep graph, offline variables up-to-date • In-memory store provides fast read only access to Decision services Decision Service Avg. read time: 2ms 95th percentile: 6ms Events stream offline online
  • 15. Confidential and Proprietary15 • Hadoop is best for offline processing of variety and volume data – not for real time • CEP is a solution for online, big data in motion (velocity), complements Hadoop • Harness true power of big data by combining offline and online data • Data integration is a key – careful planning and optimization is needed • Online data stores are not optimized for highly parallel writes, bulk loads • Big data can solve complex problems while delivering speed and accuracy CONCLUSION

Hinweis der Redaktion

  1. Key message – large user base, large number of lookup, stringent SLA’s, big data …
  2. What is big data?You will get variety of answers depending who you ask. Answer would include; hadoop, large processing cluster, large data size, variety of data, velocity of data, etc.What really matters is, the insight, it provides to make effective decisions by providing rich set of characteristics.The richness of characteristic effects the quality of decision.The power of big data is realized when the insight provided by big data is transformed effectively into business value.
  3. Key Message: There is so much talk about three dimensions of big data - velocity, variety and volume. In reality, these dimensions are source of conflicting requirements.For example, in stock trading system, the price ticks are change in matter of milliseconds (velocity). The ticks data needs to analyzed and consumed quick to make critical trading decisions. A small delay can alter the outcome (accuracy) and cause monetary damage.In order to make accurate decisions, the system should utilize both historical patterns and the current pattern. For example, customers who buy diapers also buy formula milk. You want to recommend formula milk to every customer who puts diapers in shopping cart. A customer just bought formula milk a couple of hours ago, apply further filters in real time and recommend something else to the customer, e.g. shampoo if female customer, a beer if male customer.So, the combination of historical and real time analysis delivers best possible decisions.
  4. Key Message: The very first thing is to lay out data strategy to utilize big data effectively. Data analysis in offline environment always lags behind for two reasons – Latency of data movement from production (transactional) system to analytics systemOffline analysis are performed in a batch mode, hence at any given time, there is a set of data not consumed by offline systemThat means, we need to deal with two states of the data: 1. Data in Motion (Online environment) has the following characteristics - data volumes are relatively smaller than offline environment - main source of transactional data, user interactions and behavioral data, hence is mainly structured (generic data models for flexibility) - data is transient and can be recreated any time on offline system using simulations - data technologies for online systems are optimized for speed, complex to maintain, hence cost is high, able to handle terabytes of data2.Data in Use (Offline environment) as the following characteristics - data volumes are very large - large variety of data, different formats – structured, semi-structured, raw, diverse data sources - master copy of the data, generally appended to preserve trail of changes for historical analysis - data technologies for offline system are optimized for throughput, can handle &gt; petabytes of data, simple architecture, can be deployed using commodity H/W
  5. You need to merge offline and online dataa comprehensive view. Hence, data integration is a key.Single most important factor that needs to be considered to keep data integration cost low – “minimize the data bits that needs to be moved”
  6. Online data technologies are not optimized for parallel/bulk data loads. 1TB data load can take up to 15 hours. increase efficiencyless verbose data format, separate metadata from contentaggressively use compression, partition data for parallel data loadreduce costvalidate data before moving to preserve data link bandwidthminimize resource usage by moving only changes to the dataincrease reliabilitythrottle data load speed to minimize degradation of online systemsBe sensitive to peak load time