SlideShare ist ein Scribd-Unternehmen logo
1 von 21
1© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
From Big data
to Smart data
A journey into the
eXelate cloud
Motty Cohen,
Chief Architect, eXelate
2© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
eXelate is the smart
data company that
powers smarter
digital marketing
decisions worldwide
Advertiser
1st Party
Data
Data Providers
Offline
Data
Online
Data
Media Platforms
Modeling
Scoring
Segmentation
Analytics
Distribution
Marketing
Data Exchange Platform
3© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
• Demographic
• Age: 40-55
• Urbanicity: Suburban
• Income: High
• Education: Graduate Plus
• Employment: Management
• Interest
• Sport
• Travels
• Wines
• Gadgets
• Intent
• Travel to Barcelona
• 4-star resort
Smart Data:
Accurate &
actionable audience
segmentation
4© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Our journey begins
in the browser
5© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Inside eXelate
Cloud:
Real-time Serving
& Smart data
delivery
Get Event Info
Add History
Data
Apply Rules &
Models
Sell to buyers
6© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Challenges
Big Data
Relevancy Access Time
On demand
Analytics
7© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
8© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Challenge 1:
Relevancy
Grabbing the
relevant audience
on site, on time
9© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Generating Models
Model
Model
Model
Data Mining
Analytics
Create Models
Netezza
tables
Running
Analytics on
Amazon
Java
Packages
10© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Real time
segmentation:
Running rules and
models
Basic
Rules
Association
Rules
Analytic
Models
Model
Model
Model
Can we run all these within the limited time frame?
11© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Continuous
Incremental
Segmentation
Continuous Incremental Segmentation
12© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Challenge 2:
Fast access to
distributed big
storage
13© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
User Object • User Info
• Segments, Delivery info, Intermediate results
• Object Size: x10 KB ~ x100 KB
• ~ 850M UU
• Access time
• Read / Write within a few ms
• Availability
• For any machine in the cluster
• For any cluster in every data center
14© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Aerospike:
Frontend storage
for fast access
Aerospike Cluster
Serving Cluster
XDR: Cross Data Center Replication
Optimized for SSD, Indexed in RAM
Smart Eviction Policy
Fast read/writes: 500K+ TPS
Key-value NoSQL distributed DB
15© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Replicated storage
across data centers
US WEST
CA
US CENRAL
TX
EUROPE
NL
US EAST
NY
Aerospike XDR:
Cross Datacenter
Replication
16© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Challenge 3:
On demand
analytics
Show me the data,
Now!
17© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
optiX:
Interactive data
analytics
On Demand Calculation
18© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
optiX:
Interactive data
analytics
On Demand Calculation
19© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Data Center
Elastic Search:
Using search engine
for counting.
Netezza
DWH
Aggregator
ES Cluster
(30 Nodes)
Reporter
S3
Loader
optiX
REST FTP
20© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
What did we have
so far?
• Data relevancy
• Real-time scoring
• Parallel processing
• Split processing over time
• Big data access time
• Front end, Replicated, Aerospike cluster
• On-demand analytics
• Change your schema to optimize query time
• Move processing from querying to loading phase
• Trade off: Space + Processing -> Performance
21© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
Thank You
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Creating Real-time Systems of Engagement with Analytics and Big Data
Creating Real-time Systems of Engagement with Analytics and Big DataCreating Real-time Systems of Engagement with Analytics and Big Data
Creating Real-time Systems of Engagement with Analytics and Big DataMongoDB
 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoBuilding a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoDenodo
 
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data apps
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data appsGianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data apps
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data appsCodemotion
 
Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2Ashnikbiz
 
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...Databricks
 
A Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain OptimizationA Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain OptimizationNeo4j
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at KoinexMongoDB
 
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial
 
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...Codemotion
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDenodo
 
Cloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service OptionCloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service OptionDenodo
 
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...MongoDB
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
From Data to Insights to Action: When Transactions and Analytics Converge
From Data to Insights to Action: When Transactions and Analytics ConvergeFrom Data to Insights to Action: When Transactions and Analytics Converge
From Data to Insights to Action: When Transactions and Analytics ConvergeAli Hodroj
 
Building a Modern FinTech Big Data Infrastructure
Building a Modern FinTech Big Data InfrastructureBuilding a Modern FinTech Big Data Infrastructure
Building a Modern FinTech Big Data InfrastructureDatabricks
 
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture OpportunityThe Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture OpportunityDenodo
 
Geo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data GridsGeo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data GridsAli Hodroj
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB
 

Was ist angesagt? (20)

Creating Real-time Systems of Engagement with Analytics and Big Data
Creating Real-time Systems of Engagement with Analytics and Big DataCreating Real-time Systems of Engagement with Analytics and Big Data
Creating Real-time Systems of Engagement with Analytics and Big Data
 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoBuilding a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In Denodo
 
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data apps
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data appsGianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data apps
Gianluigi Viganò - How to use HP HEAVEN-on-demand functions for Big Data apps
 
Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2Build robust streaming data pipelines with MongoDB and Kafka P2
Build robust streaming data pipelines with MongoDB and Kafka P2
 
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...
Powering Asurion's Connected Home Platform with Spark Structured Streaming, D...
 
A Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain OptimizationA Connections-first Approach to Supply Chain Optimization
A Connections-first Approach to Supply Chain Optimization
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex
[MongoDB.local Bengaluru 2018] Rapid Development at Scale with MongoDB at Koinex
 
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
 
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
[Keynote HP] Guido Pezzin - Big Data - from theory to practice with the simpl...
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
 
Cloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service OptionCloud Modernization and Data as a Service Option
Cloud Modernization and Data as a Service Option
 
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...
Bangalore Executive Seminar 2015: MongoDB - Your database of choice for real ...
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
From Data to Insights to Action: When Transactions and Analytics Converge
From Data to Insights to Action: When Transactions and Analytics ConvergeFrom Data to Insights to Action: When Transactions and Analytics Converge
From Data to Insights to Action: When Transactions and Analytics Converge
 
Building a Modern FinTech Big Data Infrastructure
Building a Modern FinTech Big Data InfrastructureBuilding a Modern FinTech Big Data Infrastructure
Building a Modern FinTech Big Data Infrastructure
 
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture OpportunityThe Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
 
Geo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data GridsGeo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data Grids
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 

Ähnlich wie Big data e xposed from big data to smart data

Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnectaDigital
 
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...FIWARE
 
How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?DataStax
 
Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Gilles Fedak
 
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingAnalyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingDenodo
 
The Value of Customer Insights & Analytics in a Modern Retail Environment
The Value of Customer Insights & Analytics in a Modern Retail EnvironmentThe Value of Customer Insights & Analytics in a Modern Retail Environment
The Value of Customer Insights & Analytics in a Modern Retail EnvironmentDenodo
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
Successful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringSuccessful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringDatabricks
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Precisely
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
The Snowflake training in Hyderabad ad
The Snowflake training  in Hyderabad     adThe Snowflake training  in Hyderabad     ad
The Snowflake training in Hyderabad adyeswitha3zen
 
Operationalizing a Vision for the Monetization of Telco Consumer Data
Operationalizing a Vision for the Monetization of Telco Consumer DataOperationalizing a Vision for the Monetization of Telco Consumer Data
Operationalizing a Vision for the Monetization of Telco Consumer DataPrecisely
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 

Ähnlich wie Big data e xposed from big data to smart data (20)

DataStax
DataStaxDataStax
DataStax
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...
FIWARE Global Summit - Advanced ML/AI Techniques with FIWARE and Connected Io...
 
How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?
 
Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?
 
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision MakingAnalyst Webinar: Best Practices In Enabling Data-Driven Decision Making
Analyst Webinar: Best Practices In Enabling Data-Driven Decision Making
 
The Value of Customer Insights & Analytics in a Modern Retail Environment
The Value of Customer Insights & Analytics in a Modern Retail EnvironmentThe Value of Customer Insights & Analytics in a Modern Retail Environment
The Value of Customer Insights & Analytics in a Modern Retail Environment
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
Successful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringSuccessful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data Engineering
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the Cloud
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Snowflake.3zen (1).pptx
Snowflake.3zen (1).pptxSnowflake.3zen (1).pptx
Snowflake.3zen (1).pptx
 
The Snowflake training in Hyderabad ad
The Snowflake training  in Hyderabad     adThe Snowflake training  in Hyderabad     ad
The Snowflake training in Hyderabad ad
 
Operationalizing a Vision for the Monetization of Telco Consumer Data
Operationalizing a Vision for the Monetization of Telco Consumer DataOperationalizing a Vision for the Monetization of Telco Consumer Data
Operationalizing a Vision for the Monetization of Telco Consumer Data
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Big data e xposed from big data to smart data

  • 1. 1© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 From Big data to Smart data A journey into the eXelate cloud Motty Cohen, Chief Architect, eXelate
  • 2. 2© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 eXelate is the smart data company that powers smarter digital marketing decisions worldwide Advertiser 1st Party Data Data Providers Offline Data Online Data Media Platforms Modeling Scoring Segmentation Analytics Distribution Marketing Data Exchange Platform
  • 3. 3© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 • Demographic • Age: 40-55 • Urbanicity: Suburban • Income: High • Education: Graduate Plus • Employment: Management • Interest • Sport • Travels • Wines • Gadgets • Intent • Travel to Barcelona • 4-star resort Smart Data: Accurate & actionable audience segmentation
  • 4. 4© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Our journey begins in the browser
  • 5. 5© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Inside eXelate Cloud: Real-time Serving & Smart data delivery Get Event Info Add History Data Apply Rules & Models Sell to buyers
  • 6. 6© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Challenges Big Data Relevancy Access Time On demand Analytics
  • 7. 7© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013
  • 8. 8© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Challenge 1: Relevancy Grabbing the relevant audience on site, on time
  • 9. 9© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Generating Models Model Model Model Data Mining Analytics Create Models Netezza tables Running Analytics on Amazon Java Packages
  • 10. 10© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Real time segmentation: Running rules and models Basic Rules Association Rules Analytic Models Model Model Model Can we run all these within the limited time frame?
  • 11. 11© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Continuous Incremental Segmentation Continuous Incremental Segmentation
  • 12. 12© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Challenge 2: Fast access to distributed big storage
  • 13. 13© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 User Object • User Info • Segments, Delivery info, Intermediate results • Object Size: x10 KB ~ x100 KB • ~ 850M UU • Access time • Read / Write within a few ms • Availability • For any machine in the cluster • For any cluster in every data center
  • 14. 14© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Aerospike: Frontend storage for fast access Aerospike Cluster Serving Cluster XDR: Cross Data Center Replication Optimized for SSD, Indexed in RAM Smart Eviction Policy Fast read/writes: 500K+ TPS Key-value NoSQL distributed DB
  • 15. 15© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Replicated storage across data centers US WEST CA US CENRAL TX EUROPE NL US EAST NY Aerospike XDR: Cross Datacenter Replication
  • 16. 16© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Challenge 3: On demand analytics Show me the data, Now!
  • 17. 17© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 optiX: Interactive data analytics On Demand Calculation
  • 18. 18© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 optiX: Interactive data analytics On Demand Calculation
  • 19. 19© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Data Center Elastic Search: Using search engine for counting. Netezza DWH Aggregator ES Cluster (30 Nodes) Reporter S3 Loader optiX REST FTP
  • 20. 20© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 What did we have so far? • Data relevancy • Real-time scoring • Parallel processing • Split processing over time • Big data access time • Front end, Replicated, Aerospike cluster • On-demand analytics • Change your schema to optimize query time • Move processing from querying to loading phase • Trade off: Space + Processing -> Performance
  • 21. 21© 2013 eXelate Inc. Confidential and Proprietary. #bdx2013 Thank You Questions?

Hinweis der Redaktion

  1. In this session I’ll take the audience to a short trip in the eXelate’s cloud in which I’ll present 3 big data related challenges and how we faced them.
  2. The digital marketing industry includes a huge eco-system: publishers, data providers, data management platforms, ad-networks, marketing agencies and marketers.Data is the fuel driving this industry and eXelate get the raw data coming online from publisher’s sites, enhance it with data coming from various offline providers, run set of deterministic rules and analytical models to mark the users browsing the internet with specific attributes and sell it online to the marketers (via ardencies and DMPs = Demand side platforms).
  3. The basic business entity we are selling is a segment.Segment is group of individuals (part of a larger population) which have similar attributes.Different segments have different attributes that may be defined as a target audience which might be interesting for the marketers.Marketers are always seeking for the most relevant target audience in order to uplift their sells.We can analyze three major categories of segments:The first is demographic segments which includes demographic characteristics such as age, gender, income, education, employment level etc`. These are quite static characteristics barely changing.The second category includes behavioral aspects like domain of interests (like sport, wines, travel, shopping etc`).The third category is intent. As the name suggests, it implies that the person browsing the internet has an immediate intent (like purchasing specific item). These segments are the most relevant for marketers since a specific advertisement related to the user’s intent will most likely be the most efficient one from the marketer’s point of view and the most relevant for the user.eXelate “marks” the user browsing the publishers’ sites with segments and sell it online to the marketers / DMPs so they can direct their ads to the targeted audience.
  4. Our journey starts in the browser.A user browses to one of our partners – publisher’s site – like homeaway, kayak and pronto. The site includes a tag (pixel) with a reference to eXelate’s URL. In most cases, the publisher adds details to the URL (e.g. publisher id, some details he knows about the user, tags representing user activities in the site) and the data is processed by eXelate.This is the place to tell that eXelate is very sensitive and aware to privacy issue. eXelate does not keep any attribute identifying the user. The user is represented as an anonymous entity which belongs to some groups – segments. We are not interested in the specific use’s details but only in the fact that he is part of a large group of individuals with similar attributes.
  5. What is now happening inside the cloud?The request is processed within 200ms to generate response – a redirect to up to 5 buyer’s URLs. How do we identify those buyers?First we process the individual event information by extracting parameters from the URL and user browser agent.By the cookie we can tell if we have previous information on the user or should we generate a new entity representing him.We are adding historical data and apply set of rules and analytic models on the data to generate the segments.After the user is “marked” with segments, we do a buyers matching process to select 5 most relevant buyers to include their URL’s in the redirect response.We have 5 billions events generation 27 GB of data every day. We have a total of 850 million unique users every given moment and their data spans over 14 TB in the storage.We have more than 500,000 rules and 20,000 segments we are generating and selling to more than 100 media platforms.
  6. So, we do all the above in a short time of 200ms. We have 3 challenges we would like to share:Relevancy – how to identify the most relevant data to the buyer?Access time – how to access to a single user info in a shortest time?On demand analytics – how to process huge data sets on demand to provide meaningful insights to the marketer?
  7. We process 5 billions events per day that produce a lot of data, but a lot of data = a lot of noise.Smart data is the signal – actionable data - we are looking for.In the previous slide, intent combined with user characteristics (segments) is a good example of smart data. A well targeted ad will be the most relevant for this user.More than that, it will be most relevant now, not tomorrow and not even in the next hour.
  8. The problem is that our data set is not static, millions of users browsing the web executing a lot of actions every hour and we can learn new things about them.The goal is to mark in segments as many users as we can to cover most of the target audience.In the classical approach, we would just take a snapshot of the daily data once a day or every few hours, run the rules and analytical models to score the users and generate the groups of segments.The problem is that we will not meet most of the users again, a user can be active for the next minutes of hours, maybe after a week. After a month it is most likely that we will not see him again (actually we will see him identified as another user). The relevancy of the data is dropping rapidly.We need to perform scoring and segmentation since the actionable data should be available for the advertiser in real time.
  9. The first step is to generate the analytical models. Our data scientists extract the data in the database, we are using Netezza as our data warehouse. This data is event centric and represents all the events generated by the users up to the last 90 days.The data scientists building the models using R and running them on amazon cloud. After validating the models, they are implementing them as java packages which we are deploying to our eXtream cluster to perform scoring and generate models in real time.
  10. Inside the cluster, we are running the following sequence on every event (URL call):We are running basic rules (defined in XML document) to generate demographic segments.After having the event info and user history, we are running association rules which are deterministic rules that can be implemented by pattern matching. We are using Jboss Drools that implements RETE algorithm which is most efficient for this purpose.Finally, we are applying the analytical models that do the scoring and segmentation based on some advanced algorithms like Linear Regression and Collaborative Filtering.In the near future, we will add real-time learning to generate the analytical models on the fly.We have over 500,000 rules and quite complex algorithms and still growing, can we do all that within 200 ms?
  11. Our solution is Continuous Incremental Segmentation.We separated the serving layer from the rules computation (segmentation) layer. Each layer contains dedicated HW most suitable for its role. Communication between layers is implemented using 0MQ – a blazing fast messaging infrastructure.A single message – request for segmentation – is sent from the serving cluster to the segmentation cluster. Models and rules run in parallel to generate segment and results sent asynchronously to the calling process. Each rule and model send its results independently from the other models.The serving cluster collects all the result within a specific time frame and build the response. Results which will not arrive within this time frame will not be included in the response.The segments and intermediate results are stored in the user storage.Why it is continuous? Because the same process will be performed again for the same user in his next action on the page (or even in another page) which will result a call to eXelate serving.Why it is incremental? Since the next process will run only those models and rules that were not included in the previous response.Eventually, the process spans over several iterations (calls) to generate as many segments as possible.
  12. W saw earlier that we are processing over 5 billions events per day and we need fast access (both read and write) to a storage hosting the user’s information.Not only that, this data must be available for any machine in the cluster or even machines in another data center.
  13. What are the requirements from the user storage?For each user we save the segments, delivery information (to whom and when we deliver this user info as part of segment), and intermediate results.A user object size may vary between a few dozens KB to few hundreds KB.We are holding 850 millions unique users in every given moment.We need fast access time (read and write) within a few milliseconds.We need the user object to be available for every machine in the cluster and across data centers.At a time, we examined a few NoSQL data bases: CouchBase and MongoDB. Eventually we selected Aerospike.
  14. Aerospike is a:Key-Value NoSQL DB support billions of objects, work well in a clusterAbove 500K TPS, we gain 1 -2 ms access timeSmart data eviction policyOptimized for SSD, indexed in RAMObjects partitioning by namespacesCross data center replication
  15. We have a 9 nodes Aerospike cluster on each of our 4 serving data centers.The Aerospike replicates data across data center within minutes.Some of the Aerospike challenges we faced:It was (and still is) a cutting edge technology. We encountered some instabilities and bugs but we ware actually their beta site and they provided very good support. Soon version 3 will be released and it looks like they are becoming more stable and mature product.Very basic management and monitoring capabilities comparing the other products.Small install base and eco-system resulting small knowledge base.Requires specific HW (SSD) which is not yet commodity HW.
  16. The background processes generating a lot of valuable data that can provide meaningful insights to another client.This would be a marketing manager, short in time, working on several campaigns and need to take marketing decisions quickly.For these customers we are providing optiX.
  17. OptiX is an interactive application for the marketers to help him understand the market and get some insights regarding the audience his is targeting.OptiX provides this information based on the Hugh data cloud in our data warehouse.In this sample we can see a screen providing on demand calculation of the size of the segments selected by the user.These are not simple counting, they includes aggregation and de-duplication on the fly.
  18. Here is more complicated screen with even more numbers to calculate on the fly.The numbers are calculated on the fly, we can’t do it a head since the calculation is based on the user’s parameters selection.The challenge is to calculate it fast on the big data sets.This could be a problem easily solved by a well indexed relational data base, problem is that RDBMS are not capable of processing this amount of data quickly and efficiently.
  19. We selected a solution based on a search engine.A search engine is optimized to count words instances on a large set of documents.This is very similar to our use case, in the on demand queries, we are not interested in the data itself but on the number of instances (users) who share the same segments.We selected Elasticsearch for this task.Elasticsearch is an open source, fast search engine based on Apache Lucene.We have built 30 nodes Elasticsearch cluster on amazon cloud.The aggregator collects the data from the Netezza data warehouse and re-organize it in an optimized structure for our search (the data in the Netezza is event-centric while we need it to be user-centric in the Elasticsearch).The aggregator generate data files and load them to amazon S3.Another process – the loader – load it to the Elastic search where the data is indexed.When a query issued from the OptiX application server, it is processed by the reporter machine to a set of queries running in the ES cluster.The ES runs the queries in parallel and aggregates the result (a la Hadoop style) and the result is returned to the application server in 1 second.
  20. Data relevancyReal-time scoringParallel processingSplit processing over timeBig data access timeFront end, Replicated, Aerospike clusterOn-demand analyticsChange your schema to optimize query timeMove processing from querying to loading phaseTrade off: Space + Processing -> Performance