SlideShare ist ein Scribd-Unternehmen logo
1 von 47
High Value Approximation
Conquering Big Data @ Scale
September, 2016
Disclaimer: This presentation violates my most basic belief
that thought leadership at a conference should never involve
promoting a product
However: There is nothing else in the market that delivers
high value approximation via leveraging an abstraction layer
based on statistical models
Therefore: This is an extremely interesting concept, which
we see as exceptional thought leadership in the space, so
we have no other option
Apologies in advance.
Data production will be
44 times greater
in 2020 than it was in 2009
Keep Throwing Hardware at The Problem
Keep Throwing Hardware at The Problem
Bad
30 Minute Query Times
3 Hour Query Times
More?
Bad
Use Approximation
Sampling
10% - 15% Samples Can Be Decent
10% - 15% of the Resources
Good
10 Times Faster Query Times
Also Good
Issues with JOINS
Issues with Missed Outliers
Issues with Cardinality
There is an Alternative
Limit Use Cases to
Single Table
Low Cardinality
Where Missing Outliers
is all OK
Not the RIGHT Alternative
But 10 Times Faster
Using
10% of the Resources
is Good.
Right?
Using Statistical Models in Lieu of Sampling
100% Introspection of the Data
No Missed Outliers
Statistical Metadata Across Tables
High Value Approximation Across Tables
JOIN Support
Broad Range of Use Cases
Use Cases
Confidential – Do Not Distribute
Trading & Financial Services Trend Analysis & BI Integration
Security Adtech & Audience Profiling
 Currently only using impression and bidding data to
form client profiles
 IAQ allows detailed auction data (i.e. cookie data, IP
data, devices, sites) to be used and NOT discarded to
form richer profiles for improved campaign planning
 Use investigative analytics approach to identify
trends to enable decision making
 Visualize trends and forecast
 Simple integration without new tools
 Applicable to trading and risk management
 Create and test trading algorithms in minutes, not weeks
 Generate higher profit with improved algorithms
 Improved assessment of risk
Exact queries required at end Exact queries NEVER required
 Applicable to network intrusion detection and
troubleshooting
 Query iteratively to determine source
 Reduce remediation time from days to hours/mins
Exact vs. Approximate
Confidential – Do Not Distribute 26
Exact
Answer
Exact queries search through all atomic data pulling on more resources and require more time
Time
Exact
Answer
IAQ uses approximate queries to narrow answers until a final exact answer is required. At times,
exact answers are NEVER required, eliminating this final step altogether.
Adtech: Audience Profiling
Serious limitations with current technologies
Throw away cookie data,
IP data, devices, sites …
(very rich information, but
impractical to retain)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
Profile based on as
detailed analysis
possible
The key is to join the
underlying datasets which
are simply too big to be
used in a practical manner
Confidential – Do Not Distribute 27
Adtech: Audience Profiling - Approximate Query
The inclusion of broader and richer data increases adtech results
Retain cookie data, IP data,
devices, sites …
(now capable of being
joined and queried using
Approximate Query. Exact
answers are never required)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
A much richer and
valuable profile
(that was not
possible before)
can be created by
leveraging detailed
data previously
disregarded.
No exact result
required.
Confidential – Do Not Distribute 28
IoT: Exploding Data for Many Use Cases
29
Operational
Surveillance
Logistics
Demand
Forecasting
Cybersecurity
Warranty
Analytics
Agricultural
Yield
Management
Exponentially more
data will be available
More data sets will be
combined for richer
signatures
Some IoT use cases
will not require exact
answers immediately
Most will never need
exact answers
1% of the Resources
150 to 5900 Times Faster
Extremely High Fidelity Results
Do you deal with Use Cases where
Approximation
Makes Sense?
Do you use BI Tools?
Are you a Data Scientist?
Is this Hard to Believe?
We Understand
BENCHMARK SUMMARY
IAQ Advantage
- Normalized Resources
Considerations in Projection
- Memory
- CPU
- I/O
Confidential – Do Not Distribute 38
Avg Min Max
Hive 5950x 1176x 13081x
Impala 2986x 863x 5590x
Spark 160x 10x 1040x
IAQ Performance Advantage
Performance should get better with each release of IAQ
Hive/Impala Test Environments
Confidential – Do Not Distribute 39
Hive,
Impala IAQ
Nodes: 4 1
RAM (GB) Total: 64 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
Results on Average:
IAQ 3,300x faster than Hive
IAQ 1,900x faster than Impala
Query Times: Hive, Impala, and IAQ
Confidential – Do Not Distribute 40
0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 9000.0 10000.0
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10
Test 11
Test 12
Test 13
Test 14
Test 15
Test 16
Test 17
Test 18
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10Test 11Test 12Test 13Test 14Test 15Test 16Test 17Test 18
Hive (s) 6384.9 7205.5 7006.8 8004.7 7115.7 6496.3 7560.9 9116.6 8088.7 9016.2 8189.6 7778.1 7681.8 9205.5 8229.5 9148.3 8345.8 7809.5
Impala (s) 1059.0 1542.1 2148.0 6642.9 5499.4 4676.5 3231.2 5064.4 3531.0 6552.2 5472.7 5660.8 3185.2 5070.1 3510.1 6628.5 5532.5 5729.3
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
Impala vs IAQ
Confidential – Do Not Distribute 41
Test Name Similarity (0.6.1) IAQ (s) Impala (s) Impala/IAQ Hive (s) Hive/IAQ
Test 1 99.97% 0.5 1059.0 1611.8 6384.9 9718.2
Test 2 99.99% 1.5 1542.1 1057.7 7205.5 4942.0
Test 3 100.00% 1.2 2148.0 1695.4 7006.8 5530.3
Test 4 99.07% 1.8 6642.9 3767.9 8004.7 4540.4
Test 5 98.63% 1.0 5499.4 5149.2 7115.7 6662.6
Test 6 97.61% 4.1 4676.5 1040.1 6496.3 1444.9
Test 7 100.00% 0.6 3231.2 5412.5 7560.9 12664.8
Test 8 99.48% 5.0 5064.4 1004.4 9116.6 1808.1
Test 9 100.00% 1.3 3531.0 2664.9 8088.7 6104.7
Test 10 100.00% 1.7 6552.2 3874.7 9016.2 5331.9
Test 11 100.00% 1.0 5472.7 5297.9 8189.6 7928.0
Test 12 63.96% 5.9 5660.8 960.1 7778.1 1319.2
Test 13 100.00% 0.6 3185.2 5221.6 7681.8 12593.2
Test 14 100.00% 5.4 5070.1 938.7 9205.5 1704.4
Test 15 100.00% 1.3 3510.1 2562.1 8229.5 6006.9
Test 16 100.00% 1.7 6628.5 3770.5 9148.3 5203.8
Test 17 100.00% 1.0 5532.5 5185.1 8345.8 7821.8
Test 18 100.00% 6.6 5729.3 741.9 7809.5 1011.3
Spark v2.0 Test Environments
Confidential – Do Not Distribute 42
Spark IAQ
Nodes: 6 1
RAM (GB) Total: 240 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
• Partitioned by Date
Results on Average:
IAQ 244x faster than Spark
Query Times: Spark v2 and IAQ
Confidential – Do Not Distribute 43
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0
Test 1
Test 3
Test 5
Test 7
Test 9
Test 11
Test 13
Test 15
Test 17
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9
Test
10
Test
11
Test
12
Test
13
Test
14
Test
15
Test
16
Test
17
Test
18
Spark (s) 6.6 1773.3 126.5 125.8 44.7 1435.9 16.9 1820.6 112.2 108.7 40.8 1394.9 16.6 1726.3 104.3 106.2 34.8 1366.4
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
Spark v2 vs IAQ
Confidential – Do Not Distribute 44
Test Name Similarity (0.6.1) IAQ (s) Spark (s) Spark/IAQ
Test 1 99.97% 0.5 6.6 12.4
Test 2 99.99% 1.5 1773.3 1216.3
Test 3 100.00% 1.2 126.5 103.8
Test 4 99.07% 1.8 125.8 71.6
Test 5 98.63% 1.0 44.7 43.6
Test 6 97.61% 4.1 1435.9 349.7
Test 7 100.00% 0.6 16.9 29.2
Test 8 99.48% 5.0 1820.6 361.1
Test 9 100.00% 1.3 112.2 84.7
Test 10 100.00% 1.7 108.7 64.3
Test 11 100.00% 1.0 40.8 40.2
Test 12 63.96% 5.9 1394.9 236.6
Test 13 100.00% 0.6 16.6 28.2
Test 14 100.00% 5.4 1726.3 319.6
Test 15 100.00% 1.3 104.3 79.1
Test 16 100.00% 1.7 106.2 63.1
Test 17 100.00% 1.0 34.8 33.8
Test 18 100.00% 6.6 1366.4 205.8
High Fidelity Results
Total events generated by internal IP addresses
- Similarity 99.48%
- Results Nearly
Indistinguishable
- Disk Usage:
- Raw: 11 TB
- Impala/Hive: 1.2 TB
- IAQ: 0.099 TB
- Query Speeds:
- IAQ: 5.58 s
- Impala:5,064.39 s
- Hive: 9,116.59 s
- Query:
select count(*) thecnt , tm_day from t_iee
where issrcinternal = 1 group by tm_day;
Confidential – Do Not Distribute 45
Timeline
Confidential – Do Not Distribute 46
V0.7
Introduce More Beta Customers
Enhanced Functionality /
Optimization Coverage
o Basic Join Functionality
o Improved Quality (Strong Zero
Correlations, Better Domain
End Point Modeling)
o IAQ Agent – Global dictionary
maintenance
V1.0
GA
Documentation complete
Issues found in Beta addressed
V0.8
Introduce More Beta Customers
Enhanced Functionality /
Optimization Coverage
o Improved Aggregation
o Improved Quality (eg. Count
Distinct)
o Enhanced Memory Handling
V1.1
Enhanced Functionality /
Optimization Coverage
o Greater “where” coverage
o Eg. Enhance “or”
o Basic expressions
o More join coverage
o More skins for Ingest Agent
September October November December … March
Thank You
Confidential – Do Not Distribute 47

Weitere ähnliche Inhalte

Ähnlich wie BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
How to Improve Quality and Efficiency Using Test Data Analytics
How to Improve Quality and Efficiency Using Test Data AnalyticsHow to Improve Quality and Efficiency Using Test Data Analytics
How to Improve Quality and Efficiency Using Test Data AnalyticsTequra Analytics
 
Machine Learning 101
Machine Learning 101Machine Learning 101
Machine Learning 101Nafis Neehal
 
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and LeadDevOps.com
 
Smoking and Pregnancy SurveyPlease take this brief survey of w.docx
Smoking and Pregnancy SurveyPlease take this brief survey of w.docxSmoking and Pregnancy SurveyPlease take this brief survey of w.docx
Smoking and Pregnancy SurveyPlease take this brief survey of w.docxpbilly1
 
A/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsA/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsSlava Borodovsky
 
Need for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikNeed for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikQA or the Highway
 
Load Test Like a Pro
Load Test Like a ProLoad Test Like a Pro
Load Test Like a ProRob Harrop
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operationAngelainBay
 
QA Club Kiev 18 - Test Management and Approaches
QA Club Kiev 18 - Test Management and ApproachesQA Club Kiev 18 - Test Management and Approaches
QA Club Kiev 18 - Test Management and ApproachesQA Club Kiev
 
Tablet tools and micro tasks
Tablet tools and micro tasksTablet tools and micro tasks
Tablet tools and micro tasksmhilde
 
Common Errors in ML
Common Errors in MLCommon Errors in ML
Common Errors in MLKyle Polich
 
Common Errors in ML
Common Errors in MLCommon Errors in ML
Common Errors in MLKyle Polich
 
Designing and Running Performance Experiments
Designing and Running Performance ExperimentsDesigning and Running Performance Experiments
Designing and Running Performance ExperimentsJ On The Beach
 
Big Data Makes The Flake Go Away
Big Data Makes The Flake Go AwayBig Data Makes The Flake Go Away
Big Data Makes The Flake Go AwayDave Cadwallader
 
Mattias Ratert - Incremental Scenario Testing
Mattias Ratert - Incremental Scenario TestingMattias Ratert - Incremental Scenario Testing
Mattias Ratert - Incremental Scenario TestingTEST Huddle
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 

Ähnlich wie BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation (20)

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
How to Improve Quality and Efficiency Using Test Data Analytics
How to Improve Quality and Efficiency Using Test Data AnalyticsHow to Improve Quality and Efficiency Using Test Data Analytics
How to Improve Quality and Efficiency Using Test Data Analytics
 
Machine Learning 101
Machine Learning 101Machine Learning 101
Machine Learning 101
 
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead
2020 Testing Trends: Top Predictions for QA Teams to Watch, Join, and Lead
 
Smoking and Pregnancy SurveyPlease take this brief survey of w.docx
Smoking and Pregnancy SurveyPlease take this brief survey of w.docxSmoking and Pregnancy SurveyPlease take this brief survey of w.docx
Smoking and Pregnancy SurveyPlease take this brief survey of w.docx
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
A/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and PitfalsA/B Testing - Design, Analysis and Pitfals
A/B Testing - Design, Analysis and Pitfals
 
Need for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikNeed for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie Bhaumik
 
Load Test Like a Pro
Load Test Like a ProLoad Test Like a Pro
Load Test Like a Pro
 
SIMULATION.pptx
SIMULATION.pptxSIMULATION.pptx
SIMULATION.pptx
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operation
 
QA Club Kiev 18 - Test Management and Approaches
QA Club Kiev 18 - Test Management and ApproachesQA Club Kiev 18 - Test Management and Approaches
QA Club Kiev 18 - Test Management and Approaches
 
Tablet tools and micro tasks
Tablet tools and micro tasksTablet tools and micro tasks
Tablet tools and micro tasks
 
Common Errors in ML
Common Errors in MLCommon Errors in ML
Common Errors in ML
 
Common Errors in ML
Common Errors in MLCommon Errors in ML
Common Errors in ML
 
Designing and Running Performance Experiments
Designing and Running Performance ExperimentsDesigning and Running Performance Experiments
Designing and Running Performance Experiments
 
Big Data Makes The Flake Go Away
Big Data Makes The Flake Go AwayBig Data Makes The Flake Go Away
Big Data Makes The Flake Go Away
 
Mattias Ratert - Incremental Scenario Testing
Mattias Ratert - Incremental Scenario TestingMattias Ratert - Incremental Scenario Testing
Mattias Ratert - Incremental Scenario Testing
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 

Mehr von Big Data Week

BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
 BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A... BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...Big Data Week
 
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...Big Data Week
 
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal InferenceBDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal InferenceBig Data Week
 
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...Big Data Week
 
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...Big Data Week
 
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...Big Data Week
 
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of DataBDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of DataBig Data Week
 
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...Big Data Week
 
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...Big Data Week
 
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...Big Data Week
 
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the CloudBDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the CloudBig Data Week
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBig Data Week
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week
 
BDW16 London - Nondas Sourlas, Bupa - Big Data in Healthcare
BDW16 London  - Nondas Sourlas, Bupa - Big Data in HealthcareBDW16 London  - Nondas Sourlas, Bupa - Big Data in Healthcare
BDW16 London - Nondas Sourlas, Bupa - Big Data in HealthcareBig Data Week
 
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...Big Data Week
 
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...Big Data Week
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...Big Data Week
 
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word BingoBDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word BingoBig Data Week
 
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with Ansible
BDW16 London -  Marius Boeru, Bigstep - How to Automate Big Data with AnsibleBDW16 London -  Marius Boeru, Bigstep - How to Automate Big Data with Ansible
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with AnsibleBig Data Week
 
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...
BDW16 London - Josh Partridge, Shazam -  How Labels, Radio Stations and Brand...BDW16 London - Josh Partridge, Shazam -  How Labels, Radio Stations and Brand...
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...Big Data Week
 

Mehr von Big Data Week (20)

BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
 BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A... BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
BDW17 London - Edward Kibardin - Mitie PLC - Learning and Topological Data A...
 
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
BDWW17 London - Steve Bradbury, GRSC - Big Data to the Rescue: A Fraud Case S...
 
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal InferenceBDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
 
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
BDW17 London - Rita Simoes, Boehringer Ingelheim - Big Data in Pharma: Sittin...
 
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
BDW17 London - Mick Ridley, Exterion Media & Dale Campbell , TfL - Transformi...
 
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
 
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of DataBDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
BDW17 London - Steve Bradbury - GRSC - Making Sense of the Chaos of Data
 
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
BDW17 London - Andy Boura - Thomson Reuters - Does Big Data Have to Mean Big ...
 
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
BDW17 London - Tom Woolrich, Financial Times - What Does Big Data Mean for th...
 
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
BDW17 London - Andrew Fryer, Microsoft - Everybody Needs a Bit of Science in ...
 
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the CloudBDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
BDW16 London - Alex Bordei, Bigstep - Building Data Labs in the Cloud
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
BDW16 London - Nondas Sourlas, Bupa - Big Data in Healthcare
BDW16 London  - Nondas Sourlas, Bupa - Big Data in HealthcareBDW16 London  - Nondas Sourlas, Bupa - Big Data in Healthcare
BDW16 London - Nondas Sourlas, Bupa - Big Data in Healthcare
 
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
BDW16 London - John Callan, Boxever - Data and Analytics - The Fuel Your Bran...
 
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
BDW16 London - John Belchamber, Telefonica - New Data, New Strategies, New Op...
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
 
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word BingoBDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
BDW16 London - Jonny Voon, Innovate UK - Smart Cities and the Buzz Word Bingo
 
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with Ansible
BDW16 London -  Marius Boeru, Bigstep - How to Automate Big Data with AnsibleBDW16 London -  Marius Boeru, Bigstep - How to Automate Big Data with Ansible
BDW16 London - Marius Boeru, Bigstep - How to Automate Big Data with Ansible
 
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...
BDW16 London - Josh Partridge, Shazam -  How Labels, Radio Stations and Brand...BDW16 London - Josh Partridge, Shazam -  How Labels, Radio Stations and Brand...
BDW16 London - Josh Partridge, Shazam - How Labels, Radio Stations and Brand...
 

Kürzlich hochgeladen

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Kürzlich hochgeladen (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation

  • 1. High Value Approximation Conquering Big Data @ Scale September, 2016
  • 2. Disclaimer: This presentation violates my most basic belief that thought leadership at a conference should never involve promoting a product However: There is nothing else in the market that delivers high value approximation via leveraging an abstraction layer based on statistical models Therefore: This is an extremely interesting concept, which we see as exceptional thought leadership in the space, so we have no other option Apologies in advance.
  • 3. Data production will be 44 times greater in 2020 than it was in 2009
  • 4. Keep Throwing Hardware at The Problem
  • 5. Keep Throwing Hardware at The Problem Bad
  • 6. 30 Minute Query Times 3 Hour Query Times More? Bad
  • 9. 10% - 15% Samples Can Be Decent
  • 10. 10% - 15% of the Resources Good
  • 11. 10 Times Faster Query Times Also Good
  • 13. Issues with Missed Outliers
  • 15. There is an Alternative
  • 16. Limit Use Cases to Single Table Low Cardinality Where Missing Outliers is all OK
  • 17. Not the RIGHT Alternative
  • 18. But 10 Times Faster Using 10% of the Resources is Good. Right?
  • 19. Using Statistical Models in Lieu of Sampling
  • 23. High Value Approximation Across Tables JOIN Support
  • 24. Broad Range of Use Cases
  • 25. Use Cases Confidential – Do Not Distribute Trading & Financial Services Trend Analysis & BI Integration Security Adtech & Audience Profiling  Currently only using impression and bidding data to form client profiles  IAQ allows detailed auction data (i.e. cookie data, IP data, devices, sites) to be used and NOT discarded to form richer profiles for improved campaign planning  Use investigative analytics approach to identify trends to enable decision making  Visualize trends and forecast  Simple integration without new tools  Applicable to trading and risk management  Create and test trading algorithms in minutes, not weeks  Generate higher profit with improved algorithms  Improved assessment of risk Exact queries required at end Exact queries NEVER required  Applicable to network intrusion detection and troubleshooting  Query iteratively to determine source  Reduce remediation time from days to hours/mins
  • 26. Exact vs. Approximate Confidential – Do Not Distribute 26 Exact Answer Exact queries search through all atomic data pulling on more resources and require more time Time Exact Answer IAQ uses approximate queries to narrow answers until a final exact answer is required. At times, exact answers are NEVER required, eliminating this final step altogether.
  • 27. Adtech: Audience Profiling Serious limitations with current technologies Throw away cookie data, IP data, devices, sites … (very rich information, but impractical to retain) Keep impressions and what they bid on… Listentoadrequests– extremelyhighrecordcounts Impression & Bidding Data Detailed Auction Data Client Data Profile based on as detailed analysis possible The key is to join the underlying datasets which are simply too big to be used in a practical manner Confidential – Do Not Distribute 27
  • 28. Adtech: Audience Profiling - Approximate Query The inclusion of broader and richer data increases adtech results Retain cookie data, IP data, devices, sites … (now capable of being joined and queried using Approximate Query. Exact answers are never required) Keep impressions and what they bid on… Listentoadrequests– extremelyhighrecordcounts Impression & Bidding Data Detailed Auction Data Client Data A much richer and valuable profile (that was not possible before) can be created by leveraging detailed data previously disregarded. No exact result required. Confidential – Do Not Distribute 28
  • 29. IoT: Exploding Data for Many Use Cases 29 Operational Surveillance Logistics Demand Forecasting Cybersecurity Warranty Analytics Agricultural Yield Management Exponentially more data will be available More data sets will be combined for richer signatures Some IoT use cases will not require exact answers immediately Most will never need exact answers
  • 30. 1% of the Resources
  • 31. 150 to 5900 Times Faster
  • 33. Do you deal with Use Cases where Approximation Makes Sense?
  • 34. Do you use BI Tools?
  • 35. Are you a Data Scientist?
  • 36. Is this Hard to Believe?
  • 38. BENCHMARK SUMMARY IAQ Advantage - Normalized Resources Considerations in Projection - Memory - CPU - I/O Confidential – Do Not Distribute 38 Avg Min Max Hive 5950x 1176x 13081x Impala 2986x 863x 5590x Spark 160x 10x 1040x IAQ Performance Advantage Performance should get better with each release of IAQ
  • 39. Hive/Impala Test Environments Confidential – Do Not Distribute 39 Hive, Impala IAQ Nodes: 4 1 RAM (GB) Total: 64 64 Cores Total: 32 32 Proc Speed (MHz) 2400 2394 Test Environment • 11 Terabyte Dataset • IAQ: 99 GB on Disk • Parquet: 2.4 Terabytes (1.2 TB with Replication Factor of 2) Results on Average: IAQ 3,300x faster than Hive IAQ 1,900x faster than Impala
  • 40. Query Times: Hive, Impala, and IAQ Confidential – Do Not Distribute 40 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 9000.0 10000.0 Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10 Test 11 Test 12 Test 13 Test 14 Test 15 Test 16 Test 17 Test 18 Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10Test 11Test 12Test 13Test 14Test 15Test 16Test 17Test 18 Hive (s) 6384.9 7205.5 7006.8 8004.7 7115.7 6496.3 7560.9 9116.6 8088.7 9016.2 8189.6 7778.1 7681.8 9205.5 8229.5 9148.3 8345.8 7809.5 Impala (s) 1059.0 1542.1 2148.0 6642.9 5499.4 4676.5 3231.2 5064.4 3531.0 6552.2 5472.7 5660.8 3185.2 5070.1 3510.1 6628.5 5532.5 5729.3 IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
  • 41. Impala vs IAQ Confidential – Do Not Distribute 41 Test Name Similarity (0.6.1) IAQ (s) Impala (s) Impala/IAQ Hive (s) Hive/IAQ Test 1 99.97% 0.5 1059.0 1611.8 6384.9 9718.2 Test 2 99.99% 1.5 1542.1 1057.7 7205.5 4942.0 Test 3 100.00% 1.2 2148.0 1695.4 7006.8 5530.3 Test 4 99.07% 1.8 6642.9 3767.9 8004.7 4540.4 Test 5 98.63% 1.0 5499.4 5149.2 7115.7 6662.6 Test 6 97.61% 4.1 4676.5 1040.1 6496.3 1444.9 Test 7 100.00% 0.6 3231.2 5412.5 7560.9 12664.8 Test 8 99.48% 5.0 5064.4 1004.4 9116.6 1808.1 Test 9 100.00% 1.3 3531.0 2664.9 8088.7 6104.7 Test 10 100.00% 1.7 6552.2 3874.7 9016.2 5331.9 Test 11 100.00% 1.0 5472.7 5297.9 8189.6 7928.0 Test 12 63.96% 5.9 5660.8 960.1 7778.1 1319.2 Test 13 100.00% 0.6 3185.2 5221.6 7681.8 12593.2 Test 14 100.00% 5.4 5070.1 938.7 9205.5 1704.4 Test 15 100.00% 1.3 3510.1 2562.1 8229.5 6006.9 Test 16 100.00% 1.7 6628.5 3770.5 9148.3 5203.8 Test 17 100.00% 1.0 5532.5 5185.1 8345.8 7821.8 Test 18 100.00% 6.6 5729.3 741.9 7809.5 1011.3
  • 42. Spark v2.0 Test Environments Confidential – Do Not Distribute 42 Spark IAQ Nodes: 6 1 RAM (GB) Total: 240 64 Cores Total: 32 32 Proc Speed (MHz) 2400 2394 Test Environment • 11 Terabyte Dataset • IAQ: 99 GB on Disk • Parquet: 2.4 Terabytes (1.2 TB with Replication Factor of 2) • Partitioned by Date Results on Average: IAQ 244x faster than Spark
  • 43. Query Times: Spark v2 and IAQ Confidential – Do Not Distribute 43 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0 Test 1 Test 3 Test 5 Test 7 Test 9 Test 11 Test 13 Test 15 Test 17 Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10 Test 11 Test 12 Test 13 Test 14 Test 15 Test 16 Test 17 Test 18 Spark (s) 6.6 1773.3 126.5 125.8 44.7 1435.9 16.9 1820.6 112.2 108.7 40.8 1394.9 16.6 1726.3 104.3 106.2 34.8 1366.4 IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
  • 44. Spark v2 vs IAQ Confidential – Do Not Distribute 44 Test Name Similarity (0.6.1) IAQ (s) Spark (s) Spark/IAQ Test 1 99.97% 0.5 6.6 12.4 Test 2 99.99% 1.5 1773.3 1216.3 Test 3 100.00% 1.2 126.5 103.8 Test 4 99.07% 1.8 125.8 71.6 Test 5 98.63% 1.0 44.7 43.6 Test 6 97.61% 4.1 1435.9 349.7 Test 7 100.00% 0.6 16.9 29.2 Test 8 99.48% 5.0 1820.6 361.1 Test 9 100.00% 1.3 112.2 84.7 Test 10 100.00% 1.7 108.7 64.3 Test 11 100.00% 1.0 40.8 40.2 Test 12 63.96% 5.9 1394.9 236.6 Test 13 100.00% 0.6 16.6 28.2 Test 14 100.00% 5.4 1726.3 319.6 Test 15 100.00% 1.3 104.3 79.1 Test 16 100.00% 1.7 106.2 63.1 Test 17 100.00% 1.0 34.8 33.8 Test 18 100.00% 6.6 1366.4 205.8
  • 45. High Fidelity Results Total events generated by internal IP addresses - Similarity 99.48% - Results Nearly Indistinguishable - Disk Usage: - Raw: 11 TB - Impala/Hive: 1.2 TB - IAQ: 0.099 TB - Query Speeds: - IAQ: 5.58 s - Impala:5,064.39 s - Hive: 9,116.59 s - Query: select count(*) thecnt , tm_day from t_iee where issrcinternal = 1 group by tm_day; Confidential – Do Not Distribute 45
  • 46. Timeline Confidential – Do Not Distribute 46 V0.7 Introduce More Beta Customers Enhanced Functionality / Optimization Coverage o Basic Join Functionality o Improved Quality (Strong Zero Correlations, Better Domain End Point Modeling) o IAQ Agent – Global dictionary maintenance V1.0 GA Documentation complete Issues found in Beta addressed V0.8 Introduce More Beta Customers Enhanced Functionality / Optimization Coverage o Improved Aggregation o Improved Quality (eg. Count Distinct) o Enhanced Memory Handling V1.1 Enhanced Functionality / Optimization Coverage o Greater “where” coverage o Eg. Enhance “or” o Basic expressions o More join coverage o More skins for Ingest Agent September October November December … March
  • 47. Thank You Confidential – Do Not Distribute 47