BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation

High Value Approximation
Conquering Big Data @ Scale
September, 2016

Disclaimer: This presentation violates my most basic belief
that thought leadership at a conference should never involve
promoting a product
However: There is nothing else in the market that delivers
high value approximation via leveraging an abstraction layer
based on statistical models
Therefore: This is an extremely interesting concept, which
we see as exceptional thought leadership in the space, so
we have no other option
Apologies in advance.

Data production will be
44 times greater
in 2020 than it was in 2009

Keep Throwing Hardware at The Problem

Keep Throwing Hardware at The Problem
Bad

30 Minute Query Times
3 Hour Query Times
More?
Bad

10% - 15% Samples Can Be Decent

10% - 15% of the Resources
Good

10 Times Faster Query Times
Also Good

Limit Use Cases to
Single Table
Low Cardinality
Where Missing Outliers
is all OK

But 10 Times Faster
Using
10% of the Resources
is Good.
Right?

Using Statistical Models in Lieu of Sampling

100% Introspection of the Data

Statistical Metadata Across Tables

High Value Approximation Across Tables
JOIN Support

Use Cases
Confidential – Do Not Distribute
Trading & Financial Services Trend Analysis & BI Integration
Security Adtech & Audience Profiling
 Currently only using impression and bidding data to
form client profiles
 IAQ allows detailed auction data (i.e. cookie data, IP
data, devices, sites) to be used and NOT discarded to
form richer profiles for improved campaign planning
 Use investigative analytics approach to identify
trends to enable decision making
 Visualize trends and forecast
 Simple integration without new tools
 Applicable to trading and risk management
 Create and test trading algorithms in minutes, not weeks
 Generate higher profit with improved algorithms
 Improved assessment of risk
Exact queries required at end Exact queries NEVER required
 Applicable to network intrusion detection and
troubleshooting
 Query iteratively to determine source
 Reduce remediation time from days to hours/mins

Exact vs. Approximate
Confidential – Do Not Distribute 26
Exact
Answer
Exact queries search through all atomic data pulling on more resources and require more time
Time
Exact
Answer
IAQ uses approximate queries to narrow answers until a final exact answer is required. At times,
exact answers are NEVER required, eliminating this final step altogether.

Adtech: Audience Profiling
Serious limitations with current technologies
Throw away cookie data,
IP data, devices, sites …
(very rich information, but
impractical to retain)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
Profile based on as
detailed analysis
possible
The key is to join the
underlying datasets which
are simply too big to be
used in a practical manner

Adtech: Audience Profiling - Approximate Query
The inclusion of broader and richer data increases adtech results
Retain cookie data, IP data,
devices, sites …
(now capable of being
joined and queried using
Approximate Query. Exact
answers are never required)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
A much richer and
valuable profile
(that was not
possible before)
can be created by
leveraging detailed
data previously
disregarded.
No exact result
required.

IoT: Exploding Data for Many Use Cases
29
Operational
Surveillance
Logistics
Demand
Forecasting
Cybersecurity
Warranty
Analytics
Agricultural
Yield
Management
Exponentially more
data will be available
More data sets will be
combined for richer
signatures
Some IoT use cases
will not require exact
answers immediately
Most will never need
exact answers

Extremely High Fidelity Results

Do you deal with Use Cases where
Approximation
Makes Sense?

BENCHMARK SUMMARY
IAQ Advantage
- Normalized Resources
Considerations in Projection
- Memory
- CPU
- I/O
Avg Min Max
Hive 5950x 1176x 13081x
Impala 2986x 863x 5590x
Spark 160x 10x 1040x
IAQ Performance Advantage
Performance should get better with each release of IAQ

Hive/Impala Test Environments
Hive,
Impala IAQ
Nodes: 4 1
RAM (GB) Total: 64 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
Results on Average:
IAQ 3,300x faster than Hive
IAQ 1,900x faster than Impala

Query Times: Hive, Impala, and IAQ
0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 9000.0 10000.0
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10
Test 11
Test 12
Test 13
Test 14
Test 15
Test 16
Test 17
Test 18
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10Test 11Test 12Test 13Test 14Test 15Test 16Test 17Test 18
Hive (s) 6384.9 7205.5 7006.8 8004.7 7115.7 6496.3 7560.9 9116.6 8088.7 9016.2 8189.6 7778.1 7681.8 9205.5 8229.5 9148.3 8345.8 7809.5
Impala (s) 1059.0 1542.1 2148.0 6642.9 5499.4 4676.5 3231.2 5064.4 3531.0 6552.2 5472.7 5660.8 3185.2 5070.1 3510.1 6628.5 5532.5 5729.3
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6

Impala vs IAQ
Test Name Similarity (0.6.1) IAQ (s) Impala (s) Impala/IAQ Hive (s) Hive/IAQ
Test 1 99.97% 0.5 1059.0 1611.8 6384.9 9718.2
Test 2 99.99% 1.5 1542.1 1057.7 7205.5 4942.0
Test 3 100.00% 1.2 2148.0 1695.4 7006.8 5530.3
Test 4 99.07% 1.8 6642.9 3767.9 8004.7 4540.4
Test 5 98.63% 1.0 5499.4 5149.2 7115.7 6662.6
Test 6 97.61% 4.1 4676.5 1040.1 6496.3 1444.9
Test 7 100.00% 0.6 3231.2 5412.5 7560.9 12664.8
Test 8 99.48% 5.0 5064.4 1004.4 9116.6 1808.1
Test 9 100.00% 1.3 3531.0 2664.9 8088.7 6104.7
Test 10 100.00% 1.7 6552.2 3874.7 9016.2 5331.9
Test 11 100.00% 1.0 5472.7 5297.9 8189.6 7928.0
Test 12 63.96% 5.9 5660.8 960.1 7778.1 1319.2
Test 13 100.00% 0.6 3185.2 5221.6 7681.8 12593.2
Test 14 100.00% 5.4 5070.1 938.7 9205.5 1704.4
Test 15 100.00% 1.3 3510.1 2562.1 8229.5 6006.9
Test 16 100.00% 1.7 6628.5 3770.5 9148.3 5203.8
Test 17 100.00% 1.0 5532.5 5185.1 8345.8 7821.8
Test 18 100.00% 6.6 5729.3 741.9 7809.5 1011.3

Spark v2.0 Test Environments
Spark IAQ
Nodes: 6 1
RAM (GB) Total: 240 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
• Partitioned by Date
Results on Average:
IAQ 244x faster than Spark

Query Times: Spark v2 and IAQ
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0
Test 1
Test 3
Test 5
Test 7
Test 9
Test 11
Test 13
Test 15
Test 17
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9
Test
10
Test
11
Test
12
Test
13
Test
14
Test
15
Test
16
Test
17
Test
18
Spark (s) 6.6 1773.3 126.5 125.8 44.7 1435.9 16.9 1820.6 112.2 108.7 40.8 1394.9 16.6 1726.3 104.3 106.2 34.8 1366.4
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6

Spark v2 vs IAQ
Test Name Similarity (0.6.1) IAQ (s) Spark (s) Spark/IAQ
Test 1 99.97% 0.5 6.6 12.4
Test 2 99.99% 1.5 1773.3 1216.3
Test 3 100.00% 1.2 126.5 103.8
Test 4 99.07% 1.8 125.8 71.6
Test 5 98.63% 1.0 44.7 43.6
Test 6 97.61% 4.1 1435.9 349.7
Test 7 100.00% 0.6 16.9 29.2
Test 8 99.48% 5.0 1820.6 361.1
Test 9 100.00% 1.3 112.2 84.7
Test 10 100.00% 1.7 108.7 64.3
Test 11 100.00% 1.0 40.8 40.2
Test 12 63.96% 5.9 1394.9 236.6
Test 13 100.00% 0.6 16.6 28.2
Test 14 100.00% 5.4 1726.3 319.6
Test 15 100.00% 1.3 104.3 79.1
Test 16 100.00% 1.7 106.2 63.1
Test 17 100.00% 1.0 34.8 33.8
Test 18 100.00% 6.6 1366.4 205.8

High Fidelity Results
Total events generated by internal IP addresses
- Similarity 99.48%
- Results Nearly
Indistinguishable
- Disk Usage:
- Raw: 11 TB
- Impala/Hive: 1.2 TB
- IAQ: 0.099 TB
- Query Speeds:
- IAQ: 5.58 s
- Impala:5,064.39 s
- Hive: 9,116.59 s
- Query:
select count(*) thecnt , tm_day from t_iee
where issrcinternal = 1 group by tm_day;

Timeline
V0.7
Introduce More Beta Customers
Enhanced Functionality /
Optimization Coverage
o Basic Join Functionality
o Improved Quality (Strong Zero
Correlations, Better Domain
End Point Modeling)
o IAQ Agent – Global dictionary
maintenance
V1.0
GA
Documentation complete
Issues found in Beta addressed
V0.8
Introduce More Beta Customers
o Improved Aggregation
o Improved Quality (eg. Count
Distinct)
o Enhanced Memory Handling
V1.1
o Greater “where” coverage
o Eg. Enhance “or”
o Basic expressions
o More join coverage
o More skins for Ingest Agent
September October November December … March

Thank You

BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation

Ähnlich wie BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation (20)

Mehr von Big Data Week

Mehr von Big Data Week (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BDW Chicago 2016 - Don Deloach, CEO and President, Infobright - Rethinking Architectural Models With High Value Approximation