All analysts and associated industry projections suggest the rate at which data volumes will grow continues to pick up momentum. While it may seem we are splitting at the seams now, projections suggest we are on the cusp of hitting the wall with current architectural models with no end in sight. When gaining insight from the data is a function of one or more complex queries, simply applying more hardware and more people becomes unfeasible.
In this presentation, Infobright CEO Don DeLoach will discuss how high-value approximation can be used to gain equivalent insight to exact queries while overcoming the prohibitive time and costs associated with continuing with traditional models.
Rethinking the problem using statistical metadata offers a compelling opportunity to overcome the mounting scale barriers by drastically reducing the resource requirements and query times to enable previously unattainable opportunities.
2. Disclaimer: This presentation violates my most basic belief
that thought leadership at a conference should never involve
promoting a product
However: There is nothing else in the market that delivers
high value approximation via leveraging an abstraction layer
based on statistical models
Therefore: This is an extremely interesting concept, which
we see as exceptional thought leadership in the space, so
we have no other option
Apologies in advance.
25. Use Cases
Confidential – Do Not Distribute
Trading & Financial Services Trend Analysis & BI Integration
Security Adtech & Audience Profiling
Currently only using impression and bidding data to
form client profiles
IAQ allows detailed auction data (i.e. cookie data, IP
data, devices, sites) to be used and NOT discarded to
form richer profiles for improved campaign planning
Use investigative analytics approach to identify
trends to enable decision making
Visualize trends and forecast
Simple integration without new tools
Applicable to trading and risk management
Create and test trading algorithms in minutes, not weeks
Generate higher profit with improved algorithms
Improved assessment of risk
Exact queries required at end Exact queries NEVER required
Applicable to network intrusion detection and
troubleshooting
Query iteratively to determine source
Reduce remediation time from days to hours/mins
26. Exact vs. Approximate
Confidential – Do Not Distribute 26
Exact
Answer
Exact queries search through all atomic data pulling on more resources and require more time
Time
Exact
Answer
IAQ uses approximate queries to narrow answers until a final exact answer is required. At times,
exact answers are NEVER required, eliminating this final step altogether.
27. Adtech: Audience Profiling
Serious limitations with current technologies
Throw away cookie data,
IP data, devices, sites …
(very rich information, but
impractical to retain)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
Profile based on as
detailed analysis
possible
The key is to join the
underlying datasets which
are simply too big to be
used in a practical manner
Confidential – Do Not Distribute 27
28. Adtech: Audience Profiling - Approximate Query
The inclusion of broader and richer data increases adtech results
Retain cookie data, IP data,
devices, sites …
(now capable of being
joined and queried using
Approximate Query. Exact
answers are never required)
Keep impressions and
what they bid on…
Listentoadrequests–
extremelyhighrecordcounts
Impression &
Bidding
Data
Detailed
Auction
Data
Client Data
A much richer and
valuable profile
(that was not
possible before)
can be created by
leveraging detailed
data previously
disregarded.
No exact result
required.
Confidential – Do Not Distribute 28
29. IoT: Exploding Data for Many Use Cases
29
Operational
Surveillance
Logistics
Demand
Forecasting
Cybersecurity
Warranty
Analytics
Agricultural
Yield
Management
Exponentially more
data will be available
More data sets will be
combined for richer
signatures
Some IoT use cases
will not require exact
answers immediately
Most will never need
exact answers
38. BENCHMARK SUMMARY
IAQ Advantage
- Normalized Resources
Considerations in Projection
- Memory
- CPU
- I/O
Confidential – Do Not Distribute 38
Avg Min Max
Hive 5950x 1176x 13081x
Impala 2986x 863x 5590x
Spark 160x 10x 1040x
IAQ Performance Advantage
Performance should get better with each release of IAQ
39. Hive/Impala Test Environments
Confidential – Do Not Distribute 39
Hive,
Impala IAQ
Nodes: 4 1
RAM (GB) Total: 64 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
Results on Average:
IAQ 3,300x faster than Hive
IAQ 1,900x faster than Impala
40. Query Times: Hive, Impala, and IAQ
Confidential – Do Not Distribute 40
0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 9000.0 10000.0
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
Test 9
Test 10
Test 11
Test 12
Test 13
Test 14
Test 15
Test 16
Test 17
Test 18
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10Test 11Test 12Test 13Test 14Test 15Test 16Test 17Test 18
Hive (s) 6384.9 7205.5 7006.8 8004.7 7115.7 6496.3 7560.9 9116.6 8088.7 9016.2 8189.6 7778.1 7681.8 9205.5 8229.5 9148.3 8345.8 7809.5
Impala (s) 1059.0 1542.1 2148.0 6642.9 5499.4 4676.5 3231.2 5064.4 3531.0 6552.2 5472.7 5660.8 3185.2 5070.1 3510.1 6628.5 5532.5 5729.3
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
41. Impala vs IAQ
Confidential – Do Not Distribute 41
Test Name Similarity (0.6.1) IAQ (s) Impala (s) Impala/IAQ Hive (s) Hive/IAQ
Test 1 99.97% 0.5 1059.0 1611.8 6384.9 9718.2
Test 2 99.99% 1.5 1542.1 1057.7 7205.5 4942.0
Test 3 100.00% 1.2 2148.0 1695.4 7006.8 5530.3
Test 4 99.07% 1.8 6642.9 3767.9 8004.7 4540.4
Test 5 98.63% 1.0 5499.4 5149.2 7115.7 6662.6
Test 6 97.61% 4.1 4676.5 1040.1 6496.3 1444.9
Test 7 100.00% 0.6 3231.2 5412.5 7560.9 12664.8
Test 8 99.48% 5.0 5064.4 1004.4 9116.6 1808.1
Test 9 100.00% 1.3 3531.0 2664.9 8088.7 6104.7
Test 10 100.00% 1.7 6552.2 3874.7 9016.2 5331.9
Test 11 100.00% 1.0 5472.7 5297.9 8189.6 7928.0
Test 12 63.96% 5.9 5660.8 960.1 7778.1 1319.2
Test 13 100.00% 0.6 3185.2 5221.6 7681.8 12593.2
Test 14 100.00% 5.4 5070.1 938.7 9205.5 1704.4
Test 15 100.00% 1.3 3510.1 2562.1 8229.5 6006.9
Test 16 100.00% 1.7 6628.5 3770.5 9148.3 5203.8
Test 17 100.00% 1.0 5532.5 5185.1 8345.8 7821.8
Test 18 100.00% 6.6 5729.3 741.9 7809.5 1011.3
42. Spark v2.0 Test Environments
Confidential – Do Not Distribute 42
Spark IAQ
Nodes: 6 1
RAM (GB) Total: 240 64
Cores Total: 32 32
Proc Speed
(MHz) 2400 2394
Test Environment
• 11 Terabyte Dataset
• IAQ: 99 GB on Disk
• Parquet: 2.4 Terabytes
(1.2 TB with Replication
Factor of 2)
• Partitioned by Date
Results on Average:
IAQ 244x faster than Spark
43. Query Times: Spark v2 and IAQ
Confidential – Do Not Distribute 43
0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0
Test 1
Test 3
Test 5
Test 7
Test 9
Test 11
Test 13
Test 15
Test 17
Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9
Test
10
Test
11
Test
12
Test
13
Test
14
Test
15
Test
16
Test
17
Test
18
Spark (s) 6.6 1773.3 126.5 125.8 44.7 1435.9 16.9 1820.6 112.2 108.7 40.8 1394.9 16.6 1726.3 104.3 106.2 34.8 1366.4
IAQ (s) 0.5 1.5 1.2 1.8 1.0 4.1 0.6 5.0 1.3 1.7 1.0 5.9 0.6 5.4 1.3 1.7 1.0 6.6
44. Spark v2 vs IAQ
Confidential – Do Not Distribute 44
Test Name Similarity (0.6.1) IAQ (s) Spark (s) Spark/IAQ
Test 1 99.97% 0.5 6.6 12.4
Test 2 99.99% 1.5 1773.3 1216.3
Test 3 100.00% 1.2 126.5 103.8
Test 4 99.07% 1.8 125.8 71.6
Test 5 98.63% 1.0 44.7 43.6
Test 6 97.61% 4.1 1435.9 349.7
Test 7 100.00% 0.6 16.9 29.2
Test 8 99.48% 5.0 1820.6 361.1
Test 9 100.00% 1.3 112.2 84.7
Test 10 100.00% 1.7 108.7 64.3
Test 11 100.00% 1.0 40.8 40.2
Test 12 63.96% 5.9 1394.9 236.6
Test 13 100.00% 0.6 16.6 28.2
Test 14 100.00% 5.4 1726.3 319.6
Test 15 100.00% 1.3 104.3 79.1
Test 16 100.00% 1.7 106.2 63.1
Test 17 100.00% 1.0 34.8 33.8
Test 18 100.00% 6.6 1366.4 205.8
45. High Fidelity Results
Total events generated by internal IP addresses
- Similarity 99.48%
- Results Nearly
Indistinguishable
- Disk Usage:
- Raw: 11 TB
- Impala/Hive: 1.2 TB
- IAQ: 0.099 TB
- Query Speeds:
- IAQ: 5.58 s
- Impala:5,064.39 s
- Hive: 9,116.59 s
- Query:
select count(*) thecnt , tm_day from t_iee
where issrcinternal = 1 group by tm_day;
Confidential – Do Not Distribute 45
46. Timeline
Confidential – Do Not Distribute 46
V0.7
Introduce More Beta Customers
Enhanced Functionality /
Optimization Coverage
o Basic Join Functionality
o Improved Quality (Strong Zero
Correlations, Better Domain
End Point Modeling)
o IAQ Agent – Global dictionary
maintenance
V1.0
GA
Documentation complete
Issues found in Beta addressed
V0.8
Introduce More Beta Customers
Enhanced Functionality /
Optimization Coverage
o Improved Aggregation
o Improved Quality (eg. Count
Distinct)
o Enhanced Memory Handling
V1.1
Enhanced Functionality /
Optimization Coverage
o Greater “where” coverage
o Eg. Enhance “or”
o Basic expressions
o More join coverage
o More skins for Ingest Agent
September October November December … March