Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Jethro for tableau webinar (11 15)
1.
2. Webinar Topics
• Who is Jethro?
• Tableau & Big Data: Extract vs. Live Connect
• Big Data Platforms: Hadoop vs. EDW Appliances
• Two DB architectures: Full-scan vs. Index Access
• Live Demo: Tableau over Impala / Redshift / Jethro
• What is Jethro for Tableau and how it accelerates Tableau’s performance
• Q&A
3. About Us
• What does Jethro do?
– SQL engine optimized for accelerating
BI on big data
• How it works?
– Combines Columnar SQL DB design
with full-indexing technology
• Where is it?
– In dev since 2012; GA: mid 2015
– Download & free eval
• When to use it?
– BI Spinner Syndrome (BSS)
• Partnerships
– BI and Hadoop vendors
• Speaker
– Eli Singer, CEO JethroData
– esinger@jethrodata.com
– 917.509.6111
• Experience
– Long-time DBA
– Over 20 years of leading Tech startups
• Where to find us
– Jethrodata.com
– @JethroData
4. Tableau and Big Data: Extract (In-Mem)
Tableau
Extract
EDW / Hadoop
• Typical Tableau usage is based
on extracting selective data from
remote sources
• Extracted data is then
dynamically loaded into Tableau
memory for interactive analysis
• Limitations: Performance
degradation and scale (typically
~200M rows)
5. Tableau and Big Data: Live Connect (In-DB)
Tableau
EDW / Hadoop
• Tableau issues SQL queries
to the target DB for every
user interaction
• DB retrieves requested
data and returns to Tableau
• Limitation: DB
performance is significantly
slower than in-mem speed
Live
Connect
6. Big Data Platforms: Hadoop Vs. EDW Appliances
10x-100x Data
1/10 HW $cost
Open Platform
Analytics: ETL, Predictive, Reporting, BI
SQL enables the change of data platform while keeping the analytic apps intact
7. The Hadoop Trade-Off: Scale & Cost Vs. Performance
SQL-on-Hadoop
ETL Predictive Reporting
BI
Too SLOW in Hadoopx
It’s unrealistic to expect to the same performance when data is much
larger, and highly optimized hardware is replaced with commodity boxes.
8. SQL-on-Hadoop – MPP / Full Scan Architecture
Architecture:
MPP / Full-Scan (All SQL-on-Hadoop)
Query:
List books by author “Stephen King”
Process:
Each librarian is assigned a rack, they
then pull each book, check if author is
“Stephen King”, if so, get book title
Result:
Too slow, costly, unscalable.
Unsuitable for BI
A Library Analogy:
Billions of books, Thousands of racks
9. SQL-on-Hadoop – Index-Access Architecture
Architecture:
Index Access (Only Jethro)
Query:
List books by author “Stephen King”
Process:
Access Author index, entry of
“Stephen King”, get list of books, fetch
only these books
Result:
Fast, minimal resources, scalable
Optimal for BI
10. 10
SQL on Hadoop – Competitive Landscape
• Hive
• Impala
• Presto
• SparkSQL
• Drill
• Pivotal/HAWQ
• IBM/Big SQL
• Actian
• Teradata/SQL-H
• …
• Jethro
Full-Scan Based Solutions
Reads all rows. Every Time.
Index Based Solution
Reads ONLY needed rows.
Use-Case Comparison:
Full-Scan: Optimal for Predictive, reporting
Index: Optimal for Interactive BI
11. LIVE Benchmark: BI on Hadoop (and Redshift)
Hardware – AWS
• Hadoop: CDH 5.4
• 6 nodes: m1.xlarge, r3.xlarge
• Jethro: r3.8xlarge
• Point browser at: tableau.jethrodata.com
– UID/PWD: demo / demo
• Choose workbook: “Jethro”, “Impala”, “Redshift”
• BI Dashboard: choose year, category or any other filter to drill-down
• Data
– Based on TPC-DS benchmark
– 1TB raw data (400GB fact)
– Fact table: ~2.9B rows
– Dimensions: 7
Hardware Data
Format
Hadoop
Cluster
Compute
Cluster
Total
RAM, CPU
AWS
$ per hr.
Jethro Jethro
indexes
(250GB)
3x m1.xlarge 2x r3.4xlarge
(spot)
289GB,
44 cores
$0.80
Impala Parquet
(160GB)
8x r3.2xlarge
1x r3.xlarge
510GB
68 cores
$5.95
Redshift Redshift
(229GB)
8x dc1.large 120GB,
16 cores
$2.00
12. What Is Jethro for Tableau?
Tableau
EDW / Hadoop / Cloud / Local FS / NAS
Extract
• An indexing & caching server
• Relevant data is extracted from EDW
/ Hadoop into Jethro. No size
limitation
• Jethro then fully indexes the data
(every column!)
• Jethro’s column and index files are
stored back in Hadoop (or other
storage system)
• Tableau uses Live Connect to send
Jethro SQL queries (ODBC)
• Jethro uses indexes to speed up
queries and return results to Tableau
Live
Connect
2. Store
3.
1.
13. Selecting Data for Jethro Acceleration
• Select only Tableau “worthy” datasets
– Not ALL data in Hadoop should have Jethro
• Use any ETL tool to extract from source
– Jethro receives data in a CSV/delimited format
– Extracted data can be temporarily stored in a file or
“piped” live to Jethro
• After initial creation, incremental loads are supported
– As frequently as every few min
• Jethro stores it’s version of the dataset back in HDFS
– Can also use local filesystem, network storage or cloud storage
• Load is fast
– ~1B rows/hour
– Data in highly compressed: 1TB -> 400GB data + indexes
EDW / Hadoop
Extract
14. Data
Node
Index-Access – How it works
Data
Node
Data
Node
Data
Node
Data
Node
Jethro
Query
Node
Query
Node
1. Index Access 2. Read data only for require rows
Performance and resources based on the size of the working-set
Storage
- HDFS
- Cloud (S3, EFS)
- NAS/SAN
- Local FS
Tableau
SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
15. Jethro Indexes – Superior Technology
http://www.google.com/patents/WO2013001535A3?cl=enPatent Pending:
• Complete
– Every column is indexed
• Simple
Inverted-list indexes map each column
value to a list of rows
• Fast to read
Direct Access to a value entry
No need to scan entire index, or load
index to memory
• Scalable
Distributed, highly hierarchical
compressed bitmaps
Appendable Index Structure for
Fast Incremental Loads
16. Adaptive Optimization: Active Cache of Query Results
• Reuse of intermediate/final query results
– Repeat queries return immediately
• Addresses wide top-of-the-funnel queries
– Exploration starts with queries with no/few
filters
– Those queries are likely to be repeated in
dashboard scenarios
• Transparently adapts to incremental loads
– Execution on delta data + merge saved results
Query
Speed
Query
Selectivity
Fast
Slow
Few More
Query
speed
Query
Selectivity
Fast
Slow
Few More
Query
speed
Query
Selectivity
Fast
Slow
Few More
Index Performance Cache Performance
Index + Cache
17. Summary: Why Index Access Optimal for BI?
1. Use of indexes eliminates need to read unnecessary data
2. The deeper you go, the faster it gets: as users drill down and add
more filters the faster the queries perform
3. Unlimited flexibility: users can aggregate and filter by any columns
they choose with no performance penalty
4. Concurrent users accessing dashboards generate repeatable queries
that result in high cache efficiency
5. Shields BI workload from other analytics overwhelming the cluster
18. Ready to Try Jethro?
1. Register: jethrodata.com/download-jethro-for-tableau
2. Schedule a 45min POC review with Jethro SA (free!)
3. One time setup
- Download and Install Jethro on a server / VM
- Start services, configure instance
4. Extract & Load data
5. Use Tableau
- Install ODBC driver
- Point Tableau data source at Jethro
That’s It!