Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
T R E A S U R E D A T A
Presto At Treasure Data
Presto Meetup @ Tokyo - June 15, 2017
Taro L. Saito - GitHub:@xerial
Ph.D....
Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day 

(= 173 Million Rows / sec.)
150,000~ Queries / Da...
Configurations
• Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan)
• Multi-Tenancy Clusters
• PlazmaDB
• Storage: Amazon S3...
Challenges
• Major Complaint
• Presto is slower than usual
• Only 20% of 150,000 queries are using our scheduling feature
...
Understanding Implicit SLOs
• We usually looked into slow queries to figure out the performance bottlenecks.
• However anal...
• Bad:
• Collecting stdout/stderr logs of Presto
• Good:
• Collecting logs in a queryable format with Presto
• Collecting ...
Query Event Logs
• Query Completion
• queryId, user id, session parameters, etc.
• Query stats: running time, total rows, ...
Clustering Queries with Query Signature
• Finding Implicit SLOs
• Need to classify 85% of scheduled queries
• Extracting Q...
Implicit SLOs
• Collect the historical query running times
• Queries that have the same query signature
• Median-absolute ...
Typical Performance Bottlenecks
• Huge Queries
• Frequent S3 access, wide table scans
• Single-node operators
• order by, ...
Split Resource Manager
• Problem: A singe query can occupy the entire cluster resource
• But Presto has a limited performa...
Presto Ops Robot
• Problem: Insufficient memory of a worker
• Queries using that worker node enter WAITING_FOR_MEMORY state
...
S3 Access Performance
• Problem: Slow Table Scan
• S3 GET request has constant latency
• 30ms ~ 50ms latency regardless of...
Presto Stella: Plazma Storage Optimizer
• Problem:
• Some query reads 1 million partitions <- S3 latency overhead is quite...
Transitions of Database Usages
15
New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the s...
Prestobase Proxy: Low-Latency Access to Presto
• Needed more interactive experiences of Presto
• Prestobase Proxy: Gateway...
Customizing Prestobase Filters
• Prestobase Proxy: Gateway to access Presto
• Adding TD specific binding
• Finagle filters -...
Airframe
• http://wvlet.org/airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Sess...
VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI se...
Optimizing QueryResults Transfer in Prestobase
• Accept: application/x-msgpack
• HTTP header
• Returning Presto query resu...
Prestobase Modules
• prestobase-proxy
• Proxy server to access Presto with authentication
• prestobase-agent
• Agent for r...
Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and sc...
sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/pr...
Big Challenge: Splitting Huge Queries
• Table Scan Log Analysis
• Revealed most of customers are scanning the same data ov...
Time Range Primitives
• TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’)
• Most frequently used UDF, but inconvenien...
MessageFrame (In Design)
• Next-generation Tabular Data Format
• Hybrid layout:
• row-oriented: for streaming. Quick write...
Summary
• Managing Implicit SLOs
• Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto
• SQL clustering -...
T R E A S U R E D A T A
29
Nächste SlideShare
Wird geladen in …5
×

Presto At Treasure Data

4.032 Aufrufe

Veröffentlicht am

Presto At Treasure Data - Presto Meetup Tokyo 2017
https://techplay.jp/event/621143

Veröffentlicht in: Technologie
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2ZDZFYj ♥♥♥
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating direct: ❤❤❤ http://bit.ly/2ZDZFYj ❤❤❤
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Presto At Treasure Data

  1. 1. T R E A S U R E D A T A Presto At Treasure Data Presto Meetup @ Tokyo - June 15, 2017 Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. 1
  2. 2. Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day 
 (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users Hosting Presto as a service for 3 years 2
  3. 3. Configurations • Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan) • Multi-Tenancy Clusters • PlazmaDB • Storage: Amazon S3 or RiakCS • S3 file indexes: PostgreSQL • Storage format: Columnar Message Pack (MPC) • MessagePack: Self-type describing format. • Compact: 10x compression ratio from the original input data (JSON) • 200GB JVM memory per node • To support varieties of query usage • Estimating required memory in advance is difficult • For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing • In small-memory configuration, major GCs was quite frequent 3
  4. 4. Challenges • Major Complaint • Presto is slower than usual • Only 20% of 150,000 queries are using our scheduling feature • However, 85% of queries are actually scheduled by user scripts or third-party tools 
 • How can we know the expected performance? • (Implicit) Service Level Objectives (SLOs) 4
  5. 5. Understanding Implicit SLOs • We usually looked into slow queries to figure out the performance bottlenecks. • However analyzing SQL takes a long time • Because we need to understand the meaning of the data. • Understanding a hundred lines of SQL is painful • Created Presto Query Tuning Guides: • Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq • Expectations to Performance • Scheduled queries: We can estimate SLOs from historical stats • Scheduled, but submitted from third-party tools or user scripts • How do we know the expected performance? • We need to internalize customer’s knowledge on query performance 5
  6. 6. • Bad: • Collecting stdout/stderr logs of Presto • Good: • Collecting logs in a queryable format with Presto • Collecting Query Event Logs to Treasure Data • Presto Event Listener -> fluentd -> Treasure Data • Treasure Data • schema-less: Schema can be automatically generated from the data • As we add new fields to the event, the schema evolves automatically • We are collecting every single query log since the beginning of the Presto service Our Approach: Data-Driven Improvement Query Logs Store Analyze SQL Improve & Optimize 6
  7. 7. Query Event Logs • Query Completion • queryId, user id, session parameters, etc. • Query stats: running time, total rows, bytes, splits, CPU time, etc. • SQL statement • Split Completion • Running time, Processed rows, bytes, etc. • S3 GET access count, read bytes • Table Scan • Accessed tables names, column sets • Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.) • Filtering conditions (predicate) 7
  8. 8. Clustering Queries with Query Signature • Finding Implicit SLOs • Need to classify 85% of scheduled queries • Extracting Query Signatures • Simplify complex SQL expressions into a tiny SQL representation • Reusing ANTLR parser of Presto • Query Signature Example: • S[Cnt](J(T1,G(S[Cnt](T2)))) • SELET count(a),... FROM T1 
 JOIN (SELECT count(b),... FROM T2 GROUP BY x) 8
  9. 9. Implicit SLOs • Collect the historical query running times • Queries that have the same query signature • Median-absolute deviation (MAD): the deviation of (running time - median)^2 • CoV: Coefficient of variation = MAD / median • If CoV > 1, the query running time tends to vary • If CoV < 1, median of historical running time is useful for query running time estimation. • SLO violation: • If query is running longer than median + MAD • Customer feels query is slower than usual • However, query might be processing much more data than usual • Normalization based on the processing data size is also necessary 9
  10. 10. Typical Performance Bottlenecks • Huge Queries • Frequent S3 access, wide table scans • Single-node operators • order by, window function, count(distinct x), processing skewed data, etc. • Ill-performing worker nodes • Heavy load on a single worker node • Insufficient pool memory • Major/full GCs • We are using min.error-duration = 2m, but GC pause can be longer • Too much resource usage • A single query occupies the entire cluster • e.g., A query with hundreds of query stages! 10
  11. 11. Split Resource Manager • Problem: A singe query can occupy the entire cluster resource • But Presto has a limited performance control • Only for cpu time, memory usage, and concurrent queries (CQ) limits • No throttling nor boosting • Created Split Resource Manger • Limiting the max runnable splits for each customer • Using a custom RemoteTask class, which adds an wait if no splits are available • => Efficient Use of Multi-Tenancy Cluster 11
  12. 12. Presto Ops Robot • Problem: Insufficient memory of a worker • Queries using that worker node enter WAITING_FOR_MEMORY state • Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot • Presto Ops Robot • Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status) • or kill memory consuming queries in the worker node • Restarting worker JVM process • At least every 1 week, to avoid any issues when running JVM for a long time • Resetting any effect caused by unknown bugs • Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.) 12
  13. 13. S3 Access Performance • Problem: Slow Table Scan • S3 GET request has constant latency • 30ms ~ 50ms latency regardless of the read size (up to 8KB read) • Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary • Reading small header part of S3 objects can be the majority of query processing time • Columnar format: header + column blocks • IO Manager: • Need to send as many S3 GET requests as possible • 1 split = multiple S3 objects • Pipelining S3 GET requests and column reads 13
  14. 14. Presto Stella: Plazma Storage Optimizer • Problem: • Some query reads 1 million partitions <- S3 latency overhead is quite high • Data from mobile applications often have wide-range of time values. • Presto Stella Connector • Using Presto for optimizing physical storage partitions • Input records: File list on S3 • Table writer stage: Merges fragmented partitions, and upload them to S3 • Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction) • Performance Improvement • e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.) • 20x performance improvement • Use Cases • Maintain fragmented user-defined partitions • 1-hour partitioning -> more flexible time range partitioning 14
  15. 15. Transitions of Database Usages 15
  16. 16. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Convenient and low-latency access are necessary • SQL can be inefficient at first • While exploring data, SQL can be sophisticated, but not always 16
  17. 17. Prestobase Proxy: Low-Latency Access to Presto • Needed more interactive experiences of Presto • Prestobase Proxy: Gateway to Presto Coordinator • Talks Presto Protocol (/v1/statement/…) • Written in Scala. • Runs on Docker • Based on Finagle (HTTP server written by Twitter) • Features • Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.) • Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc. • Authentication (API key) • Rewriting nextUri (internal IP address -> external host name) • BI-tool specific query filters • etc. 17
  18. 18. Customizing Prestobase Filters • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters • Using Airframe, dependent injection library for Scala 18
  19. 19. Airframe • http://wvlet.org/airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • examples: • Open/close Presto connection • Shutting down Presto server • etc. • Session • Manage singletons and binding rules 19
  20. 20. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB file for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 20
  21. 21. Optimizing QueryResults Transfer in Prestobase • Accept: application/x-msgpack • HTTP header • Returning Presto query result rows in MessagePack format • QueryResults object • Contains Array<Array<Object>> => MessagePack (compact binary) • Encoding QueryResults objects using MessagePack/Jackson • https://github.com/msgpack/msgpack-java • Presto client doesn’t need to parse the row part • 1.5x ~ 2.0x performance improvement for streaming query results 21
  22. 22. Prestobase Modules • prestobase-proxy • Proxy server to access Presto with authentication • prestobase-agent • Agent for running Presto queries and storing their results • prestobase-vcr • For recording/replaying Presto responses • prestobase-codec • MessagePack codec of Presto query responses • prestobase-hq (headquarter) • Presto usage analysis pipelines, SLO monitoring, etc. • prestobase-conductor • Multi Presto cluster management tool • td-prestobase • Treasure Data specific bindings of prestobase • TD Authentication, job logging/monitoring • BI tool specific filters (Tableau, Looker, etc.) 22
  23. 23. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside Programming Language • prestobase-hq • Need to manage hundreds of SQLs and their results • SLO analysis, query performance analysis, etc. • But How? 23
  24. 24. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL files • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis in prestobase-hq 24
  25. 25. Big Challenge: Splitting Huge Queries • Table Scan Log Analysis • Revealed most of customers are scanning the same data over and over • Optimizing SQL is not the major concern. • Analyzing data has higher priority • Splitting a huge query into scheduled hourly/daily jobs • digdag: Open-source workflow engine • http://digdag.io • YAML-based task definition • Scheduling, run Presto queries • Easy to use 25
  26. 26. Time Range Primitives • TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’) • Most frequently used UDF, but inconvenient • Use short description of relative time ranges • 1d (1 day) • 7d (7 days) • 1h (1 hour) • 1w (1 week) • 1M (1 month) • today, yeasterday, lastWeek, thisWeek, etc. • Recent data access • 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range • Splitting ranges • 1w.splitIntoDays 26
  27. 27. MessageFrame (In Design) • Next-generation Tabular Data Format • Hybrid layout: • row-oriented: for streaming. Quick write • column-oriented: better compression & fast read • Specification Layers • Layer-0 (basic specs: Keep it simple stupid) • Data type: MessagePack • Compression codec: raw, delta, gzip, (snappy, zstd? etc.) • Column metadata: min/max/sum values of columns • Layer-1 (advanced compression) • Layer-N should be convertible to Layer-0 27
  28. 28. Summary • Managing Implicit SLOs • Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto • SQL clustering -> Find a bottleneck -> Optimize it! • Optimization approaches • Split usage control, Presto Ops Robot, Stella partition optimizer • Low-latency access by Prestobase • Workflow • On-going Work • Physical storage optimization (Stella) • Huge query optimization • Incremental Processing Support • DigDag workflow • MessageFrame 28 https://www.treasuredata.com/company/careers/
  29. 29. T R E A S U R E D A T A 29

×