17. Data Analytics Platform
⢠Data collection, storage: Ruby(OSS), Java/JRuby(OSS)
⢠Console & API endpoints: Ruby(RoR)
⢠Schema management: Ruby/Java (MessagePack)
⢠Processing (batch, query, ...): Java(Hadoop,Presto)
⢠Queuing & Scheduling: Ruby(OSS)
⢠Data connector/exporter: Java, Java/JRuby(OSS)
18. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
19. OSS products
⢠To make logging more easy & simple than ever!
⢠Plugin system
⢠Open development
⢠For various environment/usage
⢠Fluentd, Fluent-Bit, Embulk
⢠Fluent-Bit: Data collector for Embedded Linux
http://ďŹuentbit.io/
22. Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/
24. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
25. Console/API
⢠RoR + AWS RDS + AngularJS
⢠on EC2 (API) and Heroku (Console)
⢠Operation, ConďŹguration & Managing Data
26. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
27. Collecting Data
⢠Import over Console/API
⢠From browsers and CLI (TD toolbelt)
⢠Treasure Agent (rpm/deb)
⢠Fluentd packaged by Treasure Data
⢠Post from JavaScript/iOS/Android SDK
⢠To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)
28. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
29. DataConnector
⢠Data bulk loader for various data sources
⢠Load customers' data to Treasure Data
⢠S3, Redshift, MySQL, PostgreSQL, Salesforce, ...
⢠Hosted Embulk
⢠Much computing resources
⢠Distributed execution on Hadoop MapReduce
30. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
31. Hadoop, Presto clusters
⢠Some Hadoop/Presto clusters
⢠We're OSS products itself, not customized one
⢠with minimal patches for storage I/O
32. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
33. Queue/Worker, Scheduler
⢠Treasure Data: multi-tenant data analytics service
⢠executes many jobs in shared clusters (queries,
imports, ...)
⢠CORE: queues-workers & schedulers
⢠Clusters have queues/scheduler... it's not enough
⢠resource limitations for each price plans
⢠priority queues for job types
⢠and many others
35. PerfectQueue
⢠Highly available distributed queue using RDBMS
⢠Written in CRuby
⢠Enqueue by INSERT INTO
⢠Dequeue/Commit by UPDATE
⢠Flexible scheduling rather than scalability
⢠Using Amazon RDS (MySQL) internally
⢠+ Workers on EC2
37. PerfectSched
⢠Highly available distributed scheduler using RDBMS
⢠Written in CRuby
⢠At-least-one semantics
⢠PerfectSched enqueues jobs into PerfectQueue
38. Storage, Schema
⢠Another core technology for Treasure Data service
⢠High performance, schema on read, less cost
⢠columnar ďŹle format
⢠high throughput & high concurrency
⢠compression
⢠Less schema management
⢠for customers
39. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
41. PlazmaDB
⢠Distributed database using RDBMS & Distributed FS
⢠metadata on RDBMS, data chunks on DFS
⢠Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS
⢠High throughput & high availability by S3
⢠Columnar format based on MessagePack
⢠time based chunking for time series data
42. Monitoring
⢠Using DataDog for internal operations
⢠Monitoring for our customers required:
⢠How many records are they importing?
⢠How many jobs are they executing?
⢠How many threads/processes is a job consuming?
44. PerfectMonitor
⢠Is still under construction :P
⢠Fluentd based metrics collection
⢠Detailed metric for real-time, summarized for past
⢠Real-time metric storage using InďŹuxDB
⢠Historic metric storage using Treasure Data
⢠Real-time data series are disposable :D
⢠Potential next OSS product from Treasure Data
45. For Further improvement
⢠More performance for more customers
⢠Dynamic scaling for better performance and less
cost
⢠New analytics features for brand new experience