4. Company Overview
§ Silicon Valley-based Company
• All Founders are Japanese
• Hironobu Yoshikawa
• Kazuki Ohta
• Sadayuki Furuhashi
• About 20 people
• Over 3.5 million jobs
§ OSS Enthusiasts
• MessagePack, Fluentd, etc.
4
Sunday, July 14, 13
5. Investors
§ Bill Tai
§ Othman Laraki - Former VP Growth at Twitter
§ James Lindenbaum, Adam Wiggins, Orion Henry -
Heroku Founders
§ Anand Babu Periasamy, Hitesh Chellani - Gluster
Founders
§ Yukihiro “Matz” Matsumoto - Creator of Ruby
§ Dan Scheinman - Director of Arista Networks
§ Jerry Yang - Founder of Yahoo!
5
Sunday, July 14, 13
7. The Problem with Other Solutions
7
Customer
Value
Time
Sign-up or PO
On-Premise
Solutions
Obsolescence
over time
Treasure Data
Fully integrated Big Data full-
stack service with simple
interface, low friction initial
engagement & continuous
technical upgrade
Need Upgrade
AWS
(or hosted Hadoops)EC2
EMR
RedShift
S3 Step-by-step manual
integrations
Maintain
NO SpecialistsTOO LONG to get Live
=
Complex Solutions
+
Data Collection
+
Sunday, July 14, 13
8. 8
Big Data Adoption Stages
Intelligence Sophistication
Standard Reports
Ad-hoc Reports
Drill Down Query
Alerts
Statistical Analysis
Predictive Analysis
Optimization
What happened?
Where?
Where exactly?
Error?
Why?
What’s a trend?
What’s the best?
Analytics
Reporting
Sunday, July 14, 13
9. 8
Big Data Adoption Stages
Intelligence Sophistication
Standard Reports
Ad-hoc Reports
Drill Down Query
Alerts
Statistical Analysis
Predictive Analysis
Optimization
What happened?
Where?
Where exactly?
Error?
Why?
What’s a trend?
What’s the best?
Analytics
Reporting
Treasure Data’s FOCUS
(80% of needs)
Sunday, July 14, 13
10. 9
Full Stack Support for Big Data Reporting
Our best-in-class architecture
and operations team ensure the
integrity and availability of your
data.
Data from almost any source
can be securely and reliably
uploaded using td-agent in
streaming or batch mode.
Our SQL, REST, JDBC, ODBC
and command-line interfaces
support all major query tools
and approaches.
You can store gigabytes to
petabytes of data efficiently and
securely in our cloud-based
columnar datastore.
Sunday, July 14, 13
14. 13
A case: “14 Days” from Signup to Success
1. Europe’s largest mobile ad
exchange.
2. Serving >20 billion imps/
month for >15,000 mobile
apps (Q1 2013)
3. Immediate need of analytics
infrastructure: ASAP!
4. With TD, MobFox got into
production only in 14 days,
by one engineer.
"Time is the most precious asset in our fast-moving business,
and Treasure Data saved us a lot of it."
Julian Zehetmayr, CEO & Founder
td-agent =
fluentd rpm/deb
Sunday, July 14, 13
15. 14
A case: “Replace” in-house Hadoop to TD
1. Global “Hulu” - Online Video
Service with millions of users
2. Video contents are distributed
to over 150 languages.
3. Had hard time maintaining
Hadoop cluster
4. With TD, Viki deprecated their
in-house Hadoop cluster and
use engineer for core
businesses.
Before
After
“Treasure Data has always given us thorough and timely support
peppered with insightful tips to make the best use of their service."
Huy Nguyen, Software Engineer
Sunday, July 14, 13
16. 15
A case: Treasure Data with BI Tool (Tableau)
1. World’s largest android
application market
2. Serving >3 billion app
downloads for >100 million
users
3. Only one engineer managing
the data infrastructure
4. With TD, the data engineer can
focus on analyzing data with
existing BI tool
"I will recommend Treasure Data to my friends in a heartbeat because it
benefits all three stakeholders: Operations, Engineering and Business."
Simon Dong, Principal Architect - Data Engineering
Sunday, July 14, 13
17. 16
- Vision -
Single Analytics Platform for the World
http://www.chisite.org/initiatives/WGII
Sunday, July 14, 13
20. 19
Architecture Breakdown
Data Collection
• Increasing variety of
data sources
• No single data schema
• Lack of streaming data
collection method
• 60% of Big Data project
resource consumed
Data Store/Analytics
• Remaining complexity in
both traditional DWH
and Hadoop (very slow
time to market)
• Challenges in scaling
data volume and
expanding cost.
Connectivity
• Required to ensure
connectivity with
existing BI/visualization/
apps by JDBC, ODBC
and REST.
• Output ot other services,
e.g. S3, RDBMS, etc.
Sunday, July 14, 13
21. Product Philosophy
§ Data first, Schema later
• “Schema-on-Read”
• Both Batch and Query processing
§ Simple APIs
• Easy to use and powerful
§ Easy integration
• Log collecting, BI tools and etc...
20
Sunday, July 14, 13
22. Our technology stack
§ td-agent
• ETL part of Treasure Data
§ Plazma
• Big data processing infrastructure
• Columnar oriented storage
• Reliable data handling
§ Multi-tenant scheduler
• Robust distributed queue and scheduler
21
Sunday, July 14, 13
23. § 60% of BI project resource is consumed here
§ Most ‘underestimated’ and ‘unsexy’ but MOST important
§ Fluentd: OSS lightweight but robust Log Collector
• http://fluentd.org/
1) Data Collection
22
Sunday, July 14, 13
28. In short
§ Open sourced log collector written in Ruby
• Easy to use, reliable and well performance
• like streaming event processing
§ Using rubygems ecosystem for plugins
27
It’s like syslogd, but
uses JSON for log messages
Sunday, July 14, 13
34. td-agent
§ Open sourced distribution package of Fluentd
• ETL part of Treasure Data
• rpm, deb and homebrew
§ Including useful components
• ruby, jemalloc, fluentd
• 3rd party gems: td, mongo, webhdfs, etc...
• td plugin is for Treasure Data
§ http://packages.treasure-data.com/
33
Sunday, July 14, 13
35. § Remaining complexity in both DWH and Hadoop
§ Challenges in scaling data volume and expanding cost
§ Plazma: Hadoop eco system and own projects
2) Data Store / Analytics
34
Sunday, July 14, 13
37. AWS Component Dependencies (1)
§ RDS
• Store user information, job status, etc...
• Store metadata of our columnar database
• Queue worker / Scheduler
§ EC2
• API servers (Ruby on Rails 3)
• Hadoop clusters
• Job workers
• Using Chef to deploy
36
Sunday, July 14, 13
38. AWS Component Dependencies (2)
§ ELB
• Load balancing of API servers
• Load balancing of td-agents
§ S3
• Columnar storage built on top of S3
• MessagePack columnar format
• Realtime / Archive storage
• Our Result feature supports S3 output.
37
No EBS, EMR, SQS and other products !
Sunday, July 14, 13
39. Frontend
Queue
Worker
Hadoop
Fluentd
Applications push
metrics to Fluentd
(via local Fluentd)
Librato Metrics
for realtime analysis
Treasure
Data
for historical analysis
Fluentd sums up data minutes
(partial aggregation)
Treasure Data Service Processing Flow
38
Hadoop
Sunday, July 14, 13
48. Multi-Tenancy
§ All customers share the Hadoop clusters (Multi Data Centers)
§ Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade
47
datacenter A
datacenter B
datacenter C
datacenter D
Local FairScheduler
Local FairScheduler
Local FairScheduler
Local FairScheduler
Global
Scheduler
On-Demand
Resouce Allocation
Job Submission
+ Plan Change
Sunday, July 14, 13
49. Trial and error on Cloud
§ Rapid development
• Change hardware
• New architecture testing
• Performance testing
• Change software
• Hadoop parameters
• etc...
§ Use git and chef for these purposes
• Easy to deploy and apply changes
• git for change history
48
Sunday, July 14, 13
50. § Services
• CopperEgg
• Librato Metrics
• Logentries
• NewRelic
• PagerDuty
• Desk.com
• Olark
• HipChat
• Alerting
Our Operation Stack: Full Use of SaaS
49
§ Tools
• Hosted Chef (Opscode)
• Jenkins
• including integration test
44
Sunday, July 14, 13
54. 53
3) Connectivity
§ Need to visualize the query result
§ Use metrics / graph for interactive comparison
§ Result: Export result and use existence tools
45
Sunday, July 14, 13
56. 55
Pull and Push approaches
Query
(Pull)
Web App
MySQL
Treasure Data
Columnar Storage
Query
Processing
Cluster
Query
API
REST API
JDBC, ODBC Driver
td-command
BI apps
S3
Result
(Push)
…
Sunday, July 14, 13
57. Support list
56
§ Result
• Treasure Data
• MySQL
• PostgreSQL
• Google SpreadSheet
• REST API
• S3
• etc...
§ BI tool
• Pentaho
• Tableau
• JasperSoft
• Indicee
• Dr. Sum
• Metric Insight
• etc...
http://docs.treasure-data.com/categories/3rd-party-tools-overview
http://docs.treasure-data.com/categories/result
Sunday, July 14, 13
58. § Treasure Data
• Cloud based Big-data analytics platform
• Provide Machete for Big data reporting
§ Big Data processing
• Collect / Store / Analytics / Visualization
§ Consider trade-off
• Cloud reinforces idea but not differentiator
• What is the strong point?
• Should focus own vision!
Conclusion
57
Our focus!
Sunday, July 14, 13
59. Big Data for the Rest of Us
www.treasure-data.com | @TreasureData
Sunday, July 14, 13