6. Main Topics around "Data"
• Data collection
• Storage
• Data processing
• Batch distributed processing
• Stream processing
• Machine Learning
• Near real-time query & Data lake
• Visualization
9. Using Services or Not
• Using services fully-managed:
• Google BigQuery & Dataflow
• Treasure Data services
• Using services self-managed:
• Amazon EMR & Redshift
• Google Cloud Dataproc
• Using your own environment & cluster
10. Using Services or Not
• Using services fully-managed:
• Google BigQuery & Dataflow
• Treasure Data services
• Using services self-managed:
• Amazon EMR & Redshift
• Google Cloud Dataproc
• Using your own environment & cluster
a bit more cost
extremely less efforts
fully controlled by self
extremely more efforts
less cost
less efforts
11. Using Services or Not:
"Use Services!"
To concentrate
DATA and Analytics,
NOT tools
12. Why should we use services?
• About distributed systems:
• hard to operate & upgrade
• impossible to "small-start"
• very hard to hire professional engineer
• Data Driven Development:
• collect/store data at first!
• consider output data at second!
• "before building your own environment"
13. Really? Are you TD guy?
• ...Really!
• But it requires very long discussions :P
• "スタートアップのデータ処理基盤、作るか、使うか"
http://tsuchinoko.dmmlabs.com/?p=1770
14. How to choose software/services
in
Data-Driven Development
15. "What" decides "How"
• Distributed systems are to solve problems
• There're many kind of data
• There're many problems
• Systems solve different problems from each other
• There are no "Silver bullet"!
16. What First, How Second
• What do you want to do?
• Reporting? Analytics? Recommendation? or ...
• What type of data you wan to process?
• Stored large log? Stream sensor data? or ...
• What is you need as result?
• CSV? Spreadsheet? Graph? DB Relation? or ...
17. How?(just for example)
• MapReduce, Tez
• Large batch jobs, big JOINs, high stability
• Spark
• Small/Middle batch jobs, machine learning
• Impala, Presto, Drill, Redshift, BigQuery
• Near-real-time search, small-to-large analytics
• Storm, Spark streaming
• Stream data conversion/aggregation
19. Data Analytics Flow (again)
Collect Store Process Visualize
Data source
Reporting
Monitoring
20. Data Analytics Flow (again)
Collect Store Process Visualize
Data source
Reporting
Monitoring
21. Data Collection
• Data Driven Development -> collect at first!
• As batch: Data already exists as files
• Easily integrated with existing batch systems
• Sqoop, Embulk, ...
• As stream: Data just generated now
• Easily connected with monitoring systems
• Without burst network traffic
• Flume, Logstash, Fluentd, ...