4. Example Scalable Data Product
Now, we would like to count the item view by an item in the past 100 days.
- How large is the data size of the event log for the past 100 days?
- 346 GB
- How many lines are included the event log for the past 100 days?
- 1,265,405,561 lines
- How many items were viewed for the past 100 days?
- 11,279,020 items
6. Approach 1: Application-Base Implemtantion
- Pros
- The architecture could be easy to understand.
- Cons
- We have to implement one every time.
- We must control its horizontal scalability on another tier.
- Maintaining such a data product is annoying.
8. Approach 2: Stateful Streaming Processing
- Pros
- It is super easy to implement scalable algorithm for big data guys.
- Cons
- We still implemet one every time.
- Developing scalable data products as a team would be hard at a startup.
- At mercari, who is really familiar with big data frameworks?
- Maintaining such a data product is annoying.
9. How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
10. How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Simple Cases,
but the Largest Segment!
11. How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Do we have to implement
something with distributed
processing frameworks for such
simple cases in reality?
12. How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Machine learning enginners should focus on this area!
13. What are tough points of scalable data products?
- Implementing an algorithm at scala would be more difficult than doing one on
small data.
- Can you implement scalable algorithms with Apache Spark?
- Storing a banch of records with a specific data format of each project is
annoying.
- Can you insert data to a database with 100,000 records/min?
14. High-Level Overview to Create Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.
16. Google BigQuery
A fast, economical and fully-managed enterprise data warehouse for large-scale
data analytics
- Scalable
- Flexible data operation with SQL
- Highly available
- Fully Integrated with Apache Spark, Google Dataflow and so on
- Cost effective
- However, BigQuery doesn’t allow us to use like a relational database.
17. Google Datastore
Cloud Datastore is a highly-scalable NoSQL database for your web and mobile
applications
- Fast & highly scalable
- Highly available
- Fully-managed
18. Google Dataflow
Simplified stream and batch data processing, with equal reliability and
expressiveness
- Horizontal auto-scaling
- Fully-managed
19. Apache Beam
Implement batch and streaming data processing jobs that run on any execution
engine
- Highly scalable
- Portable
- Dataflow runner, Spark runnder, Flink runner
- Extensible
- text, JSON, XML, avro, HDFS, Amazon Kinesis, Apache Kafka, Google Pub/Sub, MQTT,
Apache Cassandra, Apache HBase, Apache Hive, Apache Solr, Elasticsearch, BigQuery,
BigTable, Datastore, JDBC, MongoDB, S3, Google Cloud Storage, Apache Parquet,
Memcached, Redis, RESt
20. What is bigquery-to-datastore?
- Export a BigQuery table to Google Datastore without being bothered by its
schema using Apache Beam and Google Dataflow.
- One of my weekend projects
- https://github.com/yu-iskw/bigquery-to-datastore
- Pros
- We don’t basically have to care about the scalability.
- We have to implement anything! Just execute a command!!
- Fully-managed
- BigQuery, Google Dataflow, Google Datastore
- Cons
- It doesn’t support near real-time processings.
- Using machine learning on it could be hacky, but not impossible.
22. Demo Case
Create a scalable data products:
- The number of page view by an item within the past 100 days.
- The number of unique view users by an item the past 100 days.
item_id pv uu
m223014 23 16
m99174 3 3
m9898374 105 89
27. High-Level Overview of Creating Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.
28. How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Simple Cases,
but the Largest Segment!