2017 09-27 democratize data products with SQL

Democratize Scalable Data Products with SQL
Yu Ishikawa

Have you implemented scalable data products?

Example Scalable Data Product
Now, we would like to count the item view by an item in the past 100 days.
- How large is the data size of the event log for the past 100 days?
- 346 GB
- How many lines are included the event log for the past 100 days?
- 1,265,405,561 lines
- How many items were viewed for the past 100 days?
- 11,279,020 items

Approach 1: Application-Base
DB/Cache
def item_view(self, **args):
counter = get_counter(item_id)
if user_id != item_seller_id:
counter += 1
put_counter(counter)
item_id views
m13234234 2
m432809 4
m9487201 5
counter += 1
counter += 1
DB/Cache
DB/Cache

Approach 1: Application-Base Implemtantion
- Pros
- The architecture could be easy to understand.
- Cons
- We have to implement one every time.
- We must control its horizontal scalability on another tier.
- Maintaining such a data product is annoying.

Approach 2: Stateful Streaming Processing
DB/Cache
State
Event Stream
item_id views
m13234234 2
m432809 4
m9487201 5
Entity DB/Cache
DB/Cache

Approach 2: Stateful Streaming Processing
- Pros
- It is super easy to implement scalable algorithm for big data guys.
- Cons
- We still implemet one every time.
- Developing scalable data products as a team would be hard at a startup.
- At mercari, who is really familiar with big data frameworks?
- Maintaining such a data product is annoying.

How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple

Real-Time
Streaming
Batch
Advanced
Simple
Simple Cases,
but the Largest Segment!

Real-Time
Streaming
Batch
Advanced
Simple
Do we have to implement
something with distributed
processing frameworks for such
simple cases in reality?

Real-Time
Streaming
Batch
Advanced
Simple
Machine learning enginners should focus on this area!

What are tough points of scalable data products?
- Implementing an algorithm at scala would be more difficult than doing one on
small data.
- Can you implement scalable algorithms with Apache Spark?
- Storing a banch of records with a specific data format of each project is
annoying.
- Can you insert data to a database with 100,000 records/min?

High-Level Overview to Create Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.

Proposal Approach: bigquery-to-datastore

Google BigQuery
A fast, economical and fully-managed enterprise data warehouse for large-scale
data analytics
- Scalable
- Flexible data operation with SQL
- Highly available
- Fully Integrated with Apache Spark, Google Dataflow and so on
- Cost effective
- However, BigQuery doesn’t allow us to use like a relational database.

Google Datastore
Cloud Datastore is a highly-scalable NoSQL database for your web and mobile
applications
- Fast & highly scalable
- Highly available
- Fully-managed

Google Dataflow
Simplified stream and batch data processing, with equal reliability and
expressiveness
- Horizontal auto-scaling
- Fully-managed

Apache Beam
Implement batch and streaming data processing jobs that run on any execution
engine
- Highly scalable
- Portable
- Dataflow runner, Spark runnder, Flink runner
- Extensible
- text, JSON, XML, avro, HDFS, Amazon Kinesis, Apache Kafka, Google Pub/Sub, MQTT,
Apache Cassandra, Apache HBase, Apache Hive, Apache Solr, Elasticsearch, BigQuery,
BigTable, Datastore, JDBC, MongoDB, S3, Google Cloud Storage, Apache Parquet,
Memcached, Redis, RESt

What is bigquery-to-datastore?
- Export a BigQuery table to Google Datastore without being bothered by its
schema using Apache Beam and Google Dataflow.
- One of my weekend projects
- https://github.com/yu-iskw/bigquery-to-datastore
- Pros
- We don’t basically have to care about the scalability.
- We have to implement anything! Just execute a command!!
- Fully-managed
- BigQuery, Google Dataflow, Google Datastore
- Cons
- It doesn’t support near real-time processings.
- Using machine learning on it could be hacky, but not impossible.

Type Conversion from BigQuery to Google Datastore

Demo Case
Create a scalable data products:
- The number of page view by an item within the past 100 days.
- The number of unique view users by an item the past 100 days.
item_id pv uu
m223014 23 16
m99174 3 3
m9898374 105 89

BigQuery SQL to count PV and UU by Item

Export the BigQuery taboe to Google Datastore
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.1.jar
com.github.yuiskw.beam.BigQuery2Datastore
--project=sage-shard-740
--runner=DataflowRunner
--inputBigQueryDataset=test_yu
--inputBigQueryTable=test_pv_uu
--outputDatastoreNamespace=test_double
--outputDatastoreKind=TestPvUu
--keyColumn=item_id
--tempLocation=gs://test_yu/test-log/
--gcpTempLocation=gs://test_yu/test-log/
--numWorkers=5
--maxNumWorkers=10
--workerMachineType=n1-standard-8

High-Level Overview of Creating Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.

You can build scalable data products!

2017 09-27 democratize data products with SQL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 2017 09-27 democratize data products with SQL

Ähnlich wie 2017 09-27 democratize data products with SQL (20)

Mehr von Yu Ishikawa

Mehr von Yu Ishikawa (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2017 09-27 democratize data products with SQL