SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Democratize Scalable Data Products with SQL
Yu Ishikawa
Have you ever used SQL?
Have you implemented scalable data products?
Example Scalable Data Product
Now, we would like to count the item view by an item in the past 100 days.
- How large is the data size of the event log for the past 100 days?
- 346 GB
- How many lines are included the event log for the past 100 days?
- 1,265,405,561 lines
- How many items were viewed for the past 100 days?
- 11,279,020 items
Approach 1: Application-Base
DB/Cache
def item_view(self, **args):
counter = get_counter(item_id)
if user_id != item_seller_id:
counter += 1
put_counter(counter)
item_id views
m13234234 2
m432809 4
m9487201 5
def item_view(self, **args):
counter = get_counter(item_id)
if user_id != item_seller_id:
counter += 1
put_counter(counter)
def item_view(self, **args):
counter = get_counter(item_id)
if user_id != item_seller_id:
counter += 1
put_counter(counter)
DB/Cache
DB/Cache
Approach 1: Application-Base Implemtantion
- Pros
- The architecture could be easy to understand.
- Cons
- We have to implement one every time.
- We must control its horizontal scalability on another tier.
- Maintaining such a data product is annoying.
Approach 2: Stateful Streaming Processing
DB/Cache
State
Event Stream
item_id views
m13234234 2
m432809 4
m9487201 5
Entity DB/Cache
DB/Cache
Approach 2: Stateful Streaming Processing
- Pros
- It is super easy to implement scalable algorithm for big data guys.
- Cons
- We still implemet one every time.
- Developing scalable data products as a team would be hard at a startup.
- At mercari, who is really familiar with big data frameworks?
- Maintaining such a data product is annoying.
How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Simple Cases,
but the Largest Segment!
How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Do we have to implement
something with distributed
processing frameworks for such
simple cases in reality?
How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Machine learning enginners should focus on this area!
What are tough points of scalable data products?
- Implementing an algorithm at scala would be more difficult than doing one on
small data.
- Can you implement scalable algorithms with Apache Spark?
- Storing a banch of records with a specific data format of each project is
annoying.
- Can you insert data to a database with 100,000 records/min?
High-Level Overview to Create Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.
Proposal Approach: bigquery-to-datastore
Google BigQuery
A fast, economical and fully-managed enterprise data warehouse for large-scale
data analytics
- Scalable
- Flexible data operation with SQL
- Highly available
- Fully Integrated with Apache Spark, Google Dataflow and so on
- Cost effective
- However, BigQuery doesn’t allow us to use like a relational database.
Google Datastore
Cloud Datastore is a highly-scalable NoSQL database for your web and mobile
applications
- Fast & highly scalable
- Highly available
- Fully-managed
Google Dataflow
Simplified stream and batch data processing, with equal reliability and
expressiveness
- Horizontal auto-scaling
- Fully-managed
Apache Beam
Implement batch and streaming data processing jobs that run on any execution
engine
- Highly scalable
- Portable
- Dataflow runner, Spark runnder, Flink runner
- Extensible
- text, JSON, XML, avro, HDFS, Amazon Kinesis, Apache Kafka, Google Pub/Sub, MQTT,
Apache Cassandra, Apache HBase, Apache Hive, Apache Solr, Elasticsearch, BigQuery,
BigTable, Datastore, JDBC, MongoDB, S3, Google Cloud Storage, Apache Parquet,
Memcached, Redis, RESt
What is bigquery-to-datastore?
- Export a BigQuery table to Google Datastore without being bothered by its
schema using Apache Beam and Google Dataflow.
- One of my weekend projects
- https://github.com/yu-iskw/bigquery-to-datastore
- Pros
- We don’t basically have to care about the scalability.
- We have to implement anything! Just execute a command!!
- Fully-managed
- BigQuery, Google Dataflow, Google Datastore
- Cons
- It doesn’t support near real-time processings.
- Using machine learning on it could be hacky, but not impossible.
Type Conversion from BigQuery to Google Datastore
Demo Case
Create a scalable data products:
- The number of page view by an item within the past 100 days.
- The number of unique view users by an item the past 100 days.
item_id pv uu
m223014 23 16
m99174 3 3
m9898374 105 89
BigQuery SQL to count PV and UU by Item
Demo
Export the BigQuery taboe to Google Datastore
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.1.jar 
com.github.yuiskw.beam.BigQuery2Datastore 
--project=sage-shard-740 
--runner=DataflowRunner 
--inputBigQueryDataset=test_yu 
--inputBigQueryTable=test_pv_uu 
--outputDatastoreNamespace=test_double 
--outputDatastoreKind=TestPvUu 
--keyColumn=item_id 
--tempLocation=gs://test_yu/test-log/ 
--gcpTempLocation=gs://test_yu/test-log/ 
--numWorkers=5 
--maxNumWorkers=10 
--workerMachineType=n1-standard-8
Demo
High-Level Overview of Creating Data Products
Hit upon
an idea!
Implement
the idea
Brash up
the idea
Ready for
production
We should not spend much time to
try and error in the cycle.
How can we process big data?
Real-Time
Streaming
Batch
Advanced
Simple
Simple Cases,
but the Largest Segment!
Can you use SQL?
You can build scalable data products!

Weitere ähnliche Inhalte

Was ist angesagt?

Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comAlex Van Boxel
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.
 
Journey to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthJourney to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthSingleStore
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSpark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSingleStore
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
The Fermilab HEPCloud Facility
The Fermilab HEPCloud FacilityThe Fermilab HEPCloud Facility
The Fermilab HEPCloud FacilityClaudio Pontili
 
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted ConfIntroduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted ConfIn Marketing We Trust
 
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)Cathrine Wilhelmsen
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinBoyd Hemphill
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery GirdhareeSaran
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud eventPreetyKhatkar
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutionsClaudio Pontili
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 

Was ist angesagt? (20)

Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
Journey to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthJourney to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme Growth
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSpark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Google Bigtable
Google BigtableGoogle Bigtable
Google Bigtable
 
The Fermilab HEPCloud Facility
The Fermilab HEPCloud FacilityThe Fermilab HEPCloud Facility
The Fermilab HEPCloud Facility
 
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted ConfIntroduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted Conf
 
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker Austin
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud event
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutions
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 

Ähnlich wie 2017 09-27 democratize data products with SQL

Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryMárton Kodok
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataLuke Han
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem NikulchenkoFwdays
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQueryWilliam M. Cohee
 

Ähnlich wie 2017 09-27 democratize data products with SQL (20)

Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time Deployment
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Final deck
Final deckFinal deck
Final deck
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
 

Mehr von Yu Ishikawa

Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to PolyaxonYu Ishikawa
 
2016-06-15 Sparkの機械学習の開発と活用の動向
2016-06-15 Sparkの機械学習の開発と活用の動向2016-06-15 Sparkの機械学習の開発と活用の動向
2016-06-15 Sparkの機械学習の開発と活用の動向Yu Ishikawa
 
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016Yu Ishikawa
 
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群Yu Ishikawa
 
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame IntroductionYu Ishikawa
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
「チーム開発実践入門」勉強会
「チーム開発実践入門」勉強会「チーム開発実践入門」勉強会
「チーム開発実践入門」勉強会Yu Ishikawa
 
BdasとSpark概要
BdasとSpark概要BdasとSpark概要
BdasとSpark概要Yu Ishikawa
 
Hadoop conference 2013winter_for_slideshare
Hadoop conference 2013winter_for_slideshareHadoop conference 2013winter_for_slideshare
Hadoop conference 2013winter_for_slideshareYu Ishikawa
 
2012 02-02 mixi engineer's seminor #3
2012 02-02  mixi engineer's seminor #32012 02-02  mixi engineer's seminor #3
2012 02-02 mixi engineer's seminor #3Yu Ishikawa
 

Mehr von Yu Ishikawa (10)

Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
2016-06-15 Sparkの機械学習の開発と活用の動向
2016-06-15 Sparkの機械学習の開発と活用の動向2016-06-15 Sparkの機械学習の開発と活用の動向
2016-06-15 Sparkの機械学習の開発と活用の動向
 
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
 
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
2015-11-17 きちんと知りたいApache Spark ~機械学習とさまざまな機能群
 
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
「チーム開発実践入門」勉強会
「チーム開発実践入門」勉強会「チーム開発実践入門」勉強会
「チーム開発実践入門」勉強会
 
BdasとSpark概要
BdasとSpark概要BdasとSpark概要
BdasとSpark概要
 
Hadoop conference 2013winter_for_slideshare
Hadoop conference 2013winter_for_slideshareHadoop conference 2013winter_for_slideshare
Hadoop conference 2013winter_for_slideshare
 
2012 02-02 mixi engineer's seminor #3
2012 02-02  mixi engineer's seminor #32012 02-02  mixi engineer's seminor #3
2012 02-02 mixi engineer's seminor #3
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

2017 09-27 democratize data products with SQL

  • 1. Democratize Scalable Data Products with SQL Yu Ishikawa
  • 2. Have you ever used SQL?
  • 3. Have you implemented scalable data products?
  • 4. Example Scalable Data Product Now, we would like to count the item view by an item in the past 100 days. - How large is the data size of the event log for the past 100 days? - 346 GB - How many lines are included the event log for the past 100 days? - 1,265,405,561 lines - How many items were viewed for the past 100 days? - 11,279,020 items
  • 5. Approach 1: Application-Base DB/Cache def item_view(self, **args): counter = get_counter(item_id) if user_id != item_seller_id: counter += 1 put_counter(counter) item_id views m13234234 2 m432809 4 m9487201 5 def item_view(self, **args): counter = get_counter(item_id) if user_id != item_seller_id: counter += 1 put_counter(counter) def item_view(self, **args): counter = get_counter(item_id) if user_id != item_seller_id: counter += 1 put_counter(counter) DB/Cache DB/Cache
  • 6. Approach 1: Application-Base Implemtantion - Pros - The architecture could be easy to understand. - Cons - We have to implement one every time. - We must control its horizontal scalability on another tier. - Maintaining such a data product is annoying.
  • 7. Approach 2: Stateful Streaming Processing DB/Cache State Event Stream item_id views m13234234 2 m432809 4 m9487201 5 Entity DB/Cache DB/Cache
  • 8. Approach 2: Stateful Streaming Processing - Pros - It is super easy to implement scalable algorithm for big data guys. - Cons - We still implemet one every time. - Developing scalable data products as a team would be hard at a startup. - At mercari, who is really familiar with big data frameworks? - Maintaining such a data product is annoying.
  • 9. How can we process big data? Real-Time Streaming Batch Advanced Simple
  • 10. How can we process big data? Real-Time Streaming Batch Advanced Simple Simple Cases, but the Largest Segment!
  • 11. How can we process big data? Real-Time Streaming Batch Advanced Simple Do we have to implement something with distributed processing frameworks for such simple cases in reality?
  • 12. How can we process big data? Real-Time Streaming Batch Advanced Simple Machine learning enginners should focus on this area!
  • 13. What are tough points of scalable data products? - Implementing an algorithm at scala would be more difficult than doing one on small data. - Can you implement scalable algorithms with Apache Spark? - Storing a banch of records with a specific data format of each project is annoying. - Can you insert data to a database with 100,000 records/min?
  • 14. High-Level Overview to Create Data Products Hit upon an idea! Implement the idea Brash up the idea Ready for production We should not spend much time to try and error in the cycle.
  • 16. Google BigQuery A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics - Scalable - Flexible data operation with SQL - Highly available - Fully Integrated with Apache Spark, Google Dataflow and so on - Cost effective - However, BigQuery doesn’t allow us to use like a relational database.
  • 17. Google Datastore Cloud Datastore is a highly-scalable NoSQL database for your web and mobile applications - Fast & highly scalable - Highly available - Fully-managed
  • 18. Google Dataflow Simplified stream and batch data processing, with equal reliability and expressiveness - Horizontal auto-scaling - Fully-managed
  • 19. Apache Beam Implement batch and streaming data processing jobs that run on any execution engine - Highly scalable - Portable - Dataflow runner, Spark runnder, Flink runner - Extensible - text, JSON, XML, avro, HDFS, Amazon Kinesis, Apache Kafka, Google Pub/Sub, MQTT, Apache Cassandra, Apache HBase, Apache Hive, Apache Solr, Elasticsearch, BigQuery, BigTable, Datastore, JDBC, MongoDB, S3, Google Cloud Storage, Apache Parquet, Memcached, Redis, RESt
  • 20. What is bigquery-to-datastore? - Export a BigQuery table to Google Datastore without being bothered by its schema using Apache Beam and Google Dataflow. - One of my weekend projects - https://github.com/yu-iskw/bigquery-to-datastore - Pros - We don’t basically have to care about the scalability. - We have to implement anything! Just execute a command!! - Fully-managed - BigQuery, Google Dataflow, Google Datastore - Cons - It doesn’t support near real-time processings. - Using machine learning on it could be hacky, but not impossible.
  • 21. Type Conversion from BigQuery to Google Datastore
  • 22. Demo Case Create a scalable data products: - The number of page view by an item within the past 100 days. - The number of unique view users by an item the past 100 days. item_id pv uu m223014 23 16 m99174 3 3 m9898374 105 89
  • 23. BigQuery SQL to count PV and UU by Item
  • 24. Demo
  • 25. Export the BigQuery taboe to Google Datastore java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.1.jar com.github.yuiskw.beam.BigQuery2Datastore --project=sage-shard-740 --runner=DataflowRunner --inputBigQueryDataset=test_yu --inputBigQueryTable=test_pv_uu --outputDatastoreNamespace=test_double --outputDatastoreKind=TestPvUu --keyColumn=item_id --tempLocation=gs://test_yu/test-log/ --gcpTempLocation=gs://test_yu/test-log/ --numWorkers=5 --maxNumWorkers=10 --workerMachineType=n1-standard-8
  • 26. Demo
  • 27. High-Level Overview of Creating Data Products Hit upon an idea! Implement the idea Brash up the idea Ready for production We should not spend much time to try and error in the cycle.
  • 28. How can we process big data? Real-Time Streaming Batch Advanced Simple Simple Cases, but the Largest Segment!
  • 29. Can you use SQL?
  • 30. You can build scalable data products!