GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud

Proprietary + Confidential
Data Science in Cloud
Quick Tour
Priyanka Vergadia
Staff Developer Advocate
Google Cloud
Twitter: @pvergadia

1
2
3
4
Google Cloud Orientation
Data Science
Data Analytics
MLOps
Flow
Some Google Cloud Tools - BigQuery,
BigQuery ML and Vertex AI
5
Wrap up
6
@pvergadia

Things I don’t
want to think
about...
1. Provisioning hardware
2. Installing software
3. Upgrading operating systems
4. Security patching
5. System and network admin
6. Scaling up/down
7. Paying for stuff I don’t use
8. Dealing with failures
9. Managing clusters

Things I want
to think
about...
1. Solving my problem

Getting things done using someone else’s computers, especially
where someone else worries about maintenance, provisioning, system
administration, security, networking, failure recover, etc.

The real problems with a
ML system will be found
while you are continuously
operating it for the long term”
Launching is easy,
Operating is hard.
pixabay.com

Developing the model
is just the beginning...
Modeling Code

…a product requires so much more
Configuration
Data Collection
Data
Verification
Feature Extraction Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML Code

Why do things become harder in production?
(an incomplete list)
● data cleaning and processing is hard at scale
● scaling out training and serving; infrastructure issues
● tracking, monitoring, and reproducibility requirements
○ model or data drift
○ training/serving skew
● access control issues, security requirements
● (and lots more)

The level of automation defines the maturity of the
ML process
Level 0
Build and
deploy manually
Level 1
Automate
the training phase
Level 2
Automate training,
validation, and deployment

“On any given day there are thousands of TFX
pipelines running, which are processing exabytes
of data and producing tens of thousands of
models, which in turn are performing hundreds
of millions of inferences per second.“

Google BigQuery
Data warehouse with customers ranging from TB to 100+ PB
Insights for everyone
Cloud-scale enterprise
data warehouse
Unique
Serverless platform
Standard SQL(ANSI 2011)
with DML Support
Encrypted, durable,
highly available Unique
Built-in ML Unique
Real-time insights Unique

In ~15s:
● Read 2TB:
○ ~1k disks
● Run 50B regexps:
○ ~3k cores

Train and deploy ML models in
SQL
BigQuery ML
Execute ML workflows without
moving data from BigQuery
Automate common ML tasks
Built-in infrastructure
management, security &
compliance

BigQuery ML supported models and features
The data analyst’s onramp to AI and ML
Classification
Logistic regression
DNN classiﬁer (TensorFlow)
XGBoost
Regression
Other Models
k-means clustering
Time series forecasting
Model ops and
explainability
Import/export TensorFlow models for
batch and online prediction
NDA
AutoML Tables
Linear regression
DNN regressor (TensorFlow)
XGBoost
AutoML Tables
Recommendation: Matrix factorization
NDA
Time series anomaly detectionPreview Q2’21,
GA H2’21
Hyperparameter tuning using Cloud AI
VizierPreview H1’21, GA H2’21
Model explainability using Cloud
AIPreview H1’21, GA H2’21
Managed Kubernetes and TFX
pipelinesPreview H2’21, GA 2022
List models for comparison and online
deployment in Cloud AIPreview H2’21, GA 2022
Model versioning, continuous
monitoringfuture
Wide and Deep NNsPreview, GA H1’21
Wide and Deep NNsPreview, GA H1’21
Autoencoders

Vertex AI is a
managed ML platform
to speed the rate of
experimentation and accelerate
deployment of AI models.

The End-To-End ML Journey through Vertex AI
Where can I find
training data?
Feature Store
Datasets
Where do I start with
model experiments?
Workbench
How can I track
the results of
experiments?
Experiments
How can I train at scale?
Training
How do I deploy?
Endpoints
And for production?
Monitoring
Pipelines

● Introduction to Data Science blog:
https://goo.gle/dsintro
● Getting started docs:
cloud.google.com/vertex-ai/docs
● Get started in Cloud Console:
console.cloud.google.com/ai/platform
● Best practices:
cloud.google.com/architecture/ml-on-gcp-best-practices
Learn more

goo.gle/bqml-use-cases
BQML design patterns

https://github.com/priyankavergadia/google-cloud-4-words

Thank you!
Twitter, LinkedIn: @pvergadia

What is Dataplex?
NDA
BigQuery
Dataplex
Data Lifecycle Mgmt
(Ingest, discover, prep, monitor, serve, archive)
Logical data organization
Unified Security and Governance
Unified Metadata with auto-discovery
Dataproc AI Platform
Data
Studio
Structured Streaming Data*
Semi-Structured Unstructured
GCP On-premises*
Multi-Cloud*
Dataflow
Storage
Built for distributed data
Logically unify and organize your data without any data
movement.
Intelligent Data Management
Automatic data discovery, metadata harvesting,
lifecycle management, and data quality with built-in
AI-driven intelligence.
Centralized Security & Governance
Central policy management, monitoring and auditing for
data authorization, retention, and classification.
Data Classification and Data Quality
Data
Intelligence
Analytics
*future capabilities

Data Science On Google Cloud
A Guided Tour
Polong Lin & Marc Cohen
Developer Relations Engineers
Google Cloud
Slides: mco.fyi/ds

Lab
mco.fyi/mllab
or
mco.fyi/forecast

Photo by Martin Olsen on
Complexity is a
barrier to adoption

HELLO CSECT The name of this program is 'HELLO'
* Register 15 points here on entry from OPSYS or caller.
STM 14,12,12(13) Save registers 14,15, and 0 thru 12 in caller's Save area
LR 12,15 Set up base register with program's entry point address
USING HELLO,12 Tell assembler which register we are using for pgm. base
LA 15,SAVE Now Point at our own save area
ST 15,8(13) Set forward chain
ST 13,4(15) Set back chain
LR 13,15 Set R13 to address of new save area
* -end of housekeeping (similar for most programs) -
WTO 'Hello World' Write To Operator (Operating System macro)
*
L 13,4(13) restore address to caller-provided save area
XC 8(4,13),8(13) Clear forward chain
LM 14,12,12(13) Restore registers as on entry
DROP 12 The opposite of 'USING'
SR 15,15 Set register 15 to 0 so that the return code (R15) is Zero
BR 14 Return to caller
*
SAVE DS 18F Define 18 fullwords to save calling program registers
END HELLO This is the end of the program

class HelloWorld
{
public static void main(String args[])
{
System.out.println("Hello, World");
}
}

Continuous Training for Production ML in the TFX Platform. OpML (2019).
Slice Finder: Automated Data Slicing for Model Validation. ICDE (2019).
Data Validation for Machine Learning. SysML (2019).
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017).
Data Management Challenges in Production Machine Learning. SIGMOD (2017).
Rules of Machine Learning: Best Practices for ML Engineering. Google AI Web (2017).
Machine Learning: The High Interest Credit Card of Technical Debt. NeurIPS (2015).
Hidden Technical Debt in Machine Learning Systems. NIPS (2015).
Production ML Research

Serving and
Monitoring
Continuous
Training
Experimentation/
Development
Code
Repository
Training Pipeline
CI/CD
Code and
configurations
Artifact
Repository
Pipeline
artifacts
Model
Registry
Model Deployment
CI/CD
Serving
Infrastructure
Trained
model
Model
deployment
ML Metadata
Logs
Serving
logs
Putting it all together
End-to-end view

Code & Config
Training pipeline
Registered model
Deployed model
Serving logs
Focus of today
Vertex
Feature Store
Vertex Training and
Pipelines
Vertex Model
Monitoring
Vertex
Workbench
Cloud Build
Vertex ML Metadata
Vertex Endpoints
and Prediction

Feature Store in one picture
Our Solution
Feature
Store
Online Store
Feature
Management API
Batch
Ingestion API
Stream
Ingestion API
Feature Discovery
API
Online Serving API
Batch Serving API
Cache
Online Prediction
Model Training
Batch Feature
Engineering
Streaming Feature
Engineering
Data Lake
(BQ, GCS)
Kafka/Pubsub
Point-in-time
lookups
Registry
Feature Monitoring
Offline Store

How does the new SDK fits in the picture?
Our Solution
Feature
Store
Online Store
Feature
Management API
Batch
Ingestion API
Stream
Ingestion API
Feature Discovery
API
Online Serving API
Batch Serving API
Cache
Online Prediction
Model Training
Batch Feature
Engineering
Streaming Feature
Engineering
Data Lake
(BQ, GCS)
Kafka/Pubsub
Point-in-time
lookups
Registry
Feature Monitoring
Offline Store
Vertex AI SDK
Data engineer ML engineer
Data scientist

Scalable training and serving on Vertex AI
Train with
Data
Analyst
ML
Developer
Data
Scientist
Use when Serve with
Vertex
Training
• Your problem doesn’t match the criteria
listed below for BigQuery ML or AutoML.
• You’re already running training on-premises or
another cloud, and you need consistency across
the platforms.
Vertex
Prediction
AutoML
• Your problem fits into one of the types AutoML
supports. Offers a point-and-click workflow.
• Natural Language or Video models are served from
Google Cloud. While Vision and Tables support
edge / downloadable models.
BigQuery ML
• All your data is contained in BigQuery.
• Users are most comfortable with SQL.
• The set of models available in BigQuery ML
matches the problem you’re trying to solve.
Train with Use when Serve with
Data
Analyst
ML
Developer
Data
Scientist

Model deployment &
management (MLOps)
Explainable AI
Model development and
data science
BigQuery ML Roadmap for 2021
H1’21 H2’21
TF Wide and Deep NNs Preview
Autoencoders Preview
PCA Preview
P-values for linear models Preview
Hyperparameter Tuning Preview
Anomaly Detection Preview
AutoML Tables GA
NDA
TF Wide and Deep NNs GA
Autoencoders GA
PCA GA
P-values for linear models GA
Hyperparameter Tuning GA
Anomaly Detection GA
Multivariate Time Series (AutoML) Preview
Model Registry Preview
Managed Pipelines Preview
Explainable AI Preview

Preparing the training data
Mix of demographic & behavioural data
Each row
is a
different
user

Preparing the training data
Each row
is a
different
user
Mix of demographic & behavioural data
Goal is to
create
cluster
labels
3
2
3
2
1

Developer Days
SELECT
* EXCEPT(userId)
FROM
mydataset.train
Build and train with
CREATE MODEL

Developer Days
CREATE OR REPLACE MODEL
mydataset.kmeans_3
OPTIONS(
model_type='KMEANS',
kmeans_init_method = 'KMEANS++',
num_clusters=3
)
SELECT
* EXCEPT(userId)
FROM
mydataset.train
Build and train with
CREATE MODEL

Developer Days
ML.PREDICT results

Developer Days
Compute cluster labels
using ML.PREDICT
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.kmeans_3,
(
SELECT
*
FROM
mydataset.train ))

Developer Days
Inspecting the clusters "Evaluation" tab on the BigQuery UI

Developer Days
Inspecting the clusters

Anomaly detection with k-means
Fraud detection
Each row is a transaction
Which rows are anomalies?

CREATE MODEL - k-means clustering
#Query for model training
CREATE MODEL demo.kmeans_model
OPTIONS(
model_type='kmeans',
num_clusters= 8,
kmeans_init_method = 'kmeans++'
)
AS
SELECT * EXCEPT(Time, Class)
FROM
bigquery-public-data.ml_datasets.ulb_fraud_detection;

ML.DETECT_ANOMALIES with k-means clustering
#Query for creating anomaly detection results
SELECT
*
FROM
ML.DETECT_ANOMALIES(
MODEL demo.kmeans_model,
STRUCT(0.005 AS contamination),
TABLE bigquery-public-data.ml_datasets.ulb_fraud_detection
);

Blogpost
https://cloud.google.com/blog/prod
ucts/data-analytics/bigquery-ml-unsu
pervised-anomaly-detection
Docs
https://cloud.google.com/bigquery-
ml/docs/reference/standard-sql/bigq
ueryml-syntax-detect-anomalies

Automated HP tuning
Have BigQuery ML automatically
search for the optimal
hyperparameters
Preview
Select number of trials
1
Don't need to be an expert in HPs
Save time from manually training
models with different HPs
Easy to use
CREATE MODEL
mydataset.my_logreg_model
OPTIONS(
model_type="logistic_reg",
input_label_cols=["mylabel"],
num_trials=20
) AS
SELECT
*
FROM
mydataset.my_training_data
Hyperparameter tuning with BigQuery ML

Automated HP tuning
hyperparameters
Preview
1
Uses Vertex Vizier under-the-hood
Easy to use
Inspect the trials info
2
SELECT
*
FROM
ML.TRIAL_INFO(MODEL mydataset.my_logreg_model)
Even while it's
still training!

Automated HP tuning
hyperparameters
Preview
1
Easy to use
2
Evaluate your model
3

Automated HP tuning
hyperparameters
Preview
1
Easy to use
2
Evaluate your model
3
Predict!
4
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.my_logreg_model)

How to import TensorFlow models to do
batch predictions in BigQuery
using BigQuery ML

Importing TensorFlow models into BigQuery
CREATE MODEL
PREDICT
https://towardsdatascience.com/how-to-do-batch-predictions-of-tensorflow-models-directly-in-bigquery-ffa843ebdba6
https://cloud.google.com/bigquery-ml/docs/making-predictions-with-imported-tensorflow-models

Question:
Can we do text
similarity based on
embeddings?
# The following are example embedding outputs of 20 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [0.0560572519898, 0.0534118898213, -0.0112254749984, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [-0.0343746766448, -0.0529498048127, 0.0469399243593, ...]

Text similarity using an imported Tensorflow model
https://towardsdatascience.com/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65
Goal:
I want to search for
comments similar to:
"power line down on a home"

Step 1: Save the TensorFlow model to GCS
CREATE OR REPLACE MODEL
mydataset.swivel_text_embed
OPTIONS(
model_type='tensorflow',
model_path='gs://BUCKET/swivel/*')
Step 2: CREATE MODEL using the GCS folder path

Step 3: Use ML.PREDICT to get comment embeddings
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.swivel_text_embed,
(SELECT
comments AS sentences
FROM
mydataset.mydata) );

Step 3: Use ML.PREDICT to get comment embeddings
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.swivel_text_embed,
(SELECT
comments AS sentences
FROM
mydataset.mydata) );
Text converted into an
embedding of 20
ﬂoating points

Step 4: Calculate distance between embeddings to
compute text similarity
Input search term:
"power line down on a home"
Top 15 most similar comments to input

Exporting BQML models for use with Vertex
Model trained with BigQuery ML Vertex Pipelines
Export to Cloud Storage
https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-mlops

Data
Labeling
AutoML
DL Environment (DL VM + DL Container)
Prediction
Feature
Store
Training
Experiments
Data
Readiness
Feature
Engineering
Training/
HP-Tuning
Model
Monitoring
Model
serving
Understanding/
Tuning
Edge
Model
Management
Notebooks
Pipelines (Orchestration)
Explainable
AI
Hybrid AI
Continuous
Monitoring
Metadata
Vision Translation Tables
Language
Video
AI
Accelerators
Vizier
Optimization
Datasets
What’s included in Vertex AI? NDA

Vertex Pipelines: Key capabilities
Python SDKs
Data Scientist friendly
Python SDKs
Serverless and
Scalable
Run as many pipelines
on as much data as you
want.
Metadata and lineage
Store metadata for
every artifact produced
by the pipeline.
Monitoring UIs
and APIs
Track and debug
pipelines executions
Security
Supports Cloud IAM,
VPC-SC, and CMEK.
Cost-effective
Only pay for the pipelines
you run and the
resources they use

Conditional triggers

Logging metrics

Experimentation management with Vertex Pipelines
Iterative Experimentation
Data
Prep
Development
datasets / Features
Source
Repository
Feature
Eng
Model
Training
Model
Eval
Experiment Tracking
Training Pipeline
Automation
Parameters, metrics, artifacts
Training
Pipeline
Source Code

Continuous Training with Vertex Pipelines
Orchestrated Training Pipeline
Data
Extraction
Development
datasets / Features
Model Registry &
Artifact Store
Data
Valid.
Data
Prep.
Model
Training
Training Pipeline Metadata
Trained
Model
Model
Eval.
Model
Valid.
Training Pipeline CI/CD
Training Pipeline Source Code

Evaluate and Understand Models
Tabular Text
What-If Tool (WIT)
Visually probe the behavior of trained machine
learning models, with minimal coding
Language Interpretability Tool (LIT)
Open-source platform for visualization and
understanding of NLP models.

A canonical ML workflow
Experimentation (Re) Training Model Deployment
Continuous Model
Monitoring
Training Serving
1 2 3 4
EDA /
Prototyping
Training
pipeline dev
Pipeline
CI/CD
Candidate
Model generation
Model
Serving
Canary & A/B
Testing
Model performance monitoring
Retrain Triggers
Data
Validation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Registry
Model Cards
& Reporting
Model
Provenance
Compliance
Model Management & Governance

Learning Transferable Architectures for Scalable Image Recognition, Zoph et al. 2017, https://arxiv.org/abs/1707.07012
computational cost
Accuracy
(precision
@1)
accuracy
AutoML outperforms handcrafted models

92
https://cloud.google.com/architecture/ml-on-gcp-best-practices

Three Modalities
of Google Cloud

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud

Ähnlich wie GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud (20)

Mehr von James Anderson

Mehr von James Anderson (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud