Do you know The Cloud Girl? She makes the cloud come alive with pictures and storytelling.
The Cloud Girl, Priyanka Vergadia, Chief Content Officer @Google, joins us to tell us about Scaleable Data Analytics in Google Cloud.
Maybe, with her explanation, we'll finally understand it!
Priyanka is a technical storyteller and content creator who has created over 300 videos, articles, podcasts, courses and tutorials which help developers learn Google Cloud fundamentals, solve their business challenges and pass certifications! Checkout her content on Google Cloud Tech Youtube channel.
Priyanka enjoys drawing and painting which she tries to bring to her advocacy.
Check out her website The Cloud Girl: https://thecloudgirl.dev/ and her new book: https://www.amazon.com/Visualizing-Google-Cloud-Illustrated-References/dp/1119816327
Axa Assurance Maroc - Insurer Innovation Award 2024
Â
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google Cloud
1. Proprietary + Confidential
Data Science in Cloud
Quick Tour
Priyanka Vergadia
Staff Developer Advocate
Google Cloud
Twitter: @pvergadia
2. Proprietary + Confidential
1
2
3
4
Google Cloud Orientation
Data Science
Data Analytics
MLOps
Flow
Some Google Cloud Tools - BigQuery,
BigQuery ML and Vertex AI
5
Wrap up
6
@pvergadia
4. Things I donât
want to think
about...
1. Provisioning hardware
2. Installing software
3. Upgrading operating systems
4. Security patching
5. System and network admin
6. Scaling up/down
7. Paying for stuff I donât use
8. Dealing with failures
9. Managing clusters
6. Getting things done using someone elseâs computers, especially
where someone else worries about maintenance, provisioning, system
administration, security, networking, failure recover, etc.
13. Proprietary + Confidential
The real problems with a
ML system will be found
while you are continuously
operating it for the long termâ
Launching is easy,
Operating is hard.
pixabay.com
15. âŠa product requires so much more
Configuration
Data Collection
Data
Verification
Feature Extraction Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML Code
16. Proprietary + Confidential
Why do things become harder in production?
(an incomplete list)
â data cleaning and processing is hard at scale
â scaling out training and serving; infrastructure issues
â tracking, monitoring, and reproducibility requirements
â model or data drift
â training/serving skew
â access control issues, security requirements
â (and lots more)
17. The level of automation defines the maturity of the
ML process
Level 0
Build and
deploy manually
Level 1
Automate
the training phase
Level 2
Automate training,
validation, and deployment
19. âOn any given day there are thousands of TFX
pipelines running, which are processing exabytes
of data and producing tens of thousands of
models, which in turn are performing hundreds
of millions of inferences per second.â
22. Google BigQuery
Data warehouse with customers ranging from TB to 100+ PB
Insights for everyone
Cloud-scale enterprise
data warehouse
Unique
Serverless platform
Standard SQL(ANSI 2011)
with DML Support
Encrypted, durable,
highly available Unique
Built-in ML Unique
Real-time insights Unique
23. In ~15s:
â Read 2TB:
â ~1k disks
â Run 50B regexps:
â ~3k cores
24. Train and deploy ML models in
SQL
BigQuery ML
Execute ML workflows without
moving data from BigQuery
Automate common ML tasks
Built-in infrastructure
management, security &
compliance
25. BigQuery ML supported models and features
The data analystâs onramp to AI and ML
Classification
Logistic regression
DNN classiïŹer (TensorFlow)
XGBoost
Regression
Other Models
k-means clustering
Time series forecasting
Model ops and
explainability
Import/export TensorFlow models for
batch and online prediction
NDA
AutoML Tables
Linear regression
DNN regressor (TensorFlow)
XGBoost
AutoML Tables
Recommendation: Matrix factorization
NDA
Time series anomaly detectionPreview Q2â21,
GA H2â21
Hyperparameter tuning using Cloud AI
VizierPreview H1â21, GA H2â21
Model explainability using Cloud
AIPreview H1â21, GA H2â21
Managed Kubernetes and TFX
pipelinesPreview H2â21, GA 2022
List models for comparison and online
deployment in Cloud AIPreview H2â21, GA 2022
Model versioning, continuous
monitoringfuture
Wide and Deep NNsPreview, GA H1â21
Wide and Deep NNsPreview, GA H1â21
Autoencoders
27. Vertex AI is a
managed ML platform
to speed the rate of
experimentation and accelerate
deployment of AI models.
28. The End-To-End ML Journey through Vertex AI
Where can I find
training data?
Feature Store
Datasets
Where do I start with
model experiments?
Workbench
How can I track
the results of
experiments?
Experiments
How can I train at scale?
Training
How do I deploy?
Endpoints
And for production?
Monitoring
Pipelines
30. Proprietary + Confidential
â Introduction to Data Science blog:
https://goo.gle/dsintro
â Getting started docs:
cloud.google.com/vertex-ai/docs
â Get started in Cloud Console:
console.cloud.google.com/ai/platform
â Best practices:
cloud.google.com/architecture/ml-on-gcp-best-practices
Learn more
34. Proprietary + Confidential
What is Dataplex?
NDA
BigQuery
Dataplex
Data Lifecycle Mgmt
(Ingest, discover, prep, monitor, serve, archive)
Logical data organization
Unified Security and Governance
Unified Metadata with auto-discovery
Dataproc AI Platform
Data
Studio
Structured Streaming Data*
Semi-Structured Unstructured
GCP On-premises*
Multi-Cloud*
Dataflow
Storage
Built for distributed data
Logically unify and organize your data without any data
movement.
Intelligent Data Management
Automatic data discovery, metadata harvesting,
lifecycle management, and data quality with built-in
AI-driven intelligence.
Centralized Security & Governance
Central policy management, monitoring and auditing for
data authorization, retention, and classification.
Data Classification and Data Quality
Data
Intelligence
Analytics
*future capabilities
35.
36. Proprietary + Confidential
Data Science On Google Cloud
A Guided Tour
Polong Lin & Marc Cohen
Developer Relations Engineers
Google Cloud
Slides: mco.fyi/ds
40. Photo by Martin Olsen on
Complexity is a
barrier to adoption
41. HELLO CSECT The name of this program is 'HELLO'
* Register 15 points here on entry from OPSYS or caller.
STM 14,12,12(13) Save registers 14,15, and 0 thru 12 in caller's Save area
LR 12,15 Set up base register with program's entry point address
USING HELLO,12 Tell assembler which register we are using for pgm. base
LA 15,SAVE Now Point at our own save area
ST 15,8(13) Set forward chain
ST 13,4(15) Set back chain
LR 13,15 Set R13 to address of new save area
* -end of housekeeping (similar for most programs) -
WTO 'Hello World' Write To Operator (Operating System macro)
*
L 13,4(13) restore address to caller-provided save area
XC 8(4,13),8(13) Clear forward chain
LM 14,12,12(13) Restore registers as on entry
DROP 12 The opposite of 'USING'
SR 15,15 Set register 15 to 0 so that the return code (R15) is Zero
BR 14 Return to caller
*
SAVE DS 18F Define 18 fullwords to save calling program registers
END HELLO This is the end of the program
45. Continuous Training for Production ML in the TFX Platform. OpML (2019).
Slice Finder: Automated Data Slicing for Model Validation. ICDE (2019).
Data Validation for Machine Learning. SysML (2019).
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017).
Data Management Challenges in Production Machine Learning. SIGMOD (2017).
Rules of Machine Learning: Best Practices for ML Engineering. Google AI Web (2017).
Machine Learning: The High Interest Credit Card of Technical Debt. NeurIPS (2015).
Hidden Technical Debt in Machine Learning Systems. NIPS (2015).
Production ML Research
47. Code & Config
Training pipeline
Registered model
Deployed model
Serving logs
Focus of today
Vertex
Feature Store
Vertex Training and
Pipelines
Vertex Model
Monitoring
Vertex
Workbench
Cloud Build
Vertex ML Metadata
Vertex Endpoints
and Prediction
48. Feature Store in one picture
Our Solution
Feature
Store
Online Store
Feature
Management API
Batch
Ingestion API
Stream
Ingestion API
Feature Discovery
API
Online Serving API
Batch Serving API
Cache
Online Prediction
Model Training
Batch Feature
Engineering
Streaming Feature
Engineering
Data Lake
(BQ, GCS)
Kafka/Pubsub
Point-in-time
lookups
Registry
Feature Monitoring
Offline Store
49. How does the new SDK fits in the picture?
Our Solution
Feature
Store
Online Store
Feature
Management API
Batch
Ingestion API
Stream
Ingestion API
Feature Discovery
API
Online Serving API
Batch Serving API
Cache
Online Prediction
Model Training
Batch Feature
Engineering
Streaming Feature
Engineering
Data Lake
(BQ, GCS)
Kafka/Pubsub
Point-in-time
lookups
Registry
Feature Monitoring
Offline Store
Vertex AI SDK
Data engineer ML engineer
Data scientist
50.
51. Proprietary + Confidential
Scalable training and serving on Vertex AI
Train with
Data
Analyst
ML
Developer
Data
Scientist
Use when Serve with
Vertex
Training
âą Your problem doesnât match the criteria
listed below for BigQuery ML or AutoML.
âą Youâre already running training on-premises or
another cloud, and you need consistency across
the platforms.
Vertex
Prediction
AutoML
âą Your problem fits into one of the types AutoML
supports. Offers a point-and-click workflow.
âą Natural Language or Video models are served from
Google Cloud. While Vision and Tables support
edge / downloadable models.
BigQuery ML
âą All your data is contained in BigQuery.
âą Users are most comfortable with SQL.
âą The set of models available in BigQuery ML
matches the problem youâre trying to solve.
Train with Use when Serve with
Data
Analyst
ML
Developer
Data
Scientist
52. Model deployment &
management (MLOps)
Explainable AI
Model development and
data science
BigQuery ML Roadmap for 2021
H1â21 H2â21
TF Wide and Deep NNs Preview
Autoencoders Preview
PCA Preview
P-values for linear models Preview
Hyperparameter Tuning Preview
Anomaly Detection Preview
AutoML Tables GA
NDA
TF Wide and Deep NNs GA
Autoencoders GA
PCA GA
P-values for linear models GA
Hyperparameter Tuning GA
Anomaly Detection GA
Multivariate Time Series (AutoML) Preview
Model Registry Preview
Managed Pipelines Preview
Explainable AI Preview
53. Preparing the training data
Mix of demographic & behavioural data
Each row
is a
different
user
54. Preparing the training data
Each row
is a
different
user
Mix of demographic & behavioural data
Goal is to
create
cluster
labels
3
2
3
2
1
56. Proprietary + Confidential
Developer Days
CREATE OR REPLACE MODEL
mydataset.kmeans_3
OPTIONS(
model_type='KMEANS',
kmeans_init_method = 'KMEANS++',
num_clusters=3
)
SELECT
* EXCEPT(userId)
FROM
mydataset.train
Build and train with
CREATE MODEL
61. Anomaly detection with k-means
Fraud detection
Each row is a transaction
Which rows are anomalies?
62. CREATE MODEL - k-means clustering
#Query for model training
CREATE MODEL demo.kmeans_model
OPTIONS(
model_type='kmeans',
num_clusters= 8,
kmeans_init_method = 'kmeans++'
)
AS
SELECT * EXCEPT(Time, Class)
FROM
bigquery-public-data.ml_datasets.ulb_fraud_detection;
63. ML.DETECT_ANOMALIES with k-means clustering
#Query for creating anomaly detection results
SELECT
*
FROM
ML.DETECT_ANOMALIES(
MODEL demo.kmeans_model,
STRUCT(0.005 AS contamination),
TABLE bigquery-public-data.ml_datasets.ulb_fraud_detection
);
66. Automated HP tuning
Have BigQuery ML automatically
search for the optimal
hyperparameters
Preview
Select number of trials
1
Don't need to be an expert in HPs
Save time from manually training
models with different HPs
Easy to use
CREATE MODEL
mydataset.my_logreg_model
OPTIONS(
model_type="logistic_reg",
input_label_cols=["mylabel"],
num_trials=20
) AS
SELECT
*
FROM
mydataset.my_training_data
Hyperparameter tuning with BigQuery ML
67. Automated HP tuning
Have BigQuery ML automatically
search for the optimal
hyperparameters
Preview
Select number of trials
1
Uses Vertex Vizier under-the-hood
Save time from manually training
models with different HPs
Easy to use
Inspect the trials info
2
SELECT
*
FROM
ML.TRIAL_INFO(MODEL mydataset.my_logreg_model)
Even while it's
still training!
Hyperparameter tuning with BigQuery ML
68. Automated HP tuning
Have BigQuery ML automatically
search for the optimal
hyperparameters
Preview
Select number of trials
1
Uses Vertex Vizier under-the-hood
Save time from manually training
models with different HPs
Easy to use
Inspect the trials info
2
Evaluate your model
3
Hyperparameter tuning with BigQuery ML
69. Automated HP tuning
Have BigQuery ML automatically
search for the optimal
hyperparameters
Preview
Select number of trials
1
Uses Vertex Vizier under-the-hood
Save time from manually training
models with different HPs
Easy to use
Inspect the trials info
2
Evaluate your model
3
Predict!
4
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.my_logreg_model)
Hyperparameter tuning with BigQuery ML
70. How to import TensorFlow models to do
batch predictions in BigQuery
using BigQuery ML
71. Importing TensorFlow models into BigQuery
CREATE MODEL
PREDICT
https://towardsdatascience.com/how-to-do-batch-predictions-of-tensorflow-models-directly-in-bigquery-ffa843ebdba6
https://cloud.google.com/bigquery-ml/docs/making-predictions-with-imported-tensorflow-models
73. Question:
Can we do text
similarity based on
embeddings?
# The following are example embedding outputs of 20 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [0.0560572519898, 0.0534118898213, -0.0112254749984, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [-0.0343746766448, -0.0529498048127, 0.0469399243593, ...]
74. Text similarity using an imported Tensorflow model
https://towardsdatascience.com/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65
Goal:
I want to search for
comments similar to:
"power line down on a home"
75. Step 1: Save the TensorFlow model to GCS
CREATE OR REPLACE MODEL
mydataset.swivel_text_embed
OPTIONS(
model_type='tensorflow',
model_path='gs://BUCKET/swivel/*')
Step 2: CREATE MODEL using the GCS folder path
76. Step 3: Use ML.PREDICT to get comment embeddings
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.swivel_text_embed,
(SELECT
comments AS sentences
FROM
mydataset.mydata) );
77. Step 3: Use ML.PREDICT to get comment embeddings
SELECT
*
FROM
ML.PREDICT(MODEL mydataset.swivel_text_embed,
(SELECT
comments AS sentences
FROM
mydataset.mydata) );
Text converted into an
embedding of 20
ïŹoating points
78. Step 4: Calculate distance between embeddings to
compute text similarity
Input search term:
"power line down on a home"
Top 15 most similar comments to input
79. Exporting BQML models for use with Vertex
Model trained with BigQuery ML Vertex Pipelines
Export to Cloud Storage
https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-mlops
80. Proprietary + Confidential
Data
Labeling
AutoML
DL Environment (DL VM + DL Container)
Prediction
Feature
Store
Training
Experiments
Data
Readiness
Feature
Engineering
Training/
HP-Tuning
Model
Monitoring
Model
serving
Understanding/
Tuning
Edge
Model
Management
Notebooks
Pipelines (Orchestration)
Explainable
AI
Hybrid AI
Continuous
Monitoring
Metadata
Vision Translation Tables
Language
Video
AI
Accelerators
Vizier
Optimization
Datasets
Whatâs included in Vertex AI? NDA
81. Proprietary + Confidential
Vertex Pipelines: Key capabilities
Python SDKs
Data Scientist friendly
Python SDKs
Serverless and
Scalable
Run as many pipelines
on as much data as you
want.
Metadata and lineage
Store metadata for
every artifact produced
by the pipeline.
Monitoring UIs
and APIs
Track and debug
pipelines executions
Security
Supports Cloud IAM,
VPC-SC, and CMEK.
Cost-effective
Only pay for the pipelines
you run and the
resources they use
87. Proprietary + Confidential
Experimentation management with Vertex Pipelines
Iterative Experimentation
Data
Prep
Development
datasets / Features
Source
Repository
Feature
Eng
Model
Training
Model
Eval
Experiment Tracking
Training Pipeline
Automation
Parameters, metrics, artifacts
Training
Pipeline
Source Code
88. Proprietary + Confidential
Continuous Training with Vertex Pipelines
Orchestrated Training Pipeline
Data
Extraction
Development
datasets / Features
Model Registry &
Artifact Store
Data
Valid.
Data
Prep.
Model
Training
Training Pipeline Metadata
Trained
Model
Model
Eval.
Model
Valid.
Training Pipeline CI/CD
Training Pipeline Source Code
89. Evaluate and Understand Models
Tabular Text
What-If Tool (WIT)
Visually probe the behavior of trained machine
learning models, with minimal coding
Language Interpretability Tool (LIT)
Open-source platform for visualization and
understanding of NLP models.
90. A canonical ML workflow
Experimentation (Re) Training Model Deployment
Continuous Model
Monitoring
Training Serving
1 2 3 4
EDA /
Prototyping
Training
pipeline dev
Pipeline
CI/CD
Candidate
Model generation
Model
Serving
Canary & A/B
Testing
Model performance monitoring
Retrain Triggers
Data
Validation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Registry
Model Cards
& Reporting
Model
Provenance
Compliance
Model Management & Governance