Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
1. The Data Warehouse Blueprint for
ML, AI, and Hybrid Cloud
@garyorenstein @memsql
MemSQL 1
2. Today’s Talk
A Data Warehouse Blueprint for
• Machine Learning and Artificial Intelligence
• Hybrid Cloud
Live demonstration of machine learning in SQL
• K_means clustering
MemSQL 2
3. Demonstration Step 1
1. Launch cluster
2. Setup k_means functions with MemSQL extensibility
3. Load data
4. Train data
5. Gain insights
• important_tags.sql
• representative_channels.sql
MemSQL 3
5. The Real-Time Data Warehouse
for the front lines of your business
MemSQL 5
6. What is a real-time data warehouse?
Similar to an
“Operational Data Warehouse”
MemSQL 6
7. A Real-Time Data Warehouse
• Adds real-time to analytics
• Reduces latency and ETL
• Manages structured data, loaded continuously
• Supports real-time decisions with embedded analytics
• Serves as an operational data store
• Delivers low latency reporting with automated queries
MemSQL 7
8. MemSQL: A Real-Time Data Warehouse
Streaming, Live and Historical Data
Immediate Insights with SQL
Scalable and distributed
MemSQL 8
22. ...you can’t do AI without
machine learning. You also can’t
do machine learning without
analytics, and you can’t do analytics
without data infrastructure.
— Hilary Mason, Data Scientist
MemSQL 22
23. Demonstration Step 2 and 3
1. Launch cluster
2. Setup k_means functions with MemSQL extensibility
3. Load data
4. Train data
5. Gain insights
• important_tags.sql
• representative_channels.sql
MemSQL 23
26. Over a billion users
Almost 1/3 of all people on the
Internet
Every day those users watch a
billion hours of video, generating
billions of views.
MemSQL 26
28. YouTube Tags Data Set
Channel, Video, Tag
(Gary’s Channel, GO Video 1, hi)
(Gary’s Channel, GO Video 1, hello)
(Gary’s Channel, GO Video 2, hello)
(Gary’s Channel, GO Video 2, blue)
“Tag” Vector for Gary’s Channel
(hi:1, hello:2, blue:1)
MemSQL 28
29. Now we can compare vectors and
calculate clusters with k-means
MemSQL 29
30. k-means clustering partitions
observations into k clusters
Each observation belongs to the
cluster with the nearest mean,
serving as a prototype of the cluster
MemSQL 30
33. K-means in MemSQL with Extensibility
create or replace procedure k_means(num_its bigint, num_centroids bigint)
as
begin
call initialize_centroids(num_centroids);
for i in 1 .. num_its loop
call k_means_iteration();
end loop;
end //
MemSQL 33
34. Demonstration Step 4 and 5
1. Launch cluster
2. Setup k_means functions with MemSQL extensibility
3. Load data
4. Train data
5. Gain insights
• important_tags.sql
• representative_channels.sql
MemSQL 34
35. Steps 4 and 5
Train and Gain Insights
!MemSQL 35
36. important_tags.sql
select centroid_id, field_ids.field_id, importance, rn
from
(
select centroids.centroid_id,
centroids.field_id,
centroids.val - centroid_sums.val importance,
row_number() over (partition by centroids.centroid_id order by centroids.val - centroid_sums.val desc) rn
from centroids
join
(
select field_id, sum(val) / (select count(distinct centroid_id) from centroids) as val
from centroids
group by field_id
) centroid_sums
on centroids.field_id = centroid_sums.field_id
) centroids
join field_ids
on centroids.field_id = field_ids.id
where rn < 10
order by centroid_id, rn;
MemSQL 36