SlideShare ist ein Scribd-Unternehmen logo
1 von 59
CONFIDENTIAL © 2018
The road to AI is paved with
pragmatic intentions
Jean Georges “JG" Perrin
August 22nd 2018
CONFIDENTIAL © 2018
JGP • Jean Georges Perrin
• @jgperrin
• Chapel Hill, NC
• I 🏗 SW • Since 1983
• #Knowledge = 

𝑓 ( ∑ (#SmallData, #BigData), #DataScience)

& #Software 
• #IBMChampion x10 • #KeepLearning
• @ http://jgp.net
CONFIDENTIAL © 2018
CONFIDENTIAL © 2018
Who are thou?
• Experience with Spark?
• Who is familiar with Data Quality?
• Who has already implemented Data Quality in Spark?
• Who is expecting to be an AI guru after this session?
• Who just came for the free food?
• Who is expecting to provide better insights faster after this
session?
CONFIDENTIAL © 2018
• What is ?
• What can I do with ?
• What is a app, anyway?
• What’s AI?
• Why is a great environment for AI?
• Meet Cactar, the Mongolian warlord of data quality
• Why data quality matters?
• Your first AI app with
• And finally a little surprise…
Agenda
CONFIDENTIAL © 2018
Analytics operating system
CONFIDENTIAL © 2018
Apps
Analytics
Distrib.
An analytics operating system?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
Apps
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
CONFIDENTIAL © 2018
use cases
• NCEatery.com
• Restaurant analytics
• 1.57×1021 datapoints analyzed (that’s about one zetta datapoints)
• (@ Lumeris)
• General compute
• Distributed data transfer
• IBM
• DSX (Data Science Experience)
• Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
• CERN
• Analysis of the science experiments in the LHC - Large Hadron Collider
CONFIDENTIAL © 2018
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
CONFIDENTIAL © 2018
Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS
Node 1 -
Hardware
Node 2 -
Hardware
Node 3 -
Hardware
Node 4 -
Hardware
Unified API
Spark SQL Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5 - OS
Node 5 -
Hardware
Your Application
…
…
CONFIDENTIAL © 2018
Node 1 Node 2 Node 3 Node 4
Unified API
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5
Your Application
…
DataFrame
CONFIDENTIAL © 2018
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
DataFrame
CONFIDENTIAL © 2018
http://bit.ly/spark-clego
CONFIDENTIAL © 2018
What’s #AI?
CONFIDENTIAL © 2018
Popular beliefs
• Robot with human-like behavior
• HAL from 2001
• Isaac Asimov
• Potential ethic problems
• Lots of mathematics
• Heavy calculations
• Algorithms
• Self-driving cars
Current state-of-the-art
General AI Narrow AI
CONFIDENTIAL © 2018
In 2018…
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
CONFIDENTIAL © 2018
Machine learning
• Common algorithms
• Linear and logistic regressions
• Classification and regression trees
• K-nearest neighbors (KNN)
• Deep learning
• Subset of ML
• Artificial neural networks (ANNs)
• Super CPU intensive, use of GPU
CONFIDENTIAL © 2018
There are two kinds of data scientists:
1) Those who can extrapolate from incomplete

data.
-The Internet

and my personal dedication to Sam Christie
CONFIDENTIAL © 2018
DATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
CONFIDENTIAL © 2018
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
CONFIDENTIAL © 2018
1
3
5
7
9
11
13
15
17
Jan-14 Jul-14 Jan-15 Jul-15 Jan-16 Jul-16 Jan-17 Jul-17 Jan-18
Scala Java Python R
Programming languages
RedMonk programming language rankings
40
50
60
70
80
90
100
2014 2015 2016 2017 2018
Scala Java Python R SQL
IEEE Spectrum, top programming languages
CONFIDENTIAL © 2018
xkcd
As goes the old adage:
Garbage in,
Garbage out
CONFIDENTIAL © 2018
If Everything Was As Simple…
Dinner
revenue per
number of
guests
CONFIDENTIAL © 2018
…as a Visual Representation
Anomaly #1
Anomaly #2
CONFIDENTIAL © 2018
I Love It When a Plan Comes Together
CONFIDENTIAL © 2018
Data is like a 

box of chocolates, 

you never know what
you're gonna get.
Jean ”Gump” Perrin
June 2017
CONFIDENTIAL © 2018
Data from everywhere
Databases
RDBMS, NoSQL
Files
CSV, XML, Json, Excel, Photos, Video
Machines
Services, REST, Streaming, IoT…
CONFIDENTIAL © 2018
CACTAR
is not a 

Mongolian warlord
CONFIDENTIAL © 2018
What is data quality?
Attributes (CACTAR):
• Consistency,
• Accuracy,
• Completeness,
• Timeliness,
• Accessibility,
• Reliability.
To allow:
• Operations,
• Decision making,
• Planning,
• Machine Learning,
• Artificial Intelligence.
CONFIDENTIAL © 2018
Ouch, bad data story
Hubble was blind
because of a
2.2nm error
Legend says it’s a
metric to imperial/
US conversion
CONFIDENTIAL © 2018
Now it hurts
• Challenger exploded in 1986 

because of a defective O-ring
• Root causes were:
• Invalid data for the 

O-ring parts
• Lack of data-lineage

& reporting



CONFIDENTIAL © 2018
#1 Scripts and the likes
• Source data are cleaned by
scripts (shell, Python, Java
app…)
• I/O intensive
• Storage space intensive
• No parallelization
CONFIDENTIAL © 2018
#2 Use Spark SQL
• All in memory!
• But limited to SQL and built-
in function
CONFIDENTIAL © 2018
#3 Use UDFs
• User Defined Function
• SQL can be extended with
UDFs
• UDFs benefit from the cluster
architecture and distributed
processing
CONFIDENTIAL © 2018
Enough!
Let me code!
CONFIDENTIAL © 2018
SPRU your UDF
• Service: 

build your code, it might be already existing!
• Plumbing: 

connect your existing business logic to Spark via an UDF
• Register: 

the UDF in Spark
• Use: 

the UDF is available in Spark SQL and via callUDF()
CONFIDENTIAL © 2018
Code sample #1.2 - plumbing
package net.jgp.labs.sparkdq4ml.dq.udf;


import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;


public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
CONFIDENTIAL © 2018
Code sample #1.3 - register
SparkSession spark = SparkSession
.builder().appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Dataset with anomalies
+-----+-----+
|guest|price|
+-----+-----+
|   1|23.24|
|    2|30.89|
|    2|33.74|
|    3|34.89|
|    3|29.91|
|    3| 38.0|
|    4| 40.0|
|    5|120.0|
|    6| 50.0|
|    6|112.0|
|    8| 60.0|
|    8|127.0|
|    8|120.0|
|    9|130.0|
+-----+-----+
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Highlight anomalies
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
|    1| 23.1|        23.1|
|    2| 30.0|        30.0|
|    2| 33.0|        33.0|
|    3| 34.0|        34.0|
|   24|142.0|       142.0|
|   24|138.0|       138.0|
|   25|  3.0|        -1.0|
|   26| 10.0|        -1.0|
|   25| 15.0|        -1.0|
|   26|  4.0|        -1.0|
|   28| 10.0|        -1.0|
|   28|158.0|       158.0|
|   30|170.0|       170.0|
|   31|180.0|       180.0|
+-----+-----+------------+
CONFIDENTIAL © 2018
Code sample #1.4 - use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Cleansed dataset
+-----+-----+
|guest|price|
+-----+-----+
|    1| 23.1|
|    2| 30.0|
|    2| 33.0|
|    3| 34.0|
|    3| 30.0|
|    4| 40.0|
|   19|110.0|
|   20|120.0|
|   22|131.0|
|   24|142.0|
|   24|138.0|
|   28|158.0|
|   30|170.0|
|   31|180.0|
+-----+-----+
CONFIDENTIAL © 2018
Data can now be used for ML
• Convert/Adapt dataset to Features and Label
• Required for Linear Regression in MLlib
• Needs a column called label of type double
• Needs a column called features of type VectorUDT
CONFIDENTIAL © 2018
Code sample #2 - register & use
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));


// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
CONFIDENTIAL © 2018
Prediction for 40 guests…
+-----+-----+-----+--------+------------------+
|guest|price|label|features|        prediction|
+-----+-----+-----+--------+------------------+
|    1| 23.1| 23.1|   [1.0]|24.563807596513133|
|    2| 30.0| 30.0|   [2.0]|29.595283312577884|
|    2| 33.0| 33.0|   [2.0]|29.595283312577884|
|    3| 34.0| 34.0|   [3.0]| 34.62675902864264|
|    3| 30.0| 30.0|   [3.0]| 34.62675902864264|
|    3| 38.0| 38.0|   [3.0]| 34.62675902864264|
|    4| 40.0| 40.0|   [4.0]| 39.65823474470739|
|   14| 89.0| 89.0|  [14.0]| 89.97299190535493|
|   16|102.0|102.0|  [16.0]|100.03594333748444|
|   20|120.0|120.0|  [20.0]|120.16184620174346|
|   22|131.0|131.0|  [22.0]|130.22479763387295|
|   24|142.0|142.0|  [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
CONFIDENTIAL © 2018
(the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
CONFIDENTIAL © 2018
Surprise!
CONFIDENTIAL © 2018
Using Spark to analyse the World Cup 2018
ScoreStatisticsApp:
Counting the goals and other statistics from World Cup 2018
Goals by country
+-------+----------+
|country|sum(score)|
+-------+----------+
|Belgium| 16|
| France| 14|
|Croatia| 14|
|England| 12|
| Russia| 11|
+-------+----------+
only showing top 5 rows
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Using Spark to analyse all the World Cups
HistoricScoreStatisticsApp:
Analyzing the attendance for all World Cups and other stats
Attendance per year
+----+----------+
|Year|Attendance|
+----+----------+
|1930| 1181098|
|1934| 726000|
|1938| 751400|
|1950| 2090492|
|1954| 1537214|
|1958| 1639620|
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Demo
Using Spark to predict the next World Cup winner
• Analysis of soccer tournaments during World Cup from 1930 to
2018
• Numerous data set from Kaggle, data.world
• Potentially billions of data points
• Heavy usage of Java
/jgperrin/net.jgp.labs.spark.football
CONFIDENTIAL © 2018
Conclusion
CONFIDENTIAL © 2018
Key takeaways
• Build your data lake in memory (and disk).
• Store first, ask questions later.
• Sample your data (to spot anomalies and more).
• Build (and reuse!) business rules with the business people.
• Use Spark dataframe, SQL, and UDFs to build a consistent &
coherent dataset.
• Rely on UDFs for prepping ML formats.
• Use Java.
CONFIDENTIAL © 2018
Going further
• Contact me @jgperrin
• Hands-on tutorial at All Things Open (October in Raleigh, NC)
• Join the Spark User mailing list
• Get help from Stack Overflow
• fb.com/TriangleSpark
• Buy my book on Spark with Java in MEAP (ok, really shameless
plug here)
CONFIDENTIAL © 2018
Going even further
Spark with Java (MEAP)
by Jean Georges Perrin (@jgperrin)
published by Manning
https://www.manning.com/books/spark-with-java
sparkwjava-B108 sparkwithjava
One free book 40% off

Weitere ähnliche Inhalte

Was ist angesagt?

Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterpriseNeo4j
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
 
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...Amazon Web Services
 
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...Amazon Web Services
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform OverviewDavid Chou
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Amazon Web Services
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...DataWorks Summit
 
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...Amazon Web Services
 
Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference MLconf
 
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Amazon Web Services
 
Audi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid CloudAudi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid CloudDataWorks Summit
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsTyler Wishnoff
 
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...Amazon Web Services
 
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)Amazon Web Services
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Amazon Web Services
 
Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Amazon Web Services
 
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019Derek Ashmore
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j
 

Was ist angesagt? (20)

Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterprise
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...
Overview of the New Amazon EC2 Instances with AMD EPYC (CMP385-R1) - AWS re:I...
 
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...
Deep Learning Applications Using TensorFlow, ft. Advanced Microgrid Solutions...
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
 
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...
IoT Building Blocks: From Edge Devices to Analytics in the Cloud - SRV204 - T...
 
Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference
 
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
Hands-On: Deploy Remote Graphics Desktops for Content Production (CMP422) - A...
 
Audi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid CloudAudi‘s Hadoop Journey into the Hybrid Cloud
Audi‘s Hadoop Journey into the Hybrid Cloud
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
 
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...
The Intelligent Edge for IoT: Help Customers Harness the Power of Connected I...
 
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)
如何以 serverless 架構打造快速回應客戶需求的零售情境 (Level: 200)
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
 
Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...
 
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019
AWS Lambda: Best Practices and Common Mistakes - AWS Community Days 2019
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
 

Ähnlich wie The road to AI is paved with pragmatic intentions

Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Chun-Yu Tseng
 
AI Expo - AI Revolution in Silicon Valley
AI Expo - AI Revolution in Silicon ValleyAI Expo - AI Revolution in Silicon Valley
AI Expo - AI Revolution in Silicon ValleyAvkash Chauhan
 
Kalix: Tackling the The Cloud to Edge Continuum
Kalix: Tackling the The Cloud to Edge ContinuumKalix: Tackling the The Cloud to Edge Continuum
Kalix: Tackling the The Cloud to Edge ContinuumJonas Bonér
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfabhaybansal43
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance ComputingAbzetdin Adamov
 
“Building consistent and highly available distributed systems with Apache Ign...
“Building consistent and highly available distributed systems with Apache Ign...“Building consistent and highly available distributed systems with Apache Ign...
“Building consistent and highly available distributed systems with Apache Ign...Tom Diederich
 
Forecast: Cloud-y with Azure Skies
Forecast: Cloud-y with Azure SkiesForecast: Cloud-y with Azure Skies
Forecast: Cloud-y with Azure SkiesCharlie Oliver
 
Industrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsIndustrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsJavier Povedano
 
Supervised Manufacturing
Supervised ManufacturingSupervised Manufacturing
Supervised ManufacturingDuncan Purves
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Securing future connected vehicles and infrastructure
Securing future connected vehicles and infrastructureSecuring future connected vehicles and infrastructure
Securing future connected vehicles and infrastructureAlan Tatourian
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...Antje Barth
 
Cisco Connect Toronto 2018 DNA assurance
Cisco Connect Toronto 2018  DNA assuranceCisco Connect Toronto 2018  DNA assurance
Cisco Connect Toronto 2018 DNA assuranceCisco Canada
 
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...DataStax
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudTyler Wishnoff
 
Anomaly Detection using ML in Elisa Viihde CDN
Anomaly Detection using ML in Elisa Viihde CDNAnomaly Detection using ML in Elisa Viihde CDN
Anomaly Detection using ML in Elisa Viihde CDNEficode
 

Ähnlich wie The road to AI is paved with pragmatic intentions (20)

Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
 
AI Expo - AI Revolution in Silicon Valley
AI Expo - AI Revolution in Silicon ValleyAI Expo - AI Revolution in Silicon Valley
AI Expo - AI Revolution in Silicon Valley
 
Kalix: Tackling the The Cloud to Edge Continuum
Kalix: Tackling the The Cloud to Edge ContinuumKalix: Tackling the The Cloud to Edge Continuum
Kalix: Tackling the The Cloud to Edge Continuum
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdf
 
Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
 
“Building consistent and highly available distributed systems with Apache Ign...
“Building consistent and highly available distributed systems with Apache Ign...“Building consistent and highly available distributed systems with Apache Ign...
“Building consistent and highly available distributed systems with Apache Ign...
 
Forecast: Cloud-y with Azure Skies
Forecast: Cloud-y with Azure SkiesForecast: Cloud-y with Azure Skies
Forecast: Cloud-y with Azure Skies
 
Industrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsIndustrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an Standards
 
Supervised Manufacturing
Supervised ManufacturingSupervised Manufacturing
Supervised Manufacturing
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
WaveEngine Dotnet 2018
WaveEngine Dotnet 2018WaveEngine Dotnet 2018
WaveEngine Dotnet 2018
 
Securing future connected vehicles and infrastructure
Securing future connected vehicles and infrastructureSecuring future connected vehicles and infrastructure
Securing future connected vehicles and infrastructure
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...
 
Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019
Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019
Conclusion Connect state of IoT 2019 Review io t solutions world congress 2019
 
Cisco Connect Toronto 2018 DNA assurance
Cisco Connect Toronto 2018  DNA assuranceCisco Connect Toronto 2018  DNA assurance
Cisco Connect Toronto 2018 DNA assurance
 
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
 
Anomaly Detection using ML in Elisa Viihde CDN
Anomaly Detection using ML in Elisa Viihde CDNAnomaly Detection using ML in Elisa Viihde CDN
Anomaly Detection using ML in Elisa Viihde CDN
 

Mehr von Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the worldJean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 

Mehr von Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 

Kürzlich hochgeladen

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 

Kürzlich hochgeladen (20)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 

The road to AI is paved with pragmatic intentions

  • 1. CONFIDENTIAL © 2018 The road to AI is paved with pragmatic intentions Jean Georges “JG" Perrin August 22nd 2018
  • 2. CONFIDENTIAL © 2018 JGP • Jean Georges Perrin • @jgperrin • Chapel Hill, NC • I 🏗 SW • Since 1983 • #Knowledge = 
 𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
 & #Software  • #IBMChampion x10 • #KeepLearning • @ http://jgp.net
  • 4. CONFIDENTIAL © 2018 Who are thou? • Experience with Spark? • Who is familiar with Data Quality? • Who has already implemented Data Quality in Spark? • Who is expecting to be an AI guru after this session? • Who just came for the free food? • Who is expecting to provide better insights faster after this session?
  • 5. CONFIDENTIAL © 2018 • What is ? • What can I do with ? • What is a app, anyway? • What’s AI? • Why is a great environment for AI? • Meet Cactar, the Mongolian warlord of data quality • Why data quality matters? • Your first AI app with • And finally a little surprise… Agenda
  • 7. CONFIDENTIAL © 2018 Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS Apps
  • 8. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 9. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 10. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 11. CONFIDENTIAL © 2018 use cases • NCEatery.com • Restaurant analytics • 1.57×1021 datapoints analyzed (that’s about one zetta datapoints) • (@ Lumeris) • General compute • Distributed data transfer • IBM • DSX (Data Science Experience) • Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ • CERN • Analysis of the science experiments in the LHC - Large Hadron Collider
  • 13. CONFIDENTIAL © 2018 Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  • 14. CONFIDENTIAL © 2018 Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  • 15. CONFIDENTIAL © 2018 Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  • 18. CONFIDENTIAL © 2018 Popular beliefs • Robot with human-like behavior • HAL from 2001 • Isaac Asimov • Potential ethic problems • Lots of mathematics • Heavy calculations • Algorithms • Self-driving cars Current state-of-the-art General AI Narrow AI
  • 19. CONFIDENTIAL © 2018 In 2018… I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  • 20. CONFIDENTIAL © 2018 Machine learning • Common algorithms • Linear and logistic regressions • Classification and regression trees • K-nearest neighbors (KNN) • Deep learning • Subset of ML • Artificial neural networks (ANNs) • Super CPU intensive, use of GPU
  • 21. CONFIDENTIAL © 2018 There are two kinds of data scientists: 1) Those who can extrapolate from incomplete
 data. -The Internet
 and my personal dedication to Sam Christie
  • 22. CONFIDENTIAL © 2018 DATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  • 23. CONFIDENTIAL © 2018 Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  • 24. CONFIDENTIAL © 2018 1 3 5 7 9 11 13 15 17 Jan-14 Jul-14 Jan-15 Jul-15 Jan-16 Jul-16 Jan-17 Jul-17 Jan-18 Scala Java Python R Programming languages RedMonk programming language rankings 40 50 60 70 80 90 100 2014 2015 2016 2017 2018 Scala Java Python R SQL IEEE Spectrum, top programming languages
  • 25. CONFIDENTIAL © 2018 xkcd As goes the old adage: Garbage in, Garbage out
  • 26. CONFIDENTIAL © 2018 If Everything Was As Simple… Dinner revenue per number of guests
  • 27. CONFIDENTIAL © 2018 …as a Visual Representation Anomaly #1 Anomaly #2
  • 28. CONFIDENTIAL © 2018 I Love It When a Plan Comes Together
  • 29. CONFIDENTIAL © 2018 Data is like a 
 box of chocolates, 
 you never know what you're gonna get. Jean ”Gump” Perrin June 2017
  • 30. CONFIDENTIAL © 2018 Data from everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  • 31. CONFIDENTIAL © 2018 CACTAR is not a 
 Mongolian warlord
  • 32. CONFIDENTIAL © 2018 What is data quality? Attributes (CACTAR): • Consistency, • Accuracy, • Completeness, • Timeliness, • Accessibility, • Reliability. To allow: • Operations, • Decision making, • Planning, • Machine Learning, • Artificial Intelligence.
  • 33. CONFIDENTIAL © 2018 Ouch, bad data story Hubble was blind because of a 2.2nm error Legend says it’s a metric to imperial/ US conversion
  • 34. CONFIDENTIAL © 2018 Now it hurts • Challenger exploded in 1986 
 because of a defective O-ring • Root causes were: • Invalid data for the 
 O-ring parts • Lack of data-lineage
 & reporting
 

  • 35. CONFIDENTIAL © 2018 #1 Scripts and the likes • Source data are cleaned by scripts (shell, Python, Java app…) • I/O intensive • Storage space intensive • No parallelization
  • 36. CONFIDENTIAL © 2018 #2 Use Spark SQL • All in memory! • But limited to SQL and built- in function
  • 37. CONFIDENTIAL © 2018 #3 Use UDFs • User Defined Function • SQL can be extended with UDFs • UDFs benefit from the cluster architecture and distributed processing
  • 39. CONFIDENTIAL © 2018 SPRU your UDF • Service: 
 build your code, it might be already existing! • Plumbing: 
 connect your existing business logic to Spark via an UDF • Register: 
 the UDF in Spark • Use: 
 the UDF is available in Spark SQL and via callUDF()
  • 40. CONFIDENTIAL © 2018 Code sample #1.2 - plumbing package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  • 41. CONFIDENTIAL © 2018 Code sample #1.3 - register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 42. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 43. CONFIDENTIAL © 2018 Dataset with anomalies +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+
  • 44. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 45. CONFIDENTIAL © 2018 Highlight anomalies +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+
  • 46. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 47. CONFIDENTIAL © 2018 Cleansed dataset +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+
  • 48. CONFIDENTIAL © 2018 Data can now be used for ML • Convert/Adapt dataset to Features and Label • Required for Linear Regression in MLlib • Needs a column called label of type double • Needs a column called features of type VectorUDT
  • 49. CONFIDENTIAL © 2018 Code sample #2 - register & use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 50. CONFIDENTIAL © 2018 Prediction for 40 guests… +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  • 51. CONFIDENTIAL © 2018 (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  • 53. CONFIDENTIAL © 2018 Using Spark to analyse the World Cup 2018 ScoreStatisticsApp: Counting the goals and other statistics from World Cup 2018 Goals by country +-------+----------+ |country|sum(score)| +-------+----------+ |Belgium| 16| | France| 14| |Croatia| 14| |England| 12| | Russia| 11| +-------+----------+ only showing top 5 rows /jgperrin/net.jgp.labs.spark.football
  • 54. CONFIDENTIAL © 2018 Using Spark to analyse all the World Cups HistoricScoreStatisticsApp: Analyzing the attendance for all World Cups and other stats Attendance per year +----+----------+ |Year|Attendance| +----+----------+ |1930| 1181098| |1934| 726000| |1938| 751400| |1950| 2090492| |1954| 1537214| |1958| 1639620| /jgperrin/net.jgp.labs.spark.football
  • 55. CONFIDENTIAL © 2018 Demo Using Spark to predict the next World Cup winner • Analysis of soccer tournaments during World Cup from 1930 to 2018 • Numerous data set from Kaggle, data.world • Potentially billions of data points • Heavy usage of Java /jgperrin/net.jgp.labs.spark.football
  • 57. CONFIDENTIAL © 2018 Key takeaways • Build your data lake in memory (and disk). • Store first, ask questions later. • Sample your data (to spot anomalies and more). • Build (and reuse!) business rules with the business people. • Use Spark dataframe, SQL, and UDFs to build a consistent & coherent dataset. • Rely on UDFs for prepping ML formats. • Use Java.
  • 58. CONFIDENTIAL © 2018 Going further • Contact me @jgperrin • Hands-on tutorial at All Things Open (October in Raleigh, NC) • Join the Spark User mailing list • Get help from Stack Overflow • fb.com/TriangleSpark • Buy my book on Spark with Java in MEAP (ok, really shameless plug here)
  • 59. CONFIDENTIAL © 2018 Going even further Spark with Java (MEAP) by Jean Georges Perrin (@jgperrin) published by Manning https://www.manning.com/books/spark-with-java sparkwjava-B108 sparkwithjava One free book 40% off