SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Š Cloudera, Inc. All rights reserved.
IMPROVING COMPUTER VISION MODELS AT SCALE
Dr. Mirko Kämpf | Senior Solutions Architect
2 Š Cloudera, Inc. All rights reserved.
VISION IS EVERYTHING
Automotive OEM &
Tier1 Suppliers
Healthcare &
Medical Devices
Manufacturing &
Pharmaceuticals
Security &
Public Sector
Autonomous vehicle
programs
Physician augmentation
and robotic devices
Visual inspection for
quality & yield
Customs & border
protection, anti-crime
efforts
Insurance
Claims processing and
fraud detection using
images & video
3 Š Cloudera, Inc. All rights reserved.
IMAGINE THE POSSIBILITIES...
COMPUTER VISION TECHNOLOGY ALLOWS US TO:
• detect tumors in medical images
• detect broken parts in a manufacturing line
• detect violence in public spaces
• detect dangerous situations in traffic
• detect a fire early
... ALL THAT @ SCALE AND USING THE CLOUD!
4 Š Cloudera, Inc. All rights reserved.
BE OPEN AND THINK BIG!
• Cameras are everywhere ...
• In traffic (cars, trains, planes)
• In public places (train stations, air ports, public buildings)
• Many public datasets are available ...
• Udacity: autonomous driving datasets
• Medical images
• Search like (with) Google:
• Dataset serach is a great new tool.
• Image search makes grabbing tagged images as easy as never before.
CLOUDERA BUILDS ENTERPRISE SOLUTIONS ON OPEN
STANDARDS.
Š Cloudera, Inc. All rights reserved. 5
BIG DATA, MACHINE LEARNING, ARTIFICIAL INTELIGENCE
BUT… THERE ARE CHALLANGES
6 Š Cloudera, Inc. All rights reserved.
CHALLENGES
VOLUME
ANNOTATIONMANAGEMENT
7 Š Cloudera, Inc. All rights reserved.
VOLUME CHALLENGE
8 Š Cloudera, Inc. All rights reserved.
ANNOTATION CHALLENGE
“Let’s consider Cityscapes dataset (useful for
self-driving cars). Fine pixel-level annotation
of a single image from cityscapes
required more than 1.5 hours on average.
They annotated 5000 images. With a simple
math we can calculate, that they spent about
5000 * 1.5 = 7500 hours...”
Source: https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282
Are you willing to invest for thousands of hours?
Industry prefers automation!
9 Š Cloudera, Inc. All rights reserved.
MANAGEMENT CHALLENGE
SHARINGDISCOVERY
10 Š Cloudera, Inc. All rights reserved.
RESULTING STATUS QUO
VAST UNKNOWN SMALL TRAINING
SETS
USED IN
ISOLATION
Š Cloudera, Inc. All rights reserved. 11
Š Cloudera, Inc. All rights reserved. 12
CLOUDERA DIFFERENCE
13 Š Cloudera, Inc. All rights reserved.
INTERSECTION OF TWO WORLDS
Digital Asset Management Data Science Platforms
Automated annotation of new data
Optimized model imporvements
14 Š Cloudera, Inc. All rights reserved.
ML-POWERED, DOMAIN-AWARE IMAGE REPOSITORY FOR DATA
SCIENCE
Core functions: Implemented as reusable building blocks:
Provide extensible, repeatable, and focused solutions
Asset Acquisition and
Processing
Processing of digital assets
with machine learning
annotation and enrichment
Domain-Oriented Query
and Access
Semantic query asset
acquisition and discovery with
domain specific ontologies
Training Set
Management
Using the “domain aware
DAM” to generate and retrieve
relevant data sets for model
development and automated
testing
Model Development
Employing relevant training sets
in a shared data science
environment to construct
domain-specific model and to
power automated model testing
services
This looks a bit like a part of a supply chain, doesn’t it?
15 Š Cloudera, Inc. All rights reserved.
WHAT IS IMAGE LOGISTICS?
... the problem we solve ;-)
• Provide images into training procedures efficienty (fast!)
• Identify relevant images to train better DNN.
• Rearrange images quickly, to adopt users needs
• Manage multiple kinds of metadats: movies, images + context data
• Manage dataset lifecycle of compound datasets
Š Cloudera, Inc. All rights reserved. 16
HOW THINGS WORK
Š Cloudera, Inc. All rights reserved.
Data Engineering and Model
Lifecycle
DATA ENGINEERING AND MODEL LIFE CYCLE
18 Š Cloudera, Inc. All rights reserved.
FUNCTIONAL REQUIREMENTS
• Fast random access to images
• Free text search for labels / tags / statistical properties
• Execute existing Python and Scala deep learning pipelines at scale
• Automatic labeling and indexing of detected facts
• Visual model comparison
• Search for complex scenarios: situational awareness
19 Š Cloudera, Inc. All rights reserved.
SOLUTION OVERVIEW (1)
20 Š Cloudera, Inc. All rights reserved.
SOLUTION OVERVIEW (2)
Main users:
Data Scientist
and
Domain Experts
Š Cloudera, Inc. All rights reserved.
CONCEPTS FOR EFFICIENT IMAGE WAREHOUSING:
22 Š Cloudera, Inc. All rights reserved.
COMPOUND DATASETS - ASSET ACQUISITION
+
Metadata
Image
23 Š Cloudera, Inc. All rights reserved.
UNIFIED ASSET PROCESSING
ONE API – MANY FRAMEWORKS – MANY MODELS
• Domain specific attribute extraction and enhancement
• Additional domains added as needed
Autonomous driving Disease identification
24 Š Cloudera, Inc. All rights reserved.
TENZING
Access to large image datastes simplified
• We provide an API for accessing images and image sets via Apache Spark,
for tagging/labeling (in CDH) and model training (in CDSW).
• Solr’s powerful search capabilities are used to identify the right images.
• The complexity of allocating individual images or image sets within HBase is
hidden within the data access layer (DAL), and the Tenzing-API.
25 Š Cloudera, Inc. All rights reserved.
A FULL DATA PIPELINE FOR THE TRAFFIC DOMAIN:
ffmpeg
img img img img
9.2
9.1
lon timestamp
20180428152138
area | tunnel | bridgel |
Geodata
Stadium | no | no
…
9.0
lat
48.1
48.3
48.5
NMEA
AVRO
B14 | yes | no20180428152330
20180428152831 B14 | no | yes
gps2avro
pynmea2
overpy
Image Data
CF:tagsCF:img_all
jpg imagenet
img stop-sign person
img truck
…
retinanet tiny-yolo
…
boatperson person
bicycle person traffic light boat
img img img img
CF:geo
20180428152330
20180428152330
Key:
30 30 30 30
Key
20180428152330
20180428152330
HBaseStorageHandlerNMEA
OpenStreetmap
/ overpass API
30 30 31 31
hbase-indexer-mr-job.jar
Lily
NMEA
Tenzing
if
Š Cloudera, Inc. All rights reserved.
SOME IMPLEMENTATION DETAILS ...
27 Š Cloudera, Inc. All rights reserved.
PYSPARK IMPLEMENTATION (KERAS)
def predict(iterator):
model = InceptionV3(weights=None)
model.load_weights(FLAGS.weights_file)
return [(x[0], run_inference_on_image(model, x[1])) for x in iterator]
def main():
sc = SparkContext(conf=conf)
hbase_io = common.HbaseIO(FLAGS)
out_format = common.OutputFormatter(FLAGS, MODEL_NAME)
hbase_images = hbase_io.load_from_hbase(sc)
classified_images = hbase_images.mapPartitions(predict) 
.map(out_format.imagenet_format)
classified_images.foreachPartition(hbase_io.put_to_hbase)
28 Š Cloudera, Inc. All rights reserved.
PYSPARK IMPLEMENTATION (KERAS)
• The Python environment with tensorflow is distributed to the executors at
runtime, it is not preinstalled on the nodes.
• The individual models only need to implement the following functions:
• prepare
• predict
• output_format
• Conceptually very close to scikit-learn or Spark ML Pipelines approach
• Deep Learning Pipelines can be a way to streamline the implementation
https://databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html
29 Š Cloudera, Inc. All rights reserved.
SPARK IMPLEMENTATION (DL4J / SCALA)
def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = {
val model = ModelSerializer.restoreComputationGraph(modelLoc)
pairs.map{ case (name, image) =>
(name, run_inference_on_image(model, image)
}
}
def main(args: Array[String]) = {
val sc = SparkContext(conf=conf)
val hbase_io = common.HbaseIO(args)
val out_format = common.OutputFormatter(args)
val hbase_images = hbase_io.load_from_hbase(sc)
val classified_images = hbase_images.mapPartitions(predict) 
.map(out_format.imagenet_format)
val classified_images.foreachPartition(hbase_io.put_to_hbase)
}
Š Cloudera, Inc. All rights reserved.
VISUAL MODEL AND DATA INSPECTION
31 Š Cloudera, Inc. All rights reserved.
DOMAIN-ORIENTED QUERY AND ACCESS
Bounding box inspection for model comparison
32 Š Cloudera, Inc. All rights reserved.
DOMAIN-ORIENTED IMAGE SEARCH
Semantic Search For End Users
• Semantic search:
• Things
• Relationships
• Activities
• Situations
• End user tools:
• HUE-Dashboard
• CDSW-
Notebook
bounding boxes overlap
car in front of... truck
A property of the object-pair becomes a fact.
Facts are added to the search index.
This enables semantic serach easily.
Š Cloudera, Inc. All rights reserved.
DEMO 1
Image Search and Model Comparison
34 Š Cloudera, Inc. All rights reserved.
VISUAL LABEL INSPECTION VIA HUE:
How to Inspect Label Quality & Relations Between Objects?
Index contains:
- object relations
- predicted labels
- object statistics
Rendered BoundingBoxes
are key to visual inspection.
- easy comparison of multiple:
model classes (A, B) or
model versions (C1, C2).
Model BModel A
35 Š Cloudera, Inc. All rights reserved.
TRAINING SET MANAGEMENT
Select the right data for the question
riders = ImageSet.select(”overlap:(bicycle, person)”)
36 Š Cloudera, Inc. All rights reserved.
MODEL DEVELOPMENT
Default models are usually not enough
cyclist person holding bike
rider_model = model.fit(
riders,
rider_labels,
epochs=30,
batch_size=20,
validation_data=(
validation_features,
validation_labels)
)
Š Cloudera, Inc. All rights reserved. 37
NEXT STEPS TOWARDS CONTEXTUAL AWARENESS
38 Š Cloudera, Inc. All rights reserved.
HOW TO IDENTIFY SEMANTIC RELATIONS?
From labels to semantic graphs ...
1. Build an ontology for traffic scenes (or any other domain you work on).
2. Map statistical object properties to RDF graph using heuristics
3. Combine scene-graphs in a triple store
4. Search with SPARQL
39 Š Cloudera, Inc. All rights reserved.
HOW TO WORK WITH SEMANTIC RELATIONS?
Creation of the local graph ...
• Build Ontology for Traffic Scenes
Map statistical object properties to
RDF graph using heuristics
• Combine scene-graphs (triple store)
• Search with SPARQL
• Object detection
• Deep neural networks
• Bounding Box analysis
• Render BBs with labels
• Geometry based heuristics
• Overlap ratios
• Orientation analysis
• SOLR Search by
• Labels
• Relations
• SPARQL for RDF data
• Scene recognition
40 Š Cloudera, Inc. All rights reserved.
WHY SEARCH ON A KNOWLEDGE BASE?
Provide a better search experience
• Local knowledge graphs enable search for:
THINGS (pedestrian, stop sign, hot spot, gun, …)
RELATIONSHIPS (close by, in front of, above, underneath, ...)
ACTIVITIES (danger, theft, evasion, escape)
SITUATIONS (combinations of THINGS, RELATIONS, and ACTIVITIES)
• ... very fast, even in huge image collections.
Knowledge graphs remove the need to know SOLR schema details.
41 Š Cloudera, Inc. All rights reserved.
IMPLEMENTATION OF COMPLEMENTARY SEARCH CHANNELS
Triplification of information from images using local graphs
Š Cloudera, Inc. All rights reserved. 42
MOVING FORWARD
43 Š Cloudera, Inc. All rights reserved.
CLOUDERA DATA SCIENCE WORKBENCH
Accelerate machine learning from research to production
Data Science is an essential part
of the bigger picture
WORKLOADS 3RD PARTY
SERVICES
DATA
ENGINEERING
DATA
SCIENCE
DATA
WAREHOUSE
OPERATIONAL
DATABASE
DATA CATALOG
GOVERNANCESECURITY LIFECYCLE
MANAGEMENT
STORAGE
Microsoft
ADLS
COMMON SERVICES
HDFS
Amazon
S3
CONTROL
PLANE
KUDU
44 Š Cloudera, Inc. All rights reserved.
CLOUDERA DATA SCIENCE WORKBENCH
Accelerate machine learning from research to production
For data scientists:
• Experiment faster
Use R, Python, or Scala with
on-demand compute and
secure CDH data access.
• Work together
Share reproducible research
with your whole team.
• Deploy with confidence
Get to production repeatably
and without recoding.
For IT professionals:
• Bring data science to the data
Give your data science team
more freedom while reducing
the risk and cost of silos.
• Secure by default
Leverage common security and
governance across workloads.
• Run anywhere
On-premises or in the cloud, on
CPU or GPU.
Š Cloudera, Inc. All rights reserved.
OUTLOOK
EXTEND TENZING ... for better image processing
46 Š Cloudera, Inc. All rights reserved.
IMAGE CLUSTERING USIG K-MEANS
• Extraction of image features
• Conversion of specific data formats into:
org.apache.spark.mllib.linalg.{Vector, Vectors}
• Features are persisted in this reusable format as a Parquet file
• From here we go ... apply SparkML code to a part of the compound dataset.
• Finally we feed the new labels back into the compound dataset
 e.g., for comparison of different clustering models with known clusters
Š Cloudera, Inc. All rights reserved.
DEMO 2
Image & Feature Processing
48 Š Cloudera, Inc. All rights reserved.
/GITHUB/TSA_finance/bin
./runSpark2.sh
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import lire.Base64ImageConverter
import lire.LireToolWrapperGF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
// FEATURE EXTRACTION USING OPEN SOURECE LIBRARIES ...
// THIS IS ONE SLICE OF THE “COMPOUND DATASET”: Images can be vectorized in multiple ways, e.g., using color histograms
val dataPath = "/GITHUB/finance/data/out/image-analysis/A_2018-09-10-21-11-49/image_MD_ALL.parquet”
val df = sqlContext.read.parquet(dataPath)
val feature = df.select( "FCTH" )
val input = feature.rdd.map( x => x.getAs[org.apache.spark.mllib.linalg.Vector]("FCTH") );
val kmeans = new KMeans().setK(8).setSeed(1L)
val model = kmeans.run( input )
val WSSSE = model.computeCost(vectors)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result: TODO: WRAP THIS MODEL INTO A “TENZING LABELER WHICH ALSO PERSISTD THE RESULTS IN MD-Collection”
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
val dataToCluster = ....
val labeledDdata = model.predict( dataToCluster )
49 Š Cloudera, Inc. All rights reserved.
OUTLOOK: SPARK IMPLEMENTATION OF KMEANS CLUSTERING (SCALA)
def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = {
val model = ClusteringModelSerializer.restoreKMeansModel(modelLoc)
pairs.map{ case (name, imageFeatures) =>
(name, run_kmeans_clustering_on_image_features(model, imageFeatures)
}
}
def main(args: Array[String]) = {
val sc = SparkContext(conf=conf)
val hbase_io = common.HbaseIO(args)
val out_format = common.KMeansOutputFormatter(args)
val hbase_image_features = hbase_io.load_features_from_hbase(sc)
val clustered_images = hbase_images.mapPartitions(predict) 
.map(out_format.imagenet_format)
val clustered_images.foreachPartition(hbase_io.put_to_hbase)
}
50 Š Cloudera, Inc. All rights reserved.
SUMMARY
What we can do with image search today:
• Search for combinations and amounts of objects at scale: „at least 5 cars and 2 trucks”
• Search for basic relationship among those things: „In front of”, ”In a line with”
• Enrich the search experience with other domains: geospatial, sensor data, etc.
This helps to:
• Gain better understanding of the quality of our CV models/apps
• Discover corner cases, improve model-lifecycle and
• Build new (data) products faster
In the near future:
• Focus on semantic search, advanced visualization
• Improved model lifecycles and AutoML.
Š Cloudera, Inc. All rights reserved.
Many thanks to collaborators & supporters:
Anton Vukovic, Jan Kunigk, Marton Balassi
Alexander Bartfeld, Willem Stoop
Š Cloudera, Inc. All rights reserved.
THANK YOU
53 Š Cloudera, Inc. All rights reserved.
APPENDIX: GETTING DATA
There are many great datasets out there for research purposes:
• Cityscapes, https://www.cityscapes-dataset.com/
• COCO, http://cocodataset.org/#home
• YouTube-8M, https://research.google.com/youtube8m/

Weitere ähnliche Inhalte

Ähnlich wie Improving computer vision models at scale (Strata Data NYC)

Ahluwalia ibm up con keynote (published)
Ahluwalia   ibm up con keynote (published)Ahluwalia   ibm up con keynote (published)
Ahluwalia ibm up con keynote (published)
sapenov
 
Survey on cloud simulator
Survey on cloud simulatorSurvey on cloud simulator
Survey on cloud simulator
Habibur Rahman
 

Ähnlich wie Improving computer vision models at scale (Strata Data NYC) (20)

Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Dell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarDell AI Oil and Gas Webinar
Dell AI Oil and Gas Webinar
 
Simon Wardley
Simon WardleySimon Wardley
Simon Wardley
 
Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10
 
Situation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesdaySituation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesday
 
Get ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_extGet ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_ext
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Ahluwalia ibm up con keynote (published)
Ahluwalia   ibm up con keynote (published)Ahluwalia   ibm up con keynote (published)
Ahluwalia ibm up con keynote (published)
 
InTTrust -IBM Artificial Intelligence Event
InTTrust -IBM Artificial Intelligence  EventInTTrust -IBM Artificial Intelligence  Event
InTTrust -IBM Artificial Intelligence Event
 
Survey on cloud simulator
Survey on cloud simulatorSurvey on cloud simulator
Survey on cloud simulator
 
Building Microservices in the cloud at AutoScout24
Building Microservices in the cloud at AutoScout24Building Microservices in the cloud at AutoScout24
Building Microservices in the cloud at AutoScout24
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...
 
AWS re:Invent 2018 - AIM302 - Machine Learning at the Edge
AWS re:Invent 2018 - AIM302  - Machine Learning at the Edge AWS re:Invent 2018 - AIM302  - Machine Learning at the Edge
AWS re:Invent 2018 - AIM302 - Machine Learning at the Edge
 
Situation Normal, FOWA Dublin
Situation Normal, FOWA DublinSituation Normal, FOWA Dublin
Situation Normal, FOWA Dublin
 
Certified Cloud Computing Associate (CCCA)
Certified Cloud Computing Associate (CCCA)Certified Cloud Computing Associate (CCCA)
Certified Cloud Computing Associate (CCCA)
 
Modern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSModern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECS
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldCheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial World
 

Mehr von Dr. Mirko Kämpf

Mehr von Dr. Mirko Kämpf (12)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

KĂźrzlich hochgeladen

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
 

KĂźrzlich hochgeladen (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Improving computer vision models at scale (Strata Data NYC)

  • 1. Š Cloudera, Inc. All rights reserved. IMPROVING COMPUTER VISION MODELS AT SCALE Dr. Mirko Kämpf | Senior Solutions Architect
  • 2. 2 Š Cloudera, Inc. All rights reserved. VISION IS EVERYTHING Automotive OEM & Tier1 Suppliers Healthcare & Medical Devices Manufacturing & Pharmaceuticals Security & Public Sector Autonomous vehicle programs Physician augmentation and robotic devices Visual inspection for quality & yield Customs & border protection, anti-crime efforts Insurance Claims processing and fraud detection using images & video
  • 3. 3 Š Cloudera, Inc. All rights reserved. IMAGINE THE POSSIBILITIES... COMPUTER VISION TECHNOLOGY ALLOWS US TO: • detect tumors in medical images • detect broken parts in a manufacturing line • detect violence in public spaces • detect dangerous situations in traffic • detect a fire early ... ALL THAT @ SCALE AND USING THE CLOUD!
  • 4. 4 Š Cloudera, Inc. All rights reserved. BE OPEN AND THINK BIG! • Cameras are everywhere ... • In traffic (cars, trains, planes) • In public places (train stations, air ports, public buildings) • Many public datasets are available ... • Udacity: autonomous driving datasets • Medical images • Search like (with) Google: • Dataset serach is a great new tool. • Image search makes grabbing tagged images as easy as never before. CLOUDERA BUILDS ENTERPRISE SOLUTIONS ON OPEN STANDARDS.
  • 5. Š Cloudera, Inc. All rights reserved. 5 BIG DATA, MACHINE LEARNING, ARTIFICIAL INTELIGENCE BUT… THERE ARE CHALLANGES
  • 6. 6 Š Cloudera, Inc. All rights reserved. CHALLENGES VOLUME ANNOTATIONMANAGEMENT
  • 7. 7 Š Cloudera, Inc. All rights reserved. VOLUME CHALLENGE
  • 8. 8 Š Cloudera, Inc. All rights reserved. ANNOTATION CHALLENGE “Let’s consider Cityscapes dataset (useful for self-driving cars). Fine pixel-level annotation of a single image from cityscapes required more than 1.5 hours on average. They annotated 5000 images. With a simple math we can calculate, that they spent about 5000 * 1.5 = 7500 hours...” Source: https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282 Are you willing to invest for thousands of hours? Industry prefers automation!
  • 9. 9 Š Cloudera, Inc. All rights reserved. MANAGEMENT CHALLENGE SHARINGDISCOVERY
  • 10. 10 Š Cloudera, Inc. All rights reserved. RESULTING STATUS QUO VAST UNKNOWN SMALL TRAINING SETS USED IN ISOLATION
  • 11. Š Cloudera, Inc. All rights reserved. 11
  • 12. Š Cloudera, Inc. All rights reserved. 12 CLOUDERA DIFFERENCE
  • 13. 13 Š Cloudera, Inc. All rights reserved. INTERSECTION OF TWO WORLDS Digital Asset Management Data Science Platforms Automated annotation of new data Optimized model imporvements
  • 14. 14 Š Cloudera, Inc. All rights reserved. ML-POWERED, DOMAIN-AWARE IMAGE REPOSITORY FOR DATA SCIENCE Core functions: Implemented as reusable building blocks: Provide extensible, repeatable, and focused solutions Asset Acquisition and Processing Processing of digital assets with machine learning annotation and enrichment Domain-Oriented Query and Access Semantic query asset acquisition and discovery with domain specific ontologies Training Set Management Using the “domain aware DAM” to generate and retrieve relevant data sets for model development and automated testing Model Development Employing relevant training sets in a shared data science environment to construct domain-specific model and to power automated model testing services This looks a bit like a part of a supply chain, doesn’t it?
  • 15. 15 Š Cloudera, Inc. All rights reserved. WHAT IS IMAGE LOGISTICS? ... the problem we solve ;-) • Provide images into training procedures efficienty (fast!) • Identify relevant images to train better DNN. • Rearrange images quickly, to adopt users needs • Manage multiple kinds of metadats: movies, images + context data • Manage dataset lifecycle of compound datasets
  • 16. Š Cloudera, Inc. All rights reserved. 16 HOW THINGS WORK
  • 17. Š Cloudera, Inc. All rights reserved. Data Engineering and Model Lifecycle DATA ENGINEERING AND MODEL LIFE CYCLE
  • 18. 18 Š Cloudera, Inc. All rights reserved. FUNCTIONAL REQUIREMENTS • Fast random access to images • Free text search for labels / tags / statistical properties • Execute existing Python and Scala deep learning pipelines at scale • Automatic labeling and indexing of detected facts • Visual model comparison • Search for complex scenarios: situational awareness
  • 19. 19 Š Cloudera, Inc. All rights reserved. SOLUTION OVERVIEW (1)
  • 20. 20 Š Cloudera, Inc. All rights reserved. SOLUTION OVERVIEW (2) Main users: Data Scientist and Domain Experts
  • 21. Š Cloudera, Inc. All rights reserved. CONCEPTS FOR EFFICIENT IMAGE WAREHOUSING:
  • 22. 22 Š Cloudera, Inc. All rights reserved. COMPOUND DATASETS - ASSET ACQUISITION + Metadata Image
  • 23. 23 Š Cloudera, Inc. All rights reserved. UNIFIED ASSET PROCESSING ONE API – MANY FRAMEWORKS – MANY MODELS • Domain specific attribute extraction and enhancement • Additional domains added as needed Autonomous driving Disease identification
  • 24. 24 Š Cloudera, Inc. All rights reserved. TENZING Access to large image datastes simplified • We provide an API for accessing images and image sets via Apache Spark, for tagging/labeling (in CDH) and model training (in CDSW). • Solr’s powerful search capabilities are used to identify the right images. • The complexity of allocating individual images or image sets within HBase is hidden within the data access layer (DAL), and the Tenzing-API.
  • 25. 25 Š Cloudera, Inc. All rights reserved. A FULL DATA PIPELINE FOR THE TRAFFIC DOMAIN: ffmpeg img img img img 9.2 9.1 lon timestamp 20180428152138 area | tunnel | bridgel | Geodata Stadium | no | no … 9.0 lat 48.1 48.3 48.5 NMEA AVRO B14 | yes | no20180428152330 20180428152831 B14 | no | yes gps2avro pynmea2 overpy Image Data CF:tagsCF:img_all jpg imagenet img stop-sign person img truck … retinanet tiny-yolo … boatperson person bicycle person traffic light boat img img img img CF:geo 20180428152330 20180428152330 Key: 30 30 30 30 Key 20180428152330 20180428152330 HBaseStorageHandlerNMEA OpenStreetmap / overpass API 30 30 31 31 hbase-indexer-mr-job.jar Lily NMEA Tenzing if
  • 26. Š Cloudera, Inc. All rights reserved. SOME IMPLEMENTATION DETAILS ...
  • 27. 27 Š Cloudera, Inc. All rights reserved. PYSPARK IMPLEMENTATION (KERAS) def predict(iterator): model = InceptionV3(weights=None) model.load_weights(FLAGS.weights_file) return [(x[0], run_inference_on_image(model, x[1])) for x in iterator] def main(): sc = SparkContext(conf=conf) hbase_io = common.HbaseIO(FLAGS) out_format = common.OutputFormatter(FLAGS, MODEL_NAME) hbase_images = hbase_io.load_from_hbase(sc) classified_images = hbase_images.mapPartitions(predict) .map(out_format.imagenet_format) classified_images.foreachPartition(hbase_io.put_to_hbase)
  • 28. 28 Š Cloudera, Inc. All rights reserved. PYSPARK IMPLEMENTATION (KERAS) • The Python environment with tensorflow is distributed to the executors at runtime, it is not preinstalled on the nodes. • The individual models only need to implement the following functions: • prepare • predict • output_format • Conceptually very close to scikit-learn or Spark ML Pipelines approach • Deep Learning Pipelines can be a way to streamline the implementation https://databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html
  • 29. 29 Š Cloudera, Inc. All rights reserved. SPARK IMPLEMENTATION (DL4J / SCALA) def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = { val model = ModelSerializer.restoreComputationGraph(modelLoc) pairs.map{ case (name, image) => (name, run_inference_on_image(model, image) } } def main(args: Array[String]) = { val sc = SparkContext(conf=conf) val hbase_io = common.HbaseIO(args) val out_format = common.OutputFormatter(args) val hbase_images = hbase_io.load_from_hbase(sc) val classified_images = hbase_images.mapPartitions(predict) .map(out_format.imagenet_format) val classified_images.foreachPartition(hbase_io.put_to_hbase) }
  • 30. Š Cloudera, Inc. All rights reserved. VISUAL MODEL AND DATA INSPECTION
  • 31. 31 Š Cloudera, Inc. All rights reserved. DOMAIN-ORIENTED QUERY AND ACCESS Bounding box inspection for model comparison
  • 32. 32 Š Cloudera, Inc. All rights reserved. DOMAIN-ORIENTED IMAGE SEARCH Semantic Search For End Users • Semantic search: • Things • Relationships • Activities • Situations • End user tools: • HUE-Dashboard • CDSW- Notebook bounding boxes overlap car in front of... truck A property of the object-pair becomes a fact. Facts are added to the search index. This enables semantic serach easily.
  • 33. Š Cloudera, Inc. All rights reserved. DEMO 1 Image Search and Model Comparison
  • 34. 34 Š Cloudera, Inc. All rights reserved. VISUAL LABEL INSPECTION VIA HUE: How to Inspect Label Quality & Relations Between Objects? Index contains: - object relations - predicted labels - object statistics Rendered BoundingBoxes are key to visual inspection. - easy comparison of multiple: model classes (A, B) or model versions (C1, C2). Model BModel A
  • 35. 35 Š Cloudera, Inc. All rights reserved. TRAINING SET MANAGEMENT Select the right data for the question riders = ImageSet.select(”overlap:(bicycle, person)”)
  • 36. 36 Š Cloudera, Inc. All rights reserved. MODEL DEVELOPMENT Default models are usually not enough cyclist person holding bike rider_model = model.fit( riders, rider_labels, epochs=30, batch_size=20, validation_data=( validation_features, validation_labels) )
  • 37. Š Cloudera, Inc. All rights reserved. 37 NEXT STEPS TOWARDS CONTEXTUAL AWARENESS
  • 38. 38 Š Cloudera, Inc. All rights reserved. HOW TO IDENTIFY SEMANTIC RELATIONS? From labels to semantic graphs ... 1. Build an ontology for traffic scenes (or any other domain you work on). 2. Map statistical object properties to RDF graph using heuristics 3. Combine scene-graphs in a triple store 4. Search with SPARQL
  • 39. 39 Š Cloudera, Inc. All rights reserved. HOW TO WORK WITH SEMANTIC RELATIONS? Creation of the local graph ... • Build Ontology for Traffic Scenes Map statistical object properties to RDF graph using heuristics • Combine scene-graphs (triple store) • Search with SPARQL • Object detection • Deep neural networks • Bounding Box analysis • Render BBs with labels • Geometry based heuristics • Overlap ratios • Orientation analysis • SOLR Search by • Labels • Relations • SPARQL for RDF data • Scene recognition
  • 40. 40 Š Cloudera, Inc. All rights reserved. WHY SEARCH ON A KNOWLEDGE BASE? Provide a better search experience • Local knowledge graphs enable search for: THINGS (pedestrian, stop sign, hot spot, gun, …) RELATIONSHIPS (close by, in front of, above, underneath, ...) ACTIVITIES (danger, theft, evasion, escape) SITUATIONS (combinations of THINGS, RELATIONS, and ACTIVITIES) • ... very fast, even in huge image collections. Knowledge graphs remove the need to know SOLR schema details.
  • 41. 41 Š Cloudera, Inc. All rights reserved. IMPLEMENTATION OF COMPLEMENTARY SEARCH CHANNELS Triplification of information from images using local graphs
  • 42. Š Cloudera, Inc. All rights reserved. 42 MOVING FORWARD
  • 43. 43 Š Cloudera, Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH Accelerate machine learning from research to production Data Science is an essential part of the bigger picture WORKLOADS 3RD PARTY SERVICES DATA ENGINEERING DATA SCIENCE DATA WAREHOUSE OPERATIONAL DATABASE DATA CATALOG GOVERNANCESECURITY LIFECYCLE MANAGEMENT STORAGE Microsoft ADLS COMMON SERVICES HDFS Amazon S3 CONTROL PLANE KUDU
  • 44. 44 Š Cloudera, Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH Accelerate machine learning from research to production For data scientists: • Experiment faster Use R, Python, or Scala with on-demand compute and secure CDH data access. • Work together Share reproducible research with your whole team. • Deploy with confidence Get to production repeatably and without recoding. For IT professionals: • Bring data science to the data Give your data science team more freedom while reducing the risk and cost of silos. • Secure by default Leverage common security and governance across workloads. • Run anywhere On-premises or in the cloud, on CPU or GPU.
  • 45. Š Cloudera, Inc. All rights reserved. OUTLOOK EXTEND TENZING ... for better image processing
  • 46. 46 Š Cloudera, Inc. All rights reserved. IMAGE CLUSTERING USIG K-MEANS • Extraction of image features • Conversion of specific data formats into: org.apache.spark.mllib.linalg.{Vector, Vectors} • Features are persisted in this reusable format as a Parquet file • From here we go ... apply SparkML code to a part of the compound dataset. • Finally we feed the new labels back into the compound dataset  e.g., for comparison of different clustering models with known clusters
  • 47. Š Cloudera, Inc. All rights reserved. DEMO 2 Image & Feature Processing
  • 48. 48 Š Cloudera, Inc. All rights reserved. /GITHUB/TSA_finance/bin ./runSpark2.sh import org.apache.spark.rdd.RDD import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import lire.Base64ImageConverter import lire.LireToolWrapperGF val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ // FEATURE EXTRACTION USING OPEN SOURECE LIBRARIES ... // THIS IS ONE SLICE OF THE “COMPOUND DATASET”: Images can be vectorized in multiple ways, e.g., using color histograms val dataPath = "/GITHUB/finance/data/out/image-analysis/A_2018-09-10-21-11-49/image_MD_ALL.parquet” val df = sqlContext.read.parquet(dataPath) val feature = df.select( "FCTH" ) val input = feature.rdd.map( x => x.getAs[org.apache.spark.mllib.linalg.Vector]("FCTH") ); val kmeans = new KMeans().setK(8).setSeed(1L) val model = kmeans.run( input ) val WSSSE = model.computeCost(vectors) println(s"Within Set Sum of Squared Errors = $WSSSE") // Shows the result: TODO: WRAP THIS MODEL INTO A “TENZING LABELER WHICH ALSO PERSISTD THE RESULTS IN MD-Collection” println("Cluster Centers: ") model.clusterCenters.foreach(println) val dataToCluster = .... val labeledDdata = model.predict( dataToCluster )
  • 49. 49 Š Cloudera, Inc. All rights reserved. OUTLOOK: SPARK IMPLEMENTATION OF KMEANS CLUSTERING (SCALA) def predict(pairs: Iterator[(String, (INDArray, Int, Int))]) = { val model = ClusteringModelSerializer.restoreKMeansModel(modelLoc) pairs.map{ case (name, imageFeatures) => (name, run_kmeans_clustering_on_image_features(model, imageFeatures) } } def main(args: Array[String]) = { val sc = SparkContext(conf=conf) val hbase_io = common.HbaseIO(args) val out_format = common.KMeansOutputFormatter(args) val hbase_image_features = hbase_io.load_features_from_hbase(sc) val clustered_images = hbase_images.mapPartitions(predict) .map(out_format.imagenet_format) val clustered_images.foreachPartition(hbase_io.put_to_hbase) }
  • 50. 50 Š Cloudera, Inc. All rights reserved. SUMMARY What we can do with image search today: • Search for combinations and amounts of objects at scale: „at least 5 cars and 2 trucks” • Search for basic relationship among those things: „In front of”, ”In a line with” • Enrich the search experience with other domains: geospatial, sensor data, etc. This helps to: • Gain better understanding of the quality of our CV models/apps • Discover corner cases, improve model-lifecycle and • Build new (data) products faster In the near future: • Focus on semantic search, advanced visualization • Improved model lifecycles and AutoML.
  • 51. Š Cloudera, Inc. All rights reserved. Many thanks to collaborators & supporters: Anton Vukovic, Jan Kunigk, Marton Balassi Alexander Bartfeld, Willem Stoop
  • 52. Š Cloudera, Inc. All rights reserved. THANK YOU
  • 53. 53 Š Cloudera, Inc. All rights reserved. APPENDIX: GETTING DATA There are many great datasets out there for research purposes: • Cityscapes, https://www.cityscapes-dataset.com/ • COCO, http://cocodataset.org/#home • YouTube-8M, https://research.google.com/youtube8m/