08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Human-Efficient Discovery of Training Data for Visual ML
1. Human-Efficient Discovery of
Training Data for
Visual Machine Learning
Thesis Proposal
Ziqiang (Edmond) Feng
Committee:
Mahadev Satyanarayanan (Chair)
Martial Hebert
Roberta Klatzky
Padmanabhan Pillai (Intel Labs)
2. Agenda
• The Problem
• Thesis Statement
• Overview of Eureka
• Research Thrusts
• Related Work
• Timeline
2
3. Deep Learning for Computer Vision
Classification Detection
Segmentation Activity Recognition
Bird
Cat
Dog
3
4. Training Data Is A Key Ingredient
Raw
pixels
Deep Neural Network
Predictio
n
(1) Forward pass
Label
(ground truth)
⊗ErrorDeep Neural Network
Predictio
n(2) Backward pass (aka back-propagation)
A training example = ( , )
Raw
pixels
Label
(ground
truth)
4
8. The Training Data Problem of Domain
Experts
(scientists, military, medical doctors, etc.)
Masked palm
civet (Paguma
larvata).
Transmitter of
SARS during its
2003 outbreak
BUK-M1.
Believed to
have shot
down MH17
and killed 298,
2014.
Nuclear atypia
in pathological
images. Cue of
several
diseases and
cancers. 8
9. Why Is It Difficult?
Crowds are not experts.
Domain-specific expert knowledge is
required.
Interesting phenomena are rare.
Scan through a lot of data to find a few
positives.
Access restriction.
Only one or a few experts can label the
9
10. How can a single domain expert discover
thousands of positive examples of a rare
object from unlabeled data efficiently?
10
11. Thesis Statement
The manual effort of discovering a large training set for visual machine
learning can be reduced by a system combining:
• Early discard
• Just-in-time machine learning
• The ability to create more accurate filters without writing new code
11
This approach is efficient in:
• Different computing landscapes
(e.g., edge computing and smart storage)
• Different problem domains
(e.g., object detection in images and activity recognition in videos).
12. Agenda
• The Problem
• Thesis Statement
• Overview of Eureka
• Research Thrusts
• Related Work
• Timeline
12
13. Eureka
A methodology and a system.
• For finding rare phenomena in unlabeled visual data
Goal: utilize an expert’s time efficiently
• Reduce expert’s idle time
• Improve candidate examples’ quality
13
14. Itemizer
(scoping)
Data Source
(images, videos, map data, etc.)
User Interface
Item: independent unit of
early-discard and display
(e.g., a single image)
c kAttribute: key-value attached by
filters; facilitates communication
between filters and post-analysis
Filter: examine and try to drop
items; short-circuit evaluation of
cascade filter chains
dcba
jihg
Item Processor
F2
drop
drop
e
f k
F1
Logical Execution
Flow
14
…
15. Early Discard Filters
• Purpose: to drop probably negative data and narrow
search space
• Not be to taken as “perfect detector”
• Reduce the demand for expert time & attention
• Examples:
• Sky-blue color for birds
• Bullet shape for rocket propelled grenade (RPG)
15
20. Agenda
• The Problem
• Thesis Statement
• Overview of Eureka
• Research Thrusts
• Related Work
• Timeline
20
21. Eureka in Different Computing
Landscapes
21
Edge Computing Cloud Computing Smart Storage
Image Data
Focus:
• Computer system efficiency (throughput, latency,
etc.)
• Identify hardware and software bottleneck
• Develop techniques to improve computational
efficiency
22. Eureka in Different Problem Domains
22
Edge Computing
Image Data
Focus:
• Domain-specific optimization
• Expressive programming abstraction
• User productivity
Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
23. Research Thrusts: Progress
23
Edge Computing
Image Data Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
Cloud Computing Smart Storage
24. 24
Edge Computing
Image Data Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
Cloud Computing Smart Storage
25. Why the Edge?
• Data is generated on the edge
• Sensors, cameras, smart phones,
drones, self-driving cars, smart
streetlights, etc.
• Edge computing is the answer to
scalability
• Can’t afford to send all data into the
cloud for computation
• US average Internet bandwidth
(2017) = 19 Mbps
• Barely enough to stream a 4K video
by Netflix
25
27. Experiments
27
• Yahoo! Flickr 100 Million (YFCC100M) images
• Unlabeled. Real-life object distribution.
• Evenly partitioned on cloudlets
Data
• 8 cloudlets
• Intel Xeon E5-1650, 32 GB DRAM
• Nvidia GTX 1060 GPU
Cloudlets
• 5 iterations for each target
• Start with SIFT, RGB color histogram, Difference of Gaussian, …
• Later: iteratively re-train SVM using MobileNet features
Workflow
28. Edge + Image: Results
28
Deer Taj Mahal Fire hydrant
0.07% 0.02% 0.005%Estimated base rate
(Prevalence)
111 105 74Total true positives
collected in 5 iterations
7,447 4,791 15,379Images labeled
by user
2,104,076 2,542,889 2,734,070Images discarded
by Eureka
29. Compare with Naïve Hand-Labeling
1,000
10,000
100,000
1,000,000
Deer Taj Mahal Fire hydrant
Images (TP+FP) the user inspected to collect ~100 true positives
Naïve hand-labeling Single-pass early discard Eureka
better
30. Effect of the Iterative Workflow
30
0
5
10
0 20 40 60 80
Cumulative Minutes in Workflow
Newly-discovered True Positives Per Minute
Deer (Base rate=0.07%) Taj Mahal (0.02%)
Fire Hydrant (0.005%)
Due to lower
base rate of
target
Bottlenecked by low
base rate and
computation (user’s
waiting)
better
31. Proximity to Data Is Important
31
0
500
1000
10 Mbps 25 Mbps 100 Mbps 1 Gbps (LAN)
ProcessingThroughput
(images/s)
Throttling bandwidth between compute and data.
RGB color histogram filter
US average:
18.7 Mbps (2017)
better
32. 32
Edge Computing
Image Data Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
Cloud Computing Smart Storage
33. Why the Cloud?
• Historically, many data sets have been centralized into the
cloud
• Elasticity -- easy to recruit more compute resource by
adding VMs
• Trade off $$$ for better use of expert time
33
34. Edge vs. Cloud
34
EC2 EC2 EC2 EC2
Network
S3 Storage
Edge
• Independent CPUs and disks
• Access to local disk is fast
Cloud (Amazon Web Services as example)
• Elasticity leads to separation of
computing and storage layers
• I/O stack adds extra latency
• Contention for shared
bandwidth
36. What Can We Do in the Cloud
• Use extra threads to pre-fetch data asynchronously
• Utilize the many cores
• In practice -- got throttled by the service provider
• Cache data for later re-access
• Utilize the large main memory
• Useful if workload revisits data items
36
37. 37
Edge Computing
Image Data Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
Cloud Computing Smart Storage
38. Eureka + Smart Storage
(work in progress)
Smart storage = execute application logic in on-disk controllers
• Today’s disk controllers are already small computers
Why do this?
• Storage is the first thing that scales with data
• Lower energy consumption
• Fast access to data
• Passing application knowledge to the storage system for optimization
Challenges
• Low compute capacity on device
• Difficulty of programming/debugging
38
39. Optimizing Image Storage for Eureka
• Object store semantics no need for partial reads
• Read only no writes
• Read order doesn’t matter reduce disk seek, exploit
cache, etc.
39
40. 40
Edge Computing
Image Data Video Data
Other multi-
dimensional data (e.g.,
whole-slide image, HD
map)
Cloud Computing Smart Storage
41. Activity Recognition in Video
(work in progress)
• Challenge 1: the extra time dimension
• Both algorithmic and computational
41
42. Eureka + Video Data
• Challenge 2: gap between available training sets and real-
world data
42
UCF101 data set (Soomro et al.) Surveillance Video on Forbes Avenue near CMU
43. Eureka + Video Data
• Challenge 3: complex search conditions
• “Search for men”
• “Search for a man wearing red shirt running after a child on
street side”
• Features needed
• Combining techniques for video (frame sequence) and frame
(static image)
• Nesting detection
• Correlating timestamp and location
• Feeding back Eureka’s iterative approach
43
44. Other
Multidimensional
Data
• Examples
• Whole-slide image pyramids in
digital pathology
• Map data
• Challenges
• Query interface
• Efficient computation
• What can be discovered from the
data?
Let’s build the telescope so that
domain experts can discover craters
44
45. Agenda
• The Problem
• Thesis Statement
• Overview of Eureka
• Research Thrusts
• Related Work
• Timeline
45
46. Related Work:
DNN Training and Inference
• DNN structures
• AlexNet, VGG, Inception, MobileNet, Faster R-CNN …
• Software libraries
• TensorFlow (Google), PyTorch (Facebook), …
• Hardware accelerator
• Movidius (Intel), TPU (Google), …
• On constrained hardware
• Model compression, model quantization, …
46
• Video analytics
• VideoStorm, NoScope, Focus, FilterForward, BlazeIt, VideoEdge,
…
• Premise: a “good” model exits to detect object of interest
• Ask: “how to run it faster”?
• Batch or stream processing
• My thesis is intrinsically human-in-the-loop and interactive, and
has no good model to begin with
47. Related Work:
Training Data Augmentation and
Synthesis
• Traditional data augmentation
• More intelligent data synthesis based on computer
graphics and machine learning
47
Source: Dwibedi et al.
ICCV’17
Problem:
Not truly diverse examples.
48. Related Work:
Human-sourced Labeling
48
Crowd-sourcing
• Only useful for common targets
• Human-computer interaction, active learning,
etc.
(CVPR’13, ECCV’16, etc.)
Target Visual
Data
Target
Domain
Experts
No Coding
Barrier
Expert/crowd-sourcing
• Medical literature screening
(HCOMP’15, etc.)
Snorkel
• Ask experts to write “labeling functions”
• Infer labels using statistical models
(NeurIPS’16, VLDB’18, etc.)
My Thesis Proposal