Exploring the experience and insight of VisageCloud into building, testing, training and ramping up to production face recognition workloads which can be easily integrated with big data stores.
Handwritten Text Recognition for manuscripts and early printed texts
Â
Scaling face recognition with big data - Bogdan Bocse
1. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Scaling Face Recognition
with Big Data
Bogdan BOCČE
Solutions Architect & Co-founder VisageCloud
https://VisageCloud.com
https://www.linkedin.com/in/bogdanbocse/
https://twitter.com/bocse
3. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠How to learn ?
⢠What to learn?
⢠Defining learning objectives
⢠How to scale learning?
⢠Gotchas
⢠VisageCloud
âArchitecture
âUse Cases
Agenda
4. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠What questions to ask before writing the code?
⢠How to look at the data before feeding it to the
machine?
⢠What is the state of the art regarding ML?
⢠What frameworks to use?
⢠What are the common traps to avoid?
⢠How to design for scale?
Objectives
6. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Vision
⢠Convolutional Neural Networks
⢠Inception Paper
NLP
⢠Word2Vec
⢠GloVe: Global Vectors for Words Representation
Generic
⢠Classification
⢠Prediction
How to Learn?
8. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Pooling / Max Pooling
⢠Convolution
⢠Fully Connected Activation
â Activation Function, eg. ReLu
Convolutional Neural Networks : Components
9. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Learning is an optimization problem
âFind parameters of a system (neural network) that
minimize a fixed error function
âNot unlike planning orbital paths
⢠Defining the network architecture
⢠Defining the training algorithm
âStochastic Gradient Descent
⢠With momentum
⢠With noisy
Taking a Step Back: The Math
10. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠DeepLearning4j
â Independent company
â Java interface with C-bindings for performance
⢠TensorFlow
â Python & C++ API
â Developed by Google
â Compatible with TPU
⢠Torch
â Developed by Facebook
â Written in LuaJIT, with Python bindings
Frameworks
12. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Public data sets
âLabelled Faces in the Wild (LFW)
âYoutube faces
âKaggle
⢠Private data sets
⢠Build your own
âOutsourcing: Mechanical Turk
âCrowsourcing: ReCaptcha model
Data Sets
14. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Machine learning is not magic
⢠If you canât understand the data, a machine probably
wonât either
⢠Preprocessing makes the difference between results
⢠Applying filters, normalization, anomaly detection is
computationally inexpensive
Preparing Data
16. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Supervised
âClassification
âScoring and regression
âIdentification
⢠Unsupervised
âClustering
Defining learning objectives
17. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Projecting input onto a fixed set of classes
⢠âDonât use a cannon to kill a flyâ
âSupport Vector Machines
⢠Linear
⢠Radial Based Functions
Classification
18. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Embedding
âProjecting input (image) onto an vector space with a
known property
⢠Triplet Loss Function
Identification
19. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Splitting a set of items into non-overlapping subsets,
based on item attributes
⢠Counting people in video streams
⢠Algorithms:
âFixed threshold
âK-means
âRank-order clustering
Clustering
21. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Scaling training
â Requires shared memory space
â Vertical scaling
⢠GPU
⢠Soon-to-come: TPU (tensor processing unit)
⢠Scaling evaluation
â Shared nothing architecture
â Neural network/classifier rarely change
â Load balancing pattern
â Partitioning data if needed
How to scale learning?
22. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠There is no âreduceâ for neural networks
⢠Averaging weights/parameters
â Usually not a good idea
⢠Genetic algorithms
â Requires a lot of processing power
â Running independent iterations on different machines
â Crossover between weights/parameters of independently
trained neural networks after each epoch
Ideas for horizontal scaling
24. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Our 2D and 3D intuition often fails in high dimensions
⢠Distances tend to become relatively âthe sameâ as
number of dimensions increases
⢠Dimensionality reduction
â Embedding functions
â Principal component analysis
The Curse of Dimensionality
25. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠âThe bottom of a valley is not necessarily the lowest
point on Earthâ
⢠Learning algorithms may get stuck in local optima
⢠Using momentum or some random noise reduces
this possibility
⢠Using genetic algorithms can be even more robust,
but itâs computationally expensive
Local Optima
27. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
âBased on state-of-the-art machine learning, our
weather forecast system can predict tomorrowâs
weather with 72% accuracyâ
Evaluating of Learning
You get the same results by saying âitâs going to be the same as todayâ
28. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Donât test on the data you train on
â Use different data set
â Split the data sets you have
⢠Beware of data biases
â Confirmation bias
â Survivorship bias
â Selection bias
⢠Compare against a benchmark, even a dummy one
â Coin flip
â Linear algorithms
â âSame-as-beforeâ
Evaluation of Learning
30. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
High Level Architecture
VisageCloud Production
HAProxy
(reverse proxy)
Image Storage
AWS S3
Service
(API Controller)
Cassandra
Containers
(Docker)
Neural Networks
(OpenCV, Dlib,
Torch, pixie magic)
CQL Binary
HTTP
API Consumer
(Customer Infrastructure)
HTTPS
HTTP
HTTPS
31. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Detect
faces
Align faces
Pre-
processing
Feature
extraction
Feature
comparison
Processing Pipeline
32. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠The collection
âSlice of data used together
â10K-100K records
⢠The Cache-Inside Pattern
âLoading / preloading collection in one application server
âContent based routing/balancing to maximize cache hits
âNo logic in the database layer
âRequires periodic polling for updates
⢠Weaker consistency
Partitioning Data: Application Level Logic
34. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Perform comparison logic in database
âUser Defined Aggregate Functions
⢠Removes the need to move data around between
application and database
⢠Harder to deploy/test
⢠Stronger consistency
Partitioning Data: Application Level Logic
35. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
⢠Itâs math, not magic
⢠If you donât understand the data, neither will the
machine
⢠Preprocessing makes the difference
⢠Test against a benchmark, any benchmark
⢠Evaluate first, scale later
Key Take-away
36. @ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Bogdan@VisageCloud.com
+(40) 724 714 234
https://www.linkedin.com/in/bogdanbocse/
https://twitter.com/bocse
Letâs keep in touch