Hadoop World 2010 - BAH - Fuzzy Table

Fuzzy Table Distributed Fuzzy Matching Database Lalit Kapoor kapoor_lalit@bah.com

Session Agenda Fuzzy Matching? The Big Data Problem A Scalable Solution Performance Questions? 2

What is Fuzzy Matching? Create a Vector or Matrix of doubles Start with some multimedia image/voice/audio/video/etc Feature Extraction & Normalization 1 Distance Function* 31.46 Feature Extraction & Normalization 2 These images are very similar, but obliviously not the same. To find image #2 given image #1, some sort of fuzzy matching technique needs to be used *Euclidean Distance in this example 4 Images from Flickr; Licensed under Creative Commons http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/ http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/

How Is Fuzzy Matching Being Used Today? 5

Why Do We Care? At the forefront of strategy and technology consulting for nearly a century Deep functional knowledge spanning strategy and organization, technology, operations, and analytics US government agencies in the defense, security, and civil sectors, as well as to corporations, institutions, and not-for-profit organizations 6

Biometrics – A Fuzzy Matching Problem Same Person? Lifted From A Crime Scene Law Enforcement Database 7

Biometrics – Example Create a Vector or Matrix of doubles Query Biometrics Database Feature Extraction & Normalization 1 Distance Function* 2.41 Feature Extraction & Normalization 2 *Euclidean Distance in this example 8

Growth of Multimedia Databases Flickr – over 5 billion images ImageShack – over 20 billion unique images ,[object Object]

Hulu – over 380 million videoshttp://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/ http://ksudigg.wetpaint.com/page/YouTube+Statistics http://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/ 10

Growth of Biometric Databases Combined U.S. government biometric databases are expected to grow to hold billions of identities The DHS’s US-VISIT program has the world’s largest and fastest biometric database (IDENT) with over 110 million identities and roughly 145,000 identities enrolled or verified daily* From the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) alone, there are 66 million identities with 8,000 more subjects added each day ** ,[object Object]

European Union’s Biometric Matching System (EU-BMS) is expected to hold 70 Million people’s biometric data to support visa applications, border control, and immigration ****

AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions*****US-VISIT * US-VISIT: The world’s largest biometric application. William Graves. ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm *** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/ **** http://www.findbiometrics.com/articles/i/5220/ ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx 11

Biometric Databases are a Big (Data) Problem Large scale operations Searching and storing 100’s of millions to billions of Identities Multiple biometric templates and raw files per identity for multimodal matching Fingerprints, Faces, and Iris New raw files and templates typically stored after each Verification and Identification operation Raw Images Biometric templates Results are expected in real time Systems require cost efficient storage and retrieval for biometric matches Need innovative ways to reduce costs per match 500 M Identities x (16 KB to 300 KB) x (10 to 20) = 1 – 2 PB 500 M Identities x (256 b to 3 KB) x (10 to 20) = 2 – 27 TB 12

Hadoop and Multimedia Databases HDFS as file storage for petabytes worth of multimedia (images/audio/video/etc) Redundancy Distribution Mahout/MapReduce used for indexing and clustering similar objects Improve overall search speeds Improving feature selection by analyzing the entire database with MapReduce Select most effective features in distinguishing identities N-to-N matching search (special type of Identification search) to cleanse database Find people trying to circumvent the system (Identity Fraud, etc) 14

Fuzzy Table: Large-scale, Low Latency, Fuzzy Matching Database Originally designed for Biometric applications, but has uses in other domains Enables fast parallel searches against keys that cannot be effectively ordered and that require fuzzy matching such as Biometrics Identification, large scale image search, large scale audio search, etc ,[object Object],It inherits some of the nice features of Hadoop: Horizontal scalability over commodity hardware Distributed and parallel computation High reliability and redundancy 15

Fuzzy Table: Bulk Data Processing Component Mahout’s Canopy Clustering and K-means Clustering partitions data into clusters (bins) Reduces search space so only small subset of the data must be processed This concept is based on work done in academia* Centroids from K-means clustering are used to create a “Bin classifier” Determines the best bins to search for a given key {Key, Value} records are stored as Sequence Files in HDFS Spread across the cluster for optimal parallel searching MapReduce is used for all other bulk or batch data processing Batch fuzzy match searching Re-encoding the raw files into Feature vectors Performing large-scale feature evaluation to improve clustering *Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot 17

Bulk Clustering and Real-time Classification This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching The classifier determines which Bins need to be searched in order to find the most likely matching keys 18

Fuzzy Table: Data Storage and Bins Bins are represented as directories in HDFS with one or more chunk files (Sequence Files): /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001 Chunk files contain many {Key, Value} pairs Small multiple of the HDFS block size Chunk files are distributed uniformly and randomly across the Data Servers in the cluster Ensures that the bins are striped across the cluster for optimal parallel searching Replicated across the Data Servers using HDFS’s replication mechanism Data Servers only search through local chunk files Results returned in real-time as soon as a match is found 20

Fuzzy Table: Low Latency Fuzzy Matching Component The low latency component consists of three main parts Client – submits queries for Keys and get back {Key, Value} pairs Master Server – serve metadata about which Data Servers host which bins Data Servers – Actually perform fuzzy matching searches Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records double score = fuzzyMatcher.match(key, storedRec.getKey()); if(score >= threshold) return storedRec; Fuzzy matching searches are performed in parallel across many Data Servers 21

Fuzzy Table: Optimizations Master Server HDFS Metadata Caching The HDFS Namenode is a performance bottleneck for low latency searches Master Server caches HDFS Block locations for all Fuzzytable files (Bins and Chunk Files) Periodic refresh of the cache so its metadata is always fresh Increased HDFS replication factor (Replication factor of N) Fuzzytable is close to a read only system More data replication means increased parallelism and faster query return Data Servers only perform searches against data that resides locally on disk 22

Performance and Scalability Testing Employed EC2 for all testing Downloaded ~1 TB of images from Flickr (100 Nodes) Performed the Bulk Processing Components tasks across all 1 TB of images (80 nodes) Duplicate detection and removal Feature extraction and normalization Mahout’s canopy clustering Mahout’s k-means clustering Join Clusters with Features Post processing data into bins and chunk files ,[object Object],Ramping up the number of queries per second on fixed cluster size Querying increasing cluster sizes 32

Shortest Query Times 33 Time To Respond (ms) # Of Data Servers

Longest Query Times 34 Time To Respond (ms) # Of Data Servers

Caching Performance 35 Average Response Time (ns) # Threads Polling The Master Server

Conclusion Large-scale, real-time Multimedia/Biometric Database search is a hard problem And it’s becoming computationally more expensive as the amount of data grows Hadoop is a potential solution to this problem MapReduce can be used for bulk processing to enable distributed, low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching 36

Contact Information – Cloud Computing Team Michael Ridley Associate Lalit Kapoor Senior Consultant Edmund Kohlwey Senior Consultant Robert Gordon Associate Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 ridley_michael@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kohlwey_edmund@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 gordon_robert@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 trost_jason@bah.com @jason_trost @idefine Jesse Yates Consultant @jesse_yates @ekohlwey Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523 yates_jesse@bah.com @mikeridley 37

Thanks Brandyn White (@brandynwhite) – Assistance with Flickr image retrieval 38

Hadoop World 2010 - BAH - Fuzzy Table

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop World 2010 - BAH - Fuzzy Table

Ähnlich wie Hadoop World 2010 - BAH - Fuzzy Table (20)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop World 2010 - BAH - Fuzzy Table