SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Fuzzy Table Distributed Fuzzy Matching Database Lalit Kapoor kapoor_lalit@bah.com
Session Agenda Fuzzy Matching? The Big Data Problem A Scalable Solution Performance Questions? 2
Fuzzy Matching? 3
What is Fuzzy Matching? Create a Vector or Matrix of doubles Start with some multimedia image/voice/audio/video/etc Feature Extraction & Normalization 1 Distance Function* 31.46 Feature Extraction & Normalization 2 These images are very similar, but obliviously not the same. To find image #2 given image #1, some sort of fuzzy matching technique needs to be used *Euclidean Distance in this example 4 Images from Flickr; Licensed under Creative Commons  http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/ http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/
How Is Fuzzy Matching Being Used Today? 5
Why Do We Care? At the forefront of strategy and technology consulting for nearly a century Deep functional knowledge spanning strategy and organization, technology, operations, and analytics US government agencies in the defense, security, and civil sectors, as well as to corporations, institutions, and not-for-profit organizations 6
Biometrics – A Fuzzy Matching Problem Same Person? Lifted From A Crime Scene Law Enforcement Database 7
Biometrics – Example Create a Vector or Matrix of doubles Query Biometrics Database Feature Extraction & Normalization 1 Distance Function* 2.41 Feature Extraction & Normalization 2 *Euclidean Distance in this example 8
The Big Data Problem 9
Growth of Multimedia Databases	 Flickr – over 5 billion images ImageShack – over 20 billion unique images ,[object Object]
Hulu – over 380 million videoshttp://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/ http://ksudigg.wetpaint.com/page/YouTube+Statistics http://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/ 10
Growth of Biometric Databases Combined U.S. government biometric databases are expected to grow to hold billions of identities The DHS’s US-VISIT program has the world’s largest and fastest biometric database (IDENT) with over 110 million identities and roughly 145,000 identities enrolled or verified daily* From the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) alone, there are 66 million identities with 8,000 more subjects added each day ** ,[object Object]
European Union’s Biometric Matching System (EU-BMS) is expected to hold 70 Million people’s biometric data to support visa applications, border control, and immigration ****
AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions*****US-VISIT *     US-VISIT: The world’s largest biometric application. William Graves. **     http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm ***   http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/ ****  http://www.findbiometrics.com/articles/i/5220/ ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx 11
Biometric Databases are a Big (Data) Problem Large scale operations Searching and storing 100’s of millions to billions of Identities Multiple biometric templates and raw files per identity for multimodal matching  Fingerprints, Faces, and Iris New raw files and templates typically stored after each Verification and Identification operation Raw Images Biometric  templates Results are expected in real time Systems require cost efficient storage and retrieval for biometric matches  Need innovative ways to reduce costs per match 500 M Identities x (16 KB to 300 KB) x (10 to 20) = 1 – 2 PB 500 M Identities x (256 b to 3 KB) x (10 to 20) = 2 – 27 TB 12
A Scalable Solution 13
Hadoop and Multimedia Databases HDFS as file storage for petabytes worth of multimedia (images/audio/video/etc) Redundancy Distribution Mahout/MapReduce used for indexing and clustering similar objects   Improve overall search speeds Improving feature selection by analyzing the entire database with MapReduce Select most effective features in distinguishing identities N-to-N matching search (special type of Identification search) to  cleanse database Find people trying to circumvent the system (Identity Fraud, etc) 14
Fuzzy Table: Large-scale, Low Latency, Fuzzy Matching Database Originally designed for Biometric applications, but has uses in other domains Enables fast parallel searches against keys that cannot be effectively ordered and that require fuzzy matching such as  Biometrics Identification, large scale image search, large scale audio search, etc ,[object Object],It inherits some of the nice features of Hadoop: Horizontal scalability over commodity hardware Distributed and parallel computation High reliability and redundancy 15
Fuzzy Table Architecture 16
Fuzzy Table: Bulk Data Processing Component Mahout’s Canopy Clustering and K-means Clustering partitions data into clusters (bins) Reduces search space so only small subset of the data must be processed This concept is based on work done in academia* Centroids from K-means clustering are used to create a “Bin classifier”  Determines the best bins to search for a given key {Key, Value} records are stored as Sequence Files in HDFS  Spread across the cluster for optimal parallel searching MapReduce is used for all other bulk or batch data processing Batch fuzzy match searching Re-encoding the raw files into Feature vectors Performing large-scale feature evaluation to improve clustering *Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot 17
Bulk Clustering and Real-time Classification This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching The classifier determines which Bins need to be searched in order to find the most likely matching keys 18
19 Procedure
Fuzzy Table: Data Storage and Bins Bins are represented as directories in HDFS with one or more chunk files (Sequence Files):  /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001 Chunk files contain many {Key, Value} pairs  Small multiple of the HDFS block size  Chunk files are distributed uniformly and randomly across the Data Servers in the cluster Ensures that the bins are striped across the cluster for optimal parallel searching Replicated across the Data Servers using  HDFS’s replication mechanism Data Servers only search through local chunk files Results returned in real-time as soon as a match is found 20
Fuzzy Table: Low Latency Fuzzy Matching Component The low latency component consists of three main parts Client – submits queries for Keys and get back {Key, Value} pairs Master Server – serve metadata about which Data Servers host  which bins Data Servers – Actually perform fuzzy matching searches Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records 		double score = fuzzyMatcher.match(key, storedRec.getKey()); 		if(score >= threshold)  		   return storedRec; Fuzzy matching searches are performed in parallel across many Data Servers 21
Fuzzy Table: Optimizations Master Server HDFS Metadata Caching The HDFS Namenode is a performance bottleneck for low latency searches Master Server caches HDFS Block locations for all Fuzzytable files (Bins and Chunk Files) Periodic refresh of the cache so its metadata is always fresh Increased HDFS replication factor (Replication factor of N) Fuzzytable is close to a read only system More data replication means increased parallelism and faster query return Data Servers only perform searches against data that resides locally on disk 22
Fuzzy Table Query 23
Fuzzy Table Query 24
Fuzzy Table Query 25
Fuzzy Table Query 26
Fuzzy Table Query 27
Fuzzy Table Query 28
Fuzzy Table Query 29
Fuzzy Table Query 30
Performance 31
Performance and Scalability Testing Employed EC2 for all testing Downloaded ~1 TB of images from Flickr (100 Nodes) Performed the Bulk Processing Components tasks across all 1 TB of images (80 nodes) Duplicate detection and removal Feature extraction and normalization Mahout’s canopy clustering Mahout’s k-means clustering Join Clusters with Features Post processing data into bins and chunk files ,[object Object],Ramping up the number of queries per second on fixed cluster size Querying increasing cluster sizes 32
Shortest Query Times 33 Time To Respond (ms) # Of Data Servers
Longest Query Times 34 Time To Respond (ms) # Of Data Servers
Caching Performance 35 Average Response Time (ns) # Threads Polling The Master Server
Conclusion Large-scale, real-time Multimedia/Biometric Database search is a hard problem And it’s becoming computationally more expensive as the amount of data grows Hadoop is a potential solution to this problem MapReduce can be used for bulk processing to enable distributed, low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching 36
Contact Information – Cloud Computing Team Michael Ridley Associate Lalit Kapoor Senior Consultant Edmund Kohlwey Senior Consultant Robert Gordon Associate Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 ridley_michael@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000  kohlwey_edmund@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000  gordon_robert@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 trost_jason@bah.com @jason_trost @idefine Jesse Yates Consultant @jesse_yates @ekohlwey Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523  yates_jesse@bah.com @mikeridley 37
Thanks Brandyn White (@brandynwhite) – Assistance with Flickr image retrieval 38
Questions 39
Questions? 40

Weitere ähnliche Inhalte

Was ist angesagt?

Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Real Time Reporting Platform
Real Time Reporting PlatformReal Time Reporting Platform
Real Time Reporting PlatformKyle Burke
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...WG_ Events
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution ProviderAgileiss
 
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCampSteve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCampBigDataCamp
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataNilay Mishra
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

Was ist angesagt? (20)

Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Real Time Reporting Platform
Real Time Reporting PlatformReal Time Reporting Platform
Real Time Reporting Platform
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCampSteve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Introduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigDataIntroduction_OF_Hadoop_and_BigData
Introduction_OF_Hadoop_and_BigData
 
R server and spark
R server and sparkR server and spark
R server and spark
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 

Andere mochten auch

Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010Yahoo Developer Network
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaarRegunath B
 
Zappos.com, My Experience: Colin Gilchrist
Zappos.com, My Experience: Colin GilchristZappos.com, My Experience: Colin Gilchrist
Zappos.com, My Experience: Colin GilchristColin Gilchrist
 
O Melhor do Direito:Material de Assimilação - Lei 4.717
O Melhor do Direito:Material de Assimilação - Lei 4.717O Melhor do Direito:Material de Assimilação - Lei 4.717
O Melhor do Direito:Material de Assimilação - Lei 4.717omelhordodireito
 
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...Ethio-Afric News en Views Media!!
 
Octink construction product guide 2013
Octink construction product guide 2013Octink construction product guide 2013
Octink construction product guide 2013Paul Gilligan
 
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...Akamon Entertainment
 
Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar Darryl Gobbett
 
Congelamiento de precios productos en coto
Congelamiento de precios   productos en cotoCongelamiento de precios   productos en coto
Congelamiento de precios productos en cotoDiario Elcomahueonline
 

Andere mochten auch (20)

Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
Fuzzy Data Leaks
Fuzzy Data LeaksFuzzy Data Leaks
Fuzzy Data Leaks
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
 
Oficio previc copy
Oficio previc copyOficio previc copy
Oficio previc copy
 
Zappos.com, My Experience: Colin Gilchrist
Zappos.com, My Experience: Colin GilchristZappos.com, My Experience: Colin Gilchrist
Zappos.com, My Experience: Colin Gilchrist
 
O Melhor do Direito:Material de Assimilação - Lei 4.717
O Melhor do Direito:Material de Assimilação - Lei 4.717O Melhor do Direito:Material de Assimilação - Lei 4.717
O Melhor do Direito:Material de Assimilação - Lei 4.717
 
PlayStation 4
PlayStation 4PlayStation 4
PlayStation 4
 
Hellen e vitoria musicas ....
Hellen e vitoria musicas ....Hellen e vitoria musicas ....
Hellen e vitoria musicas ....
 
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...Africa conf-report en Conference report    High-Level Conference EU-Africa Pa...
Africa conf-report en Conference report High-Level Conference EU-Africa Pa...
 
Octink construction product guide 2013
Octink construction product guide 2013Octink construction product guide 2013
Octink construction product guide 2013
 
eHealth
eHealtheHealth
eHealth
 
Zeimer BNI Presentation June 8, 2011
Zeimer BNI Presentation June 8, 2011Zeimer BNI Presentation June 8, 2011
Zeimer BNI Presentation June 8, 2011
 
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...
NdP_Akamon gana el primer premio “Who’s got game” como mejor startup de juego...
 
LinkedIn for Beginners
LinkedIn for BeginnersLinkedIn for Beginners
LinkedIn for Beginners
 
Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar Baillieu Holst Post Election Seminar
Baillieu Holst Post Election Seminar
 
Congelamiento de precios productos en coto
Congelamiento de precios   productos en cotoCongelamiento de precios   productos en coto
Congelamiento de precios productos en coto
 

Ähnlich wie Hadoop World 2010 - BAH - Fuzzy Table

My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsSherinMariamReji05
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
Database Design and Implementation
Database Design and ImplementationDatabase Design and Implementation
Database Design and ImplementationChristian Reina
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Design of file system architecture with cluster
Design of file system architecture with clusterDesign of file system architecture with cluster
Design of file system architecture with clustereSAT Publishing House
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataIRJET Journal
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 

Ähnlich wie Hadoop World 2010 - BAH - Fuzzy Table (20)

My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Database Design and Implementation
Database Design and ImplementationDatabase Design and Implementation
Database Design and Implementation
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Design of file system architecture with cluster
Design of file system architecture with clusterDesign of file system architecture with cluster
Design of file system architecture with cluster
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted Data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
UNIT_4.pptx
UNIT_4.pptxUNIT_4.pptx
UNIT_4.pptx
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
E018142329
E018142329E018142329
E018142329
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Kürzlich hochgeladen (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Hadoop World 2010 - BAH - Fuzzy Table

  • 1. Fuzzy Table Distributed Fuzzy Matching Database Lalit Kapoor kapoor_lalit@bah.com
  • 2. Session Agenda Fuzzy Matching? The Big Data Problem A Scalable Solution Performance Questions? 2
  • 4. What is Fuzzy Matching? Create a Vector or Matrix of doubles Start with some multimedia image/voice/audio/video/etc Feature Extraction & Normalization 1 Distance Function* 31.46 Feature Extraction & Normalization 2 These images are very similar, but obliviously not the same. To find image #2 given image #1, some sort of fuzzy matching technique needs to be used *Euclidean Distance in this example 4 Images from Flickr; Licensed under Creative Commons http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/ http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/
  • 5. How Is Fuzzy Matching Being Used Today? 5
  • 6. Why Do We Care? At the forefront of strategy and technology consulting for nearly a century Deep functional knowledge spanning strategy and organization, technology, operations, and analytics US government agencies in the defense, security, and civil sectors, as well as to corporations, institutions, and not-for-profit organizations 6
  • 7. Biometrics – A Fuzzy Matching Problem Same Person? Lifted From A Crime Scene Law Enforcement Database 7
  • 8. Biometrics – Example Create a Vector or Matrix of doubles Query Biometrics Database Feature Extraction & Normalization 1 Distance Function* 2.41 Feature Extraction & Normalization 2 *Euclidean Distance in this example 8
  • 9. The Big Data Problem 9
  • 10.
  • 11. Hulu – over 380 million videoshttp://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/ http://ksudigg.wetpaint.com/page/YouTube+Statistics http://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/ 10
  • 12.
  • 13. European Union’s Biometric Matching System (EU-BMS) is expected to hold 70 Million people’s biometric data to support visa applications, border control, and immigration ****
  • 14. AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions*****US-VISIT * US-VISIT: The world’s largest biometric application. William Graves. ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm *** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/ **** http://www.findbiometrics.com/articles/i/5220/ ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx 11
  • 15. Biometric Databases are a Big (Data) Problem Large scale operations Searching and storing 100’s of millions to billions of Identities Multiple biometric templates and raw files per identity for multimodal matching Fingerprints, Faces, and Iris New raw files and templates typically stored after each Verification and Identification operation Raw Images Biometric templates Results are expected in real time Systems require cost efficient storage and retrieval for biometric matches Need innovative ways to reduce costs per match 500 M Identities x (16 KB to 300 KB) x (10 to 20) = 1 – 2 PB 500 M Identities x (256 b to 3 KB) x (10 to 20) = 2 – 27 TB 12
  • 17. Hadoop and Multimedia Databases HDFS as file storage for petabytes worth of multimedia (images/audio/video/etc) Redundancy Distribution Mahout/MapReduce used for indexing and clustering similar objects Improve overall search speeds Improving feature selection by analyzing the entire database with MapReduce Select most effective features in distinguishing identities N-to-N matching search (special type of Identification search) to cleanse database Find people trying to circumvent the system (Identity Fraud, etc) 14
  • 18.
  • 20. Fuzzy Table: Bulk Data Processing Component Mahout’s Canopy Clustering and K-means Clustering partitions data into clusters (bins) Reduces search space so only small subset of the data must be processed This concept is based on work done in academia* Centroids from K-means clustering are used to create a “Bin classifier” Determines the best bins to search for a given key {Key, Value} records are stored as Sequence Files in HDFS Spread across the cluster for optimal parallel searching MapReduce is used for all other bulk or batch data processing Batch fuzzy match searching Re-encoding the raw files into Feature vectors Performing large-scale feature evaluation to improve clustering *Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot 17
  • 21. Bulk Clustering and Real-time Classification This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching The classifier determines which Bins need to be searched in order to find the most likely matching keys 18
  • 23. Fuzzy Table: Data Storage and Bins Bins are represented as directories in HDFS with one or more chunk files (Sequence Files): /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001 Chunk files contain many {Key, Value} pairs Small multiple of the HDFS block size Chunk files are distributed uniformly and randomly across the Data Servers in the cluster Ensures that the bins are striped across the cluster for optimal parallel searching Replicated across the Data Servers using HDFS’s replication mechanism Data Servers only search through local chunk files Results returned in real-time as soon as a match is found 20
  • 24. Fuzzy Table: Low Latency Fuzzy Matching Component The low latency component consists of three main parts Client – submits queries for Keys and get back {Key, Value} pairs Master Server – serve metadata about which Data Servers host which bins Data Servers – Actually perform fuzzy matching searches Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records double score = fuzzyMatcher.match(key, storedRec.getKey()); if(score >= threshold) return storedRec; Fuzzy matching searches are performed in parallel across many Data Servers 21
  • 25. Fuzzy Table: Optimizations Master Server HDFS Metadata Caching The HDFS Namenode is a performance bottleneck for low latency searches Master Server caches HDFS Block locations for all Fuzzytable files (Bins and Chunk Files) Periodic refresh of the cache so its metadata is always fresh Increased HDFS replication factor (Replication factor of N) Fuzzytable is close to a read only system More data replication means increased parallelism and faster query return Data Servers only perform searches against data that resides locally on disk 22
  • 35.
  • 36. Shortest Query Times 33 Time To Respond (ms) # Of Data Servers
  • 37. Longest Query Times 34 Time To Respond (ms) # Of Data Servers
  • 38. Caching Performance 35 Average Response Time (ns) # Threads Polling The Master Server
  • 39. Conclusion Large-scale, real-time Multimedia/Biometric Database search is a hard problem And it’s becoming computationally more expensive as the amount of data grows Hadoop is a potential solution to this problem MapReduce can be used for bulk processing to enable distributed, low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching 36
  • 40. Contact Information – Cloud Computing Team Michael Ridley Associate Lalit Kapoor Senior Consultant Edmund Kohlwey Senior Consultant Robert Gordon Associate Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 ridley_michael@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kohlwey_edmund@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 gordon_robert@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 trost_jason@bah.com @jason_trost @idefine Jesse Yates Consultant @jesse_yates @ekohlwey Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523 yates_jesse@bah.com @mikeridley 37
  • 41. Thanks Brandyn White (@brandynwhite) – Assistance with Flickr image retrieval 38
  • 46. Technologies Used Cloudera’s Distribution of Hadoop (CDH3) MapReduce HDFS Mahout Avro Amazon EC2 Ubuntu Linux Java Python Bash 43