SlideShare a Scribd company logo
1 of 20
Download to read offline
Scalable high-dimensional
indexing with Hadoop
TEXMEX team, INRIA Rennes, France
Denis Shestakov, PhD
denis.shestakov at {aalto.fi,inria.fr}
linkedin: linkedin.com/in/dshestakov
mendeley: mendeley.com/profiles/denis-shestakov
Denis Shestakov, Diana Moise,
Gylfi Gudmundsson, Laurent Amsaleg
Outline
● Motivation
● Approach overview: scaling indexing &
searching using Hadoop
● Experimental setup: datasets, resources,
configuration
● Results
● Observations & implications
● Things to share
● Future directions
Motivation
● Big data is here
○ Lots of multimedia content
○ Even forgetting 'big' companies, 1TB/day of
multimedia is now common for many parties
● Solution: apply more computational power
○ Luckily, easier access to such power via grid/cloud
resources
● Applications:
○ Large-scale image retrieval: e.g., detecting copyright
violations in huge image repositories
○ Google Goggles-like systems: annotating the scene
Our approach
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and searching
100M images with MapReduce [7]
○ See Section II for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
○ Most popular open-source implementation of
MapReduce model
○ Data stored in HDFS that splits it into chunks (64MB or
often bigger) and distributes it across nodes
Our approach
● Hadoop used for both indexing and searching
● Our search scenario:
■ Searching for batch of images
● Thousands of images in one run
● Focus on throughput, not on response time
for individual image
■ Use case: copyright violation detection
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
Experimental setup
● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
● Hadoop ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance
Experimental setup
● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of the
partners in Quaero project
■ One of the largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (3055 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (12081 images = 245 random images
from dataset and their variants)
○ Checked if original images returned as top voted
search results
Results: workflow overview
● Experiment on indexing & searching 1TB took 5-6
hours
Results: indexing 1TB
Results: indexing 4TB
● 4TB
● 100 nodes
● Used tuned parameters
○ Except change in #mappers/#reducers per node
■ To fit bigger index tree (for 4TB) to RAM
■ 4 mappers/2 reducers
● Time: 507min
Results: search quality
Results: search scalability
Results: search execution
Search 12k batch over 1TB using 100 nodes
Results: searching 4TB
● 4TB
● 87 nodes
● Copydays query batch (3k images)
○ Throughput: 460ms per image
● 12k query batch
○ Throughput: 210ms per image
● Bigger batches improve throughput insignificantly
○ bigger batch -> bigger lookup table -> more RAM per
mapper required -> less mappers per node
Observations &
implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e.
8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table
● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given resources we had
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me if interested
● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons
● Hadoop job history logs behind our experiments (not only
for those reported at CBMI) available on request
○ Describe indexing/searching our dataset by giving details
on map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
Future directions
● Deal with big batches of query images
○ ~200k query images
● Share auxiliary data (index tree, lookup table) by
mappers
○ Multithreaded map tasks
● (environment-specific) Test scalability on more nodes
○ Use several sites of Grid5000 infrastructure
■ rennes+nancy sites (up to 300 nodes) --in
progress
Acknowledgements
● TEXMEX team, INRIA Rennes http://www.
irisa.fr/texmex/index_en.php
● Quaero project, http://www.quaero.org/
● Grid5000 infrastructure & its Rennes
maintenance team, https://www.grid5000.fr
Thank you!
Questions?

More Related Content

What's hot

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
royans
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

What's hot (20)

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 

Similar to Scalable high-dimensional indexing with Hadoop

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 

Similar to Scalable high-dimensional indexing with Hadoop (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
InternReport
InternReportInternReport
InternReport
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 

More from Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
Denis Shestakov
 

More from Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Scalable high-dimensional indexing with Hadoop

  • 1. Scalable high-dimensional indexing with Hadoop TEXMEX team, INRIA Rennes, France Denis Shestakov, PhD denis.shestakov at {aalto.fi,inria.fr} linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
  • 2. Outline ● Motivation ● Approach overview: scaling indexing & searching using Hadoop ● Experimental setup: datasets, resources, configuration ● Results ● Observations & implications ● Things to share ● Future directions
  • 3. Motivation ● Big data is here ○ Lots of multimedia content ○ Even forgetting 'big' companies, 1TB/day of multimedia is now common for many parties ● Solution: apply more computational power ○ Luckily, easier access to such power via grid/cloud resources ● Applications: ○ Large-scale image retrieval: e.g., detecting copyright violations in huge image repositories ○ Google Goggles-like systems: annotating the scene
  • 4. Our approach ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [7] ○ See Section II for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework ○ Most popular open-source implementation of MapReduce model ○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes
  • 5. Our approach ● Hadoop used for both indexing and searching ● Our search scenario: ■ Searching for batch of images ● Thousands of images in one run ● Focus on throughput, not on response time for individual image ■ Use case: copyright violation detection ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  • 6. Experimental setup ● Used Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node ● Hadoop ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance
  • 7. Experimental setup ● Over 100 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ One of the largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  • 8. Experimental setup ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (3055 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (12081 images = 245 random images from dataset and their variants) ○ Checked if original images returned as top voted search results
  • 9. Results: workflow overview ● Experiment on indexing & searching 1TB took 5-6 hours
  • 11. Results: indexing 4TB ● 4TB ● 100 nodes ● Used tuned parameters ○ Except change in #mappers/#reducers per node ■ To fit bigger index tree (for 4TB) to RAM ■ 4 mappers/2 reducers ● Time: 507min
  • 14. Results: search execution Search 12k batch over 1TB using 100 nodes
  • 15. Results: searching 4TB ● 4TB ● 87 nodes ● Copydays query batch (3k images) ○ Throughput: 460ms per image ● 12k query batch ○ Throughput: 210ms per image ● Bigger batches improve throughput insignificantly ○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node
  • 16. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  • 17. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
  • 18. Future directions ● Deal with big batches of query images ○ ~200k query images ● Share auxiliary data (index tree, lookup table) by mappers ○ Multithreaded map tasks ● (environment-specific) Test scalability on more nodes ○ Use several sites of Grid5000 infrastructure ■ rennes+nancy sites (up to 300 nodes) --in progress
  • 19. Acknowledgements ● TEXMEX team, INRIA Rennes http://www. irisa.fr/texmex/index_en.php ● Quaero project, http://www.quaero.org/ ● Grid5000 infrastructure & its Rennes maintenance team, https://www.grid5000.fr