SlideShare ist ein Scribd-Unternehmen logo
1 von 45
A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011
Agenda ,[object Object],[object Object],[object Object],[object Object]
My Background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Documentum search 101 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introducing Documentum xPlore  ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],1996 2010 2005 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Enhancing Documentum Deployments  with Search ,[object Object],[object Object],[object Object],Content Server DCTM client DQL SQL RDBMS search
Enhancing Documentum Deployments  with Search ,[object Object],[object Object],[object Object],Content Server Documentum client DQL SQL xQuery RDBMS Metadata + content search
Some Basic Design Concepts behind Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Design concepts (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions   Advanced data management (partitions)   JOINs   Full Text index technology
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology
Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],xDB Transaction, Index & Page Management xDB Query Processing& Optimization xDB API xPlore API Search  Services Node & Data  Management  Services Indexing  Services Admin  Services Content Processing  Services Analytics
EMC xDB: Native XML database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries / Collections & Indexes = xDB segment =  xDB Library  / xPlore collection =  xDB Index =  xDB xml file ( dftxml , tracking  xml, status, metrics, audit)
Lucene Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Lucene Integration (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
xPlore has lucene search engine capabilities plus…. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tips and Observations on  IO and Host Virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tip #1: Don’t assume that  one-size-fits all ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Same concept applies for disk virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],50GB and 100 I/O’s per sec capacity 50GB and 200 I/O’s per sec capacity 50GB and 400 I/O’s per sec capacity
Linear mapping’s and Luns ,[object Object],[object Object]
EMC Symmetrix: Nondisruptive Mobility ,[object Object],[object Object],[object Object],[object Object],[object Object],Virtual Pools Flash 400 GB RAID 5 Fibre Channel 600 GB 15K RAID 1 SATA 2 TB RAID 6
Tip #2: Consolidation Contention ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some Vmware statistics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x
Actual Ready samples during  several hour period
Some Subtleties with  Interactive CPU denial ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],20 sec interval Denial spike
Sharing I/O capacity ,[object Object],[object Object],[object Object],Volume for Lucene application Both volumes spread over the same set of drives and effectively sharing the I/O capacity Volume for other application
Recommendations on diagnosing disk I/O related issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene  install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU Mach2  10*2024 73928 97  104142  5.3 26246  2.9  8872 22.5 43794  1.9  735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy
Linux indicators compared  to bonnie output See  https://community.emc.com/docs/DOC-9179  for additional example Device:  tps   kB_read/s  kB_wrtn/s  kB_read  kB_wrtn sde  206.10  2402.40  0.80  24024  8 09:29:17  DEV  tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz  await  svctm  %util 09:29:27  dev8-65  209.24   4877.97  1.62  23.32  1.62  7.75  3.80  79.59 09:29:17 PM  CPU  %user  %nice  %system  %iowait   %steal  %idle 09:29:27 PM  all  41.37  0.00  5.56  29.86   0.00  23.21 09:29:27 PM  0  62.44  0.00  10.56  25.38   0.00  1.62 09:29:27 PM  1  30.90  0.00  4.26  35.56   0.00  29.28 09:29:27 PM  2  36.35  0.00  3.96  30.76  0.00  28.93 09:29:27 PM  3  35.77  0.00  3.46  27.64  0.00  33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait
Tip #3: Try to ensure availability  of resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IO / caching test use-case ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st  doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity
IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup  (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set
Separation of “covering values” in  stored fields and summary  Facet Calc FinalFacet  calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup
xPlore Memory Pool areas  at-a-glance xPlore Instance  (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches  Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing  memory
Lucene data resides primarily in OS buffer cache Potential for many things to sweep  lucene from that cache
Test Env ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some results of the query suite ,[object Object],[object Object],[object Object],[object Object],Test Avg Resp to consume all results (sec) MB pre-cached I/O per result Total MB loaded into memory (cached + test) Nothing cached 1.89 0 0.89 77  Stored fields cached 0.95 241  0.38 272  Term dict cached 1.73 537  0.79 604  Positions cached 1.58 8,789  0.74 8,800  Frequencies cached 1.65 1,406  0.63 1,436  Entire index cached 0.59 10,970  < 0.05 10,970
Other Notes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Contact ,[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developersBiju Nair
 
What's new with Azure Sql Database
What's new with Azure Sql DatabaseWhat's new with Azure Sql Database
What's new with Azure Sql DatabaseMarco Parenzan
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Mohammad Asif
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresCCG
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionDenys Chamberland
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationTATA LILIAN SHULIKA
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts EMC
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Principled Technologies
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Principled Technologies
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Principled Technologies
 

Was ist angesagt? (20)

Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
What's new with Azure Sql Database
What's new with Azure Sql DatabaseWhat's new with Azure Sql Database
What's new with Azure Sql Database
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in Action
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentation
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...
 

Ähnlich wie I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformMarcelo Paiva
 
Azure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerAzure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerRafał Hryniewski
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBRalph Attard
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platformMostafa
 
Microsoft Database Options
Microsoft Database OptionsMicrosoft Database Options
Microsoft Database OptionsDavid Chou
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Modern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaModern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaLucas Jellema
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falconejedt
 
MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2Information Technology
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)Timothy McAliley
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...PROIDEA
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with AzureMarco Parenzan
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 

Ähnlich wie I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche (20)

Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data Plaraform
 
Azure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerAzure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL Server
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DB
 
Azure Data Storage
Azure Data StorageAzure Data Storage
Azure Data Storage
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Microsoft Database Options
Microsoft Database OptionsMicrosoft Database Options
Microsoft Database Options
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Modern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaModern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas Jellema
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falcone
 
MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013
 

Mehr von lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Kürzlich hochgeladen

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Kürzlich hochgeladen (20)

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

  • 1. A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
  • 12. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology
  • 13. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions Advanced data management (partitions) JOINs Full Text index technology
  • 14. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology
  • 15.
  • 16.
  • 17. Libraries / Collections & Indexes = xDB segment = xDB Library / xPlore collection = xDB Index = xDB xml file ( dftxml , tracking xml, status, metrics, audit)
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x
  • 29. Actual Ready samples during several hour period
  • 30.
  • 31.
  • 32.
  • 33. Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy
  • 34. Linux indicators compared to bonnie output See https://community.emc.com/docs/DOC-9179 for additional example Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sde 206.10 2402.40 0.80 24024 8 09:29:17 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:29:27 dev8-65 209.24 4877.97 1.62 23.32 1.62 7.75 3.80 79.59 09:29:17 PM CPU %user %nice %system %iowait %steal %idle 09:29:27 PM all 41.37 0.00 5.56 29.86 0.00 23.21 09:29:27 PM 0 62.44 0.00 10.56 25.38 0.00 1.62 09:29:27 PM 1 30.90 0.00 4.26 35.56 0.00 29.28 09:29:27 PM 2 36.35 0.00 3.96 30.76 0.00 28.93 09:29:27 PM 3 35.77 0.00 3.46 27.64 0.00 33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait
  • 35.
  • 36.
  • 37. Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity
  • 38. IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set
  • 39. Separation of “covering values” in stored fields and summary Facet Calc FinalFacet calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup
  • 40. xPlore Memory Pool areas at-a-glance xPlore Instance (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing memory
  • 41. Lucene data resides primarily in OS buffer cache Potential for many things to sweep lucene from that cache
  • 42.
  • 43.
  • 44.
  • 45.

Hinweis der Redaktion

  1. Prior to VM’s the CPU/memory resources of a server were dedicated to that application resource planners didn’t worry too much about inter-system contention Resource planning more simple , but expensive in unused capacity VMware achieves significant cost reduction by allowing applications to share unused capacity On average (over the day) the Lucene CPU consumption might be low However, the challenge is that if the capacity is not available when Lucene needs it, then response delays will result