I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011

Agenda ,[object Object],[object Object],[object Object],[object Object]

My Background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Documentum search 101 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Introducing Documentum xPlore ,[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],1996 2010 2005 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Enhancing Documentum Deployments with Search ,[object Object],[object Object],[object Object],Content Server DCTM client DQL SQL RDBMS search

Enhancing Documentum Deployments with Search ,[object Object],[object Object],[object Object],Content Server Documentum client DQL SQL xQuery RDBMS Metadata + content search

Some Basic Design Concepts behind Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Design concepts (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case

Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology

Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions Advanced data management (partitions) JOINs Full Text index technology

Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology

Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],xDB Transaction, Index & Page Management xDB Query Processing& Optimization xDB API xPlore API Search Services Node & Data Management Services Indexing Services Admin Services Content Processing Services Analytics

EMC xDB: Native XML database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Libraries / Collections & Indexes = xDB segment = xDB Library / xPlore collection = xDB Index = xDB xml file ( dftxml , tracking xml, status, metrics, audit)

Lucene Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lucene Integration (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

xPlore has lucene search engine capabilities plus…. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Tips and Observations on IO and Host Virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Tip #1: Don’t assume that one-size-fits all ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Same concept applies for disk virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],50GB and 100 I/O’s per sec capacity 50GB and 200 I/O’s per sec capacity 50GB and 400 I/O’s per sec capacity

Linear mapping’s and Luns ,[object Object],[object Object]

EMC Symmetrix: Nondisruptive Mobility ,[object Object],[object Object],[object Object],[object Object],[object Object],Virtual Pools Flash 400 GB RAID 5 Fibre Channel 600 GB 15K RAID 1 SATA 2 TB RAID 6

Tip #2: Consolidation Contention ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Some Vmware statistics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x

Actual Ready samples during several hour period

Some Subtleties with Interactive CPU denial ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],20 sec interval Denial spike

Sharing I/O capacity ,[object Object],[object Object],[object Object],Volume for Lucene application Both volumes spread over the same set of drives and effectively sharing the I/O capacity Volume for other application

Recommendations on diagnosing disk I/O related issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy

Linux indicators compared to bonnie output See https://community.emc.com/docs/DOC-9179 for additional example Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sde 206.10 2402.40 0.80 24024 8 09:29:17 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:29:27 dev8-65 209.24 4877.97 1.62 23.32 1.62 7.75 3.80 79.59 09:29:17 PM CPU %user %nice %system %iowait %steal %idle 09:29:27 PM all 41.37 0.00 5.56 29.86 0.00 23.21 09:29:27 PM 0 62.44 0.00 10.56 25.38 0.00 1.62 09:29:27 PM 1 30.90 0.00 4.26 35.56 0.00 29.28 09:29:27 PM 2 36.35 0.00 3.96 30.76 0.00 28.93 09:29:27 PM 3 35.77 0.00 3.46 27.64 0.00 33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait

Tip #3: Try to ensure availability of resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

IO / caching test use-case ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity

IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set

Separation of “covering values” in stored fields and summary Facet Calc FinalFacet calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup

xPlore Memory Pool areas at-a-glance xPlore Instance (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing memory

Lucene data resides primarily in OS buffer cache Potential for many things to sweep lucene from that cache

Test Env ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Some results of the query suite ,[object Object],[object Object],[object Object],[object Object],Test Avg Resp to consume all results (sec) MB pre-cached I/O per result Total MB loaded into memory (cached + test) Nothing cached 1.89 0 0.89 77 Stored fields cached 0.95 241 0.38 272 Term dict cached 1.73 537 0.79 604 Positions cached 1.58 8,789 0.74 8,800 Frequencies cached 1.65 1,406 0.63 1,436 Entire index cached 0.59 10,970 < 0.05 10,970

Other Notes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contact ,[object Object],[object Object],[object Object],[object Object]

I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

Ähnlich wie I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche (20)

Mehr von lucenerevolution

Mehr von lucenerevolution (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

Hinweis der Redaktion