1. Unifying Search Engine and NoSQL DBMS with a Universal Index Chris Biow MarkLogic Federal CTO
2. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
3. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
4. >200 customers, >170 employees HQ: San Carlos, CA Lead investor: Sequoia Capital 2008: Top 5 fastest growing technology companies in Silicon Valley (Deloitte) 2009, 2010: Best DBMS (SIIA CODiE). Previously best Search, CMS. About MarkLogic MarkLogic Corporation makes a purpose-built database for unstructured information
5. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
6. What is MarkLogic Server? A hybrid (integrated parts) Special purpose DBMS for XML, with enterprise expectations ACID transactions DBA, backup, replication Search engine kernel, with enterprise expectations Full text Faceted navigation, at massive scale Boolean, proximity, stemming, tokenization, decompounding, case, diacritics, … Application Server HTTP XCC Java/.NET WebDAV
7. MarkLogic as Special DBMS Not relational (RDBMS) XML The only data model required Schema agnostic Text a first-class citizen among data types XQuery (SQL) Search engine algorithms for many DB queries Order(1) initial lookup in number of docs O(log(n)) in range indexing Very low DBA overhead (0.5 FTE / 100 hosts) 5-minute install 5-minute scale-out Database and search engine are the same
8. MarkLogic as NoSQL DBMS SQL XQuery ! Extensions: cts:search() / xdmp:document-insert() NoSQL Categories [per AKF Partners] Key->Value store URI -> document (XML, JSON, text, binary) Extensible Record store Extensible Markup Language Document store XML documents, natch Differentiators ACID transactions in LAN cluster Ad hoc XQuery XML declares what is to be indexed, independently for each document DBMS and Search Engine are the same
9. MarkLogic as Special Search Engine timeline Understands document structure Transactional: high CRUD load Unicode Holds the documents Update / reindexing Delivery Geospatial: Box, Point/radius, polygon Alerting: Profiles, alerts, filters, tipping, selectors, “triggers,” … Analytics: Facets, Co-occurrence, word lexicons, … Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) Processing near the data Relational joins and inferencing Database and search engine are the same message message @id 3 @id 5 status status Oh boy… Testing XProc Element node Attribute Node Text Node
10. MarkLogic as Special App Server Native HTTP(S) server RESTful XML by default Transform to HTML Transform to PDF, MS Office, etc. PKI with no dependencies Optionally with external Auth HTTP(S) client XCC Java / .NET server Similar to JDBC / ADO.NET WebDAV Folder on the user’s desktop RESTful Architecture user/ Representations + get .json.xqy .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy
11. MarkLogic at Scale Scale up: typically 1-2TB+ XML per server Scale out: low hundreds(++) of servers in a cluster Commodity hardware Typically ~$15K HW budget per server 2-CPU x 6-core/hyperthreaded 32+ GB RAM 3x disk: local mount with failover OS Linux RHEL 5 Solaris 10 Windows 2003/8 (XP/Vista/7 for dev)
12. Collapsing the Stack The extended stack Data (CSV: data) DBMS (SQL: result sets) Search Engine (Search languages: result page) Web service (Java: XML, JSON) Application server (Ruby: HTML) MarkLogic Data (XML: xml) DBMS, Search, Service, App (XQuery, XQuery, XQuery, XQuery)
13. Data Model A database for unstructured (and semi-structured) information XML Data Model fpML Document Trade Product Title Author Metadata Trade Cashflow Section ID ID Last TradeLeg First TradeLeg Amount TradeLeg Event Event Event Event Section Section Section Section
14. Example Document Document Title Section Section (cont’d) Author Abstract Section Metadata Section Section Footer
15. Serialized as XML <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
16. Target Query Classes Target to optimize performance for these kinds of queries Full-Text Search Find all documents that contain the phrase “uniquely identify”. XML Structure Find all articles that have an abstract. XML Semantics Find all documents that mention the product “IMS”. Aggregate Queries How many articles that contain “data base” were written in each of the last 5 decades. All of the above . . . Count all articles that contain “data” in the title and mention the product “IMS” in a section, grouping by year. at the same time
17. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
18. 1) Full-text Search Find all documents that contain the phrase “uniquely identify” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
20. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
21. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 2) XML Structure Find all articles that have an abstract
23. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 3) XML Semantics Find all documents that mention the product “IMS”
25. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
26. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 4) Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades?
27. 4) Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
28. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 5) All Of The Above Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year.
30. Additional Uses For Universal Index Directories Exclusive, hierarchical, analogous to file system URI: /some/directory/hierarchy/me.xml Collections Set-based, N:M relationship Document URI : Collection URI Security Invisible to your app Document: Role, action
36. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
37. Spatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …
38. Spatial Query Data examples Latitude / Longitude Any other pair (e.g. volume / price) Query types Point (exact value) Point-Radius (circle) Lat/Lon bound (Mercator “rectangle”) Polygon (10K+ vertices) Composition with… Full Text XML structure XML semantics Other range indexes (e.g. temporal)
39. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
40. Query Registration Canonicalize a cts:query() Hash for an ID Resolve the query Cache the term list in memory Reuse as materialized sub-query (AKA Topics, Concepts, Macros, etc.) (This is not alerting)
42. Query Indexing “Alerting” Real-time search, selectors, tippers, standing queries, filters, “triggers*”, content-based routing, stream DBMS, etc. Search(query, Index[docs]) -> docs Alert(doc, Index[queries]) -> queries Queries are XML documents XML serialization of cts:query() First step is O(1) in number of queries O(n) in results returned cts:reverse-query() * MarkLogic also has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile
43. The Reverse Index REVERSE INDEX Query Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) 437 and “data” and year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) 562 “web” and and (2000 <= year <= 2010) and “web” and 597 . . . year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010
44. Alerting in Composition Scalar Query on scalar data with range queries Alert on range data with scalar reverse-queries Geospatial Query on point data with box, circle, polygon query constraint Compose with text, structure, XML-semantic query Alert on box, circle, polygon data with point reverse-query Compose with text, structure, XML-semantic data Search [Forward-]Query composes with Boolean operations (AND, OR, NOT, (()()(()())) Reverse- and forward-query compose (AND, OR, NOT, (()()(()())) Why would you ever want to do that?
45. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
46. Search Composed with Alerting In Soviet Russia, the document searches YOU! If you express yourself as XML Documents are XML Elements Text Typed data Serialized query against documents [cts:query()] Attributes Typed data Composition Boolean (AND) incorporating [forward-]query and reverse-query
47. Matchmaking Constraints upon each other Matching pairs or one-to-many Examples Suitable date: man / woman mixed pool Employment: job / resume Medication: patient / drug Search security: document / user Battle: target / shooter Carpool ride: driver / rider
48. Carpool Driver Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas Requires female passenger within five miles of start and end Passenger Woman will pay up to $20 From: 3001 Summit View Dr, San Ramon, CA 94582 To: 400 Concourse Drive, Belmont, CA 94002 Requires non-smoking car Won’t listen to country music
49. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
50. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
51. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
52. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
53. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
54. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ...
63. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
64. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
65. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
66. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
67. Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
68. Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
69. Search and Alerting Composed XML data typing expectations In both directions Arbitrary schema At each document In each query Don’t [have to] declare anything You choose the logic for empty data Specify typed range indexes for scalar, geo, faceting Search engine speed and scalability O(1) term lookup Word, structure, values Query sub-expression O(log(n)) range lookup and term list intersection Shared-nothing (sharded) query evaluation
70. Document Security Document queries the user Rules for who can see me, the document Open, ad-hoc security [lack of] model Each document declares any rules it wants “user eye color not blue” Descriptive model/schema if desired Extensible without changing DBMS schema
71. Medication Patient Diagnosis Background Idiopathic history Vital Statistics Treatment baseline Drug Therapeutics Side effects Interactions Contraindications
72. What is a strange loop? Not mere hierarchies of abstraction Abstraction is simplification and distortion Useful if bounded and well-ordered Problems if poorly bounded: Watch Ourselves Strange loops are disorderings of the hierarchy: heterarchy Hofstadter: “a paradoxical level-crossing feedback loop” Good strange loops gracefully accommodate the disordering Established examples in Computer Science Godel’s Incompleteness Theorems Self-compiling languages
73. Strange Loop: Fwd∘Rev Query Queries are an abstraction over the data /myroot//foo[@bar=7]/bash Reverse-query Indexed data are [serialized] queries. Documents are the queries. Queries can still be abstractions over a stream of docs. Composing forward and reverse query Strange Loop! Escher Drawing Hands
74. Strange Loop: XQuery throughout the Stack Query language is an abstraction over the data But the query language is re-used in the application The [No]SQL is the PL Declarative query becomes functional programming Creeping Lazy evaluation Parallelization Discarding unneeded work Schrödinger’s tuple Not religious side effects available when required DBMS transactions xdmp:set() Escher Waterfall
75. Strange Loop: XQuery on XQuery In SVN Project organization No search Dependency tracking Import requires absolute paths Namespace prefix conflicts Surprise modules (functx) XQuery (with cts: extensions) to discover and automate imports
76. The Composable, Universal Index Full text XML structure XML semantics Range indexes Range queries Aggregations Co-occurrence Spatial indexes Query indexes
77. Database of documents Stored in partitions Database Partition3 Partition2 Partition1 Databases
81. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
82. Multiversion Concurrency Control /articles/codd.xml /articles/codd.xml Document Document Title Title Author Author Metadata Metadata Section Section Last Last Year ∞ 628 ∞ 523 First First 628 ∞ Section Section Section Section Section Section Section Section Section Section c Creation Timestamp d Deleted Timestamp
83. Multiversion Concurrency Benefits High Throughput Queries don’t require locks Queries and Updates do not conflict ACID Cluster consistency: 2-phase commit Zero-latency ingestion and Indexing Append Only Ingest/update rates of ~400GB per partition per day /articles/codd.xml Document Title Author Metadata Section Last Year First 628 ∞ Section Section Section Section Section
84. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
85. A Single Forest Host Stand1 Stand2 Standn Buffer Buffer Forestk …
86. 1. Create A New Tree Host Stand1 Stand2 Standn Buffer Buffer Forestk …
93. How Did We Get Here? Founder: Christopher Lindblad MIT Architect of Ultraseek Server Intranet seach engine product Met people that wanted to use a search engine like a database Rich query language Guaranteed correctness Transactions
94. So We Built XML as data model Ad hoc schema A search engine core Universal Index Transaction model based on multiversion concurrency High throughput while keeping . . . Performance and scalability of a search engine
95. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
96. Who Uses MarkLogic? Magazine Publishing Education Healthcare Software / Services Legal Tax Financial Enterprise Aggregation Scientific Technical Medical
Ordered index in lat-major order. log(n) lookup in latitude bounds, then scans longitude bounds
Generate all the normal indexing terms for the reverse-query document, then do a linear merge to match query-document terms with the root nodes of the unified expression tree. Based on which terms do or don't match, nominate documents that may contain matching queries. For each nominated query-document, evaluate from the root of the query tree on the right side towards the leaf nodes at the left of the slide. Once a subtree has been evaluated for one query-document, we remember the result and short-circuit that evaluation for any other query-documents that share the subquery.