SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Datalevin
A simple, fast and free Datalog database for everyone
Huahai Yang, Ph.D.
Juji, Inc.
September 22, 2020
Background & Motivation
• Juji is a conversational AI company
• Conversational data query (NLDB)
• Upload a CSV file, then query it
• Natural language => database query
• Context sensitive
• My previous research in NLDB
convinced me:
• NLDB is more of a DB problem, than a NL
problem
• Data themselves provide the best context
• Better DB is the key
Database Design Goals
• Datalog is the best target query
language for NLDB
• Declarative
• Composable
• Amicable for code generation
• In-process embedded use
• Bulk writes, frequent reads
• Multiple DB paradigms
• Transparent data replication
Datalevin Design Principle - Simplicity
• Simple to use
• Just a library, add to deps, and start coding
• Simply require a different namespace to get a different DB paradigm
• Current: Key-value, Datalog
• Future: Graph, Document
• Simple to operate
• No need for complex ops: setup, backup and recovery should be dead simple
• No DB maintenance threads or processes
• No need for performance tuning
• Simple to scale
• Just provision more physical resources
Why fork Datascript?
• Datascript is a great baseline Datalog implementation
• Comprehensive test coverage
• Well maintained code base
• Similar API to Datomic
• Lots of users
• We have very different goals from the alternatives
• No interest in building a Datomic clone
• Focus on query performance
• We have plans to go far beyond NLDB
• Juji Slogan for AI: “Symbolic as the bones, machine learning as the flesh”
• High performance graph database is the basis of symbolic AI of the future
Roles of Database
• Operational
• Database as the surrogate of the external world
• ACID is derived from this use: to maintain the illusion of external world
• Primary, necessary for most use cases
• Focus on present, OLTP
• Archival
• Database as a recording of events and facts
• Don’t need ACID, eventual consistency is fine
• Secondary, necessary for many use cases, but not all
• Focus on provenance and history, OLAP
Merging operational and archival DB is hard
• More stringent performance
requirements
• History has more data than present
• More Complex APIs
• Need to deal with history
• Need to distinguish history and present
• More complex user mental model
• More things to consider -> less simple
• Mind needs to forget to work properly
• Hyperthymesia is a painful condition
Operational DB should be stateful
• In people’s mind, external world is stateful
• Wrong assumption of time model is one of the main
sources of immutable DB programming errors
• “Why do I get the wrong data with this query?”
• “I have to sort by transaction id to get the latest
version?”
• Datalevin is an operational database
• meant to be embedded in applications to manage state
Datalevin Architecture
• LMDB key value store as the
storage
• Optimized Clojure API for
LMDB
• EAV index on top of key-value
• User-facing API on top
• Key-value
• EAV index access
• Datalog
LMDB
Key Value Processing
Key-value
API
Index
Access
API
Datalog
API
EAV Index Processing
LMDB Features
• Lightning Memory Mapped DB
• ACID key value database
• DB is a memory mapped file
• Use OS filesystem cache
• B+ tree, optimized for read
• The fastest key value store for read
• Performs well in writing large values
(>2KB)
• Works on bytes, support range query
• Support multiple independent tables
(DBI)
LMDB Design
• Read and write transactions
• Single writer
• Many concurrent readers (MVCC)
• No locks on read
• Linear scale by reader threads
• Copy on write
• Similar to immutable data structure
• Reclaim obsolete pages
• Read/write do not block each other
Datalevin Optimizations
• Read transaction pool
• Avoid cost of allocating read transactions
• Pre-allocate off-heap buffers in JVM
• Write buffer (one per DBI)
• Read buffer
• Range query start and end buffers
• Auto-resize value buffers
• Re-allocate on overflows
• Auto-resize DB size
• LMDB needs to specify total DB size
Datalevin Key-Value API
• Open/close LMDB
• Open/clear/drop DBI
• Transact key-values as a batch
• :put, :del
• Fetch single value
• get-value, get-first
• Range query
• get-range
• Predicate filtering
• get-some, range-filter
• Counts
• entries, range-count, range-filter-count
EAV Indexing Processing
• Entity-Attribute-Value data model
• Versatile
• relational model: entity = tuple, attribute = column, value = value
• graph model: entity = node, attribute = edge, value = node (ref)
• RDF triple: entity = subject, attribute = predicate, value = object
• The triple is called a “datom”
• Cover indices
• EAV: row oriented index, all datoms
• AEV: column oriented index, all datoms
• AVE: support attribute range query, all datoms
• VAE: graph reverse index, only for reference type datoms
Index Storage
• In memory indices as cache
• Inherits Datascript’s persistent sorted sets
• On disk indices as permanent storage
• Binary encoded datoms into key-values
• LMDB’s key size is fixed at compile time, default: 511 bytes
• Each index is stored in its own DBI
• Key (up to 511 bytes)
• Small value: encoded datom
• Large value: encoded datom with (truncated value + hash) to support range query
• Value (8 bytes)
• Small: a sentinel long, indicating small value
• Large: a long reference to the key of the full datom in the “giant” DBI
Datom Index Disk Format
• Attribute id (aid): binary encoded 32 bit integer
• Entity id (eid): binary encoded 64 bit long
• Value:
• Data type header byte, use disallowed bytes in UTF-8
• Data types: int, long, id, boolean, float, double, byte, bytes, keyword, symbol, instant, uuid
• Potentially truncated prefix bytes of the value
• Each value data type is encoded differently to ensure: bitwise order = value order
• If truncated, a truncator byte
• If truncated, a 32 bit Clojure hash of the value
• A separator byte
aid eid
header separator
hash
truncator
value
511 bytes key
• Giants
• For large values, the full datoms are stored in a giant DBI
• append-only, fast write
• Key: auto-incremental long (gid)
• Value: serialized full datom
• Schema
• Stored in a schema DBI
• Key: attribute name
• Value: serialized Clojure map of attribute properties
• TODO: non-trivial schema migration
More Disk Storage Details
Datalog Query
• Retain most Datascript query logic
• Search on-disk indices instead of in-memory cache
• Leverage indices that Datascript does not enable: AVET and VAET
• Adopted a few performance optimization PRs that Datascript did not
merge
• Cache all on-disk indices access API call results in a LRU cache
• Main reason for the speed advantage shown in query benchmarks
• TODO: move to a more performant query engine
• Datascript query engine does hash joins on returned full datoms
• Nested maps should do less work and be more performant
Datalog Transaction
• Retain Datascript transaction logic
• Reads during transaction: first search in-memory cache, then search on disk
• Transact to in-memory cache
• Identical to Datascript
• Cache content is lost when DB restarts
• Transact to disk storage
• Collect transacted datoms, commit them as a batch
• Sync to disk after each transaction
• Clear on-disk index access cache after a transaction
Status
• Index Access API is identical to Datascript
• Missing feature from Datascript
• Composite tuples (TODO)
• Persisted transaction functions (TODO)
• Features that make sense for in-memory DB (Maybe)
• DB serialization
• DB pretty print
Benchmark: Write
• 100K entities of random people
information
• Bulk load of datoms is fast
• Bulk transaction is fast too
• Transacting small number of
datoms is slow
• Advise: batch as much as possible
data in a transaction
Benchmark: Read
• Datalevin is faster than
Datascript across the board for
all tested Datalog queries
Benchmark: Multi-threads Read
• Does LMDB claim of linear scale by
reader threads hold?
• Yes
• Is Datalevin able to keep the same?
• Yes
Roadmap
• 0.4.0 Distributed mode with raft based replication
• 0.5.0 New Datalog query engine with an optimizer
• 0.6.0 Automatic schema migration
• 0.7.0 Datalog query parity with Datascript
• 0.8.0 Implement loom graph protocols
• 0.9.0 Auto indexing of document fields
• 1.0.0 Materialized views and incremental maintenance
Thank you! Question?
Huahai Yang
https://github.com/huahaiy
@huahaiy
https://juji.io

Weitere ähnliche Inhalte

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Datalevin London-meetup2020

  • 1. Datalevin A simple, fast and free Datalog database for everyone Huahai Yang, Ph.D. Juji, Inc. September 22, 2020
  • 2. Background & Motivation • Juji is a conversational AI company • Conversational data query (NLDB) • Upload a CSV file, then query it • Natural language => database query • Context sensitive • My previous research in NLDB convinced me: • NLDB is more of a DB problem, than a NL problem • Data themselves provide the best context • Better DB is the key
  • 3. Database Design Goals • Datalog is the best target query language for NLDB • Declarative • Composable • Amicable for code generation • In-process embedded use • Bulk writes, frequent reads • Multiple DB paradigms • Transparent data replication
  • 4. Datalevin Design Principle - Simplicity • Simple to use • Just a library, add to deps, and start coding • Simply require a different namespace to get a different DB paradigm • Current: Key-value, Datalog • Future: Graph, Document • Simple to operate • No need for complex ops: setup, backup and recovery should be dead simple • No DB maintenance threads or processes • No need for performance tuning • Simple to scale • Just provision more physical resources
  • 5. Why fork Datascript? • Datascript is a great baseline Datalog implementation • Comprehensive test coverage • Well maintained code base • Similar API to Datomic • Lots of users • We have very different goals from the alternatives • No interest in building a Datomic clone • Focus on query performance • We have plans to go far beyond NLDB • Juji Slogan for AI: “Symbolic as the bones, machine learning as the flesh” • High performance graph database is the basis of symbolic AI of the future
  • 6. Roles of Database • Operational • Database as the surrogate of the external world • ACID is derived from this use: to maintain the illusion of external world • Primary, necessary for most use cases • Focus on present, OLTP • Archival • Database as a recording of events and facts • Don’t need ACID, eventual consistency is fine • Secondary, necessary for many use cases, but not all • Focus on provenance and history, OLAP
  • 7. Merging operational and archival DB is hard • More stringent performance requirements • History has more data than present • More Complex APIs • Need to deal with history • Need to distinguish history and present • More complex user mental model • More things to consider -> less simple • Mind needs to forget to work properly • Hyperthymesia is a painful condition
  • 8. Operational DB should be stateful • In people’s mind, external world is stateful • Wrong assumption of time model is one of the main sources of immutable DB programming errors • “Why do I get the wrong data with this query?” • “I have to sort by transaction id to get the latest version?” • Datalevin is an operational database • meant to be embedded in applications to manage state
  • 9. Datalevin Architecture • LMDB key value store as the storage • Optimized Clojure API for LMDB • EAV index on top of key-value • User-facing API on top • Key-value • EAV index access • Datalog LMDB Key Value Processing Key-value API Index Access API Datalog API EAV Index Processing
  • 10. LMDB Features • Lightning Memory Mapped DB • ACID key value database • DB is a memory mapped file • Use OS filesystem cache • B+ tree, optimized for read • The fastest key value store for read • Performs well in writing large values (>2KB) • Works on bytes, support range query • Support multiple independent tables (DBI)
  • 11. LMDB Design • Read and write transactions • Single writer • Many concurrent readers (MVCC) • No locks on read • Linear scale by reader threads • Copy on write • Similar to immutable data structure • Reclaim obsolete pages • Read/write do not block each other
  • 12. Datalevin Optimizations • Read transaction pool • Avoid cost of allocating read transactions • Pre-allocate off-heap buffers in JVM • Write buffer (one per DBI) • Read buffer • Range query start and end buffers • Auto-resize value buffers • Re-allocate on overflows • Auto-resize DB size • LMDB needs to specify total DB size
  • 13. Datalevin Key-Value API • Open/close LMDB • Open/clear/drop DBI • Transact key-values as a batch • :put, :del • Fetch single value • get-value, get-first • Range query • get-range • Predicate filtering • get-some, range-filter • Counts • entries, range-count, range-filter-count
  • 14. EAV Indexing Processing • Entity-Attribute-Value data model • Versatile • relational model: entity = tuple, attribute = column, value = value • graph model: entity = node, attribute = edge, value = node (ref) • RDF triple: entity = subject, attribute = predicate, value = object • The triple is called a “datom” • Cover indices • EAV: row oriented index, all datoms • AEV: column oriented index, all datoms • AVE: support attribute range query, all datoms • VAE: graph reverse index, only for reference type datoms
  • 15. Index Storage • In memory indices as cache • Inherits Datascript’s persistent sorted sets • On disk indices as permanent storage • Binary encoded datoms into key-values • LMDB’s key size is fixed at compile time, default: 511 bytes • Each index is stored in its own DBI • Key (up to 511 bytes) • Small value: encoded datom • Large value: encoded datom with (truncated value + hash) to support range query • Value (8 bytes) • Small: a sentinel long, indicating small value • Large: a long reference to the key of the full datom in the “giant” DBI
  • 16. Datom Index Disk Format • Attribute id (aid): binary encoded 32 bit integer • Entity id (eid): binary encoded 64 bit long • Value: • Data type header byte, use disallowed bytes in UTF-8 • Data types: int, long, id, boolean, float, double, byte, bytes, keyword, symbol, instant, uuid • Potentially truncated prefix bytes of the value • Each value data type is encoded differently to ensure: bitwise order = value order • If truncated, a truncator byte • If truncated, a 32 bit Clojure hash of the value • A separator byte aid eid header separator hash truncator value 511 bytes key
  • 17. • Giants • For large values, the full datoms are stored in a giant DBI • append-only, fast write • Key: auto-incremental long (gid) • Value: serialized full datom • Schema • Stored in a schema DBI • Key: attribute name • Value: serialized Clojure map of attribute properties • TODO: non-trivial schema migration More Disk Storage Details
  • 18. Datalog Query • Retain most Datascript query logic • Search on-disk indices instead of in-memory cache • Leverage indices that Datascript does not enable: AVET and VAET • Adopted a few performance optimization PRs that Datascript did not merge • Cache all on-disk indices access API call results in a LRU cache • Main reason for the speed advantage shown in query benchmarks • TODO: move to a more performant query engine • Datascript query engine does hash joins on returned full datoms • Nested maps should do less work and be more performant
  • 19. Datalog Transaction • Retain Datascript transaction logic • Reads during transaction: first search in-memory cache, then search on disk • Transact to in-memory cache • Identical to Datascript • Cache content is lost when DB restarts • Transact to disk storage • Collect transacted datoms, commit them as a batch • Sync to disk after each transaction • Clear on-disk index access cache after a transaction
  • 20. Status • Index Access API is identical to Datascript • Missing feature from Datascript • Composite tuples (TODO) • Persisted transaction functions (TODO) • Features that make sense for in-memory DB (Maybe) • DB serialization • DB pretty print
  • 21. Benchmark: Write • 100K entities of random people information • Bulk load of datoms is fast • Bulk transaction is fast too • Transacting small number of datoms is slow • Advise: batch as much as possible data in a transaction
  • 22. Benchmark: Read • Datalevin is faster than Datascript across the board for all tested Datalog queries
  • 23. Benchmark: Multi-threads Read • Does LMDB claim of linear scale by reader threads hold? • Yes • Is Datalevin able to keep the same? • Yes
  • 24. Roadmap • 0.4.0 Distributed mode with raft based replication • 0.5.0 New Datalog query engine with an optimizer • 0.6.0 Automatic schema migration • 0.7.0 Datalog query parity with Datascript • 0.8.0 Implement loom graph protocols • 0.9.0 Auto indexing of document fields • 1.0.0 Materialized views and incremental maintenance
  • 25. Thank you! Question? Huahai Yang https://github.com/huahaiy @huahaiy https://juji.io