3. Schema Design – Gary Murakami
Chess 4.5 (Northwestern University)
Larry Atkin & Dave Slate
4. Schema Design – Gary Murakami
Agenda
• What is a Record?
• Core Concepts
• What is an Entity?
• Associating Entities
• General Recommendations
• Questions
5. Schema Design – Gary Murakami
All application development is
Schema Design
6. Schema Design – Gary Murakami
Success comes from
Proper Data
Structure
8. Schema Design – Gary Murakami
Key → Value
• One-dimensional
• Single value is a blob
• Query on key only
• No schema
• Value cannot be updated, only replaced
Key Blob
9. Schema Design – Gary Murakami
Relational
• Two-dimensional (tuples)
• Each field is a single value
• Query on any field
• Very structured schema (table)
• In-place updates *
• Normalization requires many tables, joins,
indexes, and poor data locality and
performance
Primary
Key
10. Schema Design – Gary Murakami
Document
• N-dimensional
• Each field can contain 0, 1,
many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data
locality, requires fewer indexes, has better
performance
_id
12. Schema Design – Gary Murakami
Traditional Schema Design
Focus on data storage
13. Schema Design – Gary Murakami
Document Schema Design
Focus on data use
14. Schema Design – Gary Murakami
Another way to think about it
Traditional:
What answers do I have?
Document:
What questions do I
have?
15. Schema Design – Gary Murakami
Three Building Blocks of
Document Schema
Design
16. Schema Design – Gary Murakami
1 – Flexibility
• Choices for schema design
• Each record can have different fields
• Field names consistent for programming
• Common structure can be enforced by
application
• Easy to evolve as needed
17. Schema Design – Gary Murakami
2 – Arrays
Multiple Values per Field
• Each field can be:
– Absent
– Set to null
– Set to a single value
– Set to an array of many values
• Query for any matching value
– Can be indexed and each value in the array is in the
index
18. Schema Design – Gary Murakami
3 - Embedded Documents
• Any value can be a document
• Nested documents provide structure
• Query any field at any level
– Can be indexed
19. Schema Design – Gary Murakami
Belle and Endgame tablebases
Play chess with God – Ken
Thompson
21. Schema Design – Gary Murakami
An Entity
• Object in your model
• Associations with other entities
Referencing (Relational) Embedding (Document)
has_one embeds_one
belongs_to embedded_in
has_many embeds_many
has_and_belongs_to_ma
ny
MongoDB has both referencing and embedding for universal
coverage
22. Schema Design – Gary Murakami
Let's model something
together
How about a business
card?
26. Schema Design – Gary Murakami
Relational Schema
Contact
• name
• company
• title
• phone
Address
• street
• city
• state
• zip_code
27. Contact
• name
• company
• adress
• Street
• City
• State
• Zip
• title
• phone
• address
• street
• city
• State
• zip_code
Schema Design – Gary Murakami
Document Schema
28. Schema Design – Gary Murakami
How are they different? Why?
Contact
• name
• company
• title
• phone
Address
• street
• city
• state
• zip_code
Contact
• name
• company
• adress
• Street
• City
• State
• Zip
• title
• phone
• address
• street
• city
• state
• zip_code
30. Schema Design – Gary Murakami
Longest “Database Endgame”
Mate
• Augment schema with meta data
– Distance to mate (DTM)
– Distance to conversion (DTC)
• Retrograde analysis of DB
• Longest checkmate
– 6 piece – 262 moves, KRNKNN
– 7 piece – 517 moves, so far
• Completion by 2015
33. Schema Design – Gary Murakami
Address Book
• What questions do I have?
• What are my entities?
• What are my associations?
34. Schema Design – Gary Murakami
Address Book Entity-
Relationship
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1
36. Schema Design – Gary Murakami
One to One
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1
37. Schema Design – Gary Murakami
One to One
Schema Design Choices
contact
• twitter_id
twitter1 1
contact twitter
• contact_id1 1
Redundant to track relationship on both sides
• Both references must be updated for consistency
• Saves a fetch if no twitter
Contact
• twitter
twitter 1
38. Schema Design – Gary Murakami
One to One
General Recommendation
• Full contact info all at once
– Contact embeds twitter
• Parent-child relationship
– “contains”
• No additional data duplication
• Can query or index on embedded field
– e.g., “twitter.name”
Contact
• twitter
twitter 1
39. Schema Design – Gary Murakami
One to Many
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1
40. Schema Design – Gary Murakami
One to Many
Schema Design Choices
contact
• phone_ids: [ ]
phone1 N
contact phone
• contact_id1 N
Redundant to track relationship on both sides
• Both references must be updated for consistency
• Not possible in relational DBs
• Saves a fetch if no phones
Contact
• phones
phone N
41. Schema Design – Gary Murakami
One to Many
General Recommendation
• Full contact info all at once
– Contact embeds multiple phones
• Parent-children relationship
– “contains”
• No additional data duplication
• Can query or index on any field
– e.g., { “phones.type”: “mobile” }
Contact
• phones
phone N
42. Schema Design – Gary Murakami
Many to Many
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1
43. Schema Design – Gary Murakami
Many to Many
Traditional Relational Association
Join table
Contacts
• name
• company
• title
• phone
Groups
• name
GroupContacts
• group_id
• contact_id
X
Use arrays instead
44. Schema Design – Gary Murakami
Many to Many
Schema Design Choices
group
• contact_ids: [ ]
contactN N
group
contact
• group_ids: [
]
N N
Redundant to track
relationship on both sides
• Both references must be
updated for consistency
Redundant to track
relationship on both sides
• Duplicated data must be
updated for consistency
group
• contacts
contact
N
contact
• groups
group
N
45. Schema Design – Gary Murakami
Many to Many
General Recommendation
• Depends on use case
1. Simple address book
• Contact references groups
2. Corporate email groups
• Group embeds contacts for performance
group
contact
• group_ids: [
]
N N
46. Schema Design – Gary Murakami
Contacts
• name
• company
• title
addresses
• type
• street
• city
• state
• zip_code
phones
• type
• number
emails
• type
• address
thumbnail
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
twitter
• name
• location
• web
• bio
N
N
N
1
1
Document model - holistic and efficient representation
48. Schema Design – Gary Murakami
Can We Solve Chess
One Day?
• Chess tablebase problem
– Chess programs often play worse
– Search is not localized, poor cache performance, seeks
– Working set too large for memory
• Endgame database size – big data
– 5 piece: 7 GB compressed 75%
• 157 MB Shredderbase – 1000x
• 441 MB Shredderbase – 10,000x
– 6 piece: 1.2 TB compressed
– 7 piece: 70 TB estimated by 2015
49. Schema Design – Gary Murakami
Working Set
1. To reduce the working set
– reference less-used data instead of embedding
• extract into referenced child document
– reference bulk data, e.g., portrait
2. To increase resources
– read from secondaries in a replica set
– use sharding
51. Schema Design – Gary Murakami
Embedding over Referencing
• Embed
– When “one” or “many” objects are viewed with their parent
– For performance
– For atomicity
• Reference
– When you need more scaling: max document size is
16MB
– For easy “many to many” associations
– For smaller parent documents and working set
52. Schema Design – Gary Murakami
Legacy Migration
1. Copy existing schema & some data to
MongoDB
2. Iterate schema design
1. Measure performance and find bottlenecks
2. Denormalize by embedding
1. one to one associations first
2. one to many associations next
3. many to many associations last
3. Examine, measure and analyze, review concerns,
scaling
53. Schema Design – Gary Murakami
New Application
1. Focus on your application
1. Requests
2. Responses
3. Business-domain model objects / data structures
2. Then persist language object data to
MongoDB
1. Collections
2. Associations
3. Refactor for optimization and add indices
54. Schema Design – Gary Murakami
It’s All About Your
Application
• Your schema is the impedance matcher
– Design choices: normalize/denormalize,
reference/embed
– Melds programming with MongoDB for best of both
– Flexible for development and change
• Programs+Databases = (Big) Data Applications
55. Schema Design – Gary Murakami
It’s All About Your
Application
• Your schema is the impedance matcher
– Design choices: normalize/denormalize,
reference/embed
– Melds programming with MongoDB for best of both
– Flexible for development and change
• Programs MongoDB = Great Big Data
Applications
• Play chess with God
56. Schema Design – Gary Murakami
It’s All About Your
Application
• Your schema is the impedance matcher
– Design choices: normalize/denormalize,
reference/embed
– Melds programming with MongoDB for best of both
– Flexible for development and change
• Programs MongoDB = Great Big Data
Applications
• Play music with God – AAC
57.
58. Lead Engineer / Evangelist
Gary J. Murakami, Ph.D.
#MongoDB
Questions?
"His pattern indicates
two-dimensional thinking.”
- Spock
Star Trek II: The Wrath of Khan
www.3dchessfederation.com
59. Thank you so much to our community who
made An Evening with MongoDB Minneapolis
possible:
• David Hussman
• Josh Kennedy
• Matthew Chimento
• Jeffrey Lemmerman
• Dan Chamberlain
• Christopher Rueber
• Erin Newkirk
Thank you DevJam for hosting our event!
Hinweis der Redaktion
A long, long time ago in a state not to far from here, I was in high school.There I discovered the wonder of computer programming. I was on the chess team, …and on the wresting time. I ran laps as conditioning for wrestling, and to keep running, I dreamed up algorithms and data structures to play chess.The importance of data structures was confirmed to me at Northwestern University when I took a course that used Pascal and Niklaus Wirth’s book “Algorithms + Data Structures = Programs.”And such data structures could be used to program computers to play chess.Next slide – skip the followingAt the Illinois High School chess finals, I was astounded by my opponent. Fortunately, it was not by his play on the chess board, but by an extremely thick printout of his Tic-Tac-Toe program.It was one huge nested if statement exhaustively enumerated all of the possibilities.The complexity of this is illustrated in the diagram that shows the map for O – playing second – of optimal moves.I knew that the “program” an abuse of a programming language and a tree, and worse than a chess blunder, a travesty.An application without good Schema Design is a similar travesty.
Chess 4.5 was a pioneering chess program in the 1970s.It was the first program to win a human chess tournament.I enjoyed playing against it at Northwestern, and I even played a rated chess game against the programmer Dave Slate.Chess 4.5 added a database of “book” openings that greatly improved the capability of the program.So the chess program melded algorithms, data structures, and a database to take on human chess masters.Could you do similar great things with good schema design?
Perhaps you will have moments of insight where you say “Aha!”For those of you who say “Of course, I knew that,” may the truth resonate and grow.Some might disagree strongly with my general recommendations.May you all find the presentation interesting and thought provoking.And may it inspire enthusiasm in your schema design work for your applications.
Schema Design is very important; its impact on your application is pervasive.
Wrong data structure will hurt you.Proper data structure can make all the pieces fall into place.
One-dimensional storage can be very fast but is limited with respect to querying.Speed is why key-value stores are popular for modern web applications.
A record in a traditional relational DBs is atOOple or row in a table.This table representation forces normalization of your data.Normalization is good for querying anything that the data can answer, and it is good for new queries.Relational DBs won out over other DBs that came before.To me, the winning technology is that every field or value is first class,In essence, every field can be addressed in queries and can be indexed for faster responses.But normalizationrequires many tables, joins to rehydrate relations, indexes to make joins faster, and it results in poor data locality.For example, in order to represent an array, another table must be used just for that array.Slow performance is whyNoSQL alternatives are becoming popular.In-place updates * SQL storage may use “padding” space for dynamic strings instead of fixed allocation
Document somewhat of a misnomer, not the Constitution or XML object data (without methods) – often visualized as JSONInline updates * padding factor can reduce the need to move a documentThe essential capability (querying and indexing) persists and gets even better.The document structure can match your data structures – your schema.
Answers dataQuestions applicationDoes your schema take advantage of your application-specific knowledge of known queries, use cases, and client-program data structures?Traditional DBs make it hard to take advantage of them.Document DBs make it easy to take advantage of them.MongoDB documents can match your application – given good schema design.
Not “schema-less” but rather “flexible schema”Common structure can be enforced by applicationWhile MongoDB does not enforce common structure, neither does it restrict your applicationDocuments may have a common structure that is optionally extended at the document-levelUse this flexibility for class hierarchy with subclasses- Traditional relational representation requires separate tables- Work around with multiple mostly-empty columns- Example, three days for schema migrationKeywords: flexible, choice, evolve, change, modify
The lack of multivalued fields is usually the first complaint of programmers that don’t wish to pay the cost for normalization.Concept of arrays incorporates multiple values and also associations involving many entities.Keywords: array, multiple, many
Documents may have a common structure that is optionally extended at the document-level.The application mapping can enforce the required and optional fields. What could you do with these building blocks?Perhaps play chess, and beat human chess masters?
Belle (picture on the left) was the first computer built for the sole purpose of chess playing.It wasdeveloped by former coworkers of mineJoe Condon and Ken Thompson at Bell Labs in the 1970s and 1980s.Ken is reknown for developing the Unix operating system in the C programming language.Bell officially became the first master-level machine in 1983 and dominated play throughout the 1980s.Ken used Belle extensively for pioneering research with chess endgame tablebase.Starting from all possible checkmates with 3 pieces, retrograde analysis was used to exhaustively calculate all possible positions with forced mates.Ken completed the endgame tablebase for up to five pieces and published it on CD-ROM.It represents years of compute time and is still available online under the caption “Play chess with God”Good Schema Design matched the endgame tablebase to live chess playing so that Belle could beat human chess masters.Let’s investigate good schema design for an application.
“Vintage” business card
Contact and Address entities areassociated one to one.Traditional relational association is via referencing.In this example, the contact record for Steve Jobs has a reference to his address via the address_id field.
We’ve discussed Entities, Associations, Referencing, Embedding, and business cards, and we’ll build on that knowledge.Chess programmers have built on the endgame database with interesting results.
Entity-Relational diagram
Entity-Relational diagram for embedding documents
Left – relational - requires either two fetches/queries (or a join in a relational DB)Right – document – requires only one fetch/query and has data locality
We have discussed Entities, Associations, Referencing, Embedding, and business cards as sample data.We’ll build on that knowledge.Chess programmers have built on the endgame database with interesting results.
Likewise for your application, use Schema Design and the flexible schema of MongoDB to empower your database analytics
A common example will help us understand the joy of flexible document structure.
Left: One to one We're going to assume users only have on Twitter account. A thumbnail is a small profile image while portrait is a very large profile image.Right: One to manyMiddle: Many to many
Arrays of references are more direct than a join table and save a fetch.
fundamentally not “contains”Concerns – exceptional casesExceeding maximum document size due to large data or scalingTransferring very large documents is probably a performance concernScaling may affect working set sizeSchema can be adjusted to improve performance- Fetch only the data that you need
Embedding entities in the contact document reduces six fetches to one
We’ve completed our address book example, but what about chess?
Chess is not just an interesting challenge that raises philosophical questions about the intelligence of humans and computers.It is also a prime example of the effectiveness of algorithms plus data structures, plus good schema design for databases.And the endgame database has the challenges of big data and working set size that we face in our growing big data applications.
To increase resources with MongoDBUse a replica set and read from secondariesUse sharding
Embedding is a bit like pre-joined dataBSON (Binary JSON) document ops are easy for the serverChoose embedding by default as oppose to referencing.Embed (90/10 following rule of thumb)When the “one” or “many” objects are viewed in the context of their parentReference for easy consistency with “many to many” associations without duplicated dataReferencing is not just the default for relational DBs, there is no other choice.
You no longer have to coerce your data into a form acceptable to a SQL database.You can now architect or tailor your data to your application in your programming language and persist it to MongoDB.
May you build Great Big Data Applications.Perhaps you can say inspiring quotes like Ken Thompson, “Play chess with God.”
Good news – giving power and control back to the programmer and the programming languageKen and I worked on Perceptual Audio Coding, better known as Advanced Audio Coding or AAC as found in the iPod and iPhone.So I hope that this will inspire you to“Play music with God”to build your killer app.How is this made possible?Here’s the technology in MongoDB that makes this all possible.
BSON (Binary JSON) is the “magic” or core technology in MongoDB for data structures and performance.BSON does not have to be parsed like JSON, but is rather a format that can be traversed easily.