Creating a Single View Part 2: Loading Disparate Source Data and Creating a Single Enterprise-Wide View
1. Enterprise Architect, MongoDB
Buzz Moschetti
buzz.moschetti@mongodb.com
#ConferenceHashTag
Creating a Single View Part 2:
Data Design & Loading
Strategies
2. Who Is Talking To You?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 27 years of designing and building systems
• Big and small
• Super-specialized to broadly useful in any vertical
• “Traditional” to completely disruptive
• Advocate of language leverage and strong factoring
• Inventor of perl DBI/DBD
• Still programming – using emacs, of course
3. What Is He Going To Talk About?
Historic Challenges
New Strategy for Success
Technical examples and tips
Overview &
Data Analysis
Data Design &
Loading
Strategies
Securing Your
Deployment
ç
Ω
Creating A Single View
Part
1
Part
2
Part
3
5. It’s 2014: Why is this still hard to
do?
• Business / Technical / Information Challenges
• Missteps in evolution of data transfer technology
A X
6. We wish this “just worked”
A
Query objects from A
with great performance
Query objects from B
with great performance
X
Query objects from
merged A and B with
great performance
B
7. …but Beware The Blue Arrow!
A X
• Extracting many tables into many files
• Some tables require more than one file to capture representation
• Encoding/formatting clever tricks
• Reconciliation
• Different extracts for different consumers
• Different extracts for different versions of data to same consumer
8. Loss of fidelity exposed
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date>
versDates;
int[] unitBundles;
//…
}
widget1,,3,,good texture,retains value,,,20142304,102.3,201401
widget2,XS,6,,,,not fragile,,,20132304,73,87653
widget3,XT,,,4,,dense,shiny,mysterious,,,19990304,73,87653,,
widget4,,,3,4,,,,,,20040101,,999999,,
A
ORM
9. What happened to XML?
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date>
versDates;
int[] unitBundles;
//…
}
<product>
<name>widget1</name>
<features>
<feature>
<text>good texture</text>
<type>A</type>
</feature>
</features>
<introDate>20140204</introDate>
<versDates>
<versDate>20100103</versDate>
<versDate>20100601</versDate>
</versDates>
<unitBundles>1,3,9</unitBun…
ç
Ω
10. XML: Created More Issues Than
Solved
<product>
<name>widget1</name>
<features>
<feature>
<text>good texture</text>
<type>A</type>
</feature>
</features>
<introDate>20140204</introDate>
<versDates>
<versDate>20100103</versDate>
<versDate>20100601</versDate>
</versDates>
<unitBundles>1,3,9</unitBun…
• No native handling of
arrays
• Attribute vs. nested tag
rules/conventions widely
variable
• Generic parsing (DOM)
yields a tree of Nodes of
Strings – not very friendly
• SAX is fast but too low
level
11. … and it eventually became this
<p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” …
<p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” …
<p name=“widget3” ftxt1=“dense” idt=“20140203” …
<p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” …
• Short, cryptic, conflated tag names
• Everything is a string attribute
• Mix of flattened arrays and delimited strings
• Irony: org.xml.sax.Attributes easier to deal with than rest of
DOM
12. Schema Change Challenges:
Multiplied & Concentrated!
X
Alter table(s)
split() more data
A
Alter table(s)
Extract more data
LOE = x1
Alter table(s)
split() more data
Alter table(s)
split() more data
B
Alter table(s)
Extract more
data
LOE = x2
C
Alter table(s)
Extract more
data
LOE = x3
LOE = xn
1
n
å + f (n)
where f() is nonlinear wrt n
13. SLAs & Security: Tough to
Combine
A
B
User 1 entitled to see X
User 2 entitled to see Y
User 1 entitled to see Z
User 2 entitled to see V
X
Entitlements managed per-
system/per-application here….
…are lost in the
low-fidelity transfer
of data….
…and have to be
reconstituted here
…somehow…
16. Overall Strategy For Success
• Let the source systems entities drive the
data design, not the physical database
• Capture data in full fidelity
• Perform cross-ref and additional logic at the
single point of view
17. Don’t forget the power of the API
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date> versDates;
int[] unitBundles;
//…
}
If you can, avoid files altogether!
Haskell
ç
Ω
18. But if you are creating files: emit
JSON
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date> versDates;
int[] unitBundles;
//…
}
{
“name”: “widget1”,
“features”: [
{ “text”: “good texture”,
“type”: “A” }
],
“introDate”: “20140204”,
“versDates”: [
“20100103”, “20100601”
],
“unitBundles”: [1,3,7,9]
// …
}
ç
Ω
19. Let The Feeding System Express
itself
A
B
C
{ “name”: “widget1”,
“features”: [
{ “text”: “good texture”,
“type”: “A” }
]
}
{ “myColors”: [“red”,”blue”],
“myFloats”: [ 3.14159, 2.71828 ],
“nest”: { “as”: { “deep”: true }}}
}
{ “myBlob”: { “$binary”: “aGVsbG8K”},
“myDate”: { “$date”: “20130405” }
}
21. The Joy (and value) of mongoDB
A
Alter table(s)
Extract more
data
LOE = .25x1
B
Alter table(s)
Extract more data
LOE = .25x2
C
Alter table(s)
Extract more data
LOE = .25x3
LOE =O(1)
22. Helpful Hint: Use the APIs
curs.execute("select A.did, A.fullname, B.number from contact A
left outer join phones B on A.did = B.did order by A.did")
for q in curs.fetchall():
if q[0] != lastDID:
if lastDID != None:
coll.insert(contact)
contact = { "did": q[0], "name": q[1]}
lastDID = q[0]
if q[2] is not None:
if 'phones' not in contact:
contact['phones'] = []
contact['phones'].append({"number”:q[2]})
if lastDID != None:
coll.insert(contact)
{
"did" : ”D159308",
"phones" : [
{"number”: "1-666-444-3333”},
{"number”: "1-999-444-3333”},
{"number”: "1-999-444-9999”}
],
"name" : ”Buzz"
}
ç
Ω
23. Helpful Hint: Declare Types
Use mongoDB conventions for dates and binary data:
{“dateA”: {“$date”:“2014-05-16T09:42:57.112-0000”}}
{“dateB”: {“$date”:1400617865438}}
{“someBlob”: { "$binary" : "YmxhIGJsYSBibGE=",
"$type" : "00" }
24. Helpful Hint: Keep the file flexible
Use CR-delimited JSON:
{ “name”: “buzz”, “locale”: “NY”}
{ “name”: “steve”, “locale”: “UK”}
{ “name”: “john”, “locale”: “NY”}
…instead of a giant array:
records = [
{ “name”: “buzz”, “locale”: “NY”},
{ “name”: “steve”, “locale”: “UK”},
{ “name”: “john”, “locale”: “NY”},
]
25. Helpful Hint: Don’t be afraid of metadata
Use a version number in each document:
{ “v”: 1, “name”: “buzz”, “locale”: “NY”}
{ “v”: 1, “name”: “steve”, “locale”: “UK”}
{ “v”: 2, “name”: “john”, “region”: “NY”}
…or get fancier and use a header record:
{ “vers”: 1, “creator”: “ID”, “createDate”: …}
{ “name”: “buzz”, “locale”: “NY”}
{ “name”: “steve”, “locale”: “UK”}
{ “name”: “john”, “locale”: “NY”}
27. Now that we have the data…
You’re well on your way to a single view
consolidation…but first:
– Data Work
• Cross-reference important keys
• Potential scrubbing/cleansing
– Software Stack Work
30. Build THIS!
http://yourcompany/yourapp
Data Access Layer
Object Constructon Layer
Basic Functional Layer
Portal Functional Layer
GUI adapter Layer
Web Service Layer
Other Regular
Performance
Applications
Higher Performance
Applications
Special
Generic Applications
31. What Is Happening Next?
Access Control
Data Protection
Auditing
Overview &
Data Analysis
Data Design &
Loading
Strategies
ç
Ω
Creating A Single View
Part
1
Part
2
Securing Your
Deployment
Part
3
AND WHY ARE WE DOING IT AT ALL! Federation? Managed QoS? Because traditional RDBMS dynamics make it difficult to well-serve a number of access patterns
The single most important part of this that will make you successful is the simplest – and is part of the mongoDB data environment
ETL fabric fidelity of data typically LCD
CSV still carries the day because easy to make and technically parse (but difficult to change or express things)
XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions
Anecdote about getting screwered by the arrow
The arrow is disingenuous!
This is LOSS OF FIDELITY
Most people use an ORM to get from DB to good objects – and mongoDB has a story around that too!
But for the moment, assume we use it.
XML was supposed to be The Thing.
XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions
No one runs schema validation in production because of performance
Schemas became too complicated anyway…..
JAXB, JAXP are compile-time bound
XML set us back about 10 years
Leads to this: Can you please just send me a CSV again?
Changes to data in source system imply DB schema upgrade in data hub – with X source systems, this starts to become unscalable
Hub Data storage scalability
In summary: traditionally, common data hubs are harder to manage than the sum of their source systems – which themselves are not so easy to manage!
Remember this formula; we’ll see how we improve upon this in just a bit.
Data entitlement implicit to system access
Fast moving businesses cannot be held up by naturally more slowing moving ones
(Andreas will cover this in greater detail later)
How did we get here, examples from past? Anecdotal reinforcement. Knowing legacy problems and experience, here are the 3 things that work.
Don’t think about transfering tables’ think about transfering products, logs, trades, customers
----- Meeting Notes (5/19/14 13:31) -----
A zillion APIs.
This does not necessarily mean REALTIME. We can do realtime with “microbatching”. We can do EOD batch with a filefree API. It’s all about how producer and consumer agree to capture the data – we’ll see more about this context later in the presentation.
----- Meeting Notes (5/19/14 13:31) -----
Our most successful customers do this
or use microbatching.
The Green Arrow
JSON is the new leader in highly interoperable, ASCII structured data format
ASCII interop is critical so GPB, Avro, and other formats are out.
Better than XML because
Strings, numbers, maps, and arrays natively supported
Simpler data model (no attributes or unnested content)
Easier to programmatically construct
(Much!) better than CSV because
Rich detail is preserved
Content can be expanded later without struggling with “comma hell”
Warning: JSON does NOT have Date or binary (BLOB) types! We’ll come back to a strategy on that….
The Basic Rules:
Let feeder systems drive the data design
Do not dilute, format, or otherwise mess with the data
JUST ADD IT.
Not talking about doubles turning into lists of dates – but there’s a hint coming up that could help there too.
MUCH easier to update JSON feed handler for new data
Essentially constant time to ingest new or changed data!
Build the rich structure!
You have to do this anyway to produce a JSON file so if you can, go the extra distance and just directly insert the content.
Don’t worry about transactions; you should be using batchID which we’ll get to in a moment.
mongoDB does not extend JSON per se. Rather, within the JSON spec, we have a structural-naming convention that allows us to clearly hint at the true intended type of the string value.
Easy to grep and use jq too
Std unix utils work nicely too:
Same format as mongoimport and mongoexport
Does not force large memory footprint on loader
Don’t be afraid to make mistakes – for the same reason we explored on slide 21.
Context is an identifier for a set of data: ABC123
Dates are dangerous
For global systems, two (or more!) local dates possible.
System processing date can be misleading
Context has additional benefits
Easy to associate other information with context ID like functional ID
Single View of Customer does not mean Single Technical visualization of Customer thru GUI!!