Boost Fertility New Invention Ups Success Rates.pdf
Creating a Single View: Data Design and Loading Strategies
1. Enterprise Architect, MongoDB
Buzz Moschetti
buzz.moschetti@mongodb.com
#ConferenceHashTag
Creating a Single View Part 2:
Data Design & Loading
Strategies
2. Who Is Talking To You?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 27 years of designing and building systems
• Big and small
• Super-specialized to broadly useful in any vertical
• “Traditional” to completely disruptive
• Advocate of language leverage and strong factoring
• Inventor of perl DBI/DBD
• Still programming – using emacs, of course
3. What Is He Going To Talk About?
Historic Challenges
New Strategy for Success
Technical examples and tips
Overview &
Data Analysis
Data Design &
Loading
Strategies
Securing Your
Deployment
ç
Ω
Creating A Single View
Part
1
Part
2
Part
3
5. It’s 2014: Why is this still hard to
do?
• Business / Technical / Information Challenges
• Missteps in evolution of data transfer technology
A X
6. We wish this “just worked”
A
Query objects from A
with great performance
Query objects from B
with great performance
X
Query objects from
merged A and B with
great performance
B
7. …but Beware The Blue Arrow!
A X
• Extracting many tables into many files
• Some tables require more than one file to capture representation
• Encoding/formatting clever tricks
• Reconciliation
• Different extracts for different consumers
• Different extracts for different versions of data to same consumer
8. Loss of fidelity exposed
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date>
versDates;
int[] unitBundles;
//…
}
widget1,,3,,good texture,retains value,,,20142304,102.3,201401
widget2,XS,6,,,,not fragile,,,20132304,73,87653
widget3,XT,,,4,,dense,shiny,mysterious,,,19990304,73,87653,,
widget4,,,3,4,,,,,,20040101,,999999,,
AORM
9. What happened to XML?
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date>
versDates;
int[] unitBundles;
//…
}
<product>
<name>widget1</name>
<features>
<feature>
<text>good texture</text>
<type>A</type>
</feature>
</features>
<introDate>20140204</introDate>
<versDates>
<versDate>20100103</versDate>
<versDate>20100601</versDate>
</versDates>
<unitBundles>1,3,9</unitBun…
ç
Ω
10. XML: Created More Issues Than
Solved
<product>
<name>widget1</name>
<features>
<feature>
<text>good texture</text>
<type>A</type>
</feature>
</features>
<introDate>20140204</introDate>
<versDates>
<versDate>20100103</versDate>
<versDate>20100601</versDate>
</versDates>
<unitBundles>1,3,9</unitBun…
• No native handling of
arrays
• Attribute vs. nested tag
rules/conventions widely
variable
• Generic parsing (DOM)
yields a tree of Nodes of
Strings – not very friendly
• SAX is fast but too low
level
11. … and it eventually became this
<p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” …
<p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” …
<p name=“widget3” ftxt1=“dense” idt=“20140203” …
<p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” …
• Short, cryptic, conflated tag names
• Everything is a string attribute
• Mix of flattened arrays and delimited strings
• Irony: org.xml.sax.Attributes easier to deal with than rest of
DOM
12. Schema Change Challenges:
Multiplied & Concentrated!
X
Alter table(s)
split() more data
A
Alter table(s)
Extract more data
LOE = x1
Alter table(s)
split() more data
Alter table(s)
split() more data
B
Alter table(s)
Extract more
data
LOE = x2
C
Alter table(s)
Extract more
data
LOE = x3
LOE = xn
1
n
å + f (n)
where f() is nonlinear wrt n
13. SLAs & Security: Tough to
Combine
A
B
User 1 entitled to see X
User 2 entitled to see Y
User 1 entitled to see Z
User 2 entitled to see V
X
Entitlements managed per-
system/per-application here….
…are lost in the
low-fidelity transfer
of data….
…and have to be
reconstituted here
…somehow…
16. Overall Strategy For Success
• Let the source systems entities drive the
data design, not the physical database
• Capture data in full fidelity
• Perform cross-ref and additional logic at the
single point of view, not in transit
17. Don’t forget the power of the API
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date> versDates;
int[] unitBundles;
//…
}
If you can, avoid files altogether!
Haskell
ç
Ω
18. But if you are creating files: emit
JSON
class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date> versDates;
int[] unitBundles;
//…
}
{
“name”: “widget1”,
“features”: [
{ “text”: “good texture”,
“type”: “A” }
],
“introDate”: “20140204”,
“versDates”: [
“20100103”, “20100601”
],
“unitBundles”: [1,3,7,9]
// …
}
ç
Ω
19. Let The Feeding System Express
itself
A
B
C
{ “name”: “widget1”,
“features”: [
{ “text”: “good texture”,
“type”: “A” }
]
}
{ “myColors”: [“red”,”blue”],
“myFloats”: [ 3.14159, 2.71828 ],
“nest”: { “as”: { “deep”: true }}}
}
{ “myBlob”: { “$binary”: “aGVsbG8K”},
“myDate”: { “$date”: “20130405” }
}
21. The Joy (and value) of mongoDB
A
Alter table(s)
Extract more
data
LOE = .25x1
B
Alter table(s)
Extract more data
LOE = .25x2
C
Alter table(s)
Extract more data
LOE = .25x3
LOE =O(1)
23. Helpful Hint: Use the APIs
curs.execute("select A.did, A.fullname, B.number from contact A
left outer join phones B on A.did = B.did order by A.did")
for q in curs.fetchall():
if q[0] != lastDID:
if lastDID != None:
coll.insert(contact)
contact = { "did": q[0], "name": q[1]}
lastDID = q[0]
if q[2] is not None:
if 'phones' not in contact:
contact['phones'] = []
contact['phones'].append({"number”:q[2]})
if lastDID != None:
coll.insert(contact)
{
"did" : ”D159308",
"phones" : [
{"number”: "1-666-444-3333”},
{"number”: "1-999-444-3333”},
{"number”: "1-999-444-9999”}
],
"name" : ”Buzz"
}
ç
Ω
24. Helpful Hint: Declare Types
Use mongoDB conventions for dates and binary data:
{“dateA”: {“$date”:“2014-05-16T09:42:57.112-0000”}}
{“dateB”: {“$date”:1400617865438}}
{“someBlob”: { "$binary" : "YmxhIGJsYSBibGE=",
"$type" : "00" }
25. Helpful Hint: Keep the file flexible
Use CR-delimited JSON:
{ “name”: “buzz”, “locale”: “NY”}
{ “name”: “steve”, “locale”: “UK”}
{ “name”: “john”, “locale”: “NY”}
…instead of a giant array:
records = [
{ “name”: “buzz”, “locale”: “NY”},
{ “name”: “steve”, “locale”: “UK”},
{ “name”: “john”, “locale”: “NY”},
]
30. Now that we have the data…
You’re well on your way to a single view
consolidation…but first:
– Data Work
• Cross-reference important keys
• Potential scrubbing/cleansing
– Software Stack Work
33. Build THIS!
http://yourcompany/yourapp
Data Access Layer
Object Constructon Layer
Basic Functional Layer
Portal Functional Layer
GUI adapter Layer
Web Service Layer
Other Regular
Performance
Applications
Higher Performance
Applications
Special
Generic Applications
34. What Is Happening Next?
Access Control
Data Protection
Auditing
Overview &
Data Analysis
Data Design &
Loading
Strategies
ç
Ω
Creating A Single View
Part
1
Part
2
Securing Your
Deployment
Part
3
AND WHY ARE WE DOING IT AT ALL! Federation? Managed QoS? Because traditional RDBMS dynamics make it difficult to well-serve a number of access patterns
The single most important part of this that will make you successful is the simplest – and is part of the mongoDB data environment
ETL fabric fidelity of data typically LCD
CSV still carries the day because easy to make and technically parse (but difficult to change or express things)
XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions
Anecdote about getting screwered by the arrow
The arrow is disingenuous!
This is LOSS OF FIDELITY
Most people use an ORM to get from DB to good objects – and mongoDB has a story around that too!
But for the moment, assume we use it.
XML was supposed to be The Thing.
XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions
No one runs schema validation in production because of performance
Schemas became too complicated anyway…..
JAXB, JAXP are compile-time bound
XML set us back about 10 years
Leads to this: Can you please just send me a CSV again?
Changes to data in source system imply DB schema upgrade in data hub – with X source systems, this starts to become unscalable
Hub Data storage scalability
In summary: traditionally, common data hubs are harder to manage than the sum of their source systems – which themselves are not so easy to manage!
Remember this formula; we’ll see how we improve upon this in just a bit.
Data entitlement implicit to system access
Fast moving businesses cannot be held up by naturally more slowing moving ones
(Andreas will cover this in greater detail later)
Knowing legacy problems and experience, here are the 3 things that work.
Don’t think about transfering tables’ think about transfering products, logs, trades, customers
Cross ref at the SPOV. Especially as the number of feeders grows large, you’ll want to concentrate and control enrichment instead of having potentially dozens of scripts and utils getting involved in the flow. This also vastly simplifies a necessary evil: reconciliation.
----- Meeting Notes (5/19/14 13:31) -----
A zillion APIs.
This does not necessarily mean REALTIME. We can do realtime with “microbatching”. We can do EOD batch with a filefree API. It’s all about how producer and consumer agree to capture the data – we’ll see more about this context later in the presentation.
----- Meeting Notes (5/19/14 13:31) -----
Our most successful customers do this
or use microbatching.
If direct connect isn’t your bag, feel free to create a web service: but pass JSON to that web service.
JSON is the new leader in highly interoperable, ASCII structured data format
ASCII interop is critical so GPB, Avro, and other formats are out.
Better than XML because
Strings, numbers, maps, and arrays natively supported
Simpler data model (no attributes or unnested content)
Easier to programmatically construct
(Much!) better than CSV because
Rich detail is preserved
Content can be expanded later without struggling with “comma hell”
Warning: JSON does NOT have Date or binary (BLOB) types! We’ll come back to a strategy on that….
WRT to actually creating JSON, there are all sorts of options including frameworks that use annotations on your POJOs
BUT: My recommendation observer software engineering 101: have feeder program build a Map then use anyone of the JSON parser/generators like Jackson to
The Basic Rules:
Let feeder systems drive the data design
Do not dilute, format, or otherwise mess with the data
Schema Design: An entire session could be devoted to schema design. In general,
always embed 1:1
embed “co-lo” 1:n (vectors of bespoke results, contact and phone numbers)
use foreign keys to link 1:n where n is shared by others
use foreign keys for n:n
JUST ADD IT.
Not talking about doubles turning into lists of dates – but there’s a hint coming up arounding versioning that could help there too.
If you do this even halfway right, it may be last feed infra you need to create for this consolidated view.
MUCH easier to update JSON feed handler for new data
Essentially constant time to ingest new or changed data!
No silver bullet or magic about processing the data – but you are no longer wrestling with the database!
Build the rich structure!
You have to do this anyway to produce a JSON file so if you can, go the extra distance and just directly insert the content.
Don’t worry about transactions; you should be using batchID which we’ll get to in a moment.
mongoDB does not extend JSON per se. Rather, within the JSON spec, we have a structural-naming convention that allows us to clearly hint at the true intended type of the string value.
These are natively grok’d by mongoimport, BTW.
By CR delimited we mean no pretty-printing of the JSON.
The computer doesn’t care if it’s pretty or not and
Packing everything on one line allows you to:
Easy to write a BufferedReader / fread
Easy to grep and Std unix utils work nicely too:
Same format as mongoimport and mongoexport
Does not force large memory footprint on loader
and you can use jq!
We have 100,000 items.
Goal: How many mobile phones are explicitly marked as do-not-call?
Challenge: single person per “greppable” line and phones is an array.
In these 2 lines, there are 5 phones.
Also phones.type is not the same as .type SO grepping for “mobile” leads to peril and very often wrong results
.phones select phones element from doc
But we still have it as an ARRAY
[] “flattens” out the array to be a set of documents! (just like $unwind in the mongoDB agg framework)
jq operations are very rich . You can redact/replace fields, add brand new fields to output, etc.
The –c option produces CR-delimited JSON
JSON compresses very well (like one FIFTH the space) so go ahead and gzip -9 the JSON and decompress on the fly into jq!
Don’t be afraid to make mistakes – for the same reason we explored on slide 21.
Context is an identifier for a set of data: ABC123
Dates are dangerous
For global systems, two (or more!) local dates possible.
System processing date can be misleading
Context has additional benefits
Easy to associate other information with context ID like functional ID
Single View of Customer does not mean Single Technical visualization of Customer thru GUI!!