SlideShare a Scribd company logo
1 of 65
Download to read offline
Hi we’re Josh and Jordan
1
And we work for MIOsoft. We’re here to give you a peek at some of the R&D MIOsoft has been doing since 1998. .
2
MIOsoft has its world and R&D headquarters in Madison, and major offices in Hamburg and Beijing. We started in 1998 with a project for CUNA
Mutual Group.
3
When you ask us what MIOsoft does, we typically throw around a lot of technical terms. Or buzz words.
4
But really, what we’re focused on is data.
5
And it’s everywhere. In business systems. In sensors. In machines.
6
Companies are just now starting to realize this. And they are trying to gather it up as cheap as possible, hoping to derive some insight or discover
some new business line.
7
Its like the gold rush. Everyone’s digging in their data for something.
8
The real value of the data is what sorts of relationships it can show you as a business. You want to find relationships between pieces of data so
you can contextually understand entities like customers. You want to find statistical relationships between business outcomes and contextual
state, so you can influence future business outcomes.
9
From the business side, they’re looking for a silver bullet. They want prescriptive analytics to enable proactive service. For a customer-centric
business, you really want to treat every individual event or encounter in context of the customer. For instance, in a coffee shop the barista can be
advised to automatically start making their favorite coffee, or a notification can be sent to the customer’s phone with a list of their favorite items
sorted from most likely to least likely with an offer to start putting the order together and to pay without having to wait in line. This would also be
an opportunity to introduce the customer to something new based on discoveries made by observing other similar customers.
10
In many ways, this is what we mean by contextual understanding. We call it context.
11
Context is incredibly valuable, and analysts are starting to recognize that.
12
When you can quickly discover and operate on contextually related data, ordinarily technologically hard development can instead be focused on
business value and business use cases instead of architectural components.
13
From an architecture and development perspective, we define context as containing all the data, logic, and relationships you need to understand
the state of an entity at a particular point in time, and provide relevant services such as concise views and proactive action
14
15
Rather than modelling *structurally related* data like in tables, we want to model *contextually related* data together. We call it a context: a
graph of closely related objects, that can additionally have relationships to other contexts.
16
A context model is essentially an object model with boundaries defined by relationships. This is convenient for OO programming. And like an
object model, it includes behavior.
17
Context is exactly what we want to serve to applications, so they can leverage that unified understanding of an entity.
18
First step to representing context is figuring out how to store and retrieve the data.
19
We want hard disk because its high capacity and cheap.
20
Ok, let’s just say that’s settled for the moment. But its not just the cost of the hard disk. It’s the whole infrastructure.
21
And we know that commodity hardware is way cheaper than specialized hardware.
22
But 48TBs, although a lot of space, is hardly PBs.
23
This must be economical at scale. When WE SAY SCALE, THINK IOT+SOCIAL+TRADITIONAL. Think petabyes at the bottom end.
24
THINK DISTRIBUTED.
25
Lucky for us. Context can also be our unit of distribution. By spreading contexts across several servers, we can leverage the combined storage of
commodity servers without sharding the database in an unnatural way like in relational. Since our contexts have behavior, we also have a way to
distribute processing
26
Once those contexts are distributed, however, we need to be able to quickly retrieve them if we’re going to have any hope of an application using
the database
27
Like we said, we want hard disk because its high capacity and cheap. But its slow.
28
Its slow because of physics. The disk has to spin and head has to move physically to access a new region of the disk.
29
How slow?
Say you want to assemble an application screen and it takes 100 queries (not out of line), at 5ms per seek that’s a dedicated ½ second if nothing
else is going on. Easily have to wait several second when multiple users are accessing.
Or say you want to do some ad-hoc analysis and aggregation on some sensor events. If each requires a seek, one million would take 83 minutes,
one billion would take 58 days.
30
Since we’re logically clustering data anyway, why not just physically cluster it? That Customer-360-degree-view app now only needs 1 disk seek to
get everything needed to populate a page.
31
We also, however want to do analysis across many contexts.
32
At a very basic level, since each server is in charge of some number of contexts, and we are interested in potentially most of the contexts in the
system, we can just have that server read in disk physical order to reduce seeks.
33
Of course This is going to FIGHT our operational queries and updates. So how do we minimize contention between transactions, analytics, and
queries
34
Most databases use write locking. Which usually means if something is writing nothing else can read. But even if we thought locks were great,
maintaining locks on a large scale can cause a huge performance hit. We could choose to lock an entire context at a time to reduce individual
object locks.
35
But we’re still going to be wasting valuable time coordinating who goes next. And arbitrarily telling requestors to wait.
36
So if we’re going to let transaction, queries, and analytics run, we’re going to be bold and get rid of locks all together.
37
We queue up transactions to operate one at a time on a context. When a transaction is committed to the queue it is guaranteed to run, and if the
requestor stops listening at that point it’s essentially eventually consistent. However, if the requestor wishes, they can wait until they get
confirmation that the transaction actually ran.
38
So how can we prevent a backlog from getting out of control? Presumably if we keep stopping we’ll just exacerbate the situation, by introducing
extra wait time to everyone in the line.
39
So instead, we actually use this stoppage to our benefit. We can batch up the transactions targeting a particular context and run them at the
same time, with one write to disk.
40
Now for the hard part.
41
How do we determine related objects from any source and cluster them in the first place? Turns out this is actually fairly difficult. As evident by
the fact that the average amount of existing data enterprises analyze is only 12% according to Forrester Research. And that’s only the data they
already have.
42
What we want is to pour the data in and let the database figure it out. We want relationship discovery.
43
44
When we say “NEAR” relationships, we mean data that belongs in the same context. If it’s a customer context, for instance, then its about the
same customer including supplementary data, like details about a customer’s bills or perhaps recent GPS location. Near related data might
include overlapping or conflicting pieces of information, even though there’s theoretically a single contextual identity.
45
We call pieces of data that are incomplete representations of a context context fragments. The goal is to recognize when context fragments are
related enough to put them in the same cluster.
46
To do this we use a distance based unsupervised machine learning clustering engine.
47
There are many different ways you can define distance between two items. It could be geospatial, or even temporal or phonetic.
48
For instance, here’s an example of distance-based clustering for fatal accidents to show clusters. It takes into account not only geospatial
closeness, but also whether the accidents happened on the same street, or at an intersection.
49
There are a few concepts that make “NEAR” relationship discovery really powerful. First, we provide for context fragments to match transitively,
meaning two fragments may be brought together if an intermediate piece of data is closely related to each.
50
For instance, in this example the top left crash is on a different street and too far away from the bottom-most crashes to match directly. However,
it is linked through the intermediate crashes, that individually matched adjacent crashes in some way.
51
We also provide for provenance.
52
Provenance being where a fragment came from and how it ended up in a certain Context.
53
And provenance being the preservation of the fragment for the future in case its needed
54
Combining transitivity and provenance allows new fragments entering the system to cause contexts to merge or split, allowing the system to
discover a better understanding of near relationships as new data is available.
55
The data may also imply a relationship with another context. That’s a “far” relationship.
56
So when a context fragment is implying a relationship with another context, how do we find that context and build that relationship? Again the
data might be incomplete or inconsistent. For instance, it could just be an identifier or a name.
57
SO WE’RE GOING TO USE THE SAME METHOD. WE’RE GOING TO INJECT A SPECIAL FRAGMENT, CALLED A GRAPPLE FRAGMENT, WITH THE
RELATIONSHIP DATA. THAT WAY, IT CAN USE THE SAME CONTEXTUALIZATION PROCESS TO FIND THE MOST APPROPRIATE CONTEXT, IF ONE IS
AVAILABLE.
58
When it reaches a context, it will grapple back to the originating related context, and a relationship will be formed.
59
Relationships are first class citizens in our database model, so immediately both contexts will be aware of their new relationship. Think of
intelligence analysis. Connecting the dots.
60
Since we use the same mechanism, we get the benefits of transitivity and provenance for relationships. For instance, if a new fragment arrives
about an entity where only a grapple fragment existed previously, that context will be created and the related context will be grappled to
automatically. In fact, the grapple fragment itself can cause contextualization of data, like context merging.
61
Questions.
62
There’s also a bunch of other pieces that are needed to make the system useful for mission-critical, large enterprise systems. Operations and
management tools, elasticity, containerization, conflict resolution, automatic denormalization of summary data across relationships, advanced
indexing, parallel aggregation, data curation tools, etc.
63
It’s a big project. It’s over 15 years of advanced R&D. And we’re now just really starting to talk about the technology beneath the business
solutions.
64
65

More Related Content

Viewers also liked

Self-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataSelf-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataInside Analysis
 
10 razones para quiebran un emprendimiento (2)
10 razones para quiebran un emprendimiento (2)10 razones para quiebran un emprendimiento (2)
10 razones para quiebran un emprendimiento (2)Ronald Quiros
 
Convergence and Interoperability (IFLA 2011)
Convergence and Interoperability (IFLA 2011)Convergence and Interoperability (IFLA 2011)
Convergence and Interoperability (IFLA 2011)Figoblog
 
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Health Catalyst
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger MetadataIan White
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodologyDatabase Architechs
 

Viewers also liked (8)

3 dw architectures
3 dw architectures3 dw architectures
3 dw architectures
 
Self-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataSelf-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big Data
 
10 razones para quiebran un emprendimiento (2)
10 razones para quiebran un emprendimiento (2)10 razones para quiebran un emprendimiento (2)
10 razones para quiebran un emprendimiento (2)
 
Convergence and Interoperability (IFLA 2011)
Convergence and Interoperability (IFLA 2011)Convergence and Interoperability (IFLA 2011)
Convergence and Interoperability (IFLA 2011)
 
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger Metadata
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodology
 

Similar to Big Data Madison: Architecting for Big Data (with notes)

Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...
Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...
Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...Dana Gardner
 
Delivering on the Promise of Big Data and the Cloud
Delivering on the Promise of Big Data and the CloudDelivering on the Promise of Big Data and the Cloud
Delivering on the Promise of Big Data and the CloudBooz Allen Hamilton
 
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022HostedbyConfluent
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsVikram Ramesh
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016Jon Hawes
 
Data lakehouse fallacies
 Data lakehouse fallacies Data lakehouse fallacies
Data lakehouse fallaciesNeil Raden
 
Scoda project companygraph
Scoda project companygraphScoda project companygraph
Scoda project companygraphTony Hirst
 
Scoda company networks2
Scoda company networks2Scoda company networks2
Scoda company networks2Tony Hirst
 
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...Dana Gardner
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investmentvijayk23x
 
Know your cirrus from your cumulus (with notes)
Know your cirrus from your cumulus (with notes)Know your cirrus from your cumulus (with notes)
Know your cirrus from your cumulus (with notes)Andrew Phillips
 
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...Dana Gardner
 
Efficient Association Rule Mining in Heterogeneous Data Base
Efficient Association Rule Mining in Heterogeneous Data BaseEfficient Association Rule Mining in Heterogeneous Data Base
Efficient Association Rule Mining in Heterogeneous Data BaseIJTET Journal
 
12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content AnalyticsSeth Grimes
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data FabricAngelo Corsaro
 
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...101 ways to fail at security analytics ... and how not to do that - BSidesLV ...
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...Jon Hawes
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 

Similar to Big Data Madison: Architecting for Big Data (with notes) (20)

Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...
Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...
Intralinks Uses Hybrid Computing to Blaze a Compliance Trail Across the Regul...
 
Delivering on the Promise of Big Data and the Cloud
Delivering on the Promise of Big Data and the CloudDelivering on the Promise of Big Data and the Cloud
Delivering on the Promise of Big Data and the Cloud
 
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
 
IT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOsIT Performance Management Handbook for CIOs
IT Performance Management Handbook for CIOs
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016
 
Data lakehouse fallacies
 Data lakehouse fallacies Data lakehouse fallacies
Data lakehouse fallacies
 
Scoda project companygraph
Scoda project companygraphScoda project companygraph
Scoda project companygraph
 
Scoda company networks2
Scoda company networks2Scoda company networks2
Scoda company networks2
 
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
Know your cirrus from your cumulus (with notes)
Know your cirrus from your cumulus (with notes)Know your cirrus from your cumulus (with notes)
Know your cirrus from your cumulus (with notes)
 
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...
Qlik’s CTO on Why the Cloud Data Diaspora Forces Businesses To Rethink their ...
 
Efficient Association Rule Mining in Heterogeneous Data Base
Efficient Association Rule Mining in Heterogeneous Data BaseEfficient Association Rule Mining in Heterogeneous Data Base
Efficient Association Rule Mining in Heterogeneous Data Base
 
12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabriczenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
 
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...101 ways to fail at security analytics ... and how not to do that - BSidesLV ...
101 ways to fail at security analytics ... and how not to do that - BSidesLV ...
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 

Big Data Madison: Architecting for Big Data (with notes)

  • 1. Hi we’re Josh and Jordan 1
  • 2. And we work for MIOsoft. We’re here to give you a peek at some of the R&D MIOsoft has been doing since 1998. . 2
  • 3. MIOsoft has its world and R&D headquarters in Madison, and major offices in Hamburg and Beijing. We started in 1998 with a project for CUNA Mutual Group. 3
  • 4. When you ask us what MIOsoft does, we typically throw around a lot of technical terms. Or buzz words. 4
  • 5. But really, what we’re focused on is data. 5
  • 6. And it’s everywhere. In business systems. In sensors. In machines. 6
  • 7. Companies are just now starting to realize this. And they are trying to gather it up as cheap as possible, hoping to derive some insight or discover some new business line. 7
  • 8. Its like the gold rush. Everyone’s digging in their data for something. 8
  • 9. The real value of the data is what sorts of relationships it can show you as a business. You want to find relationships between pieces of data so you can contextually understand entities like customers. You want to find statistical relationships between business outcomes and contextual state, so you can influence future business outcomes. 9
  • 10. From the business side, they’re looking for a silver bullet. They want prescriptive analytics to enable proactive service. For a customer-centric business, you really want to treat every individual event or encounter in context of the customer. For instance, in a coffee shop the barista can be advised to automatically start making their favorite coffee, or a notification can be sent to the customer’s phone with a list of their favorite items sorted from most likely to least likely with an offer to start putting the order together and to pay without having to wait in line. This would also be an opportunity to introduce the customer to something new based on discoveries made by observing other similar customers. 10
  • 11. In many ways, this is what we mean by contextual understanding. We call it context. 11
  • 12. Context is incredibly valuable, and analysts are starting to recognize that. 12
  • 13. When you can quickly discover and operate on contextually related data, ordinarily technologically hard development can instead be focused on business value and business use cases instead of architectural components. 13
  • 14. From an architecture and development perspective, we define context as containing all the data, logic, and relationships you need to understand the state of an entity at a particular point in time, and provide relevant services such as concise views and proactive action 14
  • 15. 15
  • 16. Rather than modelling *structurally related* data like in tables, we want to model *contextually related* data together. We call it a context: a graph of closely related objects, that can additionally have relationships to other contexts. 16
  • 17. A context model is essentially an object model with boundaries defined by relationships. This is convenient for OO programming. And like an object model, it includes behavior. 17
  • 18. Context is exactly what we want to serve to applications, so they can leverage that unified understanding of an entity. 18
  • 19. First step to representing context is figuring out how to store and retrieve the data. 19
  • 20. We want hard disk because its high capacity and cheap. 20
  • 21. Ok, let’s just say that’s settled for the moment. But its not just the cost of the hard disk. It’s the whole infrastructure. 21
  • 22. And we know that commodity hardware is way cheaper than specialized hardware. 22
  • 23. But 48TBs, although a lot of space, is hardly PBs. 23
  • 24. This must be economical at scale. When WE SAY SCALE, THINK IOT+SOCIAL+TRADITIONAL. Think petabyes at the bottom end. 24
  • 26. Lucky for us. Context can also be our unit of distribution. By spreading contexts across several servers, we can leverage the combined storage of commodity servers without sharding the database in an unnatural way like in relational. Since our contexts have behavior, we also have a way to distribute processing 26
  • 27. Once those contexts are distributed, however, we need to be able to quickly retrieve them if we’re going to have any hope of an application using the database 27
  • 28. Like we said, we want hard disk because its high capacity and cheap. But its slow. 28
  • 29. Its slow because of physics. The disk has to spin and head has to move physically to access a new region of the disk. 29
  • 30. How slow? Say you want to assemble an application screen and it takes 100 queries (not out of line), at 5ms per seek that’s a dedicated ½ second if nothing else is going on. Easily have to wait several second when multiple users are accessing. Or say you want to do some ad-hoc analysis and aggregation on some sensor events. If each requires a seek, one million would take 83 minutes, one billion would take 58 days. 30
  • 31. Since we’re logically clustering data anyway, why not just physically cluster it? That Customer-360-degree-view app now only needs 1 disk seek to get everything needed to populate a page. 31
  • 32. We also, however want to do analysis across many contexts. 32
  • 33. At a very basic level, since each server is in charge of some number of contexts, and we are interested in potentially most of the contexts in the system, we can just have that server read in disk physical order to reduce seeks. 33
  • 34. Of course This is going to FIGHT our operational queries and updates. So how do we minimize contention between transactions, analytics, and queries 34
  • 35. Most databases use write locking. Which usually means if something is writing nothing else can read. But even if we thought locks were great, maintaining locks on a large scale can cause a huge performance hit. We could choose to lock an entire context at a time to reduce individual object locks. 35
  • 36. But we’re still going to be wasting valuable time coordinating who goes next. And arbitrarily telling requestors to wait. 36
  • 37. So if we’re going to let transaction, queries, and analytics run, we’re going to be bold and get rid of locks all together. 37
  • 38. We queue up transactions to operate one at a time on a context. When a transaction is committed to the queue it is guaranteed to run, and if the requestor stops listening at that point it’s essentially eventually consistent. However, if the requestor wishes, they can wait until they get confirmation that the transaction actually ran. 38
  • 39. So how can we prevent a backlog from getting out of control? Presumably if we keep stopping we’ll just exacerbate the situation, by introducing extra wait time to everyone in the line. 39
  • 40. So instead, we actually use this stoppage to our benefit. We can batch up the transactions targeting a particular context and run them at the same time, with one write to disk. 40
  • 41. Now for the hard part. 41
  • 42. How do we determine related objects from any source and cluster them in the first place? Turns out this is actually fairly difficult. As evident by the fact that the average amount of existing data enterprises analyze is only 12% according to Forrester Research. And that’s only the data they already have. 42
  • 43. What we want is to pour the data in and let the database figure it out. We want relationship discovery. 43
  • 44. 44
  • 45. When we say “NEAR” relationships, we mean data that belongs in the same context. If it’s a customer context, for instance, then its about the same customer including supplementary data, like details about a customer’s bills or perhaps recent GPS location. Near related data might include overlapping or conflicting pieces of information, even though there’s theoretically a single contextual identity. 45
  • 46. We call pieces of data that are incomplete representations of a context context fragments. The goal is to recognize when context fragments are related enough to put them in the same cluster. 46
  • 47. To do this we use a distance based unsupervised machine learning clustering engine. 47
  • 48. There are many different ways you can define distance between two items. It could be geospatial, or even temporal or phonetic. 48
  • 49. For instance, here’s an example of distance-based clustering for fatal accidents to show clusters. It takes into account not only geospatial closeness, but also whether the accidents happened on the same street, or at an intersection. 49
  • 50. There are a few concepts that make “NEAR” relationship discovery really powerful. First, we provide for context fragments to match transitively, meaning two fragments may be brought together if an intermediate piece of data is closely related to each. 50
  • 51. For instance, in this example the top left crash is on a different street and too far away from the bottom-most crashes to match directly. However, it is linked through the intermediate crashes, that individually matched adjacent crashes in some way. 51
  • 52. We also provide for provenance. 52
  • 53. Provenance being where a fragment came from and how it ended up in a certain Context. 53
  • 54. And provenance being the preservation of the fragment for the future in case its needed 54
  • 55. Combining transitivity and provenance allows new fragments entering the system to cause contexts to merge or split, allowing the system to discover a better understanding of near relationships as new data is available. 55
  • 56. The data may also imply a relationship with another context. That’s a “far” relationship. 56
  • 57. So when a context fragment is implying a relationship with another context, how do we find that context and build that relationship? Again the data might be incomplete or inconsistent. For instance, it could just be an identifier or a name. 57
  • 58. SO WE’RE GOING TO USE THE SAME METHOD. WE’RE GOING TO INJECT A SPECIAL FRAGMENT, CALLED A GRAPPLE FRAGMENT, WITH THE RELATIONSHIP DATA. THAT WAY, IT CAN USE THE SAME CONTEXTUALIZATION PROCESS TO FIND THE MOST APPROPRIATE CONTEXT, IF ONE IS AVAILABLE. 58
  • 59. When it reaches a context, it will grapple back to the originating related context, and a relationship will be formed. 59
  • 60. Relationships are first class citizens in our database model, so immediately both contexts will be aware of their new relationship. Think of intelligence analysis. Connecting the dots. 60
  • 61. Since we use the same mechanism, we get the benefits of transitivity and provenance for relationships. For instance, if a new fragment arrives about an entity where only a grapple fragment existed previously, that context will be created and the related context will be grappled to automatically. In fact, the grapple fragment itself can cause contextualization of data, like context merging. 61
  • 63. There’s also a bunch of other pieces that are needed to make the system useful for mission-critical, large enterprise systems. Operations and management tools, elasticity, containerization, conflict resolution, automatic denormalization of summary data across relationships, advanced indexing, parallel aggregation, data curation tools, etc. 63
  • 64. It’s a big project. It’s over 15 years of advanced R&D. And we’re now just really starting to talk about the technology beneath the business solutions. 64
  • 65. 65