A successful digital information strategy depends on being able to find, connect and consume diverse data sources repeatably and at scale. But top-down, deterministic data unification approaches (such as ETL, ELT and MDM) weren’t designed to scale to the variety of hundreds, thousands or tens of thousands of data silos. A new bottom-up, probabilistic approach to data unification complements MDM by providing the agility and scalability to exploit data variety.
1. MDM AND THE DATA UNIFICATION IMPERATIVE
JAMES MARKARIAN | ADVISOR, TAMR
2. Data Heterogeneity is Inherent in Large Companies
Data sources are bound to applications with idiosyncratic bias
Sales
Marketing
Manufacturing
HR
Support
Finance
AppsStoreApps Store
5. Result: Just 10% of Data is Consumable by Any One Person
And 80% of data scientist time is spent preparing it
90%
Dark Data
6. Expectations for Global Corporate IT as Data Broker
Increasing quickly -- along with the hype about Big Data/Analytics 3.0
HR
Sales
Finance
Divisions
Marketing MFG
ENG
7. Some Options
Option #1 - Deny Variety - use information that is easiest/closest
Option #2 - Manage Variety incrementally - using traditional approaches:
● Standardization
● Aggregation
● Master Data Management
● Rationalize Systems
● Throw Bodies at it
● Improve Individual Productivity
Option #3 - Embrace Variety using probabalistic/model based approach - Tamr
8. Traditional Data Management Approaches: Necessary but not sufficient
● Standardization
● Aggregation
● Master Data Management
● Rationalize Systems
● Throw Bodies at it
● Improve Individual Productivity
Option #2: “Manage” Variety Using Traditional Approaches
9. Logical Evolution to Probabilistic/Model-Based Approach
Probabilistic
Deterministic
Probabilistic
Deterministic
Today Future
Probabilistic (Tamr) complements, NOT Replaces, Deterministic (MDM)
10. INTRODUCING TAMR
▪ Founded in 2013 by
enterprise database software
veterans
▪ World-class engineering team
▪ Top tier venture backing
(Google Ventures, NEA)
Jerry Held,
PhD
Andy Palmer Mike Stonebraker,
PhD
Ihab Ilyas,
PhD
Kevin Burke Nidhi Aggarwal,
PhD
Min Xiao Nik Bates-
Haus
Kevin Willis
10
11. Managing enterprise information as an asset requires a new,
bottom-up design pattern
Catalog Connect Consume
ALL your metadata and
map it to logical entities
Entities and attributes to
remove information silos
Unified data in the application
of your choice via APIs
“Embrace” Variety -- Tamr’s NextGen Approach
12. Tamr’s Design Pattern: “Back to the Future”
1990’s Web:
Yahoo’s top-down
organization
2020’s Enterprise:
Probabilistic data source cataloging,
connection and consumption
14. TAMR WORKS WITH MDM SYSTEMS TO HANDLE EXTREME DATA VARIETY
14
MDM
EDW
Published Keys
Schema map
Few Well
understood
sources
Long tail of
disparate
data
sources
Matches &
Rules
● Cleansing
● Consolidation
● Survivorship
● Governance
Rapid Analytics
Benefits
● Business agility
● Faster MDM implementations (months -> weeks)
● Significantly lower ongoing maintenance
15. Fortune 50 company -- Optimized Sourcing Analysis
Benefits
● Massive reductions in
supplier list size & number
of distinct suppliers
● Automated data
maintenance; lower cost
of ownership
● Powering strategic
sourcing analytics and
governance
● Empowering individual
procurement team with
global view of payment
terms
16. Catalog
Tamr helps you catalog
metadata across the entire
enterprise, providing a logical
map of all of your information
Find us at Booth #613
Connect
Tamr helps match entities
and attributes across the
full variety of your sources,
leveraging entity relationships
for high accuracy
Consume
Tamr provides a consolidated
view of entities and records for
downstream applications via
a set of RESTful APIs
learn more at tamr.com
Find us at Booth #613
Hinweis der Redaktion
Key Messages:
Introduce yourself as James Markarian
I am currently an EIR at at Khosla ventures. Prior to Khosla, I spent 15 years as the CTO of Informatica, a leader in the ETL space, where I focused on <x>
Recently, I joined Tamr, a company focused on unifying and enriching internal and external data for enterprise analytics, to advise them on product architecture and strategy.
Today I’ll be speaking a bit about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
e>>> Heterogeneity of information sources is natural in large companies
Much of the roughly $3-4 trillion invested in enterprise software over the last 20 years, has gone toward building and deploying software systems and applications to automate and optimize key business processes in context of specific functions (sales, marketing, manufacturing) and/or geographies (countries, regions, states, etc) - essentially these are systems that produce data and do so in a very idiosyncratic manner.
As each of these idiosyncratic applications are deployed - an equally idiosyncratic data source is created. The result: the data tied to enterprise investments in software is extremely heterogeneous and siloed - the broad use of the data has been 2ndary to the primary activity of automating business processes - producing the data. The data is almost like an idiosyncratic exhaust of all of these various applications.
It’s not surprising (actually natural) that information across a large enterprise is disconnected and is managed more as the exhaust of 30+ years of business process automation. I think of this as a form of enterprise information entropy. The effort to standardize on single vendor platforms as well as creating enterprise-wide data warehouses has largely been an attempt to compensate for natural enterprise data variety/entropy and ironically - the top-down, approaches used to rationalize to a single platform or implement most warehouses (Deterministic ETL, Master Data Management and Waterfall Data Management Methods) - created not fewer silos - but just additional larger silos that increased the overall variety of data sources within an organization.
>>> Heterogeneity of information sources is natural in large companies
Much of the roughly $3-4 trillion invested in enterprise software over the last 20 years, has gone toward building and deploying software systems and applications to automate and optimize key business processes in context of specific functions (sales, marketing, manufacturing) and/or geographies (countries, regions, states, etc) - essentially these are systems that produce data and do so in a very idiosyncratic manner.
As each of these idiosyncratic applications are deployed - an equally idiosyncratic data source is created. The result: the data tied to enterprise investments in software is extremely heterogeneous and siloed - the broad use of the data has been 2ndary to the primary activity of automating business processes - producing the data. The data is almost like an idiosyncratic exhaust of all of these various applications.
It’s not surprising (actually natural) that information across a large enterprise is disconnected and is managed more as the exhaust of 30+ years of business process automation. I think of this as a form of enterprise information entropy. The effort to standardize on single vendor platforms as well as creating enterprise-wide data warehouses has largely been an attempt to compensate for natural enterprise data variety/entropy and ironically - the top-down, approaches used to rationalize to a single platform or implement most warehouses (Deterministic ETL, Master Data Management and Waterfall Data Management Methods) - created not fewer silos - but just additional larger silos that increased the overall variety of data sources within an organization.
On top of the historical pull toward application and organization specific data sources - these systems get even more complicated and disconnected when you add the confusion and complexity that results from :
M&A events every quarter
Reorganizations every 6-12 months
Changes in leadership every few years
Objective estimates of the scale of this problem are surprising - specifically - industry analysts estimate that :
90% of big data is dark (not used or cataloged within the enterprise)
90% of collected data isn’t consumable (requires significant work to be useful)
80% of data scientist time is spent preparing the data for consumption
Not being managed as an asset
This challenge is only going to become more critical -- especially as expectations of Global Corporate IT as data broker are increasing quickly along with the hype around Big Data/Analytics 3.0
As we look forward to the next 20 years, most companies have begun investing heavily in Big Data Analytics – $44 billion in 2014 alone according to Gartner << insert reference to Data/Analytics being the top priority for CIOs >>.
In this context, merely managing all of a company’s data as an asset presents a significant challenge for a globally missioned IT organization. But now - enter the trend toward proverbial Big Data and Analytics 3.0 -- and the already impossible problem of managing data variety becomes a strategic imperative for the IT organization who is now expected to integrate analytics and data seamlessly and quickly across all of these idiosyncratic silos so that all these users with great new democratized viz tools.
We’d like to think that our data integration and preparation capabilities are advanced enough to service this great democratization. And that our “plumbing” is capable of treating the massive reserves of silo’d, heterogeneous data.
However - these aspirations and the cool new viz tools that are available to everyone in the enterprise require clean, unified data that spans all the various silos. Most companies are finding this heterogeneity is a massive fundamental roadblock to effectively using state-of-the-art analytics and visualization tools. Basically Big Data Variety and heterogeneity is the dirty little secret of most enterprises and while it’s not sexy to spend time cleaning and preparing data - unified data is as important to enterprise analytics as reliable water treatment is to providing clean drinking water to the population.
All of this leaves Corporate IT organizations several options to address the data variety problem as data brokers for their enterprise.
Some orgs are simply ignoring the opportunity to convert variety into value – overwhelmed by the sheer volume of heterogeneous sources and data.
So they go ahead and carve out their pile, go to their corner, and work with what they have.
>>> Traditional approaches to managing data are necessary but not sufficient to address the broad enterprise data variety problem
In order to realize the opportunity in variety – IT brokers need to recognize that their existing top-down tools/approaches are necessary but not sufficient to solve the variety problem.
There is a long list of tools in the enterprise arsenal to try to tackle data variety - I’ve tried all of them over the years - specifically:
Master Data Management - most of the efforts to do top-down deterministic data modeling results in useful taxonomies, controlled vocabularies and ontologies. This requires you to “tell” the various divisions what they are going to map to - which inevitably degrades into a debate about who is the Master and who is the “Slave”. These also are necessary - but not sufficient in order to manage the broad variety of tabular data in most enterprises. There are always deviations from whatever the 3 star wizards in labcoats who are responsible for the “Master” reference data.
Multiple approaches have emerged to deal with the Data Variety problem, with the current state dominated by extreme top-down management (95% deterministic to 5% probabilistic). I predict that the shear number of data sources and complexity of change is going to drive us toward a bottom-up approach (80% probabilistic to 20% deterministic).
The only viable way to tame enterprise data variety is through “bottom-up, collaborative data curation complements traditional MDM, ETL, data profiling and data quality methods.
A Next-Gen Approach
We believe that big companies should start by deploying a fundamentally new design pattern for data management which enables their organization to dynamically catalog, connect, curate ALL of their enterprise information sources from the bottom up using a scalable and agile approach.
NOTE that Tamr operationalizes this approach at scale, across the enterprise -- NOT as another idiosyncratic solution -- AND work with existing data management and analytics tools].
Connect - Our emphasis has been on connecting diverse data sources across the enterprise, at scale. We are now expanding the platform to bring this level of scalable data unification and use across the enterprise.
Catalog - At the front end, Tamr now solves a very common problem: What data do I use to solve this problem?
Consume/Curate - Unified data doesn’t live in Tamr. We make it available to any downstream application or analytic tools -- including something as simple as spreadsheets - via a set of RESTful APIs.
This design pattern is not new - it’s a mimic of the design patterns on the modern world wide web - but is designed to connect the primary information asset of the enterprise - tabular data. In the mid-1990’s - the early days of Yahoo!, they used library sciences professionals and top down information management practices and tools to organize websites and web content for search. Over time - it became clear that Google’s bottom-up probabilistic approach to matching web content with search terms - was going to be a much more scalable and effective approach - so much so that as most of you know - Yahoo! decided to license Google’s tech.
Inside the enterprise, tabular data sources are the primary assets to be connected instead of websites … and companies need a new set of tools to register/catalog, connect and curate tabular data that is matched to the data/attributes that analytic users want/need. We believe that our technology at Tamr will be incorporated into existing legacy MDM, ETL and Data Management tools much in the way that Yahoo! licenced Google.
Tamr automates schema mapping using a bottom-up approach
Tamr is the master for probabilistic keys
MDM
MDM provides capabilities for
Data cleansing
Data consolidation
Data survivorship
Active and passive data governance
Results
Reduced MDM implementation time (weeks -> months)
Reduce ongoing maintenance
Use Tamr without MDM for analytical use cases which prioritize velocity of analysis
Challenge
With thousands of suppliers spanning many P&Ls and ERP systems, the company has been challenged to maintain an accurate supplier master file (SMF) to drive strategic sourcing analysis
Solution
Create a unified data model that leverages all relevant sources, including address, tax and government data
Machine learning algorithms continuously evaluate & remove potential SMF duplicates
Automated processing incrementally improves as validation is received from SMEs
Benefits
Massive reductions in supplier list size & number of distinct suppliers
Automated data maintenance; lower cost of ownership in production
Powering strategic sourcing analytics and governance at a corporate level
Empowering individual procurement team with global view of payment terms
Here’s the link for the long-form write up the team did, for background:
https://docs.google.com/a/tamr.com/document/d/12JvLG4wr_PjpKOGlUyoDx6iVULCAkwm5bhHKMYP7vwU/edit?usp=sharing