By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
4. Building Blocks – The New Enterprise Stack
TRADITIONAL MODERNISED
APPS On-Premise, Monoliths SaaS, Microservices
DATABASE Relational Non-Relational
EDW Teradata, Oracle, etc. Hadoop
COMPUTE Scale-Up Server Containers / Commodity Server / Cloud
STORAGE SAN Local Storage & Data Lakes
NETWORK Routers and Switches Software-Defined Networks
5. Challenges of Digital Transformation
Growth in Data
Silos
Lack Real-Time
Insight
Existing Systems
Overwhelmed
10. • AKA: Data Hub, 360 Degree View, Multi-Channel
display
• A system that gathers data…
• …from multiple, disconnected sources…
• …and aggregates to provide a single view
• Foundation for analytics – cross-sell, upsell,
churn risk
What is a Single View?
11. • Customer
• Product
• Employee
• Asset
• Risk
• City
• Anything meaningful to a business
A Single View… of what?
13. Why Not Use The Usual Tech – Relational Databases?
Database MUST simultaneously
handle source systems complexity
Untenable change management
Complex data access
15. • Flexible data model
• Rich query, aggregation, search & reporting
• High availability
• Predictable scalability
• Flexible deployment model
Single View – Required Database Capabilities
16. Single View – High Level Data Flow
Source:
Web App
Source:
CRM App
Source:
Mainframe
System
Batch or
real-time
Documents
Customer
Service App
Churn Analytics
Risk Model
Real-Time Access
Update
Queue
…
Group
Filter
Sort
Count
Average
Deviations
Validation
17. • Flexible data model
• Rich query, aggregation, search & reporting
• High availability
• Predictable scalability
• Flexible deployment model
Why MongoDB for Single View?
18. Single View of Customer
Insurance leader generates coveted single view of customers in 90
days – “The Wall”
Problem Why MongoDB ResultsProblem Solution Results
No single view of customer, leading to
poor customer experience and churn
145 years of policy data, 70+ systems,
24 800 numbers, 15+ front-end apps
that are not integrated
Spent 2 years, $25M trying build single
view with Oracle – failed
Built “The Wall,” pulling in disparate data
and serving single view to customer
service reps in real time
Flexible data model to aggregate
disparate data into single data store
Expressive query language and secondary
indexes to serve any field in real time
Prototyped in 2 weeks
Deployed to production in 90 days
Decreased churn and improved ability to
upsell/cross-sell
20. • Centralised repository for data collected from
operational systems
• Exploratory analytics
• Extension of EDW: often based on Hadoop
• 50% of organisations invested in data lakes*
* Gartner
What is a Data Lake?
23. • Unify analytics with operational applications
• Create smart, contextually aware, data-driven
apps & insights
• Integrate operational database with data lake
How To Avoid Being In The 70%
24. • Smart/native integration with the data lake
• Powerful real-time analytics
• Flexible, governed data model
• Scale with the data lake
• Sophisticated management & security
• MongoDB provides all these capabilities
Operational Database Requirements
25. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB
blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalised Data Lake
26. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB
blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalised Data Lake
Configure where to land
incoming data
27. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB
blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalised Data Lake
Raw data processed to
generate analytics models
28. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB
blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalised Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
29. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in 128MB
blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalised Data Lake
Compute new
models against
MongoDB & HDFS
30. Problem Why MongoDB ResultsProblem Solution Results
Existing EDW with nightly batch loads
No real-time analytics to personalize
user experience
Application changes broke ETL pipeline
Unable to scale as services expanded
Microservices architecture running on AWS
All application events written to Kafka queue, routed to
MongoDB and Hadoop
Events that personalize real-time experience (ie
triggering email send, additional questions, offers)
written to MongoDB
All event data aggregated with other data sources and
analyzed in Hadoop, updated customer profiles written
back to MongoDB
2x faster delivery of new services
after migrating to new architecture
Enabled continuous delivery: pushing
new features every day
Personalized user experience, plus
higher uptime and scalability
UK’s Leading Price Comparison Site
Out-pacing Internet search giants with continuous delivery pipeline powered
by microservices & Docker running MongoDB, Kafka and Hadoop in the cloud
32. • Development agility
• Data re-use
• Operational efficiency
• Corporate governance and data lineage
• Cost accountability
Standardising the Database Environment
33. API Access Layer
Operational Data
Customers
Products
Accounts
Transactions
Physical Infrastructure
App1 App2 App3
• Shared, multi-tenant database
accessible via a common API
• Exposes CRUD, search,
geospatial, graph, analytics
• Each data domain isolated into
its own replica set
• Logically managed as one
service, UI for self-service
provisioning & scaling
Data-as-a-Service High Level Architecture
35. Patterns for Modern Data Architectures
Existing Systems OverwhelmedGrowth in Data Silos Lack Real-Time Insight
Single View Data-as-a-Service
Operationalised
Data Lake
Hinweis der Redaktion
We all hear a lot about the benefits of DT
A number of challenges in delivering this to the business
How do we make it as fast as possible to launch new services, while at the same time provide cross enterprise use of data?
What I’ll do is present 3 deployment patters we’ve seen be particularly effective at supporting digital transformation initiatives, and elevating data to that cross-enterprise asset
Dive into each of these – talk about benefits, high level arch patterns, and give examples of where they’ve ben successfully applied
So… why is this not easy? Why hasn’t everyone been successful?
There has been disruption at every layer of the tech stack
This new tech can help give us the scale and business agility we need to deliver on DT
But simply throwing tech at the problem doesn’t help us get value from the greatest byproduct that comes from digital transformation – data
New technology alone won’t solve everything.
Unless you apply new methodologies and architectural approaches, you will run into the same issues again and again
The definition of insanity. Same thing again and again, expecting a different outcome.
Take the wrong approach and with an eagerness to deliver quickly you might INCREASE the amount of silo’d data.
Overwhelmed by new sources of data entering the business – relying on a patchwork of tech to keep up – from databases to memory grids to data lakes
That in turn has given to the rise of data silos and data duplication – apps using specific niche technologies to address specific app requirements, making it hard to share that data across all of the business processes that need it, and enforcing centralized data controls
Then how do you unlock immediate insight to that data – batch loads via ETL to EDW takes too long – data is stale, out-innovated by competitors – users are demanding analytics being delivered to the business in real time. Serve up a recommendation on next best offer, identifying a critical fault in a manu assembly line, update fraud models based on new behaviors or breaches – the companies that survive in the future won’t be those that have the most data, but will be those that make use of that data faster than their competitors
There are a whole host of modern approaches to delivery, and architecture patterns that are being adopted
Perhaps a bit buzzwordy, but there’s a reason for the popularity of these things
You might find yourself using 1, a few or indeed many of these in conjunction with each other
It’s going to depend on your use case, but ultimately you need to leverage your data. How best to do this?
CQRS – use a different model to update information than you use for reading information
We’ve observed 3 common patterns to tame data challenges – these are not exclusive – can be adopted individually or together:
Single view: bring data together from multiple silos to create a 360 degree view of the customer, single view of risk in FinSvs, single view of the supply chain: Better serve customer-facing apps, provide a foundtional for richer analytics against that entity
2. Operationalized Data lake: real time database layer on top of EDW or Hadoop that marries analytics to operational apps. Generate insights faster
3 Data-as-a-Service: standardised database platform that is delivered as a service to project teams – allowing data reuse across apps
Dive into each of these – talk about benefits, high level arch patterns, and give examples of where they’ve ben succefully applied
Single view necessary as typically data for a business entity, ie a customer is spread across multiple systems – which we need to query individually to build a status of that entity, or to run any analytics against it
Think about a subscriber to a telco - landline, mobile, broadband, to an app store – and data for each service will be stored in individual systems. Maybe you’re giving the customer a self-service portal to manage their account, or you’re responding questions from a call center, you’re navigating multiple screens to bring together customer data
Much more efficient to aggregate that data into single view – much faster to build that status – customer experience is improved.
Once we have all of the data in a single view, we can start to create new insights by mining and analyzing it – ie brining comms products together, we can identify oppts for cross-sell and upsell. We can start to run regression analysis to find relationships between customers with similar attributes and the products they’ve bought to predict which products might be most useful, or those customers at greatest risk of churn based on similar customers and their actions
Single view is about aggregating data from multiple systems to provide
create a single consolidated view of a business entity, data ggregated from multiple sources: web + mobile platforms, CRM, call center apps, etc
Business entity: customer, financial instrument, fleet of trucks
Supports process improvements in our customer facing apps – much faster if a call center age can retrieve all customer info in a single click that navigate 15 different screens: a customer of a telco can see all the services subscribed – landline, mobile, satellite, see consumed usage across those services
Once we have all of the data in a single view, we can start to create new insights by mining and analyzing it – ie brining comms products together, we can identify oppts for cross-sell and upsell. We can start to run regression analysis to find relationships between customers with similar attributes and the products they’ve bought to predict which products might be most useful, or those customers at greatest risk of churn based on similar customers and their actions
Key is that once data is in this single view, it can create insights that we’ve never had before
Customer across CRM, order, billing and marketing systems
Risk across all asset classes and geographic regions
Inventory in motion across supply chain, production, warehouse, online channels
So I’ve got a bunch of different data types from varied sources.
And I need to put them somewhere, but where?
Ok, here’s one reason, and it’s not pretty.
Great because you can join multiple tables to create the view?
Any schema changes in upstream systems will break the data model
Try and join tens or hundreds of tables at run time will take far to long
Flexible document model is key – can aggregate data from multiple sources into single documents – fast to retrieve. And the schema can be adapted without app downtime
When designing single view – Define data sources, define common fields that uniquely identify the enttity – its against these we can apply governance controls to ensure that data is usable by our consuming apps, but can also provide the flexibility to apply dynmaic fields that vary from document to document
3 core tech requirements: flexible schema to ingest data in many different shapes from many different systems – need to evolve without downtime as source systems evolve
At same time, need to enforce data quality: mandatory attrib you need to capture to uniquely identify entities – database should validate data – prescense of fields, types
Collecting data isn’t sufficient – need to query it many ways – ie all subscribers that have spent more than £100 making international calls in the past 3 months who also use b/band, so we can then go after them with a roaming data offer
Typically powering customer service apps - Service needs to be highly available, to collect increasing volume of data we collect about every part of our business.
As updates made to source systems in left – those are then propogated to our database serving up single to the consuming apps on the right. Could be in batches, or in RT (increasingly) using message brokers such as Kafka or RabbitMQ. Validation rules applied by our single view database to ensure data is properly formed.
If any of those consuming systems need to apply an update, ie the customer places a new order, that wouldn’t be applied to our single view, but rather to the source app via an update queue
Able to sync updates across all our systems
Well let’s look back at the requirements?
MongoDB provides all these
Metlife
Industry: Insurance, Financial Services
Use Case: Single View
Fairly new phenomenon
New sources – logs, clickstreams, social feeds, iot sensor data
EDW typically optimised for upper left of the quadrant – structured data from internal systems – but sruggles when data comes from outside the enterprise, and is unstructrued – volume and variety of data
This is where the data lake provides a solution
While something like 50% of enterprises either have or are evaluating Hadoop to create new classes of app, not without its challenges
Appears in a number of Gartner analysis,
One of the fundamental challenges in integration is how to integrate data lake with your operational systems
Operational apps run the business – how do you expose analytics created in the data lake to better serve customers with more relevant products and offers, to better drive efficiency savings from IoT-enabled smart factory
Unify data lake analytics with the operational applications
Enables you to create smart, contextually aware, data-driven apps
Integrated database layer operationalizes the data lake
Beyond low latency performance, specific requirements. Need much more than just a datastore, fully-featured database serving as a System of Record for online applications
Tight integration between MongoDB and the data lake – minimize data movement between them, fullt exploit native capabilities of each part of the system
Need to be able to serve operational workloads, run analytics against live operational data –ie top trending articles now so I know where to place my ads, how many widgets coming off my produiction line are failing QA, is that up or down with previous trends. Gartner calls it HTAP (Hybrid Transactional and Analytical Processing), Forrester = transalytics – to do that, need: Powerful query language, secondary indexes, aggregations & transformations all within the database – not ETL into a warehouse
Workload isolation: operational & analytics – so don’t contend for the same resource
Flexible schema to handle multi-structured data, but need to enforce governance to that data
Secure access to the data: – the operational DB typically accessed by a much broader audience than Hadoop, so security controls critical – robust access controls – LDAP, kerberos, RBAC
Auditing of all events for reg compliance. Encr of data in motion and at rest, all built into the database
Need to scale as the data lake scales – means scaling out on commodity hardware, often across geo regions
To simplify the envrionment, need sophisticated mgmt tools: to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Tight integration: not enough just to move data between analytics and operational layers – need to move it efficiently. Connectors should allow selective filtering by using secondary indexes to extract and process only the range of data it needs – for example, retrieving all customers located in a specific geography. This is very different from other databases that do not support secondary indexes. In these cases, Spark and Hadoop jobs are limited to extracting all data based on a simple primary key, even if only a subset of that data is required for the query. This means more processing overhead, more hardware, and longer time-to-insight for the user.
Workload isolation: provision database clusters with dedicated analytic nodes, allowing users to simultaneously run real-time analytics and reporting queries against live data, without impacting nodes servicing the operational application.
Flexible data model to store data of any structure, and easily evolve the model to capture new attribs – ie enriched user profiles with geospatial data. Also need to ensure data quality by enforcing validation rules against the data – to ensure it is appropriated typed, contains all attribs needed by the app
Expressive queries developers to build applications that can query and analyze the data in multiple ways – by single keys, ranges, text search, and geospatial queries through to complex aggregations and MapReduce jobs, returning responses in milliseconds. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools, and avoiding the latency that comes from moving data between operational and analytical engines. Secondary indexes give oppt to filter data in any way you need – key for low latency operational queries
Robust security controls: govern access, provide audit trails and enc data in flight and at rest
Scale-out – match scale out of data lake, as it grows, add new nodes to service higher data volumes or user load
Advanced management platform. To reduce data lake TCO and risk of application downtime, powerful tooling to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Lets go deeper and wider
This is a design pattern for the data lake – multiple components that collectively handle ingest, storage, processing and analysis of data, then serving it to consuming operational apps
Step thru
Data ingestion: Data streams are ingested to a pub/sub message queue, which routes all raw data into HDFS.
Often also have event processing running against the queue to find interesting events that need to be consumed by the operational apps immediately - displaying an offer to a user browsing a product page, or alarms generated against vehicle telemetry from an IoT apps, are routed to MongoDB for immediate consumption by operational applications.
Raw data is loaded into the data lake where we can use Hadoop jobs – MR or Spark, generate analytics models from the raw data – see examples in the layer above HDFS
MongoDB exposes these models to the operational processes, serving indexed queries and updates against them with real-time latency
The distributed processing frameworks can re-compute analytics models, against data stored in either HDFS or MongoDB, continuously flowing updates from the operational database to analytics models
Look at some examples of users who have deployed this type of design pattern little later
CTM – UK’s leading price comparisons sites – moved from an on-prem RDBMS based monlithic app to microservices architecture powered by MongoDB with Hadoop at the back end providing analytics – enabled them better personalize customer experience and deepen relationships
Read through bullets
Standardized database service, accessible across multiple apps – exposed to developers as a set of APIs
Agility: devs can build on a std data mgmt inf. Focus on app, not on underlying database
Data re-use – being able to share data between applications wihout expensive ETL, reconciliation. Eliminates duplication
Operational efficiency: using std building blocks, best practices between projects, drive up utilization
Corp governance: institutionalize standards for DR, security, reporting – common set of security controls enforced at the database layer that don’t need to be repeated for each app
Cost accountability – centralized visibility of resource consumption across projects and BU
Logically looks like 1 database managed by cloud manager – but each is a separate RS
Mike will be talking about how RBS have successfully implemented their Data Fabric - Data-as-a-Service