Analytics, machine e deep learning, data/event streaming
Big data streaming: abilitare la macchina del tempo
Real time event streaming e nuovi paradigmi concettuali:
- Transazioni distribuite
- Consistenza eventuale
- Proiezioni materializzate
Real time event streaming e nuovi paradigmi architetturali:
- Enterprise service bus
- Event store
- Database delle proiezioni
Cenni di Domain Driven Design: una visione strategica della modellazione del proprio dominio di business nell'era dei bi Data.
2. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
3. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
7. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
8. Data is often considered to be the crown jewels of an organization.
1) Most companies already use analytics in the form of reports and dashboards to help run
their business. This is largely based on well structured data from operational systems that
conform to pre-determined relationships (“a single version of the truth”).
2) Big Data, however, doesn’t follow this structured model. The streams are all different and it
is difficult to establish common relationships. But with its diversity and abundance come
opportunities to learn and to develop new ideas – ideas that can help change the business
(“a single version of the facts”)
The architectural challenge is to bring the two paradigms together. So, rather than approach Big
Data as a new technology silo, an organization should strive to create a unified information
architecture – one that enables it to leverage all types of data, as situations demand, to
promptly satisfy business needs.
The objective of this workshop is to describe a reference architecture (and its implementation)
that promotes a unified vision for information management and analytics.
Executive Overview
8
10. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
11. The architecture is organized into views that highlight three focus areas:
1. universal information management
2. real-time analytics
3. intelligent processes
They represent architecturally significant capabilities that are important to most organizations
today.
Big Data & Analytics Reference Architectures Conceptual View
11
12. Unified Information Management addresses the need to manage information holistically as
opposed to maintaining independently governed silos. At a high level this includes:
o High Volume Data Acquisition – The system must be able to acquire data despite high
volumes, velocity, and variety. It may not be necessary to persist all data that is received.
o Multi-Structured Data Organization and Discovery – The ability to navigate and search across
different forms of data can be enhanced by the capability to organize data of different
structures into a common schema.
o Low Latency Data Processing – Data processing can occur at many stages of the architecture.
In order to support the processing requirements of Big Data, the system must be fast and
efficient.
o Single Version of the Truth – When two people perform the same form of analysis they
should get the same result. As obvious as this seems, it isn’t necessarily a small feat,
especially if the two people belong to different departments or divisions of a company. Single
version of truth requires architecture consistency and governance.
Unified Information Management
12
13. Real-Time Analytics enables the business to leverage information and analysis as events are
unfolding. At a high level this includes:
o Speed of Thought Analysis – Analysis is often a journey of discovery, where the results of one
query determine the content of the next. The system must support this journey in an
expeditious manner. System performance must keep pace with the users’ thought process.
o Interactive Dashboards – Interactive dashboards allow the user to immediately react to
information being displayed, providing the ability to drill down and perform root cause
analysis of situations at hand.
o Advanced Analytics – Advanced forms of analytics, including data mining, machine learning,
and statistical analysis enable businesses to better understand past activities and spot trends
that can carry forward into the future. Applied in real-time, advanced analytics can enhance
customer interactions and buying decisions, detect fraud and waste, and enable the business
to make adjustments according to current conditions.
o Event Processing – Real-time processing of events enables immediate responses to existing
problems and opportunities. It filters through large quantities of streaming data, triggering
predefined responses to known data patterns.
Real-Time Analytics
13
14. A key objective for any Big Data and Analytics program is to execute business processes more
effectively and efficiently. This means channeling the intelligence one gains from analysis
directly into the processes that the business is performing. At a high level this includes:
o Application-Embedded Analysis – Many workers today can be classified as knowledge
workers; they routinely make decisions that affect business performance. Embedding analysis
into the applications they use helps them to make more informed decisions.
o Optimized Rules and Recommendations –With optimized rules and recommendations,
insight from analysis is used to influence the decision logic as the process is being executed.
o Guided User Navigation – Whenever possible the system should leverage the information
available in order to guide the user along the most appropriate path of investigation.
o Performance and Strategy Management – Analytics can also provide insight to guide and
support the performance and strategy management processes of a business. It can help to
ensure that strategy is based on sound analysis. Likewise, it can track business performance
versus objectives in order to provide insight on strategy achievement.
Intelligent Processes
14
15. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
16. Big Data & Analytics Reference Architectures Logical View
16
The high-level logical view defines a multi-tier architecture template that can be used to describe
many types of technology solutions.
17. Big Data & Analytics Reference Architectures Logical View
17
This layer includes the hardware
and platforms on which the Big
Data and Analytics components
run. As shared infrastructure, it
can be used to support multiple
concurrent implementations, in
support of, or analogous to, Cloud
Computing.
This layer includes infrastructure to
support traditional databases,
specialized Big Data management
systems, and infrastructure that
has been optimized for analytics.
18. Big Data & Analytics Reference Architectures Logical View
18
At the bottom are data stores that
have been commissioned for
specific purposes (g.e. individual
operational data stores, CMS, etc.)
These data stores represent
sources of data that are ingested
(upward) into the Logical Data
Warehouse (LDW). The LDW
represents a collection of data that
has been provisioned for historical
and analytical purposes.
Above the LDW are components
that provide processing and event
detection for all forms of data.
At the top of the layer are
components that virtualize all
forms of data for universal
consumption.
19. Big Data & Analytics Reference Architectures Logical View
19
The Services Layer includes
components that provide or
perform commonly used services.
Presentation Services and
Information Services are types of
Services in a Service Oriented
Architecture (SOA). They can be
defined, cataloged, used, and
shared across solutions. Business
Activity Monitoring, Business
Rules, and Event Handling provide
common services for the
processing layer(s) above.
20. Big Data & Analytics Reference Architectures Logical View
20
The Process Layer represents
components that perform higher
level processing activities. For the
purpose of Big Data and Analytics,
this layer calls out several types of
applications that support
analytical, intelligence gathering,
and performance management
processes.
The Interaction Layer is comprised
of components used to support
interaction with end users.
Common artifacts for this layer
include dashboards, reports,
charts, graphs, and spreadsheets.
In addition, this layer includes the
tools used by analysts to perform
analysis and discovery activities.
21. Big Data & Analytics Reference Architectures Logical View
21
The results of analysis can be
delivered via many different
channels. The architecture calls
out common IP network based
channels such as desktops and
laptops, common mobile network
channels such as mobile phones
and tablets, and other channels
such as email, SMS, and hardcopy.
The architecture is supported by a
number of components that affect
all layers of the architecture. These
include information and analysis
modeling, monitoring,
management, security, and
governance.
22. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
23. Big Data & Analytics Reference Architectures Technological View
23
24. It lets you publish and subscribe to streams of records. In this respect it is similar to a message
queue or enterprise messaging system.
It lets you store streams of records ia a fault-tolerant way.
It lets you process streams of records as they occur.
Apache Kafka
24
Apache Kafka™ is a distributed streaming platform.
Website: https://kafka.apache.org/
25. Speed –up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Ease of use – API in Java, Scala, Python and R
Generality – powerful stack of libraries including SQL and DataFrames, Mllib for machine
learning, GraphX and Spark Streaming
Runs Everywhere - Spark runs on Hadoop, Mesos, standalone, or in the cloud
Apache Spark
25
Apache Spark™ is a fast and general engine for large-scale data processing.
Website: http://spark.apache.org/
27. Reference Architectures – Lambda Architecture
27
Batch Layer manages the master data set, an immutable, append-only set of raw data
Speed Layer
ingest streaming data or micro-batches and provide an «active partition» with a
limited window of mutability
Serving Layer
output from the batch and speed layers are stored in the serving layer (BASE
compliant)
28. Reference Architectures – Lambda Architecture
28
Complexity
Many moving parts
Restatement is difficult
Two code base must be kept in sync
Proper failure handling is complex
29. Reference Architectures – Kappa Architecture
29
Jay Kreps, the creating of Kafka and one of the first proponents of stream-based
architectures, joking called his alternative the “Kappa Architecture”.
31. There are more options today for where to deploy a solution than ever before. At a high level
the four options for deployment of architecture components are:
1) Public Cloud – In the public cloud model, a company rents resources from a third party. The
most advanced usage of public cloud is where the business functionality is provided by the
cloud provider (i.e., software-as-a-service). Public cloud might also be used as the platform
upon which the business functionality is built (i.e., platform-as-a-service), or the public
cloud may simply provide the infrastructure for the system (i.e.,infrastructure-as-a-service).
2) Private Cloud - Private cloud is the same as public cloud, but the cloud is owned by a
company instead of being provided by a third party. Private clouds are ideal for hosting and
integrating very large data volumes while keeping data secure behind corporate firewalls.
3) Managed Services – In this model a company owns the components of the system, but
outsources some or all aspects of runtime operations.
4) Traditional IT – In this model a company owns and operates the system.
These various options for deployment are not mutually exclusive.
Deployment
32
32. Security
33
1) Authentication (Kerberos, LDAP, …)
2) Authorization (ACE, ACL, Sentry,…)
3) Encryption & Data Masking (Over-the-Wire Encryption, Encryption at Rest, Field-
Level Encryption, Format-preserving Encryption)
4) Auditing & Data Lineage
5) Disaster Recovery & Backup
The Keys to secure the enterprise Big Data platform are:
33. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
35. Analytics - Data Science
36
Notebooks combine code, output and narrative into a single document.
Notebooks
You can condunct analysis writing down code, results, ideas and thoughts.
You have multiple languages and versions in a single multi-tenant environment.
Easy to share
Easy version control
36. 37
Data Science is the science of building data products.
OVERT DATA PRODUCTS COVERT DATA PRODUCTS
• Products where the data is clearly visible
as part of the deliverable.
• Descriptive Analysis
• Dashboarding
• Reporting
• Deliver results rather than data; data is
hidden.
• Recommendation Engine
• …
Website:https://www.oreilly.com/ideas/evolution-of-data-products
Analytics - Data Science Data Products
37. BENEFITS
Analytics allows to better manage Customer Base and extract customer
value
Analyze customer profiles, behaviors and purchases and obtain a complete and strategic view
of the most recurrent customer behaviors
Develop a tailored proposition by customer segment to increase customer value along the
whole client lifecycle
Address marketing efforts based on customer insights and value
Drive consumer segments to exploit product portfolio at the right time of their customer
journey
DIGITAL DIGITAL
38. Analytics will be carried out in order to offer actionable insights on
customer and will follow a multi-step approach
Business
Objective
&Question
Business
Actions
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
%Accounts
Deciles
Responders
Non Responders
Model Interpretation
Modeling
Data Preparation
Data Exploration/Understanding
Simple exploratory Analysis in order to understand the whole set of information
available, identify problems in the data, and start observing relationships
among variables.
Use of data visualization techniques for exploring the set of information
Data is prepared for data mining and machine learning models
Imputation of missing values, computation of new variables potentially useful for
the business question, transformation of variables to make them meaningful for
the problem to be solved
Models are implemented
Available data is used and synthesized to answer the business question, by
identifying relationships among target variable and input variables
It may be a recursive process based also on sampling data and assessing models
and results
Model results are interpreted in order to be useful for business strategy and
actions.
39. OBJECTIVES
ANALYTICAL
MODEL
… that can be answered through specific statistical models and
approaches
CustomerValue
Customer Life Time
New Customer
identification
and engagement
Clienteling
& Caring
Program
Actions to retain
leaving customers
Churn Model
ENGAGE NEW
CUSTOMERS
NURTURE & DEVELOP
LOYALTY CUSTOMERS
RETAIN LEAVING
CUSTOMERS
+
Clickstream & Content Analysis
Next Best Offer Analysis
Segmentation (deterministical vs behavioural)
Propensity Model
40. Why Algorithms Analysis
Propensity Models
The model assigns a propensity score to each customer and allows to priorite initiatives
Propensity model allows to estimate
Re-purchasing probability of customers
Retargeting Optimization: predict the likehood of booking a flight for potential customers
Up-selling propensity: Reservation upgrade or ancillary services proposal
Etc.
Address marketing
investments on customer
with highest propensity
to:
– Increase up-selling
– Increase cross-selling
– Increase active
customers
– Increase redemption
of marketing
campaigns
Regressions
Decision Trees
Random Forests
Neural Networks
Support Vector Machines
Ensemble Models
…
What
+
41. Why Algorithms Analysis
Behavioural Segmentation
Behavioral segmentation follows a statistical clustering algorithm which:
Identify most significant variables for the analysis
Aggregate customers into mutually exclusive groups with similar behavioral patterns, by creating
clusters are as similar as possible
Customer affiliation to a specific cluster varies overtime, based on his behavior
Get strategic insight on
customer base to increase
loyalty and value
Tailor contact strategy
(“the right action for the
right customer”)
Enhance the website
experience
Increase the redemption
rate for targeting
marketing campaigns
Data transformation
Factor analysis
Unsupervised Clustering
models
What
+
42. Why Algorithms Analysis
Churn Models
Churn analysis is a multivariate data mining technique that assigns a score to customer attrition
It estimates the probability that a customer will not buy from a company anymore or for a given period of
time
Historical data on customers leaving the company will be investigated in order to identify anticipatory
signals. Information on flying behavior, enriched data (lifestyle, interests, motivation, SOW, price sensitivity)
and customer hyper-profile will be used to compare churn vs loyal behavior
Optimization of costs and
marketing activities in
customer retention
Identification of high risk
customers sorted by
profitability
Increase active customers
Regressions
Decision Trees
Random Forests
Neural Networks
Support Vector Machines
Ensemble Models
…
What
43. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
49. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
50. Place the project's primary focus on the core domain and domain logic
Base complex designs on a model of the domain
Initiate a creative collaboration between technical and domain experts to iteratively refine a
conceptual model that addresses particular domain problems.
Concepts
– Context: the setting in which a word or statement appears that determines its meaning
– Domain: a sphere of knowledge (ontology), influence, or activity. The subject area to which
the user applies a program is the domain of the software
– Model: a system of abstractions that describes selected aspects of a domain and can be used
to solve problems related to that domain
– Ubiquitous Language: a language structured around the domain model and used by all team
members to connect all the activities of the team with the software
– Bounded context: explicitly define the context within which a model applies. Explicitly set
boundaries in terms of team organization, usage within specific parts of the application
– Context map: Identify each model in play on the project and define its bounded context. This
includes the implicit models of non-object-oriented subsystems. Name each bounded
context, and make the names part of the ubiquitous language. Describe the points of contact
between the models, outlining explicit translation for any communication
Domain Driven Design: Concepts
51
51. Entity: An object that is not defined by its attributes, but rather by a thread of continuity and its
identity
Value Object: an object that contains attributes but has no conceptual identity. They should be
treated as immutable
Aggregate: a collection of objects that are bound together by a root entity, otherwise known as
an aggregate root. The aggregate root guarantees the consistency of changes being made within
the aggregate by forbidding external objects from holding references to its members
Domain Event: a domain object that defines an event (something that happens). A domain
event is an event that domain experts care about
Service: when an operation does not conceptually belong to any object. Following the natural
contours of the problem, you can implement these operations in services
Domain Driven Design: Building Blocks
52
54. Big Data, Analytics, AI, Machine Learning, Deep Learning
Executive Overview
Big Data & Analytics Reference Architectures Conceptual View
Big Data & Analytics Reference Architectures Logical View
Big Data & Analytics Reference Architectures Technological View
Analytics Overview and Case Studies
Event Store
Domain Model
Cloudera
Agenda
56. Cloudera Manager
57
Cloudera Manager is an end-to-end application for managing CDH clusters. Cloudera Manager sets the
standard for enterprise deployment by delivering granular visibility into and control over every part of
the CDH cluster—empowering operators to improve performance, enhance quality of service,
increase compliance and reduce administrative costs.
58. Cloudera Navigator Optimizer
59
How can you assess the risk and true cost of offloading ETL and analytic workloads and understand
what it takes to get there?
o Cloudera Navigator Optimizer gives you the insights and risk-assessments you need to build out
a comprehensive strategy for Hadoop success. Simply upload your existing SQL workloads to
get started, and Navigator Optimizer will identify relative risks and development costs for
offloading these to Hadoop based on compatibility and complexity.
o To efficiently optimize performance for the latest technologies, like Hive and Impala, you need
visibility into what users are doing with the data and when the queries themselves are to
blame. Cloudera Navigator Optimizer gives you that visibility and lets you focus optimization
efforts on critical areas and best practices.
64. Cloudera Product Mapping View
65
Cloudera Enterprise is available on a subscription basis in five editions, each designed for your
specific needs.
– Essentials provides superior support and advanced management for core Apache Hadoop
– Data Science and Engineering for programmatic preparation and predictive modeling
– Operational DB for online applications and real-time serving
– Analytic DB for BI and SQL analytics
– The Enterprise Data Hub gives you everything you need to become information-driven, with
complete use of the platform.