VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"
1. Informatica Intelligent Data Lake
Self Service for Data Analysts
Februar, 2017
Sören Eickhoff
Sales Consultant Central Europe
SEickhoff@informatica.com
3. Data Platform
Data Lake
Use Case: Data Lake / Data Platform Reference Architecture
Landing Zone
Structured and unstructured enterprise and external data is landed in its raw form,
normalized and ready for use
Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer
Discovery Zone
User sandbox for self-serve access to data for exploration, data blending, hypothesis
testing, analytics, and collaboration
Production Zone
Sanitized transactional, master, and reference data & enriched data models certified for
enterprise use
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs Improve
Predictive
Maintenance
Increase
Operational
Efficiency
Increase
Customer
Loyalty
Reduce
Security Risk
Improve
Fraud
Detection
4. • Can’t easily find trusted data
• Limited access to the data
• Frustrated by slow response from IT
due to long backlog
• Constrained by disparate desktop
tools, manual steps
• No way to collaborate, share, and
update curated datasets
• Can’t cope with growing demand
from the business
• No visibility into what the business
is doing with the data
• Struggling to deliver value to the
business
• Loosing the ability to govern and
manage data as an asset
Challenges Faced by the Business and IT Today
ITData Analysts
5. Informatica Data Lake Management
Data Lake Management
Enterprise
Information
Catalog
Intelligent
Data Lake
Secure@Source
TITAN
Blaze
Big Data
Management
Intelligent
Streaming
Live Data Map
(metadata integration)
Big Data Management
(data integration)
Data Architect /
Steward
Data Scientist /
Analyst
InfoSec Analyst Data Engineer
6. Unified view into enterprise information assets
• Business-user oriented solution
• Semantic search with dynamic facets
• Detailed Lineage and Impact Analysis
• Business Glossary Integration
• Relationships discovery
• High level data profiling
• Automatic Classifications with Data domains
• Business classifications with Custom Attributes
• Broad metadata source connectivity
• Big data scale
Enterprise Information Catalog
7. Self-service data preparation with collaborative data governance
• Collaborative project workspaces
• Automated data ingestion
• Search data asset catalog
• Rapid blend of datasets
• Crowd-sourced data asset, tagging & data
sharing
• Automated data asset discovery &
Recommendations
• Rapid ‘industrialization’ of preparation
steps into re-usable workflows
• Complete tracking of usage, lineage, and
security
• Easily support Data Discovery Platforms
Intelligent Data Lake
8. Enterprise-wide visibility into sensitive data risks
• Sensitive data classification & discovery
• Sensitive data proliferation analysis
• Who has access to sensitive data
• User activity on sensitive data
• Sensitive Data policy-based alerting
• Multi-factor risk scoring
• Identification of highest risk areas
• Integrates data security information from 3rd parties:
Data stores, owner, classification
Protection status
User access info (LDAP, IAM) and activity logs
(DB, Hadoop, Salesforce, DAM)
Secure@Source
9. Easily integrate more data faster from more data sources
Big Data Management
Smart Executor
Informatica Big Data Management
ETL/DI
Servers
Informatica Data
Transformation
Engine on
dedicated DI
servers
Data
Connectivity
Data
Integration
Data
Masking
Data
Quality
Data
Governance
YARN
HDFS
Map
Reduce
Hive on
Map
Reduce
Tez
Spark
Core
Cluster
Aware
Hive
On
Tez
Spark Blaze
Hadoop Cluster
• Visual development interface accelerates
developer productivity
• Near universal data connectivity
• Complex data parsing on Hadoop
• Data profiling on Hadoop
• High-speed data ingestion and extraction
• Process and deliver data at scale on
Hadoop
• Dynamic schemas and mapping
templates
• Data Quality and Data Governance on
Hadoop
10. Take Big Data Management to the Next Level
Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic
Automatically profit from new technologies and choose best option - Smart Optimizer
MapReduce
Spark
Blaze
Generic source Generic targetRule based logic
11. Informatica Intelligent Streaming
• Streaming analytics capability
into the Intelligent Data Platform
• Unified UI with multiple engines
underneath the covers
• Frictionless integration
conversion/extension of batch
mappings into streaming context
• Abstracted from runtime
framework
Collect, ingest and process data in realtime and streaming
Realtime
source
Realtime
target
Window
transformation
Spark
Streaming
code generated
14. How?
Applications &
Databases
Internet of Things
3rd Party Data
Data Modeling
Tools BI Tools CustomCloud
Data Access & Metadata Connectivity
Intelligent Metadata FoundationCatalog ClassifyIndex
Data
Lineage
Data
Relationships
Smart
Domains
Data
Profile
Data Discovery & Analysis Process
Recommend
Discover Collaborate
Publish
Operationalize/
Monitor
Prepare
Data
Analyst / Scientist
Intelligent Data Lake
15. Data Asset
- Data you work with as a unit
Project
- A project contains
data assets and worksheets.
Recipe
- The steps taken to prepare
data in a worksheet.
Data Publication
- the process of making prepared
data available in the data lake
Data Preparation
- The process of combining, cleansing,
transforming, and structuring data from one
or more data assets so that it is ready
for analysis.
Terminology
Intelligent Data Lake
16. Search and Discovery
Data discovery through a powerful search engine to find relevant data
Semantic
search
Fact filtering by
asset, resource Type,
latest , size, custom
attributes…
17. Data Asset Overview
Overview with asset attributes and integrated profiling stats
Asset attributes
collected from the
source system
Asset attributes
enriched by users to
add business context
Column profiling stats
including
Null/Unique/Duplicate
percentages, Inferred
data types and data
domains.
Details stats include
value and pattern
distributions
Add data asset
To Project from
any exploration
views
18. Business Glossary Integration
View Business
Glossary Assets
like Terms,
Policies and
Categories in the
Catalog
View and
navigate
to related
technical
and
business
assets in
the
catalog
19. Data Lineage
Interactively trace data origin through summarized lineage views for analysts
Use Lineage and Impact Sliders to drill down to
desired lineage levels on either side of the seed
object.
20. Relationship View
Shows ecosystem of the asset in the enterprise based on association to other assets
Get a 360 Degree View
of data asset using the
relationship view.
Includes related tables,
views, domains and
reports, users etc.
Ability to Zoom, find specific assets
in the view and filter by asset types
Expand relationship
circles to get more
details on relationship
types and objects.
21. Data Preparation continued…
Excel-based data preparation on Sample data
New formula
definition with
type-ahead
Large number of
functions
available for all
types of data
string, numeric,
date, statistical,
Math etc.
Advanced
functionality
such as Join,
Merge,
Aggregate,
Filter, Sort etc.
New values are
calculated and
shown right
away
22. Data Preparation continued…
Excel-based data preparation on Sample data
Column
level
summary
Column value
distributions
Column level
Suggestions
Data
preparation
steps
captured as
“Recipe”
23. Data Publication
Execution of data preparation steps on actual data using Infa mapping
Publish the output of
data preparation steps
back to the lake
Recipe steps are
translated into
Informatica mapping
Informatica mapping is
handed over to BDM
platform for execution on
actual data sources
BDM platform uses either
Map/Reduce or Blaze or
Spark to execute the
mapping
Mapping is available to
the ETL specialists to
open in Informatica
Developer tool to
operationalize
Users credentials are
used to access the
underlying database.
24. Organizations need ONE solution that helps them…
Easily Find &
Catalog Data &
Discover
Relationships
Rapidly Prepare &
Share Data Exactly
When it is Needed
Get instant Access to
Trusted & Secure
Data for Advanced
Analytics
Ingest, Cleanse, Integrate & protect data at scale
If your customer thinks of Informatica as an ETL company, this is a chance to change their perception. We are the #1 leader in 6 important data categories:
First, cloud data management – we have a full portfolio of data management services for all the major cloud ecosystems – either cloud only or hybrid
Data integration – our bread and butter – we have been the best at it for a long time and we continue to set the bar
Big Data Management – we are the leader in data management for Big Data platforms. We work closely with all the major Hadoop, NoSQL ecosystems and with all the latest Big Data technologies like Spark
Master Data Management – we are the leader in MDM for customer data and any other data that is important to their business. Our secret sauce is our matching engine, ability to discover relationships, and scalability. We can do this on any data platform, either on-premise or in the Cloud.
Data Quality – we are setting the bar in DQ. Whether it is for stand-alone initiatives like data governance or for embedding data quality into their business processes
Data Security – we are pioneering a new approach in security. Security remains an unsolved problem, and we can address it at the data level
Most organization are building out some version of a data lake or enterprise data hub concept.
Really they are looking to get all their data into one place for next generation of analytics, ability for all people to have access to information
They are usually divided up into multiple types of zones.
To serve these market trends best Informatica developed a Big Data solution that addresses each of the trends.
The EIC module helps people understand the data they are looking at providing context
The IDL module allows business to be more self service by providing self service data preparation capabilities, yet also helps IT operationalize the data preparation steps at scale in a managed and governed way.
Secure@Source gives insight in potential risks around privacy sensitive data, by providing insight in where this data is located, how it is proliferated across the Data Lake (and surrounding applications) and what the associated risks are.
Big Data Management helps customers ingest, parse, cleanse, integrate and deliver big data at scale.
Intelligent Streaming finally allows processes to use realtime and streaming data sources.
All this fucntionality is built as part of the Intelligent Data Platform where we try to use as much open source tools as possible leveraging the power of the ecosystem.
We use Hbase to store different types of metadata, we use Titan as a graph database to store the relationship information between data assets. We use Spark (incl Spark Streaming) and Blaze to process data at Scale, we use Kafka as a high speed data transfer mechanism and finally we use Solr to index metadata so it can be searched using a google like search interface.
The Enterpsie Information Catalog (EIC) application allows business users to quicly find all information around the collection of data assets in their data lake.
Since EIC can leverage the metadata provided by Cloudera Navigator we can even show Hive/Impala scripts and Pig scripts that are being used to process data.
Intelligent data lake provides capabilities to enable business users to do data preparation
Secure@Source gives insight in sensitive data risks.
Secure@Source gives insight in sensitive data risks.
Dynamic Mappings
Build a template once – automate mapping execution for 1000’s of sources with different schemas automatically
Mapping self adjusts dynamically to external schema changes and column characteristics
Ability to process flat files with changing order (a,b,c or c,a,b) and number of columns dynamically
Re-use PowerCenter and SQL logic
Many customers have existing investments done in traditional powercenter and/or SQL scripts. To allow re-use of these components Informatica provides capabilities to migrate existing PowerCenter logic to run in Hadoop and to convert existing SQL code to Big Data mapping logic that can be executed at scale.
Smart Optimizer
In-built mapping optimizer that automatically tunes and re-arranges the mapping for high performance
Early selection, Early projection, Mapping pruning, Semi-join, Join re-ordering, etc
Automatic partitioning support based on statistics and other heuristics
Advanced full pushdown optimization support including data ship join
Intelligent streaming aims to bring the following capabilities into the Informatica Platform:
Real-time data ingestion from streaming data sources
Rule evaluation and event triggering on a real-time data stream
Real-time Data Integration: complex transforms, lookups, joins etc. in real time
Data Stewards are responsible for strategically managing data assets in the data lake and the enterprise ensuring high levels of data quality, integrity, availability, trustworthiness, and data security while emphasizing the business value of data. By building a catalog, classifying metadata and data definitions, maintaining technical and business rules and monitoring data quality, data stewards ensure data in the lake is consistent for use in the discovery zone and enterprise zone. As the inventory of technical and business metadata is established and data sets available, data architects must design robust scalable data lake architecture to meets the business goals of the marketing data lake.
Before we dive into the demo, lets look at some terminology, I will be using these terms quite a bit in the demo:
Data Lake
A data lake is a centralized repository of large volumes of structured and unstructured data. A data lake can contain different types of data, including raw data, refined data, master data, transactional data, log file data, and machine data. In Intelligent Data Lake, the data lake is a Hadoop cluster.
Data Asset A data asset is data that you work with as a unit. Data assets can include items such as a flat file, table, or view. A data asset can include data stored in or outside the data lake.
Project A project is a container stores data assets and worksheets.
Data Preparation The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis.
Recipe A recipe includes the list of input sources and the steps taken to prepare data in a worksheet.
Data Publication data publication is the process of making prepared data available in the data lake. When you publish prepared data, Intelligent Data Lake writes the transformed input source to a Hive table in the data lake. Other analysts can add the published data to their projects and create new data assets. Or analysts can use a third-party business intelligence or advanced analytic tool to run reports to further analyze the published data.