Weitere ähnliche Inhalte Ähnlich wie Data Lake Architecture (20) Mehr von DATAVERSITY (20) Kürzlich hochgeladen (20) Data Lake Architecture1. The First Step in Information Management
looker.com
Produced by:
MONTHLY SERIES
In partnership with:
Data Lake Architecture
October 5, 2017
2. Topics for Today’s Analytics Webinar
Benefits and Risks of a Data Lake
Data Lake Reference Architecture
Lab and the Factory
Base Environment for Batch Analytics, Streaming and Real-Time Data
Critical Governance Components
Key Take-Aways
Q&A
pg 2© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
3. Polling Questions
Do you have a data lake?
− Yes
− No
− Unsure
If yes, is it:
− Operational and regularly used for analytics
− Informally used, like a lab or sandbox
− Unsure
pg 3© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
4. Defining the Data Lake
A data lake is a collection of storage instances of various data assets additional to the
originating data sources. These assets are stored in a near-exact, or even exact, copy
of the source format.
The purpose of a data lake is to present an unrefined view of data to only the most
highly skilled analysts, to help them explore their data refinement and analysis
techniques independent of any of the system-of-record compromises that may exist
in a traditional analytic data store (such as a data mart or data warehouse).
A data lake can support either/or exploratory analytics and operational uses of data.
pg 4© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Source: Gartner IT Glossary
6. Benefits of the Data Lake
pg 6© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Enables “productionizing” advanced
analytics
Cost-effective scalability and flexibility
Derives value from unlimited data types
(including raw data)
Reduces long-term cost of ownership across
entire spectrum of data use
7. Risks of the Data Lake
Loss of trust
Loss of relevance and momentum
Increased risk
Long-term excessive cost
pg 7© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
9. Modern Reality of the Data Lake
pg 9© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
The data lake has changed due to storage availability, data management tools
and ease of which data can be managed.
Today’s data lake is comprised of:
‒ Landing Zone
‒ Standardization Zone
‒ Analytics Sandbox
10. Modern Reality of the Data Lake
pg 10© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONEDATA SOURCES
Landing Zone:
Closest to original data lake
conception where raw data is stored
and available for consumption
11. Modern Reality of the Data Lake
pg 11© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONEDATA SOURCES
Standardization Zone:
Standardized, cleaned data –
the preferred version for
downstream consumers and
the Analytics Sandbox
12. Modern Reality of the Data Lake
pg 12© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES
Analytics
Sandbox:
Where Data
Scientists
work to
create new
models
13. Modern Reality of the Data Lake
pg 13© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES
DATA MANAGEMENT
14. Modern Reality of the Data Lake
pg 14© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX
DATA GOVERNANCE DATA OPERATIONS
DATA SOURCES
DATA MANAGEMENT
15. Modern Reality of the Data Lake
pg 15© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX
DATA GOVERNANCE DATA OPERATIONS
DATA SOURCES
DATA SCIENTISTS
DATA MANAGEMENT
16. Modern Reality of the Data Lake
pg 16© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX
DATA GOVERNANCE
DATA CONSUMERS
DATA OPERATIONS
DATA SOURCES
DATA SCIENTISTS
DATA MANAGEMENT
17. Reminder: Two Lenses to Derive an Effective Architecture
pg 17© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Form
Developing the
architecture so all
stakeholders can
actually understand
and develop it
Progression
Develop architectures
that are best fit for
purpose and effective,
no matter how simple
or complex
19. Why is This Topic Important?
A key to successful data lake management is understanding if it is a lab,
a factory or both.
There are architectural, governance and organizational impacts.
You must clearly identify if you are evolving from a lab to a factory or intend
to keep them separate.
pg 19© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
20. First Progression – Lab Elements
pg 20© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Organization ElementsFunctional Elements Technology Elements
Data
Consumption
DataSupply
Chain/Logistics
Data
Management
Landing/Staging ETL
Data Analysts
Access – Publish, Subscribe, Notify
Access Tools – BI, AnalyticsAnalytics – Descriptive,
Predictive, Prescriptive
HDFS, Columnar and Graph
21. Operational Elements
pg 21© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Organization ElementsFunctional Elements Technology Elements
Data
Consumption
DataSupply
Chain/Logistics
Data
Management
Pedigree and Preparation
Landing/Staging
Model/Metrics Management
Data Reduction
Glossary Management
Machine Learning/AI
Data Governance
Data OperationsData Ingestion
Reference and Master Data
Competency Centers
Self-Service/Data Citizens
ETL/Virtualization
Distributed Processing
Metadata
Data Quality/Hygiene
Lake, Pond, Warehouse
HDFS, Columnar and Graph
Data Streaming
Data Glossary
Data Lake Management
Taxonomy/Ontology
Web Services
Policy and Process
Data Analysts and Scientists
Collaboration, Decision-Making
Access – Publish, Subscribe, Notify
Access Tools – BI, Analytics
Applications
Analytics – Descriptive,
Predictive, Prescriptive
Business/Tech. Planning
Security, Privacy
Business Continuity
22. pg 22
The Lab – Characteristics
Allows for experimentation, testing new models, proof of concepts
Technical
− Flexible architectures, even ad hoc or non-persistent
− Rarely documented
− Schema on read
Organizational
− Run by the main users, hence informal or departmental
Functional
− By nature, results should be evaluated for relevance
© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
23. pg 23
The Factory – Characteristics
© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Addressing directed requirements, producing regular outputs
associated with a business service, product or action
Technical
− Architecture needs to be defined so its use and limits are understood
Organizational
− Published rules of engagement
Functional
− Data quality is monitored and known
− Lineage and metadata support navigation and use of content
− May need scheduled access and loading
− Publishing results will require some form of quality control and approval
− Models that are executed on a scheduled basis will require some sort of
administrative and maintenance capabilities
25. A Base Environment
pg 25© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Data Governance Data Operations
Rapid ingestion –
stream, low latency
or batch updating
Ease of access –
find it, use it,
know what it
means
Effective data supply chain – data of the correct
quality needs to be where it is supposed to be
Flexibility – Data
Scientists need to be
able to experiment,
but without polluting
the lake
26. Additional Components for Real-time Analytics
and Ingesting Streaming Data
Are you replacing the Operational Data Store (ODS)?
Will you be doing full CRUD operation (Create, Update, Read, Delete)?
How fast do you need to go? Latencies should match your real needs.
Vendors – Hortonworks, Attunity, Splice
− Ingest
− Process
− Consumption
Technologies you will hear about
− Apache Kafka, Storm (real-time streaming components)
− Apache Spark (fast batches)
pg 26© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
28. Major Areas of Data Governance Concern
pg 28© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
In the Data Lake
LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX
DATA GOVERNANCE
DATA CONSUMERS
DATA OPERATIONS
DATA SOURCES
DATA SCIENTISTS
1
2
3 3 3
4
5
6
1
2
3
4
5
6
Data Acquisition
Data Catalog
Data Decisions
Analytics Governance
Data Usage
Model Productionalization
Some Data Governance approaches are new, and
others are applications of traditional approaches
29. Major Areas of Data Governance Concern in the Data Lake
pg 29© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
Data is
cataloged/
mapped so it’s
easily found
Data is described
adequately to
permit reuse for
any need
Decisions about data are logged and communicated
Flow of data (data lineage) is documented, so users/
regulators can understand where it came from
Staff who knows and
understands the
data are identified
Data Governance defines
the information you
need to maintain your
data, develops the
processes to do this,
trains staff and provides
the environments to
manage the knowledge,
while monitoring and
ensuring compliance.
30. Evolution of Critical Governance Components
pg 30© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
While flexible, governance is
required to ensure appropriate use
While operational, governance will ensure
legitimacy, compliance and verify
alignment with business needs
To move to operational, governance
should supply road map, new policies,
training and organization management
32. Key Take-Aways
Make sure you offer up business benefits in addition to
traditional “access to data” – such as lower costs, more
nimble reactions.
Avoid additional data risks by providing oversight of data
quality and sources. Do not take a causal approach to
managing the data lake assets.
Understand that the architectural aspect of the data lake
(as it is evolving) is becoming a standard, much like the data
warehouse.
Maintain an open mind for supporting technologies, because
they are changing every day.
Implement Data Governance. It is a critical success factor, no
matter how you view the data lake.
pg 32© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
34. Thank you for joining today!
Please join us Thursday, Nov. 2 for the
Keys to Effective Data Visualization webinar.
John Ladley @jladley
john@firstsanfranciscopartners.com
Kelle O’Neal @kellezoneal
kelle@firstsanfranciscopartners.com