SlideShare ist ein Scribd-Unternehmen logo
1 von 56
@joe_Caserta
Big Data: Setting up a Big Data Lake
Joe Caserta
President
Caserta Concepts
September 17, 2015 - New York City
NEW YORK
@joe_Caserta
@joe_Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded
Caserta Concepts
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices
Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 3,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
Caserta Timeline
Awarded Top
Healthcare Analytics
Solution Provider
@joe_Caserta
About Caserta Concepts
• Consulting firm with focused expertise on Data Innovation, using Modern Data
Engineering approaches to solve highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• An Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leaders in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions
@joe_Caserta
Awards and Recognitions
@joe_Caserta
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
@joe_Caserta
Partners
@joe_Caserta
Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop solution concepts and
ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming (Spark)
• Recommendation Engine Optimization
• Relationship Intelligence (Graph DB/Search)
• Others (confidential)
• CIL is hosted on
@joe_Caserta
Community
New York City
3,000+ members
Free Knowledge Sharing
@joe_Caserta
As a Mindful Cyborg, Chris utilizes up to 700 sensors, devices, applications, and
services to track, analyze, and optimize as many areas of his existence.
This quantification enables him to see the connections of otherwise invisible data,
resulting in dramatic upgrades to his health, productivity, and quality of life.
The Future is Today
@joe_Caserta
The Progression of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations
Cognitive Computing / Cognitive Data Analytics

@joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail
@joe_Caserta
What’s New in Modern Data Engineering?
@joe_Caserta
What you need to know (according to Joe)
Hadoop Distribution: Apache, Cloudera, Hortonworks, MapR, IBM
 Tools:
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data
 Sqoop: Extracts external sources and loads Hadoop
 Storm: Real-time ETL
 Spark: General-purpose cluster computing framework
 NoSQL:
 Document: MongoDB, CouchDB
 Graph: Neo4j, Titan
 Key Value: Riak, Redis
 Columnar: Cassandra, Hbase
 Search: Lucene, Solr, ElasticSearch
 Languages: Python, Java, R, Scala
@joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Modern Data Engineering
Data Science
@joe_Caserta
How We’ve Built Data Warehouses
•Design – Top Down / Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
•Extract Transform Load data from source to data
warehouse
•Create Facts and Dimensions
•Put a BI tool on top
•Develop reports
•Data Governance
@joe_Caserta
The Traditional Conversation
• Kimball Vs. Inmon
• Dimensional vs. 3rd Normal Form
• What hardware do we need (that will be ready in 6 months)
• Oracle vs SQL Server, Postgres or MySQL if we were brave
(and cheap)
• Which ETL tool should we BUY  Informatica, Datastage?
• Which BI tool should we sit on top  Business Objects,
Cognos?
@joe_Caserta
The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Should we leverage Hadoop or NoSQL?
• Which platform and language are we going to code in?
• Which bleeding edge Apache Project should we put in
production!
@joe_Caserta
Why Change?
New technologies are great and all.. But what drives our
adoption of new technologies and techniques?
• Data has changed – Semistructured, Unstructured, Sparse
and evolving schema
• Volumes have changed  GB to TB to PB workloads
• Cracks in the Armor of Traditional Data Warehousing
approach!
AND MOST IMPORTANTLY:
Companies that innovate to leverage their data win!
@joe_Caserta
Cracks in the Data Warehouse Armor
• Onboarding new data is difficult!
• Data structures are rigid!
• Data Governance is slow!
• Disconnected from business needs:
“Hey – I need to munge some new data to see if it has value”
Wait! We have to….
Profile, analyze and conform the data
Change data models and load it into dimensional models
Build a semantic layer – that nobody is going to use
Create a dashboard we hope someone will notice
..and then you can have at it 3-6 months later to see if it has value!
@joe_Caserta
Is Anyone Surprised?
DWs have 70% FAILURE RATE
• Semi-scientific analysis has proven the majority of data analytic
projects fail..
• And of those that don’t fail, only a fraction are deemed a
“success”, others just finish!
• Data is just REALLY hard, especially without the right strategy
What do we think the Data Governance failure rate is?
@joe_Caserta
Is Traditional Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
•Consolidating data from disparate source systems
•Clean and conformed reference data
•Clean and integrated business facts
•Data governance (a more pragmatic version)
We can be more successful by acknowledging the
EDW can’t solve all problems.
@joe_Caserta
So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, semi-structured,
unstructured, structured
• Keep it as long as needed
• Support a number of processing workloads
• Scale-out
..and here is where Hadoop
can help us!
@joe_Caserta
Hadoop (Typically) Powers the Data Lake
Hadoop Provides us:
• Distributed storage  HDFS
• Resource Management  YARN
• Many workloads, not just Map Reduce
@joe_Caserta
Governing Big Data
 Before Data Governance
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Big Data Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
governance will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for Big Data
@joe_Caserta
Making it Right
 The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
 New tools
 External data
 Data blending
 Decentralization
 With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
 We need more systemic administration
 We need systems, tools to help with big data governance
 This space is EXTREMELY immature!
 Steps towards Data Governance for the Data Lake
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
@joe_Caserta
Process Architecture
Communication
Organization
IFP
Governance
Administration
Compliance
Reporting
Standards
Value Proposition
Risk/Reward
Information
Accountabilities
Stewardship
Architecture
Enterprise Data
Council
Data Integrity
Metrics
Control Mechanisms
Principles and
Standards
Information Usability
Communication
BDG provides vision, oversight and accountability for leveraging
corporate information assets to create competitive advantage,
and accelerate the vision of integrated delivery.
Value Creation
• Acts on Requirements
Build Capabilities
• Does the Work
• Responsible for adherence
Governance
Committees
Data Stewards
Project Teams
Enterprise
Data Council
• Executive Oversight
• Prioritizes work
Drives change
Accountable for results
Definitions
Data Governance for the Data Lake
@joe_Caserta
Data Lake Governance Realities
 Full data governance can only be applied to “Structured” data
 The data must have a known and well documented schema
 This can include materialized endpoints such as files or tables OR
projections such as a Hive table
 Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion and
transformation
 A governed usage  Data isn’t just for enterprise BI tools anymore
 We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
 Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
@joe_Caserta
Modern Data Quality Priorities
Be
Corrective
Be Fast
Be
Transparent
Be Thorough
@joe_Caserta
Data Quality Priorities
Data Quality
SpeedtoValue
Fast
Slow
Raw Refined
@joe_Caserta
The Data Scientists Can Help!
 Data Science to Big Data Warehouse mapping
 Full Data Governance Requirements
 Provide full process lineage
 Data certification process by data stewards and business owners
 Ongoing Data Quality monitoring that includes Quality Checks
 Provide requirements for Data Lake
 Proper metadata established:
 Catalog
 Data Definitions
 Lineage
 Quality monitoring
 Know and validate data
completeness
@joe_Caserta
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Data has different governance demands at each tier
 Only top tier of the pyramid is fully governed
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
Usage Pattern Data Governance
@joe_Caserta
Peeling back the layer… The Landing Area
•Source data in it’s full fidelity
•Programmatically Loaded
•Partitioned for data processing
•No governance other than catalog and ILM (Security
and Retention)
•Consumers: Data Scientists, ETL Processes,
Applications
@joe_Caserta
Data Lake
•Enriched, lightly integrated
•Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
•Partitioned for data access
•Governance additionally includes a guarantee of
completeness
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts
@joe_Caserta
A Note On Unstructured Data
• A Structure must be extracted/applied in just about every
case imaginable before analysis can be performed.
• Full data governance can only be applied to “Structured”
data
• This can include materialized endpoints such as files or
tables OR projections such as a Hive table
• Governed structured data must have:
• A known schema with Metadata
• A known and certified lineage
• A monitored, quality test, managed process for ingestion
and transformation
@joe_Caserta
Data Science Workspace
•No barrier for onboarding and analysis of new data
•Blending of new data with entire Data Lake,
including the Big Data Warehouse
•Data Scientists enrich data with insight
•Consumers: Data Scientists (cool cats) only!
@joe_Caserta
Big Data Warehouse
•Data is Fully Governed
•Data is Structured
•Partitioned/tuned for data access
•Governance includes a guarantee of completeness
and accuracy
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts, and Business Users
(the masses)
Big
Data
Warehouse
@joe_Caserta
The Refinery
BDW
Data Science
Workspace
Data Lake
Landing Area
Cool
new
data
New
Insights
•The feedback loop between Data Science and Data
Warehouse is critical
•Successful work products of science must Graduate
into the appropriate layers of the Data Lake
@joe_Caserta
Big Data Warehouse Technology?
“Polyglot Persistence - where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it…”
- Martin Fowler (http://martinfowler.com)
Abridged Version: Use the right tool for the job!
@joe_Caserta
Polyglot Warehouse
We promote the concept that the Big Data
Warehouse may live in one or more platforms
•Full Hadoop Solutions
•Hadoop plus MPP or Relational
Supplemental technologies:
•NoSQL: Columnar, Key value, Timeseries, Graph
•Search Technologies
@joe_Caserta
Hadoop is the Data Warehouse?
•Hadoop can be the entire data pyramid platform for
including landing, data lake and the Big Data
Warehouse
•Especially serves as the Data Lake and “Refinery”
•Query engines such as Hive, and Impala provide SQL
support
@joe_Caserta
More Typical: Hadoop + Relational
•Hadoop is the platform for the Data Lake and
Refinery
•The Active Set is federated out into MPP or
Relational Platforms  Presentation Layer
•Serves as a good model when there is existing MPP
or Relational Data Warehouse in place
@joe_Caserta
On the Cloud
AWS and other cloud providers present a very
powerful design pattern:
•S3 serves as the storage layer for the Data Lake
•EMR (Elastic Hadoop) provides the Refinery, most
clusters can be ephemeral
•The Active Set is stored into Redshift MPP or
Relational Platforms
Eliminate massive on premise appliance footprint
@joe_Caserta
Data Warehousing is not Dead!
• The principles of Data Warehousing still makes sense
• Recognize gaps in feature/functionality of the Relational
Database, and traditional Data Warehousing
• Believe in the Data Lake and accept Tunable Governance
• Think Polyglot Warehouse and use the right tool for the job
@joe_Caserta
What skills are needed?
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
@joe_Caserta
What about the tools I have?
People, Processes and Business commitment is still critical!
Caution: Some Assembly Required
The V’s require robust tooling:
Some of the most hopeful tools are
brand new or in incubation!
Enterprise big data implementations
typically combine products with some
custom built components
@joe_Caserta
Use Cases
• Real-Time Trade Data Analytics
• Comply with Dodd-Frank
• Electronic Medical Record Analytics
• Save lives?
@joe_Caserta
High Volume Trade Data Project
• The equity trading arm of a large US bank needed to scale its
infrastructure to enable the ability to process/parse trade data real-time
and calculate aggregations/statistics
~ 1.4Million/second ~12 Billion messages/day ~240 Billon/month
• The solution needed to map the raw data to a data model in memory or
low latency (for real-time), while persisting mapped data to disk (for end
of day reporting).
• The proposed solution also needed to handle
ad-hoc data requests for data analytics.
@joe_Caserta
The Data
• Primarily FIX messages: Financial Information Exchange
• Established in early 90's as a standard for trade data communication
widely used throughout the industry
• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |
• A single trade can be comprised of 100's of such messages, although
typical trades have about a dozen
@joe_Caserta
Data Quality
Rules Engine
Storm Cluster
Trade
Data
d3.js Real-time
Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data and
derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated data.
• Higher latency ad-hoc analytics are done in Hadoop using Pig
and Hive
Kafka
High Volume Real-time Analytics
Solution Architecture
@joe_Caserta
Electronic Medical Records (EMR) Analytics
Hadoop Data LakeEdge Node
`
100k
files
variant 1..n
…
variant 1..n
HDFS
Put
Netezza
DW
Sqoop
Pig EMR
Processor
UDF
Library
Provider
table
(parquet)
Member table
(parquet)
Python Wrapper
Provider
table
Member
table
Forqlift
Sequenc
e Files
…
variant 1..n
Sequenc
e Files
…
15 More
Entities
(parquet)
More
Dimensions
And
Facts
• Receive Electronic Medial Records from various providers in various formats
• Address Hadoop ‘small file’ problem
• No barrier for onboarding and analysis of new data
• Blend new data with Data Lake and Big Data Warehouse
• Machine Learning
• Text Analytics
• Natural Language Processing
• Reporting
• Ad-hoc queries
• File ingestion
• Information Lifecycle Mgmt
@joe_Caserta
Some Thoughts – Enable the Future
 Big Data requires the convergence of
data governance, advanced data
engineering, data science and business
smarts
 Make sure your data can be trusted
and people can be held accountable for
impact caused by low data quality.
It takes a village to achieve all the tasks
required for effective big data strategy
& execution
 Get experts that have done it before!
Achieve the impossible…..
… everything is impossible until someone does it!
@joe_Caserta
Workshops: www.casertaconcepts.com/training
Sept 21-22 (2 days), Agile Data Warehousing
taught by Lawrence Corr
Sept 23-24 (2 days), ETL Architecture and Design
taught by Joe Caserta
(Big Data module added)
SAVE $300 by using discount code: DAMANYC
Agile DW & ETL Training in NYC, 2015
New York Executive Conference Center
1601 Broadway @48th St.
New York, NY 10019
@joe_Caserta
Recommended Reading
@joe_Caserta
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Weitere ähnliche Inhalte

Was ist angesagt?

You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceTony Baer
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...StampedeCon
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureAgilisium Consulting
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?DATAVERSITY
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseBui Ha
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 

Was ist angesagt? (20)

You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake Governance
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data Architecture
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 

Ähnlich wie Big Data: Setting Up the Big Data Lake

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with MicrosoftCaserta
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science OverviewDavide Mauri
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationInside Analysis
 

Ähnlich wie Big Data: Setting Up the Big Data Lake (20)

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Bi overview
Bi overviewBi overview
Bi overview
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data Exploration
 

Mehr von Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 

Mehr von Caserta (16)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 

Kürzlich hochgeladen

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Big Data: Setting Up the Big Data Lake

  • 1. @joe_Caserta Big Data: Setting up a Big Data Lake Joe Caserta President Caserta Concepts September 17, 2015 - New York City NEW YORK
  • 3. @joe_Caserta Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 3,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations Caserta Timeline Awarded Top Healthcare Analytics Solution Provider
  • 4. @joe_Caserta About Caserta Concepts • Consulting firm with focused expertise on Data Innovation, using Modern Data Engineering approaches to solve highly complex business data challenges • Award-winning company • Internationally recognized work force • Mentoring, Training, Knowledge Transfer • Strategy, Architecture, Implementation • An Innovation Partner • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Leaders in architecting and implementing enterprise data solutions • Data Warehousing • Business Intelligence • Big Data Analytics • Data Science • Data on the Cloud • Data Interaction & Visualization • Strategic Consulting • Technical Design • Build & Deploy Solutions
  • 6. @joe_Caserta Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  • 8. @joe_Caserta Caserta Innovation Lab (CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming (Spark) • Recommendation Engine Optimization • Relationship Intelligence (Graph DB/Search) • Others (confidential) • CIL is hosted on
  • 9. @joe_Caserta Community New York City 3,000+ members Free Knowledge Sharing
  • 10. @joe_Caserta As a Mindful Cyborg, Chris utilizes up to 700 sensors, devices, applications, and services to track, analyze, and optimize as many areas of his existence. This quantification enables him to see the connections of otherwise invisible data, resulting in dramatic upgrades to his health, productivity, and quality of life. The Future is Today
  • 11. @joe_Caserta The Progression of Data Analytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations Cognitive Computing / Cognitive Data Analytics 
  • 12. @joe_Caserta Innovation is the only sustainable competitive advantage a company can have Innovations may fail, but companies that don’t innovate will fail
  • 13. @joe_Caserta What’s New in Modern Data Engineering?
  • 14. @joe_Caserta What you need to know (according to Joe) Hadoop Distribution: Apache, Cloudera, Hortonworks, MapR, IBM  Tools:  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data  Sqoop: Extracts external sources and loads Hadoop  Storm: Real-time ETL  Spark: General-purpose cluster computing framework  NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Search: Lucene, Solr, ElasticSearch  Languages: Python, Java, R, Scala
  • 15. @joe_Caserta Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of Modern Data Engineering Data Science
  • 16. @joe_Caserta How We’ve Built Data Warehouses •Design – Top Down / Bottom Up • Customer Interviews and requirements gathering • Data Profiling •Extract Transform Load data from source to data warehouse •Create Facts and Dimensions •Put a BI tool on top •Develop reports •Data Governance
  • 17. @joe_Caserta The Traditional Conversation • Kimball Vs. Inmon • Dimensional vs. 3rd Normal Form • What hardware do we need (that will be ready in 6 months) • Oracle vs SQL Server, Postgres or MySQL if we were brave (and cheap) • Which ETL tool should we BUY  Informatica, Datastage? • Which BI tool should we sit on top  Business Objects, Cognos?
  • 18. @joe_Caserta The New Conversation • Do we need a Data Warehouse at all? • If we do, does it need to be relational? • Should we leverage Hadoop or NoSQL? • Which platform and language are we going to code in? • Which bleeding edge Apache Project should we put in production!
  • 19. @joe_Caserta Why Change? New technologies are great and all.. But what drives our adoption of new technologies and techniques? • Data has changed – Semistructured, Unstructured, Sparse and evolving schema • Volumes have changed  GB to TB to PB workloads • Cracks in the Armor of Traditional Data Warehousing approach! AND MOST IMPORTANTLY: Companies that innovate to leverage their data win!
  • 20. @joe_Caserta Cracks in the Data Warehouse Armor • Onboarding new data is difficult! • Data structures are rigid! • Data Governance is slow! • Disconnected from business needs: “Hey – I need to munge some new data to see if it has value” Wait! We have to…. Profile, analyze and conform the data Change data models and load it into dimensional models Build a semantic layer – that nobody is going to use Create a dashboard we hope someone will notice ..and then you can have at it 3-6 months later to see if it has value!
  • 21. @joe_Caserta Is Anyone Surprised? DWs have 70% FAILURE RATE • Semi-scientific analysis has proven the majority of data analytic projects fail.. • And of those that don’t fail, only a fraction are deemed a “success”, others just finish! • Data is just REALLY hard, especially without the right strategy What do we think the Data Governance failure rate is?
  • 22. @joe_Caserta Is Traditional Warehousing All Wrong? NO! The concept of a Data Warehouse is sound: •Consolidating data from disparate source systems •Clean and conformed reference data •Clean and integrated business facts •Data governance (a more pragmatic version) We can be more successful by acknowledging the EDW can’t solve all problems.
  • 23. @joe_Caserta So what’s missing? The Data Lake A storage and processing layer for all data • Store anything: source data, semi-structured, unstructured, structured • Keep it as long as needed • Support a number of processing workloads • Scale-out ..and here is where Hadoop can help us!
  • 24. @joe_Caserta Hadoop (Typically) Powers the Data Lake Hadoop Provides us: • Distributed storage  HDFS • Resource Management  YARN • Many workloads, not just Map Reduce
  • 25. @joe_Caserta Governing Big Data  Before Data Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Big Data Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 26. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations for Big Data
  • 27. @joe_Caserta Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Data Governance for the Data Lake 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  • 28. @joe_Caserta Process Architecture Communication Organization IFP Governance Administration Compliance Reporting Standards Value Proposition Risk/Reward Information Accountabilities Stewardship Architecture Enterprise Data Council Data Integrity Metrics Control Mechanisms Principles and Standards Information Usability Communication BDG provides vision, oversight and accountability for leveraging corporate information assets to create competitive advantage, and accelerate the vision of integrated delivery. Value Creation • Acts on Requirements Build Capabilities • Does the Work • Responsible for adherence Governance Committees Data Stewards Project Teams Enterprise Data Council • Executive Oversight • Prioritizes work Drives change Accountable for results Definitions Data Governance for the Data Lake
  • 29. @joe_Caserta Data Lake Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semi- structured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  • 30. @joe_Caserta Modern Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 31. @joe_Caserta Data Quality Priorities Data Quality SpeedtoValue Fast Slow Raw Refined
  • 32. @joe_Caserta The Data Scientists Can Help!  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness
  • 33. @joe_Caserta Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Data has different governance demands at each tier  Only top tier of the pyramid is fully governed  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted) User community arbitrary queries and reporting Usage Pattern Data Governance
  • 34. @joe_Caserta Peeling back the layer… The Landing Area •Source data in it’s full fidelity •Programmatically Loaded •Partitioned for data processing •No governance other than catalog and ILM (Security and Retention) •Consumers: Data Scientists, ETL Processes, Applications
  • 35. @joe_Caserta Data Lake •Enriched, lightly integrated •Data has been is accessible in the Hive Metastore • Either processed into tabular relations • Or via Hive Serdes directly upon Raw Data •Partitioned for data access •Governance additionally includes a guarantee of completeness •Consumers: Data Scientists, ETL Processes, Applications, Data Analysts
  • 36. @joe_Caserta A Note On Unstructured Data • A Structure must be extracted/applied in just about every case imaginable before analysis can be performed. • Full data governance can only be applied to “Structured” data • This can include materialized endpoints such as files or tables OR projections such as a Hive table • Governed structured data must have: • A known schema with Metadata • A known and certified lineage • A monitored, quality test, managed process for ingestion and transformation
  • 37. @joe_Caserta Data Science Workspace •No barrier for onboarding and analysis of new data •Blending of new data with entire Data Lake, including the Big Data Warehouse •Data Scientists enrich data with insight •Consumers: Data Scientists (cool cats) only!
  • 38. @joe_Caserta Big Data Warehouse •Data is Fully Governed •Data is Structured •Partitioned/tuned for data access •Governance includes a guarantee of completeness and accuracy •Consumers: Data Scientists, ETL Processes, Applications, Data Analysts, and Business Users (the masses) Big Data Warehouse
  • 39. @joe_Caserta The Refinery BDW Data Science Workspace Data Lake Landing Area Cool new data New Insights •The feedback loop between Data Science and Data Warehouse is critical •Successful work products of science must Graduate into the appropriate layers of the Data Lake
  • 40. @joe_Caserta Big Data Warehouse Technology? “Polyglot Persistence - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it…” - Martin Fowler (http://martinfowler.com) Abridged Version: Use the right tool for the job!
  • 41. @joe_Caserta Polyglot Warehouse We promote the concept that the Big Data Warehouse may live in one or more platforms •Full Hadoop Solutions •Hadoop plus MPP or Relational Supplemental technologies: •NoSQL: Columnar, Key value, Timeseries, Graph •Search Technologies
  • 42. @joe_Caserta Hadoop is the Data Warehouse? •Hadoop can be the entire data pyramid platform for including landing, data lake and the Big Data Warehouse •Especially serves as the Data Lake and “Refinery” •Query engines such as Hive, and Impala provide SQL support
  • 43. @joe_Caserta More Typical: Hadoop + Relational •Hadoop is the platform for the Data Lake and Refinery •The Active Set is federated out into MPP or Relational Platforms  Presentation Layer •Serves as a good model when there is existing MPP or Relational Data Warehouse in place
  • 44. @joe_Caserta On the Cloud AWS and other cloud providers present a very powerful design pattern: •S3 serves as the storage layer for the Data Lake •EMR (Elastic Hadoop) provides the Refinery, most clusters can be ephemeral •The Active Set is stored into Redshift MPP or Relational Platforms Eliminate massive on premise appliance footprint
  • 45. @joe_Caserta Data Warehousing is not Dead! • The principles of Data Warehousing still makes sense • Recognize gaps in feature/functionality of the Relational Database, and traditional Data Warehousing • Believe in the Data Lake and accept Tunable Governance • Think Polyglot Warehouse and use the right tool for the job
  • 46. @joe_Caserta What skills are needed? Modern Data Engineering/Data Preparation Domain Knowledge/Business Expertise Advanced Mathematics/ Statistics
  • 47. @joe_Caserta What about the tools I have? People, Processes and Business commitment is still critical! Caution: Some Assembly Required The V’s require robust tooling: Some of the most hopeful tools are brand new or in incubation! Enterprise big data implementations typically combine products with some custom built components
  • 48. @joe_Caserta Use Cases • Real-Time Trade Data Analytics • Comply with Dodd-Frank • Electronic Medical Record Analytics • Save lives?
  • 49. @joe_Caserta High Volume Trade Data Project • The equity trading arm of a large US bank needed to scale its infrastructure to enable the ability to process/parse trade data real-time and calculate aggregations/statistics ~ 1.4Million/second ~12 Billion messages/day ~240 Billon/month • The solution needed to map the raw data to a data model in memory or low latency (for real-time), while persisting mapped data to disk (for end of day reporting). • The proposed solution also needed to handle ad-hoc data requests for data analytics.
  • 50. @joe_Caserta The Data • Primarily FIX messages: Financial Information Exchange • Established in early 90's as a standard for trade data communication widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 100's of such messages, although typical trades have about a dozen
  • 51. @joe_Caserta Data Quality Rules Engine Storm Cluster Trade Data d3.js Real-time Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka High Volume Real-time Analytics Solution Architecture
  • 52. @joe_Caserta Electronic Medical Records (EMR) Analytics Hadoop Data LakeEdge Node ` 100k files variant 1..n … variant 1..n HDFS Put Netezza DW Sqoop Pig EMR Processor UDF Library Provider table (parquet) Member table (parquet) Python Wrapper Provider table Member table Forqlift Sequenc e Files … variant 1..n Sequenc e Files … 15 More Entities (parquet) More Dimensions And Facts • Receive Electronic Medial Records from various providers in various formats • Address Hadoop ‘small file’ problem • No barrier for onboarding and analysis of new data • Blend new data with Data Lake and Big Data Warehouse • Machine Learning • Text Analytics • Natural Language Processing • Reporting • Ad-hoc queries • File ingestion • Information Lifecycle Mgmt
  • 53. @joe_Caserta Some Thoughts – Enable the Future  Big Data requires the convergence of data governance, advanced data engineering, data science and business smarts  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality. It takes a village to achieve all the tasks required for effective big data strategy & execution  Get experts that have done it before! Achieve the impossible….. … everything is impossible until someone does it!
  • 54. @joe_Caserta Workshops: www.casertaconcepts.com/training Sept 21-22 (2 days), Agile Data Warehousing taught by Lawrence Corr Sept 23-24 (2 days), ETL Architecture and Design taught by Joe Caserta (Big Data module added) SAVE $300 by using discount code: DAMANYC Agile DW & ETL Training in NYC, 2015 New York Executive Conference Center 1601 Broadway @48th St. New York, NY 10019
  • 56. @joe_Caserta Thank You / Q&A Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

Hinweis der Redaktion

  1. Reports  correlations  predictions  recommendations
  2. Last 2 years have been more exciting than previous 27
  3. We focused our attention on building a single version of the truth We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM. We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  4. Volume, Variety, Veracity and Veolcity
  5. Spark would make this easier and could leverage same DQ code