SlideShare a Scribd company logo
1 of 45
Achieving Data Governance in Big Data
Joe Caserta
June, 2014
Top 20 Big Data
Consulting by CIO Review
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling
25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer,
more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013 Launched Big Data Warehousing
Meetup in NYC – 850+ Members
2012
2014 Dedicated to Data Governance
Techniques on Big Data (Innovation)
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services
Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting/
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Why Big Data?
Data Science
•Data is coming in so
fast, how do we
monitor it?
•Real real-time analytics
•What does “complete”
mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
• cannot start at the
warehouse
•Data volume is higher
so the process must
be more reliant on
programmatic
administration
•Less people/process
dependence
Volume Variety
VelocityVeracity
The Challenges With Big Data
Why is Big Data Governance Important?
 Convergence of
 Data quality
 Management and policies
 All data in an organization
 Set of processes
 Ensures important data assets are formally managed throughout the
enterprise.
 Ensures data can be trusted
 People made accountable for low data quality
It is about putting people and technology in place to fix and
preventing issues with data so that the enterprise can become
more efficient.
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
The Components of Data Governance
What’s Old is New Again
 Before Data Warehousing Data Governance
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Big Data Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
Making it Right
 The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
 New tools
 External data
 Data blending
 Decentralization
 With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
 We need more systemic administration
 We need systems, tools to help with big data governance
 This space is EXTREMELY immature!
 Steps towards Big Data Governance
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
Org and Process
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
Master Data
Management
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
Data Quality and
Monitoring
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
Metadata
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
Information
Lifecycle
Preventing a Data Swamp with Governance
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Governance Pyramid
Metadata  Catalog
ILM  who has access,
how long do we “manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned
into information:
organized, well defined,
complete.
Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts
Metadata  Catalog
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring
of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Hadoop has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
1
2
4
3
Big Data Governance Realities
 Full data governance can only be applied to “Structured” data
 The data must have a known and well documented schema
 This can include materialized endpoints such as files or tables OR
projections such as a Hive table
 Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion and
transformation
 A governed usage  Data isn’t just for enterprise BI tools anymore
 We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
 Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
The Data Scientists Can Help!
 Data Science to Big Data Warehouse mapping
 Full Data Governance Requirements
 Provide full process lineage
 Data certification process by data stewards and business owners
 Ongoing Data Quality monitoring that includes Quality Checks
 Provide requirements for Data Lake
 Proper metadata established:
 Catalog
 Data Definitions
 Lineage
 Quality monitoring
 Know and validate data
completeness
People, Processes and Business commitment is still critical!
 - Apache Falcon (Incubating) promises many of the
features we need, however is fairly immature (Version 0.3).
Recommendation: Roll your own custom lifecycle management
workflow using Oozie + retention metadata
The Non-Data Part of Big Data
Caution: Some Assembly Required
The V’s require robust tooling:
 Unfortunately the toolset is pretty
thin: Some of the most hopeful tools
are brand new or in incubation!
 Components like ILM have fair
tooling, others like MDM and Data
Quality are sparse
Master Data Management
 Traditional MDM will do depending on your data size and
requirements:
 Relational is awkward, extreme normalization, poor usability and
performance
NoSQL stores like HBase has benefits
 If you need super high performance low millisecond response times to
incorporate into your Big Data ETL
 Flexible Schema
 Graph database is near perfect fit. Relationships and graph analysis bring
master data to life!
Data quality and matching processes are required
Little to no community or vendor support
More will come with YARN (more Commercial and Open Source IP
will be leveragable in Hadoop framework) -
Recommendation: Buy + Enhance or Build.
Staging
Library
Consolidated
Library
Standardization Matching
Integrated
Library
Survivorship
Source ID Name Home Address Birth Date SSN
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789
SYS C XYZ James Stag NULL 8/20/1959 NULL
Source ID Name Home Address Birth Date SSN Std Name Std Addr MDM ID
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789 James Stagnitto 123 Main Street 1
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789 James Stagnitto 132 Main Street 1
SYS C XYZ James Stag NULL 8/20/1959 NULL James Stag NULL 1
MDM ID Name Home Address Birth Date SSN
1 James Stagnitto 123 Main Street 8/20/1959 123-45-6789
Mastering Data
Validation
The Reality of Mastering Data
Proprietary Information
Graph Databases (NoSQL) to the Rescue
 Hierarchical relationships are never
rigid
 Relational models with tables and
columns not flexible enough
 Neo4j is the leading graph database
 Many MDM systems are going graph:
 Pitney Bowes - Spectrum MDM
 Reltio - Worry-Free Data for Life Sciences.
Big Data Security
 Determining Who Sees What:
 Need to be able to secure as many data types as possible
 Auto-discovery important!
 Current products:
 Sentry – SQL security semantics to Hive
 Knox – Central authentication mechanism to Hadoop
 Cloudera Navigator – Central security auditing
 Hadoop - Good old *NIX permission with LDAP
 Dataguise – Auto-discovery, masking, encryption
 Datameer – The BI Tool for Hadoop
Recommendation: Assemble based on existing tools
• For now Hive Metastore, HCatalog + Custom might be best
• HCatalog gives great “abstraction” services
• Maps to a relational schema
• Developers don’t need to worry about data formats and
storage
• Can use SuperLuminate to get started
Recommendation: Leverage HCatalog + Custom metadata tables
Metadata
They gave
developers and data
scientists a reason to
use it:
• Easy to use storage
handlers
• Automatic partitioning
• Schema backwards
compatibility
• Monitoring and
dependency Checks
The Twitter Way
 Twitter was suffering from a data science wild west.
 Developed their own enterprise Data Access Layer (DAL)
Data Quality and Monitoring
 To TRUST your information a robust set of tools for continuous
monitoring is needed
 Accuracy and completeness of data must be ensured.
 Any piece of information in the Big Data Warehouse must have
monitoring:
 Basic Stats: source to target counts
 Error Events: did we trap any errors during processing
 Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Large gap in commercial projects /open source project offerings
Data Quality and Monitoring Recommendation
DQ
metadata
Hive
Pig
MR
Quality
Check
Builder
DQ
Notifier
and
Logger
DQ
Events and
Timeseries
Facts
DQ ENGINE
• BUILD a robust data quality
subsystem:
• HBase for metadata and error
event facts
• Oozie for orchestration
• Based on Data Warehouse ETL
Toolkit
Closing Thoughts – Enable the Future
 Big Data requires the
convergence of data quality, data
management, data engineering
and business policies.
 Make sure your data can be
trusted and people can be held
accountable for impact caused by
low data quality.
 Get experts to help calm the
turbulence… it can be exhausting!
 Blaze new trails!
Polyglot Persistence – “where any decent
sized enterprise will have a variety of different
data storage technologies for different kinds of
data. There will still be large amounts of it
managed in relational stores, but increasingly
we'll be first asking how we want to manipulate
the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
© 2014 Dataguise Inc. All rights reserved.
Protect Your Big Data:
5 Steps You Need To Take
Jeremy Stieglitz, VP Product Management
Achieving Data Governance
in Big Data
… How To Bust Those Elephants Free
© 2014 Dataguise Inc. All rights reserved.
Executive Summary
• “Big Data” has become priority #1
for large enterprise in 2014:
Faster time to insights, touching $$$
Large Enterprise Driven
Realtime, Automation and On Demand
• Key challenge is to fully leverage
Big Data without exposing or risking
sensitive information
» Intelligent data discovery and adaptive
and automated protection
» Dashboard visibility into all security
actions and operators
» “Transparent” to the business analysts
29
© 2014 Dataguise Inc. All rights reserved.
Business Challenge: Data Growth
• 100% growth and 80% unstructured data by 2015
…finding and classifying sensitive data will get
harder
30
Exabytes
© 2014 Dataguise Inc. All rights reserved.
Real-world unstructured data scenarios
31
Voice-to-txt files in Hadoop
for customer service optimization
Patient and doctor medical data
in PDFs, X-Rays, doctor’s notes
Web comment fields and customer
surveys
Log data from wellheads and
oil drilling sensors
Web e-Commerce
Pay System
© 2014 Dataguise Inc. All rights reserved.
Data will grow 7500%
Enterprise IT for
Big Data will grow
150%
The Importance of Automation
© 2014 Dataguise Inc. All rights reserved.
On-Demand Hadoop.
• Without adequate sensitive
data protection, customers
left to “Penalty Boxing”
Hadoop with “Security zones”
imposed by InfoSec
» Slows business, costly and
cumbersome
• Automated sensitive data
protection can set those
assets free and be ready for
real-time Hadoop 2.0
33
© 2014 Dataguise Inc. All rights reserved. 34
Data Protection
In Hadoop
© 2014 Dataguise Inc. All rights reserved.
Security in Hadoop
• Like the Internet before it, Hadoop
designed without built-in security
» Traditional (infrastructure) tools don’t address
distributed, sharing requirements of Hadoop
» Ad hoc (arbitrary code) computing environment
» Existing authorization (Kerberos, ACLs) don’t
address compliance, insider threat issues for
sensitive data
35
© 2014 Dataguise Inc. All rights reserved.
Hadoop Security Framework
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
• The 4 approaches to address security within Hadoop (Perimeter,
Data, Access, Visibility)
• Dataguise discovers & protects at the data layer and provides visibility
for audit reporting and data lineage
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
Access
Defining what users
and applications can do
with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
© 2014 Dataguise Inc. All rights reserved.
Elements of Data Centric Protection
• 1. Identify which elements you want to protect
via:
» Delimiters (structured data), name-value pairs (semi-
structured) or data discovery service (unstructured)
• 2. Automated Protection Options:
» Automatically apply protection via:
» Format preserving encryption (FPE)
» Masking (replace, randomize, intellimask, static)
» Redaction (nullify)
• 3. Audit Strategy
» Sensitive data protection/access/lineage
37
© 2014 Dataguise Inc. All rights reserved.
Discovery
• Within HDFS
» Search for sensitive data per company policy – PII, PCI,…
» Handle complex data types such as addresses
» Process incrementally (default) to handle only the new content
• In-flight
» Support processing data on the fly as they are ingested into Hadoop HDFS
» Plug-in solution for ftp, flume
» Search for sensitive data per company policy – PII, PCI, HIPAA…
38
© 2014 Dataguise Inc. All rights reserved.
Protection Measures
• Protection plan should start with
cutting
» What data can we delete/cut?
» What data can be redacted?
» Masking choices
• Consistency
• Realistic looking data
• Partial reveal (Intellimask)
Credit Card # 4541 **** **** 3241
• What data needs reversibility
39
© 2014 Dataguise Inc. All rights reserved.
Encryption “vs” Masking
• Encryption:
+ Reversible
+ Trusted with security proofs
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ The first hammer
+ De-centralized architectures
- Complex
- Key management
- Useless without robust
authentication and authorization
- Data value destruction
40
• Masking:
+ Highest security
+ Realistic data
+ Range and value preserving
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ No performance impact on usage
+ Zero need for authentication and
authorization and key management
- Not as well marketed
- Not reversible
- Perceived to grow data
© 2014 Dataguise Inc. All rights reserved.
Encryption “vs” Masking
• Encryption:
+ Reversible
+ Trusted with security proofs
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ The first hammer
+ De-centralized architectures
- Complex
- Key management
- Useless without robust
authentication and authorization
- Data value destruction
41
• Masking:
+ Highest security
+ Realistic data
+ Range and value preserving
+ Format-preserving and partial
reveals
+Scale-out and distributed
+ No performance impact on usage
+ Zero need for authentication and
authorization and key management
- Not as well marketed
- Not reversible
- Perceived to grow data
So Which Do I Use?
The “answer is both! Mask values that can
be analyzed with a substitute value (such
as Names and SSNs), and encrypt values
that ultimately require their original value
(such as IP address or purchase amount)
© 2014 Dataguise Inc. All rights reserved.
Audit Strategy
• Essential to all goals: Compliance, breach
protection, visibility and metrics
• Avoids the “gotcha” moment
» Show all sensitive elements (count, location)
» Remediation applied
» Dashboard for fast access to critical policies and drill-
downs for file and user action
42
© 2014 Dataguise Inc. All rights reserved.
Putting it All Together: Sensitive Data
Protection in Hadoop
Policy Management
Define Hadoop file
and directory scans
Define data
elements
Define protection
rules
Define
audit/reports/alerts
Discover
In-Flight
Within HDFS
Full vs. Incremental
Structured vs.
Unstructured
Remediation
Domain based
Masking
Encryption
-Record
-Field
-FPE
Reporting
Job Level
- Sensitive elements
- Directories & Files
- Remediation applied
Dashboard
- Directory or by policy
- Drill-down
Audit report
- User actions
Notifications
Administration
User Management
Masking Domains
File Structures
Notifications
Monitor system status
43
© 2014 Dataguise Inc. All rights reserved.
Dataguise: Market Leader in
Big Data Protective Intelligence (BDPI)
Only solution with Hadoop
data discovery
Best in class– data
protection with simplicity,
scalability, and automation
Business friendly to
operators and business
analysts
44
© 2014 Dataguise Inc. All rights reserved.
45
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
Jeremy Stieglitz
VP Prod. Mgmt.
jeremy@dataguise.com
(510) 896-3755
@BigDataProtect

More Related Content

Viewers also liked

Security at Scale with AWS - AWS Summit Cape Town 2017
Security at Scale with AWS - AWS Summit Cape Town 2017 Security at Scale with AWS - AWS Summit Cape Town 2017
Security at Scale with AWS - AWS Summit Cape Town 2017 Amazon Web Services
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22Parashar Shah
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...hrmalik20
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens WeijersBigDataExpo
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenBigDataExpo
 
Elk Reporting Ii
Elk Reporting IiElk Reporting Ii
Elk Reporting Iimwmiller12
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)William Yeh
 
2017-10-03 Session aOS - Back from Ignite - MS Experiences
2017-10-03 Session aOS - Back from Ignite - MS Experiences2017-10-03 Session aOS - Back from Ignite - MS Experiences
2017-10-03 Session aOS - Back from Ignite - MS ExperiencesPatrick Guimonet
 
"Building an Epic Brand" at SaaStr Annual 2016
"Building an Epic Brand" at SaaStr Annual 2016"Building an Epic Brand" at SaaStr Annual 2016
"Building an Epic Brand" at SaaStr Annual 2016saastr
 
Azure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAzure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAaron Saikovski
 
Freek bomhof tno
Freek bomhof tnoFreek bomhof tno
Freek bomhof tnoBigDataExpo
 
The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012Amazon Web Services
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Sara Lerén
 

Viewers also liked (16)

Security at Scale with AWS - AWS Summit Cape Town 2017
Security at Scale with AWS - AWS Summit Cape Town 2017 Security at Scale with AWS - AWS Summit Cape Town 2017
Security at Scale with AWS - AWS Summit Cape Town 2017
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22
 
Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...Cloud Computing System models for Distributed and cloud computing & Performan...
Cloud Computing System models for Distributed and cloud computing & Performan...
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens Weijers
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Elk Reporting Ii
Elk Reporting IiElk Reporting Ii
Elk Reporting Ii
 
Pesla
PeslaPesla
Pesla
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
 
2017-10-03 Session aOS - Back from Ignite - MS Experiences
2017-10-03 Session aOS - Back from Ignite - MS Experiences2017-10-03 Session aOS - Back from Ignite - MS Experiences
2017-10-03 Session aOS - Back from Ignite - MS Experiences
 
"Building an Epic Brand" at SaaStr Annual 2016
"Building an Epic Brand" at SaaStr Annual 2016"Building an Epic Brand" at SaaStr Annual 2016
"Building an Epic Brand" at SaaStr Annual 2016
 
Azure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the TrenchesAzure Large Scale Deployments - Tales from the Trenches
Azure Large Scale Deployments - Tales from the Trenches
 
Freek bomhof tno
Freek bomhof tnoFreek bomhof tno
Freek bomhof tno
 
The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012The Disruption of Big Data - AWS India Summit 2012
The Disruption of Big Data - AWS India Summit 2012
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
Impact-driven Scrum Delivery at Scrum gathering Phoenix 2015
 
Oow2016 review--paas-microservices-
Oow2016 review--paas-microservices-Oow2016 review--paas-microservices-
Oow2016 review--paas-microservices-
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Big Data Governance & Compliance: Protecting Confidential Data in Hadoop

  • 1. Achieving Data Governance in Big Data Joe Caserta June, 2014
  • 2. Top 20 Big Data Consulting by CIO Review Joe Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC – 850+ Members 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance
  • 3. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Higher Education • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 4. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 5. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting/ Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 6. Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… Why Big Data? Data Science
  • 7. •Data is coming in so fast, how do we monitor it? •Real real-time analytics •What does “complete” mean •Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? •Wider breadth of datasets and sources in scope requires larger data governance support •Data governance • cannot start at the warehouse •Data volume is higher so the process must be more reliant on programmatic administration •Less people/process dependence Volume Variety VelocityVeracity The Challenges With Big Data
  • 8. Why is Big Data Governance Important?  Convergence of  Data quality  Management and policies  All data in an organization  Set of processes  Ensures important data assets are formally managed throughout the enterprise.  Ensures data can be trusted  People made accountable for low data quality It is about putting people and technology in place to fix and preventing issues with data so that the enterprise can become more efficient.
  • 9. •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) The Components of Data Governance
  • 10. What’s Old is New Again  Before Data Warehousing Data Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Big Data Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 11. Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Big Data Governance 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  • 12. • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) Org and Process • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms Master Data Management • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution Data Quality and Monitoring • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables Metadata • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) Information Lifecycle Preventing a Data Swamp with Governance
  • 13. Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Governance Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted)User community arbitrary queries and reporting 1 2 4 3
  • 14. Big Data Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semi- structured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  • 15. The Data Scientists Can Help!  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness
  • 16. People, Processes and Business commitment is still critical!  - Apache Falcon (Incubating) promises many of the features we need, however is fairly immature (Version 0.3). Recommendation: Roll your own custom lifecycle management workflow using Oozie + retention metadata The Non-Data Part of Big Data Caution: Some Assembly Required The V’s require robust tooling:  Unfortunately the toolset is pretty thin: Some of the most hopeful tools are brand new or in incubation!  Components like ILM have fair tooling, others like MDM and Data Quality are sparse
  • 17. Master Data Management  Traditional MDM will do depending on your data size and requirements:  Relational is awkward, extreme normalization, poor usability and performance NoSQL stores like HBase has benefits  If you need super high performance low millisecond response times to incorporate into your Big Data ETL  Flexible Schema  Graph database is near perfect fit. Relationships and graph analysis bring master data to life! Data quality and matching processes are required Little to no community or vendor support More will come with YARN (more Commercial and Open Source IP will be leveragable in Hadoop framework) - Recommendation: Buy + Enhance or Build.
  • 18. Staging Library Consolidated Library Standardization Matching Integrated Library Survivorship Source ID Name Home Address Birth Date SSN SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789 SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789 SYS C XYZ James Stag NULL 8/20/1959 NULL Source ID Name Home Address Birth Date SSN Std Name Std Addr MDM ID SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789 James Stagnitto 123 Main Street 1 SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789 James Stagnitto 132 Main Street 1 SYS C XYZ James Stag NULL 8/20/1959 NULL James Stag NULL 1 MDM ID Name Home Address Birth Date SSN 1 James Stagnitto 123 Main Street 8/20/1959 123-45-6789 Mastering Data Validation
  • 19. The Reality of Mastering Data
  • 20. Proprietary Information Graph Databases (NoSQL) to the Rescue  Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences.
  • 21. Big Data Security  Determining Who Sees What:  Need to be able to secure as many data types as possible  Auto-discovery important!  Current products:  Sentry – SQL security semantics to Hive  Knox – Central authentication mechanism to Hadoop  Cloudera Navigator – Central security auditing  Hadoop - Good old *NIX permission with LDAP  Dataguise – Auto-discovery, masking, encryption  Datameer – The BI Tool for Hadoop Recommendation: Assemble based on existing tools
  • 22. • For now Hive Metastore, HCatalog + Custom might be best • HCatalog gives great “abstraction” services • Maps to a relational schema • Developers don’t need to worry about data formats and storage • Can use SuperLuminate to get started Recommendation: Leverage HCatalog + Custom metadata tables Metadata
  • 23. They gave developers and data scientists a reason to use it: • Easy to use storage handlers • Automatic partitioning • Schema backwards compatibility • Monitoring and dependency Checks The Twitter Way  Twitter was suffering from a data science wild west.  Developed their own enterprise Data Access Layer (DAL)
  • 24. Data Quality and Monitoring  To TRUST your information a robust set of tools for continuous monitoring is needed  Accuracy and completeness of data must be ensured.  Any piece of information in the Big Data Warehouse must have monitoring:  Basic Stats: source to target counts  Error Events: did we trap any errors during processing  Business Checks: is the metric “within expectations”, How does it compare with an abridged alternate calculation. Large gap in commercial projects /open source project offerings
  • 25. Data Quality and Monitoring Recommendation DQ metadata Hive Pig MR Quality Check Builder DQ Notifier and Logger DQ Events and Timeseries Facts DQ ENGINE • BUILD a robust data quality subsystem: • HBase for metadata and error event facts • Oozie for orchestration • Based on Data Warehouse ETL Toolkit
  • 26. Closing Thoughts – Enable the Future  Big Data requires the convergence of data quality, data management, data engineering and business policies.  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Get experts to help calm the turbulence… it can be exhausting!  Blaze new trails! Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  • 27. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta
  • 28. © 2014 Dataguise Inc. All rights reserved. Protect Your Big Data: 5 Steps You Need To Take Jeremy Stieglitz, VP Product Management Achieving Data Governance in Big Data … How To Bust Those Elephants Free
  • 29. © 2014 Dataguise Inc. All rights reserved. Executive Summary • “Big Data” has become priority #1 for large enterprise in 2014: Faster time to insights, touching $$$ Large Enterprise Driven Realtime, Automation and On Demand • Key challenge is to fully leverage Big Data without exposing or risking sensitive information » Intelligent data discovery and adaptive and automated protection » Dashboard visibility into all security actions and operators » “Transparent” to the business analysts 29
  • 30. © 2014 Dataguise Inc. All rights reserved. Business Challenge: Data Growth • 100% growth and 80% unstructured data by 2015 …finding and classifying sensitive data will get harder 30 Exabytes
  • 31. © 2014 Dataguise Inc. All rights reserved. Real-world unstructured data scenarios 31 Voice-to-txt files in Hadoop for customer service optimization Patient and doctor medical data in PDFs, X-Rays, doctor’s notes Web comment fields and customer surveys Log data from wellheads and oil drilling sensors Web e-Commerce Pay System
  • 32. © 2014 Dataguise Inc. All rights reserved. Data will grow 7500% Enterprise IT for Big Data will grow 150% The Importance of Automation
  • 33. © 2014 Dataguise Inc. All rights reserved. On-Demand Hadoop. • Without adequate sensitive data protection, customers left to “Penalty Boxing” Hadoop with “Security zones” imposed by InfoSec » Slows business, costly and cumbersome • Automated sensitive data protection can set those assets free and be ready for real-time Hadoop 2.0 33
  • 34. © 2014 Dataguise Inc. All rights reserved. 34 Data Protection In Hadoop
  • 35. © 2014 Dataguise Inc. All rights reserved. Security in Hadoop • Like the Internet before it, Hadoop designed without built-in security » Traditional (infrastructure) tools don’t address distributed, sharing requirements of Hadoop » Ad hoc (arbitrary code) computing environment » Existing authorization (Kerberos, ACLs) don’t address compliance, insider threat issues for sensitive data 35
  • 36. © 2014 Dataguise Inc. All rights reserved. Hadoop Security Framework Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization, Data masking • The 4 approaches to address security within Hadoop (Perimeter, Data, Access, Visibility) • Dataguise discovers & protects at the data layer and provides visibility for audit reporting and data lineage Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage
  • 37. © 2014 Dataguise Inc. All rights reserved. Elements of Data Centric Protection • 1. Identify which elements you want to protect via: » Delimiters (structured data), name-value pairs (semi- structured) or data discovery service (unstructured) • 2. Automated Protection Options: » Automatically apply protection via: » Format preserving encryption (FPE) » Masking (replace, randomize, intellimask, static) » Redaction (nullify) • 3. Audit Strategy » Sensitive data protection/access/lineage 37
  • 38. © 2014 Dataguise Inc. All rights reserved. Discovery • Within HDFS » Search for sensitive data per company policy – PII, PCI,… » Handle complex data types such as addresses » Process incrementally (default) to handle only the new content • In-flight » Support processing data on the fly as they are ingested into Hadoop HDFS » Plug-in solution for ftp, flume » Search for sensitive data per company policy – PII, PCI, HIPAA… 38
  • 39. © 2014 Dataguise Inc. All rights reserved. Protection Measures • Protection plan should start with cutting » What data can we delete/cut? » What data can be redacted? » Masking choices • Consistency • Realistic looking data • Partial reveal (Intellimask) Credit Card # 4541 **** **** 3241 • What data needs reversibility 39
  • 40. © 2014 Dataguise Inc. All rights reserved. Encryption “vs” Masking • Encryption: + Reversible + Trusted with security proofs + Format-preserving and partial reveals +Scale-out and distributed + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust authentication and authorization - Data value destruction 40 • Masking: + Highest security + Realistic data + Range and value preserving + Format-preserving and partial reveals +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible - Perceived to grow data
  • 41. © 2014 Dataguise Inc. All rights reserved. Encryption “vs” Masking • Encryption: + Reversible + Trusted with security proofs + Format-preserving and partial reveals +Scale-out and distributed + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust authentication and authorization - Data value destruction 41 • Masking: + Highest security + Realistic data + Range and value preserving + Format-preserving and partial reveals +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible - Perceived to grow data So Which Do I Use? The “answer is both! Mask values that can be analyzed with a substitute value (such as Names and SSNs), and encrypt values that ultimately require their original value (such as IP address or purchase amount)
  • 42. © 2014 Dataguise Inc. All rights reserved. Audit Strategy • Essential to all goals: Compliance, breach protection, visibility and metrics • Avoids the “gotcha” moment » Show all sensitive elements (count, location) » Remediation applied » Dashboard for fast access to critical policies and drill- downs for file and user action 42
  • 43. © 2014 Dataguise Inc. All rights reserved. Putting it All Together: Sensitive Data Protection in Hadoop Policy Management Define Hadoop file and directory scans Define data elements Define protection rules Define audit/reports/alerts Discover In-Flight Within HDFS Full vs. Incremental Structured vs. Unstructured Remediation Domain based Masking Encryption -Record -Field -FPE Reporting Job Level - Sensitive elements - Directories & Files - Remediation applied Dashboard - Directory or by policy - Drill-down Audit report - User actions Notifications Administration User Management Masking Domains File Structures Notifications Monitor system status 43
  • 44. © 2014 Dataguise Inc. All rights reserved. Dataguise: Market Leader in Big Data Protective Intelligence (BDPI) Only solution with Hadoop data discovery Best in class– data protection with simplicity, scalability, and automation Business friendly to operators and business analysts 44
  • 45. © 2014 Dataguise Inc. All rights reserved. 45 Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta Jeremy Stieglitz VP Prod. Mgmt. jeremy@dataguise.com (510) 896-3755 @BigDataProtect

Editor's Notes

  1. We focused our attention on building a single version of the truth We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM. We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  2. “Obviously, everyone in the room is facing unprecendented DATA GROWTH. But what we are also starting to see in Big Data architectures is the collection of both structured and unstructured data side-by-side. … HIT ON Lots of growth in unstructured data now Drives a much higher need for data discovery to go find what that sensitive data lies.
  3. Go quickly through this slide. The point is: The fundamental point here is that in all industries, {entertainment, healthcare, oil and gas, payments and finance} Big Data architectures allow businesses to bring lots of new, varied data together they could never assemble and analyze before. Of course, those unstructured data types also force organizations to prove and audit their compliance to financial regulations and national and state privacy laws
  4. Stage 3, aka The Penalty Box, aka The CUSTOMER PAIN POINT, is really where customers finally start thinking about SECURITY ARCHITECTURES for Big Data. Explain that some of our largest customers were literally in FAILED AUDITS from the degree of sensitive data going into HADOOP, without adequate protections of that data.
  5. Introducing Dataguise
  6. I call this a COOKBOOK. It’s a quick guideline for how to think about protecting sensitive data per agenda item: Double down on building a solid data protection framework to achieve security in all aspects of handling personal data