SlideShare a Scribd company logo
1 of 19
Download to read offline
Enterprise Data Management: A
Perspective
From the days of Data Silo, EDW to the present day of Hadoop & Data Lake
This document discusses the evolution of the enterprise data management over the years, the
challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a
means to tackle such challenges. It also talks about some reference architectures and recommended
toolset in today’s context.
March, 2016
Authors:
 Selva Kumar VR
 Saurav Mukherjee
Enterprise Data Management – A Perspective
Page 1 of 18
Contents
1. The Evolution of Data Management – what led to ‘Data Lake’?..........................................................3
1.1. Data Silo ........................................................................................................................................3
1.2. Enterprise Data Warehouse (EDW) ..............................................................................................3
1.3. Big Data.........................................................................................................................................4
1.4. Hadoop..........................................................................................................................................5
2. The Challenges of present CTOs ...........................................................................................................6
3. Data Lake...............................................................................................................................................7
3.1. Key Components of Data Lake......................................................................................................7
3.1.1. Storage..................................................................................................................................7
3.1.2. Ingestion................................................................................................................................7
3.1.3. Inventory & Cataloguing .......................................................................................................7
3.1.4. Exploration............................................................................................................................7
3.1.5. Entitlement ...........................................................................................................................7
3.1.6. API & User Interface..............................................................................................................7
4. Data Lake – Implementing the Architecture.........................................................................................8
4.1. Storage..........................................................................................................................................8
4.2. Ingestion........................................................................................................................................9
4.2.1. The Challenges......................................................................................................................9
4.2.2. Recommendation..................................................................................................................9
4.3. Inventory, Catalogue & Explore..................................................................................................12
4.3.1. Discovery.............................................................................................................................12
4.3.2. Catalog & Visualization .......................................................................................................12
4.4. Entitlement & Auditing ...............................................................................................................12
4.5. API & User Interface Access........................................................................................................13
5. Conclusion...........................................................................................................................................14
6. Bibliography ........................................................................................................................................15
7. Few Other Useful References .............................................................................................................18
Enterprise Data Management – A Perspective
Page 2 of 18
Figures
Figure 1: Data Management in Silos 3
Figure 2: Typical EDW Implementation 4
Figure 3: Typical Data Lake Implementation 8
Figure 4: Apache Nifi Data Flow View 10
Figure 5: Apache Nifi Data Provenance View 10
Figure 6: Nifi - The Power of Provenance 10
Figure 7: Apache Nifi Stats View 11
Tables
Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today 6
Table 2: Data Ingestion Challenges - beyond just the tools 9
Enterprise Data Management – A Perspective
Page 3 of 18
1. The Evolution of Data Management – what led to ‘Data Lake’?
The concept of data management evolved in last 30 years based on the idea of providing better and
timely analytics to the business teams. IT team always struggled with the business demand of providing
everything in the next minute to serve new business ideas.
1.1. Data Silo
Initially, data management systems for analytics were created in silos. This approach helped extract
some insights from the organization’s data asset. However, the silos were very much restricted towards
individual LOBs (line of business) and hence, were never considered comprehensive. Usually, LOBs used
to send data to other LOBs as required and requested. In most cases, they were just reports (static &
analytical) getting pulled from application database.
Figure 1: Data Management in Silos
1.2. Enterprise Data Warehouse (EDW)
To break away from the data silos so that LOBs get the freedom to create their own data marts, the idea
of Enterprise Data Warehouse (EDW) was adopted widely by industry. This concept has been researched
for long. A joint research by HP-Labs and Microsoft research team provides a good overview of this
concept and approach (Chaudhuri, et al., 1997). All data marts source their data from one central
version of data, thereby maintaining data integrity and consistency at the enterprise level.
Though EDW solved the problem of providing an enterprise-level view of data to all business teams to a
certain extent, answering questions or providing necessary data to business teams within the next
minute of new business idea still remained a cherished but elusive dream for IT & business teams. Also,
this ‘one version-fits-all’ idea did not go well with every group in organization. And the culture of
business analysts downloading data to Microsoft Excel spread sheets or Microsoft Access from EDW and
merging them with source data continued to be widely followed.
EDW architecture offered numerous technical challenges. Few such challenges are listed below.
 Cost
 Licensing cost (Database licenses, ETL tools etc.)
 Storage cost
 Ridiculously long lead time before database schemas could be created as per standards, which in
turn followed by long ETL development cycles
LOB-1LOB-n-1
LOB-2
LOB-1
LOB-3
LOB-n
Enterprise Data Management – A Perspective
Page 4 of 18
 Every post-production fix involved long and repetitive development cycle
 Complicated designs
 Need for highly skilled labor force
Figure 2: Typical EDW Implementation
1.3. Big Data
In the meanwhile, technology evangelists like Google, Netflix, Amazon, Facebook, Twitter, advanced oil
drill equipment manufacturing companies, space companies etc. injected new types of problems in to
the data space, e.g. data type and volume. It was no more the case of structured data mindset. It
involved unstructured data like videos, social text streams, sensor data, data streams from IoT devices
etc. These data types can neither be accommodated into traditional database nor their scale are easily
manageable like structured data. In addition to the data volume, variety and velocity of data flow had to
be tackled together to derive business advantage and doing that faster than the competition. These new
generation companies also created applications which are ground up distributed in nature. New
distributed file systems, new distributed processing applications etc. were required to handle the
volume and the velocity. Thought papers from companies like Google (Chang, et al., 2006) (Dean, et al.,
2004) (Ghemawat, et al., 2003), Amazon (DeCandia, et al., 2007) etc. offer detailed discussion on this
topic. The dimensions of volume, variety and velocity gave birth to what came to be known as ‘Big
Data’1
.
1
Over time, couple more V’s – veracity & volatility got attributed to Big Data.
Reporting
Layer
EDW LayerETL LayerData Source
Layer
Data Source –
LOB-1
Data Source –
LOB-2
Data Source –
LOB-3
Data Source –
LOB-n
ETL Tools EDW
Holds schema on Write
i.e., predefined databases
schema
Data Mart
LOB-1
Data Mart
LOB-2
Data Mart
LOB-3
Data Mart
LOB-n
ReportingLayer
Enterprise Data Management – A Perspective
Page 5 of 18
1.4. Hadoop
Doug Cutting, Chief Architect at Cloudera, adopted the distributed systems idea and created Hadoop,
being inspired and modeled by Google’s high volume data processing systems. Hadoop is open source
and relies on the concept of bulk commodity hardware. It solves the cost issue (licensing cost, storage
cost) and data variety issue.
Over time, new ecosystem got created around HDFS (Hadoop Distributed File System). It generated new
efficiencies for data architecture through optimization of data processing workloads such as data
transformation and integration. It simultaneously lowered the cost of storage. Ideas like flexible
‘schema-on-read’ access to all enterprise data allowed circumventing long database schema design and
long ETL development cycles started taking shape.
Though Hadoop potentially solves data storage problem, it requires high latency for data retrieval (batch
processing). The latency issue led to new ways of data storage & retrieval in form of NoSQL databases
e.g., Apache HBase, Apache Cassandra - inspired by Amazon (DeCandia, et al., 2007) etc. for and better
processing engines like Spark (Zaharia, et al., 2012) (Zaharia, et al., 2010) (Zaharia, et al., 2012), Flink
(Apache Software Foundation, 2015) etc. However, NoSQL databases have their own challenges like
complicated table designs, joins not working well like in traditional RDBMS etc.
This landed the industry at the juncture of good infrastructure framework, low cost open source tools
(e.g., storage tools like HDFS, NoSQL databases like MongoDB, HBase, Cassandra, MemCache etc., data
processing tools like Spark, Map Reduce, Pig, Hive, Flink, Nifi etc., message broking tools like Kafka
(Kreps, et al.), RabbitQ etc.) and, of course, the existing high cost enterprise toolsets & easy access
storage (i.e. RDBMS like Oracle, DB2, SQL Server etc., Massively Parallel Processing (MPP) tools like
Teradata, Impala etc., processing tools like AbInitio, Informatica, DataStage etc.).
Along the way, the revolution called open source added significant value to technology community. It
facilitated creation of lot of start-ups, encouraged new ideas and of course, added a lot of chaos. Each
of these tools (whether low cost or high cost) are focused on solving specific use case. Every other
month, new open source products started getting released. However, for an enterprise CTO or an
architect, it gets really challenging to identify sustainable open source solutions which would also solve
multiple use cases instead of solving specific ones. Here came the open source bundling companies e.g.,
Cloudera, Hortonworks, MapR etc. They took the ownership of identifying software that are good and
sustainable, and managing tools which go through very frequent releases for improvised versions. This
solved the basic adaptation problem of open source ecosystem into an enterprise to a good extent.
There have been differences in the selection of tools of the open source bundling companies’ and of
course, it is purely left to enterprise’s use cases to decide which one to go for.
Once the new ecosystem (majorly based open source solutions) got stabilized, next challenge was to
adopt a suitable methodology for application development and maintenance. Adoption of open source
ecosystem also mandated replacement of all/some of the well accepted traditional enterprise software
and tools. Such replacement entails its own share of risks.
Also, there are no widely practiced and adopted standards in the industry for open source based
enterprise data management solutions. Most of the advanced business analysts still rely on power tools
like SQL, Metadata Management repositories etc. to infer business insights. Hadoop lacks the flexibility
Enterprise Data Management – A Perspective
Page 6 of 18
of data extraction using SQL at similar speed. On top of it, there have been challenges of dealing with
regulations, preventing data falling into wrong hands, auditing etc.
2. The Challenges of present CTOs
The previous section discussed about the evolution of data management, the multidimensional
challenges that it posed and the challenges in identifying proper adoption framework or architecture
which may be widely used, standardized and easily adopted by enterprises. The CTOs or architects
would be better served by having reference architecture or framework to minimize the risks involved.
Few exceptional use cases which may not fit well in this framework or architecture can be handled
separately.
Before delving deep in to the adoption framework or architecture, here is a quick summary of the
critical challenges from enterprise data management perspective as an evolution from EDW era.
# Description
1 Provide low cost storage and processing. Accommodate any data type.
2 Provide consolidated view of enterprise data to empower business teams to pull all required
information next minute new business idea pops up.
3 Provide consolidated view of enterprise data and flexibility of ad hoc reporting on any data
element in enterprise to the business analyst.
4 Provide metadata cataloguing and search facility for metadata.
5 Store data in original raw form to guarantee data fidelity.
6 Provide entitlement management features that take care of regulation, authorization,
authentication, encryption, data masking, auditing etc.
7 Leverage existing licensed tools for use cases / problems which open source systems cannot
solve.
8 Maintain existing good features like faster data extraction using SQL for analysis and add new
features that have significant reduction in latency in creating advanced analytical applications
like machine learning.
9 Provide data access to external & internal teams based on entitlement.
10 Provide enterprise data elements in raw form to a new category of analysts, called data
scientists.
11 Select technologies to minimize tool replacement costs and keep up with technology trends for
keep enterprise competitive.
12 Integrate data profiling and data quality results into metadata management framework.
Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today
Enterprise Data Management – A Perspective
Page 7 of 18
3. Data Lake
‘Data Lake’ came across as the next key concept in data management area and was primarily
conceptualized to tackle the challenges mentioned in the section above - The Challenges of present
CTOs. It is more of architectural concept and may be defined as - “Repository of enterprise-wide, large
quantities and variety of data elements, both structured and unstructured, in raw form.”
This definition is purely based on the insights from multiple data management implementations in
Hadoop environment, identifying challenges and coming up with architecture to solve these challenges.
However, just repository alone will not suffice in meeting the challenges mentioned in Table 1. It would
require supporting components to deliver the benefits.
3.1. Key Components of Data Lake
The Data Lake architecture involves some mandatory components (mentioned below) to make it a
successful implementation.
3.1.1. Storage
 Low cost
 Store raw data from different input sources
 Support any data type
 High durability
3.1.2. Ingestion
 Facilitate both batch & streaming ingestion frameworks
 Offer low latency
3.1.3. Inventory & Cataloguing
 Discover metadata and generate tags
 Discover lineage information
 Manage tags
3.1.4. Exploration
 Browse / Search Inventory
 Inspect Data Quality
 Tag Data Quality attributes
 Auditing
3.1.5. Entitlement
 Identify & Access Management
o Authentication, Authorization, Encryption, Quotas, Data Masking
3.1.6. API & User Interface
 Expose search API
 Expose Data Lake to customers using API & SQL interface based on entitlements and access
rights
Enterprise Data Management – A Perspective
Page 8 of 18
4. Data Lake – Implementing the Architecture
Components mentioned in Figure 3 above are minimal requirements for implementing a Data Lake.
Hadoop (HDFS) can accommodate application storage as well. These applications can also leverage Data
Lake’s built-in framework components like Catalogue, Data Quality, and Search & Entitlements.
4.1. Storage
Primary requirement of storage is to low cost, able to accommodate high volume and long durability.
Storage should be able to accommodate any data type. Current technological trends suggest that HDFS,
MapR-FS and Amazon S3 suit the need. Even though, they have different underlying implementation,
they still adhere to Hadoop standards.
Along with storing data in distributed file systems, it would be a good idea to identify suitable storage
options for different data types as below.
 Unstructured data
o Store native file format (logs, dump files, videos etc.)
o Compress with Streaming Codec (LZO, Snappy)
 Semi-Structured data – JSON, XML files
o Good to store in schema aware formats e.g., Avro. Avro allows versioning &
extensibility like adding new fields.
 Structured data
o Flat records (CSV or some other field separated)
o Avro or Columnar Storage (Parquet)
Streaming Ingestion
Access LayerData Lake LayerIngestion LayerData Source Layer
Data Source –
LOB-1
Data Source –
LOB-2
Data Source –
LOB-3
Data Source –
LOB-n
Batch Ingestion
Data Lake
Automated Inventory, Catalogue
& Tagging Framework
Data Quality Tagging Framework
Inventory & Catalogue Search &
Explore Framework
Entitlement Framework
RDBMS/
NoSQLAccess
Search(Solr/Elastic
Search)Access
API
Access
Figure 3: Typical Data Lake Implementation
Enterprise Data Management – A Perspective
Page 9 of 18
Storage life cycle policy can also be defined. There are many open source tools like Apache Falcon
(Apache Software Foundation, 2016) that operates based on pre-defined policies. Data directory
structure can be defined to segregate data based on life cycle policy - e.g., latest data, 7 years old data
as required by regulations, data older than 7 years etc.
4.2. Ingestion
Ingestion is the first piece of the puzzle that needs to be put in place after setting up the storage. This
involves setting up ingestion framework that will handle both batch and streaming data. Looking into
the current trends of data processing tools, next generation of technologies might look at batch
processing as legacy systems. Better processing tools (e.g. Spark (near real-time), Flink (real-time) etc.)
are promoting batch as streams. Complexity of good stream processing depends on use cases. O’Reilly
offers in-depth discussion on the topic of streaming - going beyond batch (Akidau, 2015) (Akidau, 2016).
4.2.1. The Challenges
However, having advanced processing tools would not be enough to ensure proper ingestion. The
following list summarizes few such challenges (as listed below) that need to be circumvented.
# Description
1 Making use of advanced processing tools would require high skilled resources in good numbers.
2 Traditional data processing engineers widely use GUI-based ETL tools (like AbInitio, Informatica
etc.) that use data flow programming techniques. Coding applications using open source
processing tools (like Spark, Flink etc.) still take consider time for development and testing for
those engineers.
3 Due to nature of open source eco system, there will always be new processing tool that will
outrun the benefit of current toolset and will add business advantage than enterprise’s
competition. This would require easy and quick adoptability, which may be a big challenge.
4 Data processing tools are good at processing data. However, ingestion framework also needs to
go beyond that and solve challenges like:
 Low latency & guaranteed data delivery
 Handling back pressure
 Data provenance (tracking data all the way from data source)
 Customizability
 Quick implementation and better UI for operations team.
 Supporting wide variety of protocols used for sending/receiving data (e.g., SSL, SSH,
HTTPS, other encrypted contents etc.)
 Load data into wide number of destinations (HDFS, Spark, MapR-FS, S3, RDBMS, NoSQL
etc.).
5 From enterprise perspective, it is often desired to have same tools to be used across the
enterprise for any applications that would require data push/pull. However, zeroing on ‘the one
toolset’ is always challenging.
Table 2: Data Ingestion Challenges - beyond just the tools
4.2.2. Recommendation
Based on tool evaluation research, two tools may be recommended to handle the ingestion problems
and quick adoptability challenges:
Enterprise Data Management – A Perspective
Page 10 of 18
4.2.2.1. Apache Nifi
Apache Nifi (Apache Software Foundation, 2015) is one of the best open source data flow programming
tool. Nifi kind of fits the bill for most of the data push/pull use cases. Just to get the uninitiated excited
about it, here are few Nifi snapshots:
Figure 4: Apache Nifi Data Flow View
Figure 5: Apache Nifi Data Provenance View
Figure 6: Nifi - The Power of Provenance
Enterprise Data Management – A Perspective
Page 11 of 18
Figure 7: Apache Nifi Stats View
Nifi can be used as a full-fledged ETL tool and does support most of the ETL features. However, Nifi still
claims itself as simple event processing and data provenance tool. If open source support is extended to
it, Nifi may well be transformed into a full-fledged ETL tool.
4.2.2.2. Cascading
To deal with the quick adaptability part, it would be a good idea to have wrapper technologies. They
would allow the coding to be done once and the processing engine underneath to be changed based on
the latest trends or best fit. Our research recommends Cascading (Driven, Inc., 2015)to be a good
candidate here. At present, Cascading supports multiple processing engines underneath (Spark, Map
Reduce, Flink etc.).
Cascading also supports development in Java and Scala. Cascading allows developing the business logic
separately from the integration logic. Complete applications may be developed and unit tests may be
written without touching a single Hadoop API. It provides the degrees of freedom to easily move
through the application development life-cycle and to deal separately with integrating existing systems.
Cascading provides a rich API that allows thinking in terms of data and business problems with
capabilities such as sort, average, filter, merge, etc. The computation engine and process planner
convert the business logic into efficient parallel jobs, delivering the optimal plan at run-time to the
computation fabric of choice.
In simple terms, cascading may be considered as the plumbing components that are used for building
pipelines. It provides sinks, traps, connections etc. It is just the matter of plugging them together to
build business logic without bothering about whether the code will run on MapReduce or Spark or Flink.
This is famously known as pattern language.
Developers can develop all the way till unit testing without touching Hadoop or any processing engine.
From technology category perspective, it is the middleware for designing workflow.
Enterprise Data Management – A Perspective
Page 12 of 18
4.3. Inventory, Catalogue & Explore
Data lake-based storage ideas & ingestion ideas mentioned above can solve few challenges like low cost
storage, low cost processing, real-time sync with data source and data in raw form for maintaining
fidelity as mentioned in Table 1. Enterprise data pushed in raw form into Data Lake provides the
flexibility to business analysts and data scientists to pull any enterprise data element as required
without waiting for long ETL development and data modeling exercises to complete. Streaming ingestion
facilitates Data Lake to be in sync with data source in as near real time as possible.
However, enterprise data in their own raw format might be huge and it will be like finding needle in hay
stack for a data scientist or any user. This mandates self-data-service framework to be built for data
discovery (Inventory), data preparation (Catalogue) and data visualization (Explore).
4.3.1. Discovery
First step in data discovery is to provide a metadata framework (a sub-component in self-data-service
framework) to capture business metadata, technical metadata and operational metadata. This process
needs to be automated to handle the sheer volume of file load into Data Lake. Even though in theory
Data Lake talks about data availability to everyone, there are constraints in the form of entitlement
which needs to be put in place for Data Governance purposes.
Metadata framework should also have features to create important data lineage information as part of
Ingestion frameworks. This will enable lineage all the way from data source to Data Lake.
4.3.2. Catalog & Visualization
Once metadata (business, technical & operation) has been captured for the raw data provided by data
sources, it may be used as catalog and UI may then be used to explore these metadata. Along with
metadata, data profiling ability & data quality metric for all data pushed into Data Lake are really
valuable and desirable in this context.
Most of the available frameworks are tag based. They identify and mark metadata, profiling metrics &
quality metrics. These frameworks are inbuilt with CRUD API, Query API or Analytics API for handling
metadata management.
This area is fairly new to industry. Only few vendors are out there who provide data self-service
framework. Below is a list of such vendors and their products.
 Cloudera (Cloudera, 2016)– Cloudera Navigator (not open source, license based).
 Waterline Data (Waterline Data, Inc., 2016) – Independent organization and integrates with any
Hadoop distribution.
 Hortonworks (Hortonworks Inc., 2016) – Apache Atlas, still in incubation. However, a limited
featured version has been added to HDP 2.3 release. Hortonworks actively partnered with
Waterline data as well.
4.4. Entitlement & Auditing
Entitlement is one of the primary pieces of data governance. Generally, data governance has few
mandatory components like Data Profiling, Data Quality, Entitlement and Auditing. Main goal of the
governance is to facilitate easy & secured data accessibility along with the reliability of the data
Enterprise Data Management – A Perspective
Page 13 of 18
(profiling & data quality measures). Previous section discussed profiling & data quality. This section will
focus on entitlements and auditing.
Entitlement & auditing covers wide range of activities, like
o Authentication
o Authorization
o Encryption
o Auditing
o Data Masking
o Data Field Level Authorization
Almost all Hadoop distribution vendors use Kerberos as authentication protocol. MapR uses propriety
authentication tool, which follows similar approach as Kerberos.
For authorization, Data Masking & Data field level authorization, different Hadoop distribution vendors
use different toolsets. Cloudera uses Sentry & Cloudera Navigator. Hortonworks uses Apache Ranger &
Apache Knox. MapR uses proprietary ACE (Authorization Control Expression) that provides better
flexibility than ACL (Authorization control list). ACE is supported by all Vendors.
All vendors offer encryption for at-rest and in-transit data. Approaches taken for Key Management for
keys used for encryption/decryption are quiet proprietary.
There are multiple open source projects in Hadoop security area. A list of few such projects is given
below.
 Apache Knox (Apache Software Foundation, 2016): A REST API gateway that provides a single
access point for all REST interactions with Hadoop clusters.
 Apache Sentry (Apache Incubator): A modular system for providing role-based authorization for
both data and metadata stored in HDFS. Sentry project is primarily led by Cloudera, one of the
best-known Hadoop distributors.
 Apache Ranger (Hortonworks, Inc., 2016): A centralized environment for administering and
managing security policies across the Hadoop ecosystem. This project is led by Hortonworks,
another well-known Hadoop distributor, and includes technology that it gained when it acquired
XA Secure in mid-2014 (Hortonworks, Inc., 2014).
 Apache Falcon (Apache Software Foundation, 2016): A data governance engine that allows
administrators to define and schedule data management and governance polices across the
Hadoop environment. The section 4.1 also discusses this.
 Project Rhino (Williams, 2013): Creates an encryption, key management capabilities and a
common authorization framework across Hadoop projects and subprojects (TechTarget). This
project is led by Intel.
Most of these security tools are inbuilt and distributed by different Hadoop bundling vendors.
4.5. API & User Interface Access
To provide easy and secure access, it is recommended to allow control access to Data Lake either using
API or Interactive SQL. This will in turn enforce inbuilt entitlement as discussed in sections above.
Enterprise Data Management – A Perspective
Page 14 of 18
Wide range of tools is available for API management and SQL (Spark SQL, Flink SQL, Impala etc.). Even
with all these tools available, data access might not be as fast as RDBMS tools. This is a case in point to
leverage existing enterprise tools.
As mentioned earlier, most of the framework setup for Data Lake can be re-used for other use cases. If
data cleansing & standardization has to be done, it can be run in Hadoop environment using data
processing tools like Map Reduce, Cascading, Spark, Flink etc. and the HDFS environment can be
segmented to hold cleansed, standardized and aggregated level information. Standardized version of
data may also be pushed to existing EDW. This approach moves complete ETL from EDW to Hadoop
environment, thus minimizing processing and licensing cost. Also, highly granularity data will reduce
RDBMS storage cost.
5. Conclusion
Data Lake provides an architectural approach with embedded Governance model. It helps Data
Management teams to implement variety of solutions using cost effective storage, efficient processing
engines and self-data-service features. Teams implementing Data Lake need to give lot of focus while
defining metadata for all types data objects ingested into Data Lake. Metadata plays key role in Data
Lake to expose self-data-service flexibility to analysts/data scientists/users and it is a key component for
defining entitlements.
Enterprise Data Management – A Perspective
Page 15 of 18
6. Bibliography
Akidau Tyler The world beyond batch: Streaming 101 - O'Reilly Media [Online] // The world beyond
batch: Streaming 101 - O'Reilly Media. - O'Reilly, Aug 05, 2015. - Mar 08, 2016. -
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.
Akidau Tyler The world beyond batch: Streaming 102 - O'Reilly Media [Online] // The world beyond
batch: Streaming 102 - O'Reilly Media. - O'Reill, Jan 20, 2016. - Mar 08, 2016. -
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102.
Amazon Dynamo: Amazon’s Highly Available Key-value Store [Online] // All Things Distrubuted. -
Amazon, 2007. - Mar 04, 2016. - http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf.
Apache Incubator Apache Sentry (incubating) [Online] // Apache Sentry (incubating). - Apache Software
Foundation. - Mar 09, 2016. - https://sentry.incubator.apache.org/.
Apache Software Foundation Apache Flink: Scalable Batch and Stream Data Processing [Online] //
Apache Flink: Scalable Batch and Stream Data Processing. - Apache Software Foundation, 2015. - Mar
07, 2016. - http://flink.apache.org/.
Apache Software Foundation Apache Nifi [Online] // Apache Nifi. - Apache Software Foundation, 2015. -
Mar 08, 2016. - https://nifi.apache.org/.
Apache Software Foundation Falcon - Falcon - Feed Management & Data processing platform
[Online] // Falcon - Falcon - Feed Management & Data processing platform. - Apache Software
Foundation, Feb 15, 2016. - Mar 08, 2016. - https://falcon.apache.org/.
Apache Software Foundation Knox Gateway - REST API Gateway for Hadoop Ecosystem [Online] // Knox
Gateway - REST API Gateway for Hadoop Ecosystem. - Apache Software Foundation, Mar 01, 2016. - Mar
09, 2016. - https://knox.apache.org/.
Chang Fay [et al.] Bigtable: A Distributed Storage System for Structured Data [Online] // Bigtable: A
Distributed Storage System for Structured Data. - Google, 2006. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.
Chaudhuri Surajit and Dayal Umeshwar An Overview of Data Warehousing and OLAP Technology
[Online] // Microsoft Research - Turning Ideas into Reality. - Microsoft, Mar 1997. - Mar 03, 2016. -
http://research.microsoft.com/pubs/76058/sigrecord.pdf.
Cloudera Cloudera [Online] // Cloudera. - Cloudera, 2016. - Mar 08, 2016. - https://cloudera.com/.
Dean Jeffrey and Ghemawat Sanjay MapReduce: Simplified Data Processing on Large Clusters
[Online] // MapReduce: Simplified Data Processing on Large Clusters. - Google, 2004. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.
DeCandia Giuseppe [et al.] Dynamo: Amazon’s Highly Available Key-value Store [Online] // Dynamo:
Amazon’s Highly Available Key-value Store. - Amazon, 2007. - Mar 15, 2016. -
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.
Enterprise Data Management – A Perspective
Page 16 of 18
Driven, Inc. Cascading | Application Platform for Enterprise Big Data [Online] // Cascading | Application
Platform for Enterprise Big Data. - Driven, Sep 2015. - Mar 08, 2016. - http://www.cascading.org/.
Ghemawat Sanjay, Gobioff Howard and Leung Shun-Tak The Google File System [Online] // The Google
File System. - Google, 2003. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.
Google, Inc The Google File System [Online] // Research at Google. - Google, 2003. - Mar 4, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.
Google, Inc. Bigtable: A Distributed Storage System for Structured Data [Online] // Research at Google. -
Google, 2006. - Mar 04, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.
Google, Inc. MapReduce: Simplified Data Processing on Large Clusters [Online] // Research at Google. -
Google, 2004. - Mar 04, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.
Hortonworks Inc. Hortonworks: Open and Connected Data Platforms [Online] // Hortonworks: Open
and Connected Data Platforms. - Hortonworks, 2016. - Mar 08, 2016. - http://hortonworks.com/.
Hortonworks, Inc. Apache Ranger [Online] // Apache Ranger. - Hortonworks, 2016. - Mar 09, 2016. -
http://hortonworks.com/hadoop/ranger/.
Hortonworks, Inc. Hortonworks Acquires XA Secure - Hortonworks [Online] // Hortonworks Acquires XA
Secure - Hortonworks. - May 15, 2014. - Mar 09, 2016. - http://hortonworks.com/press-
releases/hortonworks-acquires-xa-secure/.
Kreps Jay, Narkhede Neha and Rao Jun Kafka: a Distributed Messaging System for Log Processing
[Online] // Microsoft Research - Turning Ideas into Reality. - LinkedIn Corp.. - Mar 7, 2016. -
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-
final12.pdf.
TechTarget Managing Hadoop projects: What you need to know to succeed [Online] // Managing
Hadoop projects: What you need to know to succeed. - TechTarget. - Mar 09, 2016. -
http://searchdatamanagement.techtarget.com/essentialguide/Managing-Hadoop-projects-What-you-
need-to-know-to-succeed.
University of California, Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-
Memory Cluster Computing [Online] // Computer Science Division | EECS at UC Berkley. - University of
California, Berkley, 2012. - Mar 07, 2016. -
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
University of California, Berkeley Spark: Cluster Computing withWorking Sets [Online] // Computer
Science Division | EECS at UC Berkley. - University of California, Berkley, 2012. - Mar 07, 2016. -
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
Enterprise Data Management – A Perspective
Page 17 of 18
Waterline Data, Inc. Waterline Data | Find, understand, and govern data in Hadoop [Online] //
Waterline Data | Find, understand, and govern data in Hadoop. - Waterline, 2016. - Mar 09, 2016. -
http://www.waterlinedata.com/.
Williams Alex Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better Security
To Big Data [Online] // Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better
Security To Big Data. - TechCrunch Network, Feb 26, 2013. - Mar 09, 2016. -
http://techcrunch.com/2013/02/26/intel-launches-hadoop-distribution-and-project-rhino-an-effort-to-
bring-better-security-to-big-data/.
Zaharia Matei [et al.] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing
on Large Clusters [Online]. - University of California, Berkeley, 2012. - Mar 07, 2016. -
https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf.
Zaharia Matei [et al.] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing [Online] // Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing. - University of California, Berkeley, 2012. - Mar 15, 2016. -
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Zaharia Matei [et al.] Spark: Cluster Computing with Working Sets [Online] // Spark: Cluster Computing
with Working Sets. - University of California, Berkeley, 2010. - Mar 15, 2016. -
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
Enterprise Data Management – A Perspective
Page 18 of 18
7. Few Other Useful References
Data Lake References
 Horton works & Teradata Paper - Data Lake
 Amazon’s experience on Data Lake - Data Lake Implementation Guidelines
 Knowledgent Reference - Data Lake Design
 Waterline Data - Self Data Service
Flink: A new breed in processing tool
 Flink Streaming & Batching in One Engine
Data Security
 Cloudera Security – Paper on Hadoop Security
 Cloudera reference on Hadoop Encryption - Encryption in Cloudera
 Hortonworks - Data Governance

More Related Content

What's hot

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
jdijcks
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
guest4e975e2
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Erik Fransen
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
StampedeCon
 

What's hot (20)

The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
 
Agile NoSQL With XRX
Agile NoSQL With XRXAgile NoSQL With XRX
Agile NoSQL With XRX
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Future of Analytics: Drivers of Change
Future of Analytics: Drivers of ChangeFuture of Analytics: Drivers of Change
Future of Analytics: Drivers of Change
 
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data ModelerThe Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data Modeler
 
Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Deliver Big Data, Database and AI/ML as-a-Service anywhere
Deliver Big Data, Database and AI/ML as-a-Service anywhereDeliver Big Data, Database and AI/ML as-a-Service anywhere
Deliver Big Data, Database and AI/ML as-a-Service anywhere
 
Data architecture for modern enterprise
Data architecture for modern enterpriseData architecture for modern enterprise
Data architecture for modern enterprise
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Power BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopPower BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual Workshop
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 

Viewers also liked

Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
DataWorks Summit
 

Viewers also liked (8)

Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
 
Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 

Similar to Enterprise Data Management - Data Lake - A Perspective

Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
Rackspace
 
Application Of A New Database Management System
Application Of A New Database Management SystemApplication Of A New Database Management System
Application Of A New Database Management System
Pamela Wright
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
madlynplamondon
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
pateelhs
 
Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_Caratan
Luke Caratan
 

Similar to Enterprise Data Management - Data Lake - A Perspective (20)

The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
 
Application Of A New Database Management System
Application Of A New Database Management SystemApplication Of A New Database Management System
Application Of A New Database Management System
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
 
Big Data
Big DataBig Data
Big Data
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Data lakes
Data lakesData lakes
Data lakes
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_services
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_Caratan
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 

More from Saurav Mukherjee

More from Saurav Mukherjee (6)

Complex C-declarations & typedef
Complex C-declarations & typedefComplex C-declarations & typedef
Complex C-declarations & typedef
 
Linear regression theory
Linear regression theoryLinear regression theory
Linear regression theory
 
Presentation Skills
Presentation SkillsPresentation Skills
Presentation Skills
 
Enterprise Agile Adoption
Enterprise Agile AdoptionEnterprise Agile Adoption
Enterprise Agile Adoption
 
Tire Pressure Monitoring System (TPMS) - An Introduction
Tire Pressure Monitoring System (TPMS) - An IntroductionTire Pressure Monitoring System (TPMS) - An Introduction
Tire Pressure Monitoring System (TPMS) - An Introduction
 
Competitive positioning and routes to market for a high-technology innovation...
Competitive positioning and routes to market for a high-technology innovation...Competitive positioning and routes to market for a high-technology innovation...
Competitive positioning and routes to market for a high-technology innovation...
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 

Enterprise Data Management - Data Lake - A Perspective

  • 1. Enterprise Data Management: A Perspective From the days of Data Silo, EDW to the present day of Hadoop & Data Lake This document discusses the evolution of the enterprise data management over the years, the challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a means to tackle such challenges. It also talks about some reference architectures and recommended toolset in today’s context. March, 2016 Authors:  Selva Kumar VR  Saurav Mukherjee
  • 2. Enterprise Data Management – A Perspective Page 1 of 18 Contents 1. The Evolution of Data Management – what led to ‘Data Lake’?..........................................................3 1.1. Data Silo ........................................................................................................................................3 1.2. Enterprise Data Warehouse (EDW) ..............................................................................................3 1.3. Big Data.........................................................................................................................................4 1.4. Hadoop..........................................................................................................................................5 2. The Challenges of present CTOs ...........................................................................................................6 3. Data Lake...............................................................................................................................................7 3.1. Key Components of Data Lake......................................................................................................7 3.1.1. Storage..................................................................................................................................7 3.1.2. Ingestion................................................................................................................................7 3.1.3. Inventory & Cataloguing .......................................................................................................7 3.1.4. Exploration............................................................................................................................7 3.1.5. Entitlement ...........................................................................................................................7 3.1.6. API & User Interface..............................................................................................................7 4. Data Lake – Implementing the Architecture.........................................................................................8 4.1. Storage..........................................................................................................................................8 4.2. Ingestion........................................................................................................................................9 4.2.1. The Challenges......................................................................................................................9 4.2.2. Recommendation..................................................................................................................9 4.3. Inventory, Catalogue & Explore..................................................................................................12 4.3.1. Discovery.............................................................................................................................12 4.3.2. Catalog & Visualization .......................................................................................................12 4.4. Entitlement & Auditing ...............................................................................................................12 4.5. API & User Interface Access........................................................................................................13 5. Conclusion...........................................................................................................................................14 6. Bibliography ........................................................................................................................................15 7. Few Other Useful References .............................................................................................................18
  • 3. Enterprise Data Management – A Perspective Page 2 of 18 Figures Figure 1: Data Management in Silos 3 Figure 2: Typical EDW Implementation 4 Figure 3: Typical Data Lake Implementation 8 Figure 4: Apache Nifi Data Flow View 10 Figure 5: Apache Nifi Data Provenance View 10 Figure 6: Nifi - The Power of Provenance 10 Figure 7: Apache Nifi Stats View 11 Tables Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today 6 Table 2: Data Ingestion Challenges - beyond just the tools 9
  • 4. Enterprise Data Management – A Perspective Page 3 of 18 1. The Evolution of Data Management – what led to ‘Data Lake’? The concept of data management evolved in last 30 years based on the idea of providing better and timely analytics to the business teams. IT team always struggled with the business demand of providing everything in the next minute to serve new business ideas. 1.1. Data Silo Initially, data management systems for analytics were created in silos. This approach helped extract some insights from the organization’s data asset. However, the silos were very much restricted towards individual LOBs (line of business) and hence, were never considered comprehensive. Usually, LOBs used to send data to other LOBs as required and requested. In most cases, they were just reports (static & analytical) getting pulled from application database. Figure 1: Data Management in Silos 1.2. Enterprise Data Warehouse (EDW) To break away from the data silos so that LOBs get the freedom to create their own data marts, the idea of Enterprise Data Warehouse (EDW) was adopted widely by industry. This concept has been researched for long. A joint research by HP-Labs and Microsoft research team provides a good overview of this concept and approach (Chaudhuri, et al., 1997). All data marts source their data from one central version of data, thereby maintaining data integrity and consistency at the enterprise level. Though EDW solved the problem of providing an enterprise-level view of data to all business teams to a certain extent, answering questions or providing necessary data to business teams within the next minute of new business idea still remained a cherished but elusive dream for IT & business teams. Also, this ‘one version-fits-all’ idea did not go well with every group in organization. And the culture of business analysts downloading data to Microsoft Excel spread sheets or Microsoft Access from EDW and merging them with source data continued to be widely followed. EDW architecture offered numerous technical challenges. Few such challenges are listed below.  Cost  Licensing cost (Database licenses, ETL tools etc.)  Storage cost  Ridiculously long lead time before database schemas could be created as per standards, which in turn followed by long ETL development cycles LOB-1LOB-n-1 LOB-2 LOB-1 LOB-3 LOB-n
  • 5. Enterprise Data Management – A Perspective Page 4 of 18  Every post-production fix involved long and repetitive development cycle  Complicated designs  Need for highly skilled labor force Figure 2: Typical EDW Implementation 1.3. Big Data In the meanwhile, technology evangelists like Google, Netflix, Amazon, Facebook, Twitter, advanced oil drill equipment manufacturing companies, space companies etc. injected new types of problems in to the data space, e.g. data type and volume. It was no more the case of structured data mindset. It involved unstructured data like videos, social text streams, sensor data, data streams from IoT devices etc. These data types can neither be accommodated into traditional database nor their scale are easily manageable like structured data. In addition to the data volume, variety and velocity of data flow had to be tackled together to derive business advantage and doing that faster than the competition. These new generation companies also created applications which are ground up distributed in nature. New distributed file systems, new distributed processing applications etc. were required to handle the volume and the velocity. Thought papers from companies like Google (Chang, et al., 2006) (Dean, et al., 2004) (Ghemawat, et al., 2003), Amazon (DeCandia, et al., 2007) etc. offer detailed discussion on this topic. The dimensions of volume, variety and velocity gave birth to what came to be known as ‘Big Data’1 . 1 Over time, couple more V’s – veracity & volatility got attributed to Big Data. Reporting Layer EDW LayerETL LayerData Source Layer Data Source – LOB-1 Data Source – LOB-2 Data Source – LOB-3 Data Source – LOB-n ETL Tools EDW Holds schema on Write i.e., predefined databases schema Data Mart LOB-1 Data Mart LOB-2 Data Mart LOB-3 Data Mart LOB-n ReportingLayer
  • 6. Enterprise Data Management – A Perspective Page 5 of 18 1.4. Hadoop Doug Cutting, Chief Architect at Cloudera, adopted the distributed systems idea and created Hadoop, being inspired and modeled by Google’s high volume data processing systems. Hadoop is open source and relies on the concept of bulk commodity hardware. It solves the cost issue (licensing cost, storage cost) and data variety issue. Over time, new ecosystem got created around HDFS (Hadoop Distributed File System). It generated new efficiencies for data architecture through optimization of data processing workloads such as data transformation and integration. It simultaneously lowered the cost of storage. Ideas like flexible ‘schema-on-read’ access to all enterprise data allowed circumventing long database schema design and long ETL development cycles started taking shape. Though Hadoop potentially solves data storage problem, it requires high latency for data retrieval (batch processing). The latency issue led to new ways of data storage & retrieval in form of NoSQL databases e.g., Apache HBase, Apache Cassandra - inspired by Amazon (DeCandia, et al., 2007) etc. for and better processing engines like Spark (Zaharia, et al., 2012) (Zaharia, et al., 2010) (Zaharia, et al., 2012), Flink (Apache Software Foundation, 2015) etc. However, NoSQL databases have their own challenges like complicated table designs, joins not working well like in traditional RDBMS etc. This landed the industry at the juncture of good infrastructure framework, low cost open source tools (e.g., storage tools like HDFS, NoSQL databases like MongoDB, HBase, Cassandra, MemCache etc., data processing tools like Spark, Map Reduce, Pig, Hive, Flink, Nifi etc., message broking tools like Kafka (Kreps, et al.), RabbitQ etc.) and, of course, the existing high cost enterprise toolsets & easy access storage (i.e. RDBMS like Oracle, DB2, SQL Server etc., Massively Parallel Processing (MPP) tools like Teradata, Impala etc., processing tools like AbInitio, Informatica, DataStage etc.). Along the way, the revolution called open source added significant value to technology community. It facilitated creation of lot of start-ups, encouraged new ideas and of course, added a lot of chaos. Each of these tools (whether low cost or high cost) are focused on solving specific use case. Every other month, new open source products started getting released. However, for an enterprise CTO or an architect, it gets really challenging to identify sustainable open source solutions which would also solve multiple use cases instead of solving specific ones. Here came the open source bundling companies e.g., Cloudera, Hortonworks, MapR etc. They took the ownership of identifying software that are good and sustainable, and managing tools which go through very frequent releases for improvised versions. This solved the basic adaptation problem of open source ecosystem into an enterprise to a good extent. There have been differences in the selection of tools of the open source bundling companies’ and of course, it is purely left to enterprise’s use cases to decide which one to go for. Once the new ecosystem (majorly based open source solutions) got stabilized, next challenge was to adopt a suitable methodology for application development and maintenance. Adoption of open source ecosystem also mandated replacement of all/some of the well accepted traditional enterprise software and tools. Such replacement entails its own share of risks. Also, there are no widely practiced and adopted standards in the industry for open source based enterprise data management solutions. Most of the advanced business analysts still rely on power tools like SQL, Metadata Management repositories etc. to infer business insights. Hadoop lacks the flexibility
  • 7. Enterprise Data Management – A Perspective Page 6 of 18 of data extraction using SQL at similar speed. On top of it, there have been challenges of dealing with regulations, preventing data falling into wrong hands, auditing etc. 2. The Challenges of present CTOs The previous section discussed about the evolution of data management, the multidimensional challenges that it posed and the challenges in identifying proper adoption framework or architecture which may be widely used, standardized and easily adopted by enterprises. The CTOs or architects would be better served by having reference architecture or framework to minimize the risks involved. Few exceptional use cases which may not fit well in this framework or architecture can be handled separately. Before delving deep in to the adoption framework or architecture, here is a quick summary of the critical challenges from enterprise data management perspective as an evolution from EDW era. # Description 1 Provide low cost storage and processing. Accommodate any data type. 2 Provide consolidated view of enterprise data to empower business teams to pull all required information next minute new business idea pops up. 3 Provide consolidated view of enterprise data and flexibility of ad hoc reporting on any data element in enterprise to the business analyst. 4 Provide metadata cataloguing and search facility for metadata. 5 Store data in original raw form to guarantee data fidelity. 6 Provide entitlement management features that take care of regulation, authorization, authentication, encryption, data masking, auditing etc. 7 Leverage existing licensed tools for use cases / problems which open source systems cannot solve. 8 Maintain existing good features like faster data extraction using SQL for analysis and add new features that have significant reduction in latency in creating advanced analytical applications like machine learning. 9 Provide data access to external & internal teams based on entitlement. 10 Provide enterprise data elements in raw form to a new category of analysts, called data scientists. 11 Select technologies to minimize tool replacement costs and keep up with technology trends for keep enterprise competitive. 12 Integrate data profiling and data quality results into metadata management framework. Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today
  • 8. Enterprise Data Management – A Perspective Page 7 of 18 3. Data Lake ‘Data Lake’ came across as the next key concept in data management area and was primarily conceptualized to tackle the challenges mentioned in the section above - The Challenges of present CTOs. It is more of architectural concept and may be defined as - “Repository of enterprise-wide, large quantities and variety of data elements, both structured and unstructured, in raw form.” This definition is purely based on the insights from multiple data management implementations in Hadoop environment, identifying challenges and coming up with architecture to solve these challenges. However, just repository alone will not suffice in meeting the challenges mentioned in Table 1. It would require supporting components to deliver the benefits. 3.1. Key Components of Data Lake The Data Lake architecture involves some mandatory components (mentioned below) to make it a successful implementation. 3.1.1. Storage  Low cost  Store raw data from different input sources  Support any data type  High durability 3.1.2. Ingestion  Facilitate both batch & streaming ingestion frameworks  Offer low latency 3.1.3. Inventory & Cataloguing  Discover metadata and generate tags  Discover lineage information  Manage tags 3.1.4. Exploration  Browse / Search Inventory  Inspect Data Quality  Tag Data Quality attributes  Auditing 3.1.5. Entitlement  Identify & Access Management o Authentication, Authorization, Encryption, Quotas, Data Masking 3.1.6. API & User Interface  Expose search API  Expose Data Lake to customers using API & SQL interface based on entitlements and access rights
  • 9. Enterprise Data Management – A Perspective Page 8 of 18 4. Data Lake – Implementing the Architecture Components mentioned in Figure 3 above are minimal requirements for implementing a Data Lake. Hadoop (HDFS) can accommodate application storage as well. These applications can also leverage Data Lake’s built-in framework components like Catalogue, Data Quality, and Search & Entitlements. 4.1. Storage Primary requirement of storage is to low cost, able to accommodate high volume and long durability. Storage should be able to accommodate any data type. Current technological trends suggest that HDFS, MapR-FS and Amazon S3 suit the need. Even though, they have different underlying implementation, they still adhere to Hadoop standards. Along with storing data in distributed file systems, it would be a good idea to identify suitable storage options for different data types as below.  Unstructured data o Store native file format (logs, dump files, videos etc.) o Compress with Streaming Codec (LZO, Snappy)  Semi-Structured data – JSON, XML files o Good to store in schema aware formats e.g., Avro. Avro allows versioning & extensibility like adding new fields.  Structured data o Flat records (CSV or some other field separated) o Avro or Columnar Storage (Parquet) Streaming Ingestion Access LayerData Lake LayerIngestion LayerData Source Layer Data Source – LOB-1 Data Source – LOB-2 Data Source – LOB-3 Data Source – LOB-n Batch Ingestion Data Lake Automated Inventory, Catalogue & Tagging Framework Data Quality Tagging Framework Inventory & Catalogue Search & Explore Framework Entitlement Framework RDBMS/ NoSQLAccess Search(Solr/Elastic Search)Access API Access Figure 3: Typical Data Lake Implementation
  • 10. Enterprise Data Management – A Perspective Page 9 of 18 Storage life cycle policy can also be defined. There are many open source tools like Apache Falcon (Apache Software Foundation, 2016) that operates based on pre-defined policies. Data directory structure can be defined to segregate data based on life cycle policy - e.g., latest data, 7 years old data as required by regulations, data older than 7 years etc. 4.2. Ingestion Ingestion is the first piece of the puzzle that needs to be put in place after setting up the storage. This involves setting up ingestion framework that will handle both batch and streaming data. Looking into the current trends of data processing tools, next generation of technologies might look at batch processing as legacy systems. Better processing tools (e.g. Spark (near real-time), Flink (real-time) etc.) are promoting batch as streams. Complexity of good stream processing depends on use cases. O’Reilly offers in-depth discussion on the topic of streaming - going beyond batch (Akidau, 2015) (Akidau, 2016). 4.2.1. The Challenges However, having advanced processing tools would not be enough to ensure proper ingestion. The following list summarizes few such challenges (as listed below) that need to be circumvented. # Description 1 Making use of advanced processing tools would require high skilled resources in good numbers. 2 Traditional data processing engineers widely use GUI-based ETL tools (like AbInitio, Informatica etc.) that use data flow programming techniques. Coding applications using open source processing tools (like Spark, Flink etc.) still take consider time for development and testing for those engineers. 3 Due to nature of open source eco system, there will always be new processing tool that will outrun the benefit of current toolset and will add business advantage than enterprise’s competition. This would require easy and quick adoptability, which may be a big challenge. 4 Data processing tools are good at processing data. However, ingestion framework also needs to go beyond that and solve challenges like:  Low latency & guaranteed data delivery  Handling back pressure  Data provenance (tracking data all the way from data source)  Customizability  Quick implementation and better UI for operations team.  Supporting wide variety of protocols used for sending/receiving data (e.g., SSL, SSH, HTTPS, other encrypted contents etc.)  Load data into wide number of destinations (HDFS, Spark, MapR-FS, S3, RDBMS, NoSQL etc.). 5 From enterprise perspective, it is often desired to have same tools to be used across the enterprise for any applications that would require data push/pull. However, zeroing on ‘the one toolset’ is always challenging. Table 2: Data Ingestion Challenges - beyond just the tools 4.2.2. Recommendation Based on tool evaluation research, two tools may be recommended to handle the ingestion problems and quick adoptability challenges:
  • 11. Enterprise Data Management – A Perspective Page 10 of 18 4.2.2.1. Apache Nifi Apache Nifi (Apache Software Foundation, 2015) is one of the best open source data flow programming tool. Nifi kind of fits the bill for most of the data push/pull use cases. Just to get the uninitiated excited about it, here are few Nifi snapshots: Figure 4: Apache Nifi Data Flow View Figure 5: Apache Nifi Data Provenance View Figure 6: Nifi - The Power of Provenance
  • 12. Enterprise Data Management – A Perspective Page 11 of 18 Figure 7: Apache Nifi Stats View Nifi can be used as a full-fledged ETL tool and does support most of the ETL features. However, Nifi still claims itself as simple event processing and data provenance tool. If open source support is extended to it, Nifi may well be transformed into a full-fledged ETL tool. 4.2.2.2. Cascading To deal with the quick adaptability part, it would be a good idea to have wrapper technologies. They would allow the coding to be done once and the processing engine underneath to be changed based on the latest trends or best fit. Our research recommends Cascading (Driven, Inc., 2015)to be a good candidate here. At present, Cascading supports multiple processing engines underneath (Spark, Map Reduce, Flink etc.). Cascading also supports development in Java and Scala. Cascading allows developing the business logic separately from the integration logic. Complete applications may be developed and unit tests may be written without touching a single Hadoop API. It provides the degrees of freedom to easily move through the application development life-cycle and to deal separately with integrating existing systems. Cascading provides a rich API that allows thinking in terms of data and business problems with capabilities such as sort, average, filter, merge, etc. The computation engine and process planner convert the business logic into efficient parallel jobs, delivering the optimal plan at run-time to the computation fabric of choice. In simple terms, cascading may be considered as the plumbing components that are used for building pipelines. It provides sinks, traps, connections etc. It is just the matter of plugging them together to build business logic without bothering about whether the code will run on MapReduce or Spark or Flink. This is famously known as pattern language. Developers can develop all the way till unit testing without touching Hadoop or any processing engine. From technology category perspective, it is the middleware for designing workflow.
  • 13. Enterprise Data Management – A Perspective Page 12 of 18 4.3. Inventory, Catalogue & Explore Data lake-based storage ideas & ingestion ideas mentioned above can solve few challenges like low cost storage, low cost processing, real-time sync with data source and data in raw form for maintaining fidelity as mentioned in Table 1. Enterprise data pushed in raw form into Data Lake provides the flexibility to business analysts and data scientists to pull any enterprise data element as required without waiting for long ETL development and data modeling exercises to complete. Streaming ingestion facilitates Data Lake to be in sync with data source in as near real time as possible. However, enterprise data in their own raw format might be huge and it will be like finding needle in hay stack for a data scientist or any user. This mandates self-data-service framework to be built for data discovery (Inventory), data preparation (Catalogue) and data visualization (Explore). 4.3.1. Discovery First step in data discovery is to provide a metadata framework (a sub-component in self-data-service framework) to capture business metadata, technical metadata and operational metadata. This process needs to be automated to handle the sheer volume of file load into Data Lake. Even though in theory Data Lake talks about data availability to everyone, there are constraints in the form of entitlement which needs to be put in place for Data Governance purposes. Metadata framework should also have features to create important data lineage information as part of Ingestion frameworks. This will enable lineage all the way from data source to Data Lake. 4.3.2. Catalog & Visualization Once metadata (business, technical & operation) has been captured for the raw data provided by data sources, it may be used as catalog and UI may then be used to explore these metadata. Along with metadata, data profiling ability & data quality metric for all data pushed into Data Lake are really valuable and desirable in this context. Most of the available frameworks are tag based. They identify and mark metadata, profiling metrics & quality metrics. These frameworks are inbuilt with CRUD API, Query API or Analytics API for handling metadata management. This area is fairly new to industry. Only few vendors are out there who provide data self-service framework. Below is a list of such vendors and their products.  Cloudera (Cloudera, 2016)– Cloudera Navigator (not open source, license based).  Waterline Data (Waterline Data, Inc., 2016) – Independent organization and integrates with any Hadoop distribution.  Hortonworks (Hortonworks Inc., 2016) – Apache Atlas, still in incubation. However, a limited featured version has been added to HDP 2.3 release. Hortonworks actively partnered with Waterline data as well. 4.4. Entitlement & Auditing Entitlement is one of the primary pieces of data governance. Generally, data governance has few mandatory components like Data Profiling, Data Quality, Entitlement and Auditing. Main goal of the governance is to facilitate easy & secured data accessibility along with the reliability of the data
  • 14. Enterprise Data Management – A Perspective Page 13 of 18 (profiling & data quality measures). Previous section discussed profiling & data quality. This section will focus on entitlements and auditing. Entitlement & auditing covers wide range of activities, like o Authentication o Authorization o Encryption o Auditing o Data Masking o Data Field Level Authorization Almost all Hadoop distribution vendors use Kerberos as authentication protocol. MapR uses propriety authentication tool, which follows similar approach as Kerberos. For authorization, Data Masking & Data field level authorization, different Hadoop distribution vendors use different toolsets. Cloudera uses Sentry & Cloudera Navigator. Hortonworks uses Apache Ranger & Apache Knox. MapR uses proprietary ACE (Authorization Control Expression) that provides better flexibility than ACL (Authorization control list). ACE is supported by all Vendors. All vendors offer encryption for at-rest and in-transit data. Approaches taken for Key Management for keys used for encryption/decryption are quiet proprietary. There are multiple open source projects in Hadoop security area. A list of few such projects is given below.  Apache Knox (Apache Software Foundation, 2016): A REST API gateway that provides a single access point for all REST interactions with Hadoop clusters.  Apache Sentry (Apache Incubator): A modular system for providing role-based authorization for both data and metadata stored in HDFS. Sentry project is primarily led by Cloudera, one of the best-known Hadoop distributors.  Apache Ranger (Hortonworks, Inc., 2016): A centralized environment for administering and managing security policies across the Hadoop ecosystem. This project is led by Hortonworks, another well-known Hadoop distributor, and includes technology that it gained when it acquired XA Secure in mid-2014 (Hortonworks, Inc., 2014).  Apache Falcon (Apache Software Foundation, 2016): A data governance engine that allows administrators to define and schedule data management and governance polices across the Hadoop environment. The section 4.1 also discusses this.  Project Rhino (Williams, 2013): Creates an encryption, key management capabilities and a common authorization framework across Hadoop projects and subprojects (TechTarget). This project is led by Intel. Most of these security tools are inbuilt and distributed by different Hadoop bundling vendors. 4.5. API & User Interface Access To provide easy and secure access, it is recommended to allow control access to Data Lake either using API or Interactive SQL. This will in turn enforce inbuilt entitlement as discussed in sections above.
  • 15. Enterprise Data Management – A Perspective Page 14 of 18 Wide range of tools is available for API management and SQL (Spark SQL, Flink SQL, Impala etc.). Even with all these tools available, data access might not be as fast as RDBMS tools. This is a case in point to leverage existing enterprise tools. As mentioned earlier, most of the framework setup for Data Lake can be re-used for other use cases. If data cleansing & standardization has to be done, it can be run in Hadoop environment using data processing tools like Map Reduce, Cascading, Spark, Flink etc. and the HDFS environment can be segmented to hold cleansed, standardized and aggregated level information. Standardized version of data may also be pushed to existing EDW. This approach moves complete ETL from EDW to Hadoop environment, thus minimizing processing and licensing cost. Also, highly granularity data will reduce RDBMS storage cost. 5. Conclusion Data Lake provides an architectural approach with embedded Governance model. It helps Data Management teams to implement variety of solutions using cost effective storage, efficient processing engines and self-data-service features. Teams implementing Data Lake need to give lot of focus while defining metadata for all types data objects ingested into Data Lake. Metadata plays key role in Data Lake to expose self-data-service flexibility to analysts/data scientists/users and it is a key component for defining entitlements.
  • 16. Enterprise Data Management – A Perspective Page 15 of 18 6. Bibliography Akidau Tyler The world beyond batch: Streaming 101 - O'Reilly Media [Online] // The world beyond batch: Streaming 101 - O'Reilly Media. - O'Reilly, Aug 05, 2015. - Mar 08, 2016. - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101. Akidau Tyler The world beyond batch: Streaming 102 - O'Reilly Media [Online] // The world beyond batch: Streaming 102 - O'Reilly Media. - O'Reill, Jan 20, 2016. - Mar 08, 2016. - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102. Amazon Dynamo: Amazon’s Highly Available Key-value Store [Online] // All Things Distrubuted. - Amazon, 2007. - Mar 04, 2016. - http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf. Apache Incubator Apache Sentry (incubating) [Online] // Apache Sentry (incubating). - Apache Software Foundation. - Mar 09, 2016. - https://sentry.incubator.apache.org/. Apache Software Foundation Apache Flink: Scalable Batch and Stream Data Processing [Online] // Apache Flink: Scalable Batch and Stream Data Processing. - Apache Software Foundation, 2015. - Mar 07, 2016. - http://flink.apache.org/. Apache Software Foundation Apache Nifi [Online] // Apache Nifi. - Apache Software Foundation, 2015. - Mar 08, 2016. - https://nifi.apache.org/. Apache Software Foundation Falcon - Falcon - Feed Management & Data processing platform [Online] // Falcon - Falcon - Feed Management & Data processing platform. - Apache Software Foundation, Feb 15, 2016. - Mar 08, 2016. - https://falcon.apache.org/. Apache Software Foundation Knox Gateway - REST API Gateway for Hadoop Ecosystem [Online] // Knox Gateway - REST API Gateway for Hadoop Ecosystem. - Apache Software Foundation, Mar 01, 2016. - Mar 09, 2016. - https://knox.apache.org/. Chang Fay [et al.] Bigtable: A Distributed Storage System for Structured Data [Online] // Bigtable: A Distributed Storage System for Structured Data. - Google, 2006. - Mar 15, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf. Chaudhuri Surajit and Dayal Umeshwar An Overview of Data Warehousing and OLAP Technology [Online] // Microsoft Research - Turning Ideas into Reality. - Microsoft, Mar 1997. - Mar 03, 2016. - http://research.microsoft.com/pubs/76058/sigrecord.pdf. Cloudera Cloudera [Online] // Cloudera. - Cloudera, 2016. - Mar 08, 2016. - https://cloudera.com/. Dean Jeffrey and Ghemawat Sanjay MapReduce: Simplified Data Processing on Large Clusters [Online] // MapReduce: Simplified Data Processing on Large Clusters. - Google, 2004. - Mar 15, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf. DeCandia Giuseppe [et al.] Dynamo: Amazon’s Highly Available Key-value Store [Online] // Dynamo: Amazon’s Highly Available Key-value Store. - Amazon, 2007. - Mar 15, 2016. - http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.
  • 17. Enterprise Data Management – A Perspective Page 16 of 18 Driven, Inc. Cascading | Application Platform for Enterprise Big Data [Online] // Cascading | Application Platform for Enterprise Big Data. - Driven, Sep 2015. - Mar 08, 2016. - http://www.cascading.org/. Ghemawat Sanjay, Gobioff Howard and Leung Shun-Tak The Google File System [Online] // The Google File System. - Google, 2003. - Mar 15, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf. Google, Inc The Google File System [Online] // Research at Google. - Google, 2003. - Mar 4, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf. Google, Inc. Bigtable: A Distributed Storage System for Structured Data [Online] // Research at Google. - Google, 2006. - Mar 04, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf. Google, Inc. MapReduce: Simplified Data Processing on Large Clusters [Online] // Research at Google. - Google, 2004. - Mar 04, 2016. - http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf. Hortonworks Inc. Hortonworks: Open and Connected Data Platforms [Online] // Hortonworks: Open and Connected Data Platforms. - Hortonworks, 2016. - Mar 08, 2016. - http://hortonworks.com/. Hortonworks, Inc. Apache Ranger [Online] // Apache Ranger. - Hortonworks, 2016. - Mar 09, 2016. - http://hortonworks.com/hadoop/ranger/. Hortonworks, Inc. Hortonworks Acquires XA Secure - Hortonworks [Online] // Hortonworks Acquires XA Secure - Hortonworks. - May 15, 2014. - Mar 09, 2016. - http://hortonworks.com/press- releases/hortonworks-acquires-xa-secure/. Kreps Jay, Narkhede Neha and Rao Jun Kafka: a Distributed Messaging System for Log Processing [Online] // Microsoft Research - Turning Ideas into Reality. - LinkedIn Corp.. - Mar 7, 2016. - http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11- final12.pdf. TechTarget Managing Hadoop projects: What you need to know to succeed [Online] // Managing Hadoop projects: What you need to know to succeed. - TechTarget. - Mar 09, 2016. - http://searchdatamanagement.techtarget.com/essentialguide/Managing-Hadoop-projects-What-you- need-to-know-to-succeed. University of California, Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing [Online] // Computer Science Division | EECS at UC Berkley. - University of California, Berkley, 2012. - Mar 07, 2016. - https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf. University of California, Berkeley Spark: Cluster Computing withWorking Sets [Online] // Computer Science Division | EECS at UC Berkley. - University of California, Berkley, 2012. - Mar 07, 2016. - http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
  • 18. Enterprise Data Management – A Perspective Page 17 of 18 Waterline Data, Inc. Waterline Data | Find, understand, and govern data in Hadoop [Online] // Waterline Data | Find, understand, and govern data in Hadoop. - Waterline, 2016. - Mar 09, 2016. - http://www.waterlinedata.com/. Williams Alex Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better Security To Big Data [Online] // Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better Security To Big Data. - TechCrunch Network, Feb 26, 2013. - Mar 09, 2016. - http://techcrunch.com/2013/02/26/intel-launches-hadoop-distribution-and-project-rhino-an-effort-to- bring-better-security-to-big-data/. Zaharia Matei [et al.] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters [Online]. - University of California, Berkeley, 2012. - Mar 07, 2016. - https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf. Zaharia Matei [et al.] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing [Online] // Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. - University of California, Berkeley, 2012. - Mar 15, 2016. - https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf. Zaharia Matei [et al.] Spark: Cluster Computing with Working Sets [Online] // Spark: Cluster Computing with Working Sets. - University of California, Berkeley, 2010. - Mar 15, 2016. - http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
  • 19. Enterprise Data Management – A Perspective Page 18 of 18 7. Few Other Useful References Data Lake References  Horton works & Teradata Paper - Data Lake  Amazon’s experience on Data Lake - Data Lake Implementation Guidelines  Knowledgent Reference - Data Lake Design  Waterline Data - Self Data Service Flink: A new breed in processing tool  Flink Streaming & Batching in One Engine Data Security  Cloudera Security – Paper on Hadoop Security  Cloudera reference on Hadoop Encryption - Encryption in Cloudera  Hortonworks - Data Governance