SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
TDWI CHECKLIST REPORT
TDWI RESEARCH
tdwi.org
Active Data
Archiving
For Big Data, Compliance,
and Analytics
By Philip Russom
Sponsored by:
1  TDWI RESEARCH 	 tdwi.org
	2	FOREWORD
	2	NUMBER ONE
Embrace modern practices and platforms for active data
archiving
	3	NUMBER TWO
Assure and improve data governance by using a compli-
ance data archive
	3	NUMBER THREE
Consider an analytics archive for critical, high-value, and
aging analytics data
	4	NUMBER FOUR
Rethink how data is committed to an archive
	4	NUMBER FIVE
Rethink how archived data is accessed and used actively
	5	NUMBER SIX
Deploy archiving systems that have multiple storage and
processing tiers
	6	NUMBER SEVEN
Make security a high priority because it will make or
break an archive
	7	 ABOUT OUR SPONSOR
	7	ABOUT THE AUTHOR
	7	ABOUT TDWI RESEARCH
	7	ABOUT THE TDWI CHECKLIST REPORT SERIES
© 2014 by TDWI (The Data Warehousing InstituteTM
), a division of 1105 Media, Inc. All rights
reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail
requests or feedback to info@tdwi.org. Product and company names mentioned herein may be
trademarks and/or registered trademarks of their respective companies.
MAY 2014
ACTIVE DATA
ARCHIVING
For Big Data, Compliance,
and Analytics
By Philip Russom
TABLE OF CONTENTS
555 S Renton Village Place, Ste. 700
Renton, WA 98057-3295
T	 425.277.9126
F	 425.687.2842
E	 info@tdwi.org
tdwi.org
TDWI CHECKLIST REPORT
2  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Data archiving presents various problems in the enterprise today.
Many organizations don’t archive at all. Others mistakenly think that
mere data backups can serve as archives, whereas tape is actually
the final burial place of data, from which it rarely returns. Equally off
base, others believe a data warehouse is an archive. Although it’s true
that data archiving processes exist today in some organizations, these
are rarely formalized or policy driven, such that data is archived in an
ad hoc fashion (typically per application or per department) without an
enterprise standard or strategy.
Even when an organization makes an honest attempt at an enterprise
data archive, the result is usually not trustworthy (because
data is easily altered), not auditable (due to poor metadata and
documentation), not compliant (due to inadequate usage monitoring or
the inability to purge data at specified milestones), and not properly
secured (lacking encryption, masking, and security standards).
Furthermore, with most existing data archives, it’s hard to get data in
with integrity and out with speed because the primary platform is not
online, active, and highly available.
Why don’t more organizations invest in formal archiving processes and
technical solutions? Most likely it’s their common belief that archives
provide little or no return on investment (ROI) because users rarely
(if ever) access the archive. Without prominent and frequent usage, a
respectable ROI is unlikely.
A data archive can achieve ROI by serving multiple uses and
users from an online, active platform. Yes, organizations do need
to retain data; that’s not in question. However, archived data is not
just insurance for compliance, audit, and legal contingencies. Those
are important goals, but a data archive should also be treated as an
enterprise asset to be leveraged, typically via analytics. Hence, a data
archive can be more than a cost center; it can achieve ROI when it
serves multiple uses (archiving, compliance, and analytics of deep
historical data sets) and it manages data online for active access at
any time by a wide range of users.
Users must start planning today for active data archiving. To help
them prepare, this TDWI Checklist report will drill into the desirable
attributes, use cases, user best practices, and enabling technologies
of active data archiving.
FOREWORD
There are compelling reasons for improving data archives.
Traditional reasons for data archives still apply: namely, supplying
data for compliance, audit, and legal requirements. However, a
modern online data archive brings greater speed, accuracy, and
credibility to these tasks so they are a smaller drain on enterprise
processes and resources.
New reasons have come into play as well: namely, organizations’
voracious hunger for actionable insights discovered through advanced
analysis of raw source data, big data, and a broadening diversity of
data types. One of the most influential changes, however, concerns
the state-of-the-art in data platforms—both hardware and software.
Their speed, scale, and functionality continue to rise even as their
costs fall, which in turn makes the improvement of users’ data
archive solutions feasible for both technical and financial reasons.
Active data archiving can address these problems and
opportunities. Enterprises need to embrace the emerging practice of
active data archiving along with its enabling technologies. A modern
solution for active data archiving will:
•	 	Be built primarily for compliance or data governance but also
serve the archival needs of analytics and sometimes data backup
and disaster recovery.
•	 	Be open to active access by a wide range of users, including
those who need simple lookups and easy data exploration.
•	 	Manage data as an immutable record that cannot be altered so
that data is trustworthy for compliance and legal requirements.
•	 	Be secured like a bank vault, for data security, privacy, and trust,
using role-based permission access, data masking, encryption,
and multiple data security standards.
•	 	Scale up to multi-terabyte and petabyte data volumes using
fast bulk loads and data compression to embrace new big data
sources and because archives inevitably grow over time.
•	 	Operate online with high availability around the clock to enable
active data loads and extracts that keep the archive current up
to the minute. Furthermore, data is constantly appended to an
active archive without downtime or performance degradation.
•	 	Support high-performance access based on SQL and other
standards because users expect quick responses as they run
queries and searches against archived data.
	NUMBER ONE
EMBRACE MODERN PRACTICES AND PLATFORMS FOR
ACTIVE DATA ARCHIVING
3  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Two broad archive categories—defined by their content and the
primary use of that information—can coexist and overlap in active
data archiving solutions:
•	 Compliance archives: Data retained in content, format, and for
timeframes prescribed by legislation and other regulations (e.g.,
partners, lenders, and legal liabilities)
•	 Analytic archives: Detailed source data from operational
and transactional applications, extracted for general business
intelligence purposes but retained for advanced analytics (as
defined in the next section of this report)
Compliance archives have a number of desirable process and
technical attributes:
Data that’s properly archived is solid evidence of an
organization’s compliance. In legal terms, honest attempts at
archiving constitute proper intent, whereas a lack of archiving may
be construed as malfeasance.
Data archived for compliance must support appropriate
regulations. These vary by industry. For example, in the United
States, the most stringent regulations target banking and the
financial services industry as seen in the Dodd-Frank legislation
or SEC Rule 17a-4. Similarly, the telecommunications industry is
subject to legal hold and lawful intercept requirements that demand
timed data retention.
Archived data must be tamper proof to be trusted. Most is
captured and stored in original form so it’s a credible representation
of a transaction, report, business process, or other event at a
specific time. If archived data becomes altered, it is no longer
considered credible. For example, stock trades are stored for exact
timeframes, to protect both trader and institution. Transparency is
of the utmost importance to compliance archives, and WORM (write
once, read many times) storage has become key.
Archived data demands a convincingly documented audit trail.
Most audits commence with a request for information, followed
by a request for an audit trail for supplied information. With data
stored properly in an active archive, audits go faster—perhaps more
accurately, too—than with traditional offline, ad hoc archives. The
speedy, documented response builds confidence with auditing bodies
and contributes to favorable outcomes.
An active data archive should have tracking functions so an
organization can monitor and study its own activities to assure
compliance and make improvements. The same tracking functions
can flag data that has aged beyond its compliance requirements
and should be deleted.
ASSURE AND IMPROVE DATA GOVERNANCE BY USING
A COMPLIANCE DATA ARCHIVE
	 NUMBER TWO
Archiving operational data for analytic purposes is on the rise.
As more advanced forms of analytics have gained credence over the
last 15 to 20 years, user organizations have been retaining more
detailed source data. The traditional practice was to extract data
from operational applications and other sources, process that data
and load the results into a DW, then delete the extracted source
data. The accepted practice today keeps most source data because
it is also the preferred material for analytics based on data mining,
statistical analyses, natural language processing, and SQL-based
analytics.
An analytic archive and a data warehouse are similar but
different. Because of the stepped-up data retention, the data
staging areas within most data warehouse architectures today
are bigger than their core warehouses. This is tantamount to data
archiving, though few BI/DW professionals call it archiving. All they
know is that they have to do something to improve the content and
accessibility of their analytic data archives. Furthermore, they need
to offload this burden from core warehouses, which have higher
priorities than analytics (namely reporting, OLAP, and performance
management). Hence, as BI/DW professionals ponder where to put
certain classes of analytic data, they should consider a platform for
active data archiving.
An analytic archive easily integrates with multi-platform DW
architectures. DW system architectures have always been multi-
platform, but this trend has accelerated in recent years as users
have extended their DW environments by adding new platforms for
columnar databases, appliances, NoSQL, and Hadoop. An additional
platform—one that specializes in archiving data for advanced
analytics—would wring more value from archived source data and
easily integrate with multi-platform DW architectures.
A data archive can future-proof analytic applications. Most data
warehouses are designed by their users (not vendors) for the data
requirements of reporting, OLAP, and performance management.
These practices need calculated, aggregated, standardized, and
time-series numeric values modeled in multidimensional structures
that don’t exist in source systems. Advanced analytics has different
data requirements. It needs a very large store of unaltered (or lightly
transformed) detailed source data. Other than that, it’s impossible
to anticipate data requirements for future analytic applications (AA).
Accordingly, an analytic archive preserves source data in its original
form, so the source is there for future AAs to explore and repurpose.
CONSIDER AN ANALYTICS ARCHIVE FOR CRITICAL,
HIGH-VALUE, AND AGING ANALYTICS DATA
	 NUMBER THREE
4  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
A data archive has to be more than a dumping ground. For one
thing, there needs to be a strategy based on new and evolving
user requirements for aging, less frequently accessed data and
other metrics for identifying which data should be archived at
what level and on what schedule. Note that not all data should
be archived: some data belongs elsewhere, say, in its original
application database or in a data warehouse. Archive specialists
need to interview a broad range of business users and managers to
determine users’ needs for archived data. If your organization has
a legal department and compliance officers, give priority to their
needs but without neglecting the rest of the enterprise.
On a technology level, develop interfaces and integration logic for
getting data into the archive quickly and in lightly transformed
states that are conducive to query and search, without altering
the essential content of archived data. Finally, assume that all the
data in the archive needs an audit trail and documentation (via
metadata, etc.) that is sufficient to satisfy even the most aggressive
users and auditors.
What if data comes from applications that have been upgraded or
customized (which can alter data models)? Look for a data archiving
platform that can manage changing data models. That way, the
platform understands changes to source schema and adjusts
metadata and pointers accordingly.
What if archived data comes from an application that was
decommissioned (also known as application retirement)? When
the only application that can read a dataset with full integrity is
gone, that application’s data may need to be lightly transformed
before entering an archive (or after it’s in the archive) so it can be
easily accessed by common query and search tools. This practice is
inspired by data warehousing but it does not require the full-blown
time, skills, and expense of the average data warehouse.
Some archived data needs encryption (for security) or compression
(to reduce its storage footprint). Look for a platform that can apply
these and other data operations as data enters the archive or after
data is in the archive. Furthermore, as data growth rates continue to
rise over time and business demands for retaining older data grow,
data should be stored in a compressed state to optimize storage
capacity and scale over time. Similarly, the security classification of
data can change as organizational rules and policies evolve.
RETHINK HOW DATA IS COMMITTED TO AN ARCHIVE
	 NUMBER FOUR
Let’s be honest: We’ve all worked in organizations where archives
were purely pro forma, without a credible effort to preserve data in
a state that’s quickly or easily accessed by anyone, much less the
growing number of employees who can benefit from accessing the
information. Luckily, this old “worst practice” is giving way to the
realization that all enterprise datasets—including archived data—
are valuable assets that can contribute to many business goals. The
recent craze for analytics with big data has led many organizations
to seek more business value from their datasets.
With that in mind, active data archiving is a bit of a cultural shock
in some organizations. To get past the shock, these organizations
need upper management to define a mandate for modern archiving
based on the following goals:
Archived data must be leveraged. Typical use cases include
fast, documented auditing for compliance, a source for analytic
applications, data exploration, and information lookups.
Some data will come out of the archive to be used elsewhere.
To enable a broad range of users, tools, and purposes, the archive
should support both query and search mechanisms. Furthermore, the
archive should serve as a source for other data platforms, especially
those for business intelligence and analytics.
A growing constituency of users will have access to archived
data. This is a sticky point in organizations that define data
governance and compliance as the process of limiting data access.
The catch is to balance access and control, typically through well-
defined user types controlled via role-based user access and strong
security features in the archival platform.
Accessing archived data will be timely. First, to be truly active,
the archive must be online like a database, not offline like magnetic
tapes and optical disks or any media that demand a distracting
and time-consuming restoration process. Second, data access
mechanisms should perform at or near real time for the sake of user
productivity.
RETHINK HOW ARCHIVED DATA IS ACCESSED AND
USED ACTIVELY
	 NUMBER FIVE
5  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
For a data archive to be truly active, its primary tier should be
based on a robust database management system (DBMS). The
DBMS must include traditional relational functions (for query and
data exploration) and functions for multiple security strategies,
scalability, and high availability. The assumptions here are that
most data being archived will be structured and that most users
and applications will need to access data via queries. Even so, some
functions of the DBMS should be controlled; for example, inserting
and updating data can destroy data’s original state, whereas
appending data avoids such integrity problems. In addition to
relational technology, free text search is critical to finding records of
interest and to enabling non-technical users.
An active data archiving platform can host many archives, each
with its own unique requirements, similar to how a DBMS can
manage several databases (defined as collections of data). Thus,
multi-tenancy is another key assumption for a modern data archive.
In most cases, an archive platform is not a data processing or
analytics platform. Hence, archived data is best extracted, then
moved to a DBMS or other data platform that is more conducive
to in-database analytics, intense SQL-based analytics, and
miscellaneous forms of advanced analytics. For these purposes,
mature organizations already have in place relational data
warehouses, columnar databases, and DW appliances, possibly
NoSQL databases and Hadoop. As an exception, when an active
archive runs atop Hadoop, it may make sense to process and
analyze data on the same platform where it’s archived. Note that
the DBMS in the primary tier of a data archive does not replace
other DBMSs, especially not those deployed for analytics. Instead, it
complements them and (in addition to its archival purpose) serves
as yet another source of data for analytics (largely historical data).
The storage tier of an active archive should be diverse. This is
to accommodate subsystems users already have as well as newer
commodity-priced types such as CAS hardware or the Hadoop
Distributed File System (HDFS). Even a modern active archive might
include systems for magnetic tape and optical disk in the storage
tier. After all, many organizations have pre-existing mag tape or
op disk libraries that they must maintain. Note that these archaic
media are antithetical to an active data archive; if possible, their
data should be migrated into the active archive so it’s online and
available when users need it.
In the case of a compliance archive (for, say, a financial services
institution), the archive must reside in a WORM storage platform.
This, in turn, requires a DBMS that supports WORM devices.
WORM technologies are worth the investment because they keep
DEPLOY ARCHIVING SYSTEMS THAT HAVE MULTIPLE
STORAGE AND PROCESSING TIERS
	 NUMBER SIX
compliance and risk officers happy and they avoid fines, penalties,
and damaging publicity.
Users should consider Hadoop as both a highly scalable storage
platform for archiving and a low-cost processing platform for
analytics. Note that open-source Hadoop’s poor support for two key
standards—SQL (and other relational technologies) and security
(especially LDAP and Linux PAM)—keeps it unpalatable for mature
IT organizations.
Despite these two limitations, Hadoop has roles to play in multi-
platform archive architectures. Hadoop excels with very large data
volumes, as well as with file-based data, data documents (XML
and JSON), textual content (e-mail and word processing files),
unstructured and non-relational structured data, and schema-free
data. Hadoop’s low price is appropriate to many kinds of lower-value
(but high-volume) historic data, such as Web logs. However, due
to limitations in current releases, purely open-source Hadoop may
not be the best choice for structured data that needs relational
processing (such as intense SQL or multi-way joins) or sensitive
data that demands high security. That’s not a show stopper because
a number of software vendors offer products that integrate with
Hadoop to give it stronger and broader support for security and
relational technologies like standard SQL.
Consider economics as you select platforms, tools, and
features for a new active archiving architecture. For example,
it’s technically possible to include almost any brand of relational
DBMS in an archiving solution. However, the older and more mature
vendor brands are relatively expensive, especially once an archive
scales into multi-terabytes, and they include far more features and
functions than are required for archiving. A more cost-effective
choice is a DBMS designed for archiving or one of the newer
columnar, open-source, or appliance-based DBMSs. In this context,
Hadoop is affordable in terms of dollars per terabyte of storage.
Similarly, data compression is a feature that can reduce storage
costs because it reduces the footprint of archived data in storage.
6  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Put succinctly, if an archive isn’t secure, it won’t meet the
compliance goals that are its primary purpose. Furthermore, if users
don’t trust the security of the archival platform, they won’t use it or
its data, and the archive will fail to demonstrate a positive ROI.
The primary line of defense is the security layer built into the
relational DBMS at the heart of an active data archiving platform.
Most mature IT departments and DBMS teams prefer role-based
approaches to security, and many have LDAP and other directories
they’d like to reuse and apply within the active archiving solution.
If Hadoop is to be part of an active archive’s infrastructure, note
that security in purely open-source Hadoop today is mostly about
general access privileges controlled through Kerberos. However, a
few third parties now offer add-on products that enable LDAP, Active
Directory, and other approaches to security for the Hadoop family of
products.
Almost all modern data archives are loaded with sensitive data
about customers, partners, employees, Social Security numbers,
credit card numbers, transactions, internal financials, and so on.
Encryption or data masking can make this data unreadable in the
eventuality of a hack or other unauthorized access.
Additional layers of data protection may be used to keep data locked
and immutable. This provides evidence that data records and files
have not been altered, which is fundamental to a credible audit.
Likewise, records and files cannot be deleted before their retention
periods expire.
MAKE SECURITY A HIGH PRIORITY BECAUSE IT WILL
MAKE OR BREAK AN ARCHIVE
	 NUMBER SEVEN
7  TDWI RESEARCH 	 tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
TDWI Research provides research and advice for business
intelligence and data warehousing professionals worldwide. TDWI
Research focuses exclusively on BI/DW issues and teams up with
industry thought leaders and practitioners to deliver both broad
and deep understanding of the business and technical challenges
surrounding the deployment and use of business intelligence
and data warehousing solutions. TDWI Research offers in-depth
research reports, commentary, inquiry services, and topical
conferences as well as strategic planning services to user and
vendor organizations.
ABOUT TDWI RESEARCH
ABOUT THE AUTHOR
Philip Russom is the research director for data management
at The Data Warehousing Institute (TDWI), where he oversees
many of TDWI’s research-oriented publications, services, and
events. He’s been an industry analyst at Forrester Research and
Giga Information Group, where he researched, wrote, spoke, and
consulted about BI issues. Before that, Russom worked in technical
and marketing positions for various database vendors. Over the
years, Russom has produced over 500 publications and speeches.
You can reach him at prussom@tdwi.org.
TDWI Checklist Reports provide an overview of success factors for
a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.
ABOUT THE TDWI CHECKLIST REPORT SERIES
www.rainstor.com
RainStor provides the world’s most efficient database solutions
that reduce the cost, complexity, and compliance risk of managing
data. Delivering solutions to the enterprise, you can quickly deploy
an Analytical Archive or Compliance Archive so you continue to
create business value and stay compliant. RainStor runs anywhere:
on-premises or in the cloud and natively on Hadoop. Among
RainStor’s customers are 20 of the world’s largest communications
providers and 10 of the biggest banks and financial services
organizations, which use RainStor to manage historical data,
while saving millions. For more info: www.rainstor.com or join the
conversation: @rainstor.
ABOUT OUR SPONSOR

Weitere ähnliche Inhalte

Was ist angesagt?

Benefits of data_archiving_in_data _warehouses
Benefits of data_archiving_in_data _warehousesBenefits of data_archiving_in_data _warehouses
Benefits of data_archiving_in_data _warehousesSurendar Bandi
 
Data breach protection from a DB2 perspective
Data breach protection from a  DB2 perspectiveData breach protection from a  DB2 perspective
Data breach protection from a DB2 perspectiveCraig Mullins
 
Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsMethod360
 
Hadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHortonworks
 
DocuClassify - AutoClassification at its best
DocuClassify - AutoClassification at its bestDocuClassify - AutoClassification at its best
DocuClassify - AutoClassification at its bestDocuLynx
 
Symantec Data Insight for Storage
Symantec Data Insight for StorageSymantec Data Insight for Storage
Symantec Data Insight for StorageSymantec
 
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?Everteam
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services MarketplaceDenodo
 
CXAIR for Data Migration
CXAIR for Data MigrationCXAIR for Data Migration
CXAIR for Data MigrationConnexica
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse conceptsobieefans
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingAnalyticsWeek
 
Top 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationTop 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationWatchful Software
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishingVarsha Khodiyar
 
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DS
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DSRecommind-AXC-Data-Management-Intelligent-Information-Governance-DS
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DSrschrader1954
 
Introduction to the Query-driven Approach
Introduction to the Query-driven ApproachIntroduction to the Query-driven Approach
Introduction to the Query-driven ApproachTimothy Valihora
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analyticsDendej Sawarnkatat
 
2013 OHSUG - Clinical Data Warehouse Implementation
2013 OHSUG - Clinical Data Warehouse Implementation2013 OHSUG - Clinical Data Warehouse Implementation
2013 OHSUG - Clinical Data Warehouse ImplementationPerficient
 
Tarmin GridBank Overview
Tarmin GridBank OverviewTarmin GridBank Overview
Tarmin GridBank OverviewTarminInc
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data ManagmentDaniel Crane
 

Was ist angesagt? (20)

Benefits of data_archiving_in_data _warehouses
Benefits of data_archiving_in_data _warehousesBenefits of data_archiving_in_data _warehouses
Benefits of data_archiving_in_data _warehouses
 
Data breach protection from a DB2 perspective
Data breach protection from a  DB2 perspectiveData breach protection from a  DB2 perspective
Data breach protection from a DB2 perspective
 
Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source Systems
 
Hadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHA
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
DocuClassify - AutoClassification at its best
DocuClassify - AutoClassification at its bestDocuClassify - AutoClassification at its best
DocuClassify - AutoClassification at its best
 
Symantec Data Insight for Storage
Symantec Data Insight for StorageSymantec Data Insight for Storage
Symantec Data Insight for Storage
 
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?
Records Governance, Part 2: Can One Solution Manage All Your Archiving Needs?
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
 
CXAIR for Data Migration
CXAIR for Data MigrationCXAIR for Data Migration
CXAIR for Data Migration
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reporting
 
Top 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationTop 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data Classification
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DS
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DSRecommind-AXC-Data-Management-Intelligent-Information-Governance-DS
Recommind-AXC-Data-Management-Intelligent-Information-Governance-DS
 
Introduction to the Query-driven Approach
Introduction to the Query-driven ApproachIntroduction to the Query-driven Approach
Introduction to the Query-driven Approach
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
2013 OHSUG - Clinical Data Warehouse Implementation
2013 OHSUG - Clinical Data Warehouse Implementation2013 OHSUG - Clinical Data Warehouse Implementation
2013 OHSUG - Clinical Data Warehouse Implementation
 
Tarmin GridBank Overview
Tarmin GridBank OverviewTarmin GridBank Overview
Tarmin GridBank Overview
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data Managment
 

Andere mochten auch

RainStor 3.5 Overview
RainStor 3.5 OverviewRainStor 3.5 Overview
RainStor 3.5 OverviewRainStor
 
Smarter Management for Your Data Growth
Smarter Management for Your Data GrowthSmarter Management for Your Data Growth
Smarter Management for Your Data GrowthRainStor
 
Selling Joy Golfinal Copy
Selling Joy Golfinal CopySelling Joy Golfinal Copy
Selling Joy Golfinal Copyguestc327f7b
 
Big Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicBig Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicRainStor
 
Economic Impact Evaluation of the Economic and Social Data Service of UK
Economic Impact Evaluation of the Economic and Social Data Service of UKEconomic Impact Evaluation of the Economic and Social Data Service of UK
Economic Impact Evaluation of the Economic and Social Data Service of UKAnna Palaiologk
 
Archiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
Archiving is a No-brainer - Bloor Analyst and RainStor Executive DiscussArchiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
Archiving is a No-brainer - Bloor Analyst and RainStor Executive DiscussRainStor
 

Andere mochten auch (6)

RainStor 3.5 Overview
RainStor 3.5 OverviewRainStor 3.5 Overview
RainStor 3.5 Overview
 
Smarter Management for Your Data Growth
Smarter Management for Your Data GrowthSmarter Management for Your Data Growth
Smarter Management for Your Data Growth
 
Selling Joy Golfinal Copy
Selling Joy Golfinal CopySelling Joy Golfinal Copy
Selling Joy Golfinal Copy
 
Big Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicBig Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor Infographic
 
Economic Impact Evaluation of the Economic and Social Data Service of UK
Economic Impact Evaluation of the Economic and Social Data Service of UKEconomic Impact Evaluation of the Economic and Social Data Service of UK
Economic Impact Evaluation of the Economic and Social Data Service of UK
 
Archiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
Archiving is a No-brainer - Bloor Analyst and RainStor Executive DiscussArchiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
Archiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
 

Ähnlich wie TDWI Checklist Report: Active Data Archiving

Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...DataScienceConferenc1
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdfVishnuGone
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyKate Campbell
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data warehouse
Data warehouseData warehouse
Data warehouseMR Z
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data LakeFibonalabs
 
Seven Essential Strategies for Effective Archiving
Seven Essential Strategies for Effective ArchivingSeven Essential Strategies for Effective Archiving
Seven Essential Strategies for Effective ArchivingEMC
 
TDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureTDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureJeannette Browning
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster LEARN Project
 

Ähnlich wie TDWI Checklist Report: Active Data Archiving (20)

Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Data Mining
Data MiningData Mining
Data Mining
 
Abstract
AbstractAbstract
Abstract
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdf
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing Technology
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data Lake
 
Seven Essential Strategies for Effective Archiving
Seven Essential Strategies for Effective ArchivingSeven Essential Strategies for Effective Archiving
Seven Essential Strategies for Effective Archiving
 
TDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse InfrastructureTDWI checklist 2018 - Data Warehouse Infrastructure
TDWI checklist 2018 - Data Warehouse Infrastructure
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 

Kürzlich hochgeladen

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 

Kürzlich hochgeladen (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 

TDWI Checklist Report: Active Data Archiving

  • 1. TDWI CHECKLIST REPORT TDWI RESEARCH tdwi.org Active Data Archiving For Big Data, Compliance, and Analytics By Philip Russom Sponsored by:
  • 2. 1  TDWI RESEARCH tdwi.org 2 FOREWORD 2 NUMBER ONE Embrace modern practices and platforms for active data archiving 3 NUMBER TWO Assure and improve data governance by using a compli- ance data archive 3 NUMBER THREE Consider an analytics archive for critical, high-value, and aging analytics data 4 NUMBER FOUR Rethink how data is committed to an archive 4 NUMBER FIVE Rethink how archived data is accessed and used actively 5 NUMBER SIX Deploy archiving systems that have multiple storage and processing tiers 6 NUMBER SEVEN Make security a high priority because it will make or break an archive 7 ABOUT OUR SPONSOR 7 ABOUT THE AUTHOR 7 ABOUT TDWI RESEARCH 7 ABOUT THE TDWI CHECKLIST REPORT SERIES © 2014 by TDWI (The Data Warehousing InstituteTM ), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies. MAY 2014 ACTIVE DATA ARCHIVING For Big Data, Compliance, and Analytics By Philip Russom TABLE OF CONTENTS 555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295 T 425.277.9126 F 425.687.2842 E info@tdwi.org tdwi.org TDWI CHECKLIST REPORT
  • 3. 2  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS Data archiving presents various problems in the enterprise today. Many organizations don’t archive at all. Others mistakenly think that mere data backups can serve as archives, whereas tape is actually the final burial place of data, from which it rarely returns. Equally off base, others believe a data warehouse is an archive. Although it’s true that data archiving processes exist today in some organizations, these are rarely formalized or policy driven, such that data is archived in an ad hoc fashion (typically per application or per department) without an enterprise standard or strategy. Even when an organization makes an honest attempt at an enterprise data archive, the result is usually not trustworthy (because data is easily altered), not auditable (due to poor metadata and documentation), not compliant (due to inadequate usage monitoring or the inability to purge data at specified milestones), and not properly secured (lacking encryption, masking, and security standards). Furthermore, with most existing data archives, it’s hard to get data in with integrity and out with speed because the primary platform is not online, active, and highly available. Why don’t more organizations invest in formal archiving processes and technical solutions? Most likely it’s their common belief that archives provide little or no return on investment (ROI) because users rarely (if ever) access the archive. Without prominent and frequent usage, a respectable ROI is unlikely. A data archive can achieve ROI by serving multiple uses and users from an online, active platform. Yes, organizations do need to retain data; that’s not in question. However, archived data is not just insurance for compliance, audit, and legal contingencies. Those are important goals, but a data archive should also be treated as an enterprise asset to be leveraged, typically via analytics. Hence, a data archive can be more than a cost center; it can achieve ROI when it serves multiple uses (archiving, compliance, and analytics of deep historical data sets) and it manages data online for active access at any time by a wide range of users. Users must start planning today for active data archiving. To help them prepare, this TDWI Checklist report will drill into the desirable attributes, use cases, user best practices, and enabling technologies of active data archiving. FOREWORD There are compelling reasons for improving data archives. Traditional reasons for data archives still apply: namely, supplying data for compliance, audit, and legal requirements. However, a modern online data archive brings greater speed, accuracy, and credibility to these tasks so they are a smaller drain on enterprise processes and resources. New reasons have come into play as well: namely, organizations’ voracious hunger for actionable insights discovered through advanced analysis of raw source data, big data, and a broadening diversity of data types. One of the most influential changes, however, concerns the state-of-the-art in data platforms—both hardware and software. Their speed, scale, and functionality continue to rise even as their costs fall, which in turn makes the improvement of users’ data archive solutions feasible for both technical and financial reasons. Active data archiving can address these problems and opportunities. Enterprises need to embrace the emerging practice of active data archiving along with its enabling technologies. A modern solution for active data archiving will: • Be built primarily for compliance or data governance but also serve the archival needs of analytics and sometimes data backup and disaster recovery. • Be open to active access by a wide range of users, including those who need simple lookups and easy data exploration. • Manage data as an immutable record that cannot be altered so that data is trustworthy for compliance and legal requirements. • Be secured like a bank vault, for data security, privacy, and trust, using role-based permission access, data masking, encryption, and multiple data security standards. • Scale up to multi-terabyte and petabyte data volumes using fast bulk loads and data compression to embrace new big data sources and because archives inevitably grow over time. • Operate online with high availability around the clock to enable active data loads and extracts that keep the archive current up to the minute. Furthermore, data is constantly appended to an active archive without downtime or performance degradation. • Support high-performance access based on SQL and other standards because users expect quick responses as they run queries and searches against archived data. NUMBER ONE EMBRACE MODERN PRACTICES AND PLATFORMS FOR ACTIVE DATA ARCHIVING
  • 4. 3  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS Two broad archive categories—defined by their content and the primary use of that information—can coexist and overlap in active data archiving solutions: • Compliance archives: Data retained in content, format, and for timeframes prescribed by legislation and other regulations (e.g., partners, lenders, and legal liabilities) • Analytic archives: Detailed source data from operational and transactional applications, extracted for general business intelligence purposes but retained for advanced analytics (as defined in the next section of this report) Compliance archives have a number of desirable process and technical attributes: Data that’s properly archived is solid evidence of an organization’s compliance. In legal terms, honest attempts at archiving constitute proper intent, whereas a lack of archiving may be construed as malfeasance. Data archived for compliance must support appropriate regulations. These vary by industry. For example, in the United States, the most stringent regulations target banking and the financial services industry as seen in the Dodd-Frank legislation or SEC Rule 17a-4. Similarly, the telecommunications industry is subject to legal hold and lawful intercept requirements that demand timed data retention. Archived data must be tamper proof to be trusted. Most is captured and stored in original form so it’s a credible representation of a transaction, report, business process, or other event at a specific time. If archived data becomes altered, it is no longer considered credible. For example, stock trades are stored for exact timeframes, to protect both trader and institution. Transparency is of the utmost importance to compliance archives, and WORM (write once, read many times) storage has become key. Archived data demands a convincingly documented audit trail. Most audits commence with a request for information, followed by a request for an audit trail for supplied information. With data stored properly in an active archive, audits go faster—perhaps more accurately, too—than with traditional offline, ad hoc archives. The speedy, documented response builds confidence with auditing bodies and contributes to favorable outcomes. An active data archive should have tracking functions so an organization can monitor and study its own activities to assure compliance and make improvements. The same tracking functions can flag data that has aged beyond its compliance requirements and should be deleted. ASSURE AND IMPROVE DATA GOVERNANCE BY USING A COMPLIANCE DATA ARCHIVE NUMBER TWO Archiving operational data for analytic purposes is on the rise. As more advanced forms of analytics have gained credence over the last 15 to 20 years, user organizations have been retaining more detailed source data. The traditional practice was to extract data from operational applications and other sources, process that data and load the results into a DW, then delete the extracted source data. The accepted practice today keeps most source data because it is also the preferred material for analytics based on data mining, statistical analyses, natural language processing, and SQL-based analytics. An analytic archive and a data warehouse are similar but different. Because of the stepped-up data retention, the data staging areas within most data warehouse architectures today are bigger than their core warehouses. This is tantamount to data archiving, though few BI/DW professionals call it archiving. All they know is that they have to do something to improve the content and accessibility of their analytic data archives. Furthermore, they need to offload this burden from core warehouses, which have higher priorities than analytics (namely reporting, OLAP, and performance management). Hence, as BI/DW professionals ponder where to put certain classes of analytic data, they should consider a platform for active data archiving. An analytic archive easily integrates with multi-platform DW architectures. DW system architectures have always been multi- platform, but this trend has accelerated in recent years as users have extended their DW environments by adding new platforms for columnar databases, appliances, NoSQL, and Hadoop. An additional platform—one that specializes in archiving data for advanced analytics—would wring more value from archived source data and easily integrate with multi-platform DW architectures. A data archive can future-proof analytic applications. Most data warehouses are designed by their users (not vendors) for the data requirements of reporting, OLAP, and performance management. These practices need calculated, aggregated, standardized, and time-series numeric values modeled in multidimensional structures that don’t exist in source systems. Advanced analytics has different data requirements. It needs a very large store of unaltered (or lightly transformed) detailed source data. Other than that, it’s impossible to anticipate data requirements for future analytic applications (AA). Accordingly, an analytic archive preserves source data in its original form, so the source is there for future AAs to explore and repurpose. CONSIDER AN ANALYTICS ARCHIVE FOR CRITICAL, HIGH-VALUE, AND AGING ANALYTICS DATA NUMBER THREE
  • 5. 4  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS A data archive has to be more than a dumping ground. For one thing, there needs to be a strategy based on new and evolving user requirements for aging, less frequently accessed data and other metrics for identifying which data should be archived at what level and on what schedule. Note that not all data should be archived: some data belongs elsewhere, say, in its original application database or in a data warehouse. Archive specialists need to interview a broad range of business users and managers to determine users’ needs for archived data. If your organization has a legal department and compliance officers, give priority to their needs but without neglecting the rest of the enterprise. On a technology level, develop interfaces and integration logic for getting data into the archive quickly and in lightly transformed states that are conducive to query and search, without altering the essential content of archived data. Finally, assume that all the data in the archive needs an audit trail and documentation (via metadata, etc.) that is sufficient to satisfy even the most aggressive users and auditors. What if data comes from applications that have been upgraded or customized (which can alter data models)? Look for a data archiving platform that can manage changing data models. That way, the platform understands changes to source schema and adjusts metadata and pointers accordingly. What if archived data comes from an application that was decommissioned (also known as application retirement)? When the only application that can read a dataset with full integrity is gone, that application’s data may need to be lightly transformed before entering an archive (or after it’s in the archive) so it can be easily accessed by common query and search tools. This practice is inspired by data warehousing but it does not require the full-blown time, skills, and expense of the average data warehouse. Some archived data needs encryption (for security) or compression (to reduce its storage footprint). Look for a platform that can apply these and other data operations as data enters the archive or after data is in the archive. Furthermore, as data growth rates continue to rise over time and business demands for retaining older data grow, data should be stored in a compressed state to optimize storage capacity and scale over time. Similarly, the security classification of data can change as organizational rules and policies evolve. RETHINK HOW DATA IS COMMITTED TO AN ARCHIVE NUMBER FOUR Let’s be honest: We’ve all worked in organizations where archives were purely pro forma, without a credible effort to preserve data in a state that’s quickly or easily accessed by anyone, much less the growing number of employees who can benefit from accessing the information. Luckily, this old “worst practice” is giving way to the realization that all enterprise datasets—including archived data— are valuable assets that can contribute to many business goals. The recent craze for analytics with big data has led many organizations to seek more business value from their datasets. With that in mind, active data archiving is a bit of a cultural shock in some organizations. To get past the shock, these organizations need upper management to define a mandate for modern archiving based on the following goals: Archived data must be leveraged. Typical use cases include fast, documented auditing for compliance, a source for analytic applications, data exploration, and information lookups. Some data will come out of the archive to be used elsewhere. To enable a broad range of users, tools, and purposes, the archive should support both query and search mechanisms. Furthermore, the archive should serve as a source for other data platforms, especially those for business intelligence and analytics. A growing constituency of users will have access to archived data. This is a sticky point in organizations that define data governance and compliance as the process of limiting data access. The catch is to balance access and control, typically through well- defined user types controlled via role-based user access and strong security features in the archival platform. Accessing archived data will be timely. First, to be truly active, the archive must be online like a database, not offline like magnetic tapes and optical disks or any media that demand a distracting and time-consuming restoration process. Second, data access mechanisms should perform at or near real time for the sake of user productivity. RETHINK HOW ARCHIVED DATA IS ACCESSED AND USED ACTIVELY NUMBER FIVE
  • 6. 5  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS For a data archive to be truly active, its primary tier should be based on a robust database management system (DBMS). The DBMS must include traditional relational functions (for query and data exploration) and functions for multiple security strategies, scalability, and high availability. The assumptions here are that most data being archived will be structured and that most users and applications will need to access data via queries. Even so, some functions of the DBMS should be controlled; for example, inserting and updating data can destroy data’s original state, whereas appending data avoids such integrity problems. In addition to relational technology, free text search is critical to finding records of interest and to enabling non-technical users. An active data archiving platform can host many archives, each with its own unique requirements, similar to how a DBMS can manage several databases (defined as collections of data). Thus, multi-tenancy is another key assumption for a modern data archive. In most cases, an archive platform is not a data processing or analytics platform. Hence, archived data is best extracted, then moved to a DBMS or other data platform that is more conducive to in-database analytics, intense SQL-based analytics, and miscellaneous forms of advanced analytics. For these purposes, mature organizations already have in place relational data warehouses, columnar databases, and DW appliances, possibly NoSQL databases and Hadoop. As an exception, when an active archive runs atop Hadoop, it may make sense to process and analyze data on the same platform where it’s archived. Note that the DBMS in the primary tier of a data archive does not replace other DBMSs, especially not those deployed for analytics. Instead, it complements them and (in addition to its archival purpose) serves as yet another source of data for analytics (largely historical data). The storage tier of an active archive should be diverse. This is to accommodate subsystems users already have as well as newer commodity-priced types such as CAS hardware or the Hadoop Distributed File System (HDFS). Even a modern active archive might include systems for magnetic tape and optical disk in the storage tier. After all, many organizations have pre-existing mag tape or op disk libraries that they must maintain. Note that these archaic media are antithetical to an active data archive; if possible, their data should be migrated into the active archive so it’s online and available when users need it. In the case of a compliance archive (for, say, a financial services institution), the archive must reside in a WORM storage platform. This, in turn, requires a DBMS that supports WORM devices. WORM technologies are worth the investment because they keep DEPLOY ARCHIVING SYSTEMS THAT HAVE MULTIPLE STORAGE AND PROCESSING TIERS NUMBER SIX compliance and risk officers happy and they avoid fines, penalties, and damaging publicity. Users should consider Hadoop as both a highly scalable storage platform for archiving and a low-cost processing platform for analytics. Note that open-source Hadoop’s poor support for two key standards—SQL (and other relational technologies) and security (especially LDAP and Linux PAM)—keeps it unpalatable for mature IT organizations. Despite these two limitations, Hadoop has roles to play in multi- platform archive architectures. Hadoop excels with very large data volumes, as well as with file-based data, data documents (XML and JSON), textual content (e-mail and word processing files), unstructured and non-relational structured data, and schema-free data. Hadoop’s low price is appropriate to many kinds of lower-value (but high-volume) historic data, such as Web logs. However, due to limitations in current releases, purely open-source Hadoop may not be the best choice for structured data that needs relational processing (such as intense SQL or multi-way joins) or sensitive data that demands high security. That’s not a show stopper because a number of software vendors offer products that integrate with Hadoop to give it stronger and broader support for security and relational technologies like standard SQL. Consider economics as you select platforms, tools, and features for a new active archiving architecture. For example, it’s technically possible to include almost any brand of relational DBMS in an archiving solution. However, the older and more mature vendor brands are relatively expensive, especially once an archive scales into multi-terabytes, and they include far more features and functions than are required for archiving. A more cost-effective choice is a DBMS designed for archiving or one of the newer columnar, open-source, or appliance-based DBMSs. In this context, Hadoop is affordable in terms of dollars per terabyte of storage. Similarly, data compression is a feature that can reduce storage costs because it reduces the footprint of archived data in storage.
  • 7. 6  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS Put succinctly, if an archive isn’t secure, it won’t meet the compliance goals that are its primary purpose. Furthermore, if users don’t trust the security of the archival platform, they won’t use it or its data, and the archive will fail to demonstrate a positive ROI. The primary line of defense is the security layer built into the relational DBMS at the heart of an active data archiving platform. Most mature IT departments and DBMS teams prefer role-based approaches to security, and many have LDAP and other directories they’d like to reuse and apply within the active archiving solution. If Hadoop is to be part of an active archive’s infrastructure, note that security in purely open-source Hadoop today is mostly about general access privileges controlled through Kerberos. However, a few third parties now offer add-on products that enable LDAP, Active Directory, and other approaches to security for the Hadoop family of products. Almost all modern data archives are loaded with sensitive data about customers, partners, employees, Social Security numbers, credit card numbers, transactions, internal financials, and so on. Encryption or data masking can make this data unreadable in the eventuality of a hack or other unauthorized access. Additional layers of data protection may be used to keep data locked and immutable. This provides evidence that data records and files have not been altered, which is fundamental to a credible audit. Likewise, records and files cannot be deleted before their retention periods expire. MAKE SECURITY A HIGH PRIORITY BECAUSE IT WILL MAKE OR BREAK AN ARCHIVE NUMBER SEVEN
  • 8. 7  TDWI RESEARCH tdwi.org TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations. ABOUT TDWI RESEARCH ABOUT THE AUTHOR Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. He’s been an industry analyst at Forrester Research and Giga Information Group, where he researched, wrote, spoke, and consulted about BI issues. Before that, Russom worked in technical and marketing positions for various database vendors. Over the years, Russom has produced over 500 publications and speeches. You can reach him at prussom@tdwi.org. TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects. ABOUT THE TDWI CHECKLIST REPORT SERIES www.rainstor.com RainStor provides the world’s most efficient database solutions that reduce the cost, complexity, and compliance risk of managing data. Delivering solutions to the enterprise, you can quickly deploy an Analytical Archive or Compliance Archive so you continue to create business value and stay compliant. RainStor runs anywhere: on-premises or in the cloud and natively on Hadoop. Among RainStor’s customers are 20 of the world’s largest communications providers and 10 of the biggest banks and financial services organizations, which use RainStor to manage historical data, while saving millions. For more info: www.rainstor.com or join the conversation: @rainstor. ABOUT OUR SPONSOR