SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Data Analytics
Lifecycle
Data Analytics Lifecycle
• Big Data analysis differs from tradional data
analysis primarily due to the volume, velocity and
variety characterstics of the data being
processes.
• To address the distinct requirements for
performing analysis on Big Data, a step-by-step
methodology is needed to organize the activities
and tasks involved with acquiring, processing,
analyzing and repurposing data.
Data Analytics Lifecycle (cont..)
From a Big Data adoption and planning perspective, it is important
that in addition to the lifecycle, consideration be made for issues of
training, education, tooling and staffing of a data analytics team.
Key Roles for a
Successful Analytics Project
n Business User – understands the domain area
n Project Sponsor – provides requirements
n Project Manager – ensures meeting objectives
n Business Intelligence Analyst – provides business
domain expertise based on deep understanding of the
data
n Database Administrator (DBA) – creates DB
environment
n Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox
n Data Scientist – provides analytic techniques and
modeling
n The data analytic lifecycle is designed for
Big Data problems and data science
projects
n The cycle is iterative to represent a real
project
n Work can return to earlier phases as new
information is uncovered
Data Analytics Lifecycle (cont..)
Data Analytics Lifecycle-Abstract view
In this phase,
n The data science team must learn and investigate the
problem,
n Develop context and understanding, and
n Learn about the data sources needed and available for
the project.
n In addition, the team formulates initial hypotheses that
can later be tested with data.
Phase 1: Discovery
n The team should perform five main activities during this step of the discovery
phase:
n Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
n Make an inventory of the datasets currently available and those that
can be purchased or otherwise acquired for the tests the team wants
to perform.
n Capture aggregate data sources: This is for previewing the data and providing
high-level understanding.
n It enables the team to gain a quick overview of the data and perform
further exploration on specific areas.
n Review the raw data: Begin understanding the interdependencies among the
data attributes.
n Become familiar with the content of the data, its quality, and its
limitations.
Phase 1: Discovery (cont.)
Phase 1: Discovery (cont.)
n Evaluate the data structures and tools needed: The data type and
structure dictate which tools the team can use to analyze the data.
n Scope the sort of data infrastructure needed for this type of
problem: In addition to the tools needed, the data influences the kind of
infrastructure that's required, such as disk storage and network capacity.
n Unlike many traditional stage-gate processes, in which the team can
advance only when specific criteria are met, the Data Analytics Lifecycle
is intended to accommodate more ambiguity.
n For each phase of the process, it is recommended to pass certain
checkpoints as a way of gauging whether the team is ready to move to
the next phase of the Data Analytics Lifecycle.
n This phase includes
n Steps to explore, Preprocess, and condition
data prior to modeling and analysis.
Phase 2: Data preparation
Phase 2: Data preparation (cont.)Phase 2: Data preparation
n It requires the presence of an analytic sandbox (workspace), in
which the team can work with data and perform analytics for the
duration of the project.
n The team needs to execute Extract, Load, and Transform (ELT) or
extract, transform and load (ETL) to get data into the sandbox.
n In ETL, users perform processes to extract data from a datastore,
perform data transformations, and load the data back into the
datastore.
n The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Phase 2: Data preparation
Rules for Analytics Sandbox
n When developing the analytic sandbox, collect all kinds of data
there, as team members need access to high volumes and
varieties of data for a Big Data analytics project.
n This can include everything from summary-level aggregated
data, structured data , raw data feeds, and unstructured
text data from call logs or web logs, depending on the kind
of analysis the team plans to undertake.
n A good rule is to plan for the sandbox to be at least 5– 10 times
the size of the original datasets, partly because copies of the
data may be created that serve as specific tables or data stores
for specific kinds of analysis in the project.
Phase 2: Data preparation
Performing ETLT
n As part of the ETLT step, it is advisable to make an inventory of the data
and compare the data currently available with datasets the team needs.
n Performing this sort of gap analysis provides a framework for
understanding which datasets the team can take advantage of today and
where the team needs to initiate projects for data collection or access to
new datasets currently unavailable.
n A component of this subphase involves extracting data from the
available sources and determining data connections for raw data, online
transaction processing (OLTP) databases, online analytical
processing (OLAP) cubes, or other data feeds.
n Data conditioning refers to the process of cleaning data, normalizing datasets,
and performing transformations on the data.
Several tools are commonly used for this phase:
Hadoop can perform massively parallel ingest and custom analysis for web traffic
analysis, GPS location analytics, and combining of massive unstructured data
feeds from multiple sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such as
staged data-mining techniques (for example, first select the top 100 customers,
and then run descriptive statistics and clustering).
OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
for working with messy data. A GUI-based tool for performing data
transformations, and it's one of the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used
to perform many transformations on a given dataset.
Common Tools for the Data Preparation Phase
Phase 3: Model Planning
n Phase 3 is model planning, where the team
determines the methods, techniques, and workflow it
intends to follow for the subsequent model building
phase.
n The team explores the data to learn about the
relationships between variables and subsequently
selects key variables and the most suitable models.
n During this phase that the team refers to the
hypotheses developed in Phase 1, when they first
b e c a m e a c q u a i n t e d w i t h t h e d a t a a n d
understanding the business problems or domain
area.
Common Tools for the Model Planning Phase
Here are several of the more common ones:
n R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality code. In addition,
it has the ability to interface with databases via an ODBC connection and
execute statistical tests.
n SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models.
n SAS/ ACCESS provides integration between SAS and the analytics sandbox
via multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is
generally used on file extracts, but with SAS/ ACCESS, users can connect to
relational databases (such as Oracle or Teradata).
n In this phase the data science team needs to develop data sets
for training, testing, and production purposes. These data sets
enable the data scientist to develop the analytical model and train
it ("training data"), while holding aside some of the data ("hold-
out data" or "test data") for testing the model.
n the team develops datasets for testing, training, and production
purposes.
n In addition, in this phase the team builds and executes models
based on the work done in the model planning phase.
n The team also considers whether its existing tools will
sufficient for running the models, or if it will need a more
robust environment for executing models and workflows (for
example, fast hardware and parallel processing, if applicable).
n Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
n Commercial Tools: Matlab, STATISTICA.
Phase 4: Model Building
n In Phase 5, After executing the model, the team needs to
compare the outcomes of the modeling to the criteria established
for success and failure.
n The team considers how best to articulate the findings and
outcomes to the various team members and stakeholders, taking
into account warning, assumptions, and any limitations of the
results.
n The team should identify key findings, quantify the business
value, and develop a narrative to summarize and convey findings
to stakeholders.
Phase 5: Communicate Results
n In the final phase 6, Operationalize), the team communicates the
benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to
a full enterprise or ecosystem of users.
n This approach enables the team to learn about the performance
and re lated constraints of the model in a production environment
on a small scale and make adjustments before a full deployment.
n The team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to
implement the models in a production environment.
Phase 6: Operationalize
Common Tools for the Model Building Phase
Free or Open Source tools:
n R and PL/ R was described earlier in the model planning phase, and PL/ R is a
procedural language for PostgreSQL with R. Using this approach means that R
commands can be executed in database.
n Octave , a free software programming language for computational modeling, has
some of the functionality of Matlab. Because it is freely available, Octave is used in
major universities when teaching machine learning.
n WEKA is a free data mining software package with an analytic workbench. The
functions created in WEKA can be executed within Java code.
n Python is a programming language that provides toolkits for machine learning
and analysis, such as scikit-learn, numpy, scipy, pandas, and related data
visualization using matplotlib.
n SQL in-database implementations, such as MADlib, provide an alterative to in-
memory desktop analytical tools.
n MADlib provides an open-source machine learning library of algorithms that can
be executed in-database, for PostgreSQL or Greenplum.
Key outputs for each of the main stakeholders
n Key outputs for each of the main stakeholders of an analytics project and
what they usually expect at the conclusion of a project.
n Business User typically tries to determine the benefits and implications of
the findings to the business.
n Project Sponsor typically asks questions related to the business impact of
the project, the risks and return on investment (ROI), and the way the
project can be evangelized within the organization (and beyond).
n Project Manager needs to determine if the project was completed on time
and within budget and how well the goals were met.
n Business Intelligence Analyst needs to know if the reports and
dashboards he manages will be impacted and need to change.
n Data Engineer and Database Administrator (DBA) typically need to
share their code from the analytics project and create a technical document
on how to implement it.
n Data Scientist needs to share the code and explain the model to her peers,
managers, and other stakeholders.
Data Analytics
Lifecycle Overview
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation &
Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis
Results
The Big Data analytics lifecycle
can be divided into the following
nine stages, as shown in;
1. Business Case Evaluation
n Before any Big Data project can be started, it needs to
be clear what the business objectives and results of
the data analysis should be.
n This initial phase focuses on understanding the project
objectives and requirements from a business
perspective, and then converting this knowledge into a
data mining problem definition.
n A preliminary plan is designed to achieve the
objectives. A decision model, especially one built using
the Decision Model and Notation standard can be used.
n Once an overall business problem is defined, the
problem is converted into an analytical problem.
2. Data Identification
n The Data Identification stage determines the origin of
data. Before data can be analysed, it is important to
know what the sources of the data will be.
n Especially if data is procured from external suppliers, it
is necessary to clearly identify what the original source
of the data is and how reliable (frequently referred to
as the veracity of the data) the dataset is.
n The second stage of the Big Data Lifecycle is very
important, because if the input data is unreliable, the
output data will also definitely be unreliable.
n Identifying a wider variety of data sources may
increase the probability of finding hidden patterns and
correlations.
3. Data Acquisition and Filtering
n The Data Acquisition and Filtering Phase builds upon the
previous stage of the Big Data Lifecycle.
n In this stage, the data is gathered from different sources, both
from within the company and outside of the company.
n After the acquisition, a first step of filtering is conducted to filter
out corrupt data.
n Additionally, data that is not necessary for the analysis will be
filtered out as well.
n The filtering step will be applied on each data source individually,
so before the data is aggregated into the data warehouse.
n In many cases, especially where external, unstructured data is
concerned, some or most of the acquired data may be irrelevant
(noise) and can be discarded as part of the filtering process.
3. Data Acquisition and Filtering (cont.)
n Data classified as “corrupt”
can include records with
missing or nonsensical
values or invalid data types.
Data that is filtered out for
one analysis may possibly
be valuable for a different
type of analysis.
n Metadata can be added via
automation to data from
both internal and external
data sources to improve the
classification and querying.
n Examples of appended metadata include dataset size and structure, source
information, date and time of creation or collection and language-specific
information.
n Some of the data identified in the two previous stages may
be incompatible with the Big Data tool that will perform the
actual analysis.
n In order to deal with this problem, the Data Extraction
stage is dedicated to extracting different data formats from
data sets (e.g. the data source) and transforming these into
a format the Big Data tool is able to process and analyse.
n The complexity of the transformation and the extent in
which is necessary to transform data is greatly dependent
on the Big Data tool that has been selected.
n The Data Extraction lifecycle stage is dedicated to
extracting disparate data and transforming it into a format
that the underlying Big Data solution can use for the
purpose of the data analysis.
4. Data Extraction
(A). Illustrates the
e x t r a c t i o n o f
comments and a user
ID embedded within
an XML document
without the need for
further transformation.
4. Data Extraction (cont.)
(B). Demonstrates
the extraction of the
l a t i t u d e a n d
l o n g i t u d e
coordinates of a
user from a single
JSON field.
(A)
(B)
5. Data Validation and Cleansing
n Data that is invalid leads to invalid results. In order to
ensure only the appropriate data is analysed, the Data
Validation and Cleansing stage of the Big Data Lifecycle is
required.
n During this stage, data is validated against a set of
predetermined conditions and rules in order to ensure the
data is not corrupt.
n An example of a validation rule would be to exclude all
persons that are older than 100 years old, since it is very
unlikely that data about these persons would be correct
due to physical constraints.
n The Data Validation and Cleansing stage is dedicated to
establishing often complex validation rules and removing
any known invalid data.
5. Data Validation and Cleansing (cont.)
n For example, as illustrated in Fig. 1, the first value in Dataset
B is validated against its corresponding value in Dataset A.
n The second value in Dataset B is not validated against its
corresponding value in Dataset A. If a value is missing, it is
inserted from Dataset A.
n Data validation can be used to examine interconnected
datasets in order to fill in missing valid data.
6. Data Aggregation and Representation
n Data may be spread across multiple datasets, requiring that
dataset be joined together to conduct the actual analysis.
n In order to ensure only the correct data will be analysed in the
next stage, it might be necessary to integrate multiple
datasets.
n The Data Aggregation and Representation stage is dedicated
to integrate multiple datasets to arrive at a unified view.
n Additionally, data aggregation will greatly speed up the
analysis process of the Big Data tool, because the tool will not
be required to join different tables from different datasets,
greatly speeding up the process.
6. Data Aggregation and Representation (cont.)
n Performing this stage can become complicated because of differences
in:
• Data Structure – Although the data format may be the same, the
data model may be different.
• Semantics – A value that is labeled differently in two different
datasets may mean the same thing, for example “surname” and
“last name.”
n Whether data aggregation is required or not, it is important to
understand that the same data can be stored in many different forms.
A simple example of
data aggregation
where two datasets
a r e a g g r e g a t e d
together using the
Id field.
7. Data Analysis
n The Data Analysis stage of the Big Data Lifecycle stage is dedicated
to carrying out the actual analysis task.
n It runs the code or algorithm that makes the calculations that will
lead to the actual result.
n Data Analysis can be simple or really complex, depending on the
required analysis type.
n In this stage the ‘actual value’ of the Big Data project will be
generated. If all previous stages have been executed carefully, the
results will be factual and correct.
n Depending on the type of analytic result required, this stage can be
as simple as querying a dataset to compute an aggregation for
comparison.
n On the other hand, it can be as challenging as combining data
mining and complex statistical analysis techniques to discover
patterns and anomalies or to generate a statistical or mathematical
model to depict relationships between variables.
7. Data Analysis (cont.)
n Data analysis can be classified as confirmatory analysis or exploratory analysis,
the latter of which is linked to data mining, as shown in Figure,
 Confirmatory data analysis is a deductive approach where the cause of the
phenomenon being investigated is proposed beforehand. The proposed cause
or assumption is called a hypothesis.
 Exploratory data analysis is an inductive approach that is closely associated
with data mining. No hypothesis or predetermined assumptions are generated.
Instead, the data is explored through analysis to develop an understanding of
the cause of the phenomenon.
8. Data Visualization
n The ability to analyse massive amounts of data and find useful
insight is one thing, communicating the results in a way that
everybody can understand is something completely different.
n The Data visualization stage is dedicated to using data
visualization techniques and tools to graphically communicate
the analysis results for effective interpretation by business users.
Frequently this requires plotting data points in charts, graphs or
maps.
n The results of completing the Data Visualization stage provide
users with the ability to perform visual analysis, allowing for the
discovery of answers to questions that users have not yet even
formulated.
8. Data Visualization (cont.)
9. Utilization of Analysis Results
n After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the
results in practice.
n The Utilisation of Analysis results is dedicated to determining how
and where the processed data can be further utilised to leverage the
result of the Big Data Project.
n Depending on the nature of the analysis problems being addressed,
it is possible for the analysis results to produce “models” that
encapsulate new insights and understandings about the nature of
the patterns and relationships that exist within the data that was
analyzed.
n A model may look like a mathematical equation or a set of rules.
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program.
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle
Lecture2 big data life cycle

Weitere ähnliche Inhalte

Was ist angesagt?

What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?Bernard Marr
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 

Was ist angesagt? (20)

What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 

Ähnlich wie Lecture2 big data life cycle

MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxnikshaikh786
 
Bda life cycle slideshare
Bda life cycle   slideshareBda life cycle   slideshare
Bda life cycle slideshareSathyaseelanK1
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdfAyele40
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Report for internship
Report for internshipReport for internship
Report for internshipSalman Khan
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data ScienceJohn B. Rollins, Ph.D.
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 

Ähnlich wie Lecture2 big data life cycle (20)

MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
 
Bda life cycle slideshare
Bda life cycle   slideshareBda life cycle   slideshare
Bda life cycle slideshare
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Report for internship
Report for internshipReport for internship
Report for internship
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 

Mehr von hktripathy

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data samplinghktripathy
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualizationhktripathy
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysishktripathy
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmhktripathy
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Mehr von hktripathy (18)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data sampling
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualization
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

Kürzlich hochgeladen

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 

Kürzlich hochgeladen (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 

Lecture2 big data life cycle

  • 2. Data Analytics Lifecycle • Big Data analysis differs from tradional data analysis primarily due to the volume, velocity and variety characterstics of the data being processes. • To address the distinct requirements for performing analysis on Big Data, a step-by-step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing and repurposing data.
  • 3. Data Analytics Lifecycle (cont..) From a Big Data adoption and planning perspective, it is important that in addition to the lifecycle, consideration be made for issues of training, education, tooling and staffing of a data analytics team.
  • 4. Key Roles for a Successful Analytics Project n Business User – understands the domain area n Project Sponsor – provides requirements n Project Manager – ensures meeting objectives n Business Intelligence Analyst – provides business domain expertise based on deep understanding of the data n Database Administrator (DBA) – creates DB environment n Data Engineer – provides technical skills, assists data management and extraction, supports analytic sandbox n Data Scientist – provides analytic techniques and modeling
  • 5. n The data analytic lifecycle is designed for Big Data problems and data science projects n The cycle is iterative to represent a real project n Work can return to earlier phases as new information is uncovered Data Analytics Lifecycle (cont..)
  • 7. In this phase, n The data science team must learn and investigate the problem, n Develop context and understanding, and n Learn about the data sources needed and available for the project. n In addition, the team formulates initial hypotheses that can later be tested with data. Phase 1: Discovery
  • 8. n The team should perform five main activities during this step of the discovery phase: n Identify data sources: Make a list of data sources the team may need to test the initial hypotheses outlined in this phase. n Make an inventory of the datasets currently available and those that can be purchased or otherwise acquired for the tests the team wants to perform. n Capture aggregate data sources: This is for previewing the data and providing high-level understanding. n It enables the team to gain a quick overview of the data and perform further exploration on specific areas. n Review the raw data: Begin understanding the interdependencies among the data attributes. n Become familiar with the content of the data, its quality, and its limitations. Phase 1: Discovery (cont.)
  • 9. Phase 1: Discovery (cont.) n Evaluate the data structures and tools needed: The data type and structure dictate which tools the team can use to analyze the data. n Scope the sort of data infrastructure needed for this type of problem: In addition to the tools needed, the data influences the kind of infrastructure that's required, such as disk storage and network capacity. n Unlike many traditional stage-gate processes, in which the team can advance only when specific criteria are met, the Data Analytics Lifecycle is intended to accommodate more ambiguity. n For each phase of the process, it is recommended to pass certain checkpoints as a way of gauging whether the team is ready to move to the next phase of the Data Analytics Lifecycle.
  • 10. n This phase includes n Steps to explore, Preprocess, and condition data prior to modeling and analysis. Phase 2: Data preparation
  • 11. Phase 2: Data preparation (cont.)Phase 2: Data preparation n It requires the presence of an analytic sandbox (workspace), in which the team can work with data and perform analytics for the duration of the project. n The team needs to execute Extract, Load, and Transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. n In ETL, users perform processes to extract data from a datastore, perform data transformations, and load the data back into the datastore. n The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it.
  • 12. Phase 2: Data preparation Rules for Analytics Sandbox n When developing the analytic sandbox, collect all kinds of data there, as team members need access to high volumes and varieties of data for a Big Data analytics project. n This can include everything from summary-level aggregated data, structured data , raw data feeds, and unstructured text data from call logs or web logs, depending on the kind of analysis the team plans to undertake. n A good rule is to plan for the sandbox to be at least 5– 10 times the size of the original datasets, partly because copies of the data may be created that serve as specific tables or data stores for specific kinds of analysis in the project.
  • 13. Phase 2: Data preparation Performing ETLT n As part of the ETLT step, it is advisable to make an inventory of the data and compare the data currently available with datasets the team needs. n Performing this sort of gap analysis provides a framework for understanding which datasets the team can take advantage of today and where the team needs to initiate projects for data collection or access to new datasets currently unavailable. n A component of this subphase involves extracting data from the available sources and determining data connections for raw data, online transaction processing (OLTP) databases, online analytical processing (OLAP) cubes, or other data feeds. n Data conditioning refers to the process of cleaning data, normalizing datasets, and performing transformations on the data.
  • 14. Several tools are commonly used for this phase: Hadoop can perform massively parallel ingest and custom analysis for web traffic analysis, GPS location analytics, and combining of massive unstructured data feeds from multiple sources. Alpine Miner provides a graphical user interface (GUI) for creating analytic workflows, including data manipulations and a series of analytic events such as staged data-mining techniques (for example, first select the top 100 customers, and then run descriptive statistics and clustering). OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for working with messy data. A GUI-based tool for performing data transformations, and it's one of the most robust free tools currently available. Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and transformation. Wrangler was developed at Stanford University and can be used to perform many transformations on a given dataset. Common Tools for the Data Preparation Phase
  • 15. Phase 3: Model Planning n Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. n The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models. n During this phase that the team refers to the hypotheses developed in Phase 1, when they first b e c a m e a c q u a i n t e d w i t h t h e d a t a a n d understanding the business problems or domain area.
  • 16. Common Tools for the Model Planning Phase Here are several of the more common ones: n R has a complete set of modeling capabilities and provides a good environment for building interpretive models with high-quality code. In addition, it has the ability to interface with databases via an ODBC connection and execute statistical tests. n SQL Analysis services can perform in-database analytics of common data mining functions, involved aggregations, and basic predictive models. n SAS/ ACCESS provides integration between SAS and the analytics sandbox via multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file extracts, but with SAS/ ACCESS, users can connect to relational databases (such as Oracle or Teradata).
  • 17. n In this phase the data science team needs to develop data sets for training, testing, and production purposes. These data sets enable the data scientist to develop the analytical model and train it ("training data"), while holding aside some of the data ("hold- out data" or "test data") for testing the model. n the team develops datasets for testing, training, and production purposes. n In addition, in this phase the team builds and executes models based on the work done in the model planning phase. n The team also considers whether its existing tools will sufficient for running the models, or if it will need a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable). n Free or Open Source tools: Rand PL/R, Octave, WEKA, Python n Commercial Tools: Matlab, STATISTICA. Phase 4: Model Building
  • 18. n In Phase 5, After executing the model, the team needs to compare the outcomes of the modeling to the criteria established for success and failure. n The team considers how best to articulate the findings and outcomes to the various team members and stakeholders, taking into account warning, assumptions, and any limitations of the results. n The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders. Phase 5: Communicate Results
  • 19. n In the final phase 6, Operationalize), the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users. n This approach enables the team to learn about the performance and re lated constraints of the model in a production environment on a small scale and make adjustments before a full deployment. n The team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment. Phase 6: Operationalize
  • 20. Common Tools for the Model Building Phase Free or Open Source tools: n R and PL/ R was described earlier in the model planning phase, and PL/ R is a procedural language for PostgreSQL with R. Using this approach means that R commands can be executed in database. n Octave , a free software programming language for computational modeling, has some of the functionality of Matlab. Because it is freely available, Octave is used in major universities when teaching machine learning. n WEKA is a free data mining software package with an analytic workbench. The functions created in WEKA can be executed within Java code. n Python is a programming language that provides toolkits for machine learning and analysis, such as scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib. n SQL in-database implementations, such as MADlib, provide an alterative to in- memory desktop analytical tools. n MADlib provides an open-source machine learning library of algorithms that can be executed in-database, for PostgreSQL or Greenplum.
  • 21. Key outputs for each of the main stakeholders n Key outputs for each of the main stakeholders of an analytics project and what they usually expect at the conclusion of a project. n Business User typically tries to determine the benefits and implications of the findings to the business. n Project Sponsor typically asks questions related to the business impact of the project, the risks and return on investment (ROI), and the way the project can be evangelized within the organization (and beyond). n Project Manager needs to determine if the project was completed on time and within budget and how well the goals were met. n Business Intelligence Analyst needs to know if the reports and dashboards he manages will be impacted and need to change. n Data Engineer and Database Administrator (DBA) typically need to share their code from the analytics project and create a technical document on how to implement it. n Data Scientist needs to share the code and explain the model to her peers, managers, and other stakeholders.
  • 22. Data Analytics Lifecycle Overview 1. Business Case Evaluation 2. Data Identification 3. Data Acquisition & Filtering 4. Data Extraction 5. Data Validation & Cleansing 6. Data Aggregation & Representation 7. Data Analysis 8. Data Visualization 9. Utilization of Analysis Results The Big Data analytics lifecycle can be divided into the following nine stages, as shown in;
  • 23. 1. Business Case Evaluation n Before any Big Data project can be started, it needs to be clear what the business objectives and results of the data analysis should be. n This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition. n A preliminary plan is designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used. n Once an overall business problem is defined, the problem is converted into an analytical problem.
  • 24. 2. Data Identification n The Data Identification stage determines the origin of data. Before data can be analysed, it is important to know what the sources of the data will be. n Especially if data is procured from external suppliers, it is necessary to clearly identify what the original source of the data is and how reliable (frequently referred to as the veracity of the data) the dataset is. n The second stage of the Big Data Lifecycle is very important, because if the input data is unreliable, the output data will also definitely be unreliable. n Identifying a wider variety of data sources may increase the probability of finding hidden patterns and correlations.
  • 25. 3. Data Acquisition and Filtering n The Data Acquisition and Filtering Phase builds upon the previous stage of the Big Data Lifecycle. n In this stage, the data is gathered from different sources, both from within the company and outside of the company. n After the acquisition, a first step of filtering is conducted to filter out corrupt data. n Additionally, data that is not necessary for the analysis will be filtered out as well. n The filtering step will be applied on each data source individually, so before the data is aggregated into the data warehouse. n In many cases, especially where external, unstructured data is concerned, some or most of the acquired data may be irrelevant (noise) and can be discarded as part of the filtering process.
  • 26. 3. Data Acquisition and Filtering (cont.) n Data classified as “corrupt” can include records with missing or nonsensical values or invalid data types. Data that is filtered out for one analysis may possibly be valuable for a different type of analysis. n Metadata can be added via automation to data from both internal and external data sources to improve the classification and querying. n Examples of appended metadata include dataset size and structure, source information, date and time of creation or collection and language-specific information.
  • 27. n Some of the data identified in the two previous stages may be incompatible with the Big Data tool that will perform the actual analysis. n In order to deal with this problem, the Data Extraction stage is dedicated to extracting different data formats from data sets (e.g. the data source) and transforming these into a format the Big Data tool is able to process and analyse. n The complexity of the transformation and the extent in which is necessary to transform data is greatly dependent on the Big Data tool that has been selected. n The Data Extraction lifecycle stage is dedicated to extracting disparate data and transforming it into a format that the underlying Big Data solution can use for the purpose of the data analysis. 4. Data Extraction
  • 28. (A). Illustrates the e x t r a c t i o n o f comments and a user ID embedded within an XML document without the need for further transformation. 4. Data Extraction (cont.) (B). Demonstrates the extraction of the l a t i t u d e a n d l o n g i t u d e coordinates of a user from a single JSON field. (A) (B)
  • 29. 5. Data Validation and Cleansing n Data that is invalid leads to invalid results. In order to ensure only the appropriate data is analysed, the Data Validation and Cleansing stage of the Big Data Lifecycle is required. n During this stage, data is validated against a set of predetermined conditions and rules in order to ensure the data is not corrupt. n An example of a validation rule would be to exclude all persons that are older than 100 years old, since it is very unlikely that data about these persons would be correct due to physical constraints. n The Data Validation and Cleansing stage is dedicated to establishing often complex validation rules and removing any known invalid data.
  • 30. 5. Data Validation and Cleansing (cont.) n For example, as illustrated in Fig. 1, the first value in Dataset B is validated against its corresponding value in Dataset A. n The second value in Dataset B is not validated against its corresponding value in Dataset A. If a value is missing, it is inserted from Dataset A. n Data validation can be used to examine interconnected datasets in order to fill in missing valid data.
  • 31. 6. Data Aggregation and Representation n Data may be spread across multiple datasets, requiring that dataset be joined together to conduct the actual analysis. n In order to ensure only the correct data will be analysed in the next stage, it might be necessary to integrate multiple datasets. n The Data Aggregation and Representation stage is dedicated to integrate multiple datasets to arrive at a unified view. n Additionally, data aggregation will greatly speed up the analysis process of the Big Data tool, because the tool will not be required to join different tables from different datasets, greatly speeding up the process.
  • 32. 6. Data Aggregation and Representation (cont.) n Performing this stage can become complicated because of differences in: • Data Structure – Although the data format may be the same, the data model may be different. • Semantics – A value that is labeled differently in two different datasets may mean the same thing, for example “surname” and “last name.” n Whether data aggregation is required or not, it is important to understand that the same data can be stored in many different forms. A simple example of data aggregation where two datasets a r e a g g r e g a t e d together using the Id field.
  • 33. 7. Data Analysis n The Data Analysis stage of the Big Data Lifecycle stage is dedicated to carrying out the actual analysis task. n It runs the code or algorithm that makes the calculations that will lead to the actual result. n Data Analysis can be simple or really complex, depending on the required analysis type. n In this stage the ‘actual value’ of the Big Data project will be generated. If all previous stages have been executed carefully, the results will be factual and correct. n Depending on the type of analytic result required, this stage can be as simple as querying a dataset to compute an aggregation for comparison. n On the other hand, it can be as challenging as combining data mining and complex statistical analysis techniques to discover patterns and anomalies or to generate a statistical or mathematical model to depict relationships between variables.
  • 34. 7. Data Analysis (cont.) n Data analysis can be classified as confirmatory analysis or exploratory analysis, the latter of which is linked to data mining, as shown in Figure,  Confirmatory data analysis is a deductive approach where the cause of the phenomenon being investigated is proposed beforehand. The proposed cause or assumption is called a hypothesis.  Exploratory data analysis is an inductive approach that is closely associated with data mining. No hypothesis or predetermined assumptions are generated. Instead, the data is explored through analysis to develop an understanding of the cause of the phenomenon.
  • 35. 8. Data Visualization n The ability to analyse massive amounts of data and find useful insight is one thing, communicating the results in a way that everybody can understand is something completely different. n The Data visualization stage is dedicated to using data visualization techniques and tools to graphically communicate the analysis results for effective interpretation by business users. Frequently this requires plotting data points in charts, graphs or maps. n The results of completing the Data Visualization stage provide users with the ability to perform visual analysis, allowing for the discovery of answers to questions that users have not yet even formulated.
  • 37. 9. Utilization of Analysis Results n After the data analysis has been performed an the result have been presented, the final step of the Big Data Lifecycle is to use the results in practice. n The Utilisation of Analysis results is dedicated to determining how and where the processed data can be further utilised to leverage the result of the Big Data Project. n Depending on the nature of the analysis problems being addressed, it is possible for the analysis results to produce “models” that encapsulate new insights and understandings about the nature of the patterns and relationships that exist within the data that was analyzed. n A model may look like a mathematical equation or a set of rules. Models can be used to improve business process logic and application system logic, and they can form the basis of a new system or software program.