Introduction to Data Science: Techniques, Tools and Business Applications

Introduction to
Data Science

Prithwis Mukerjee, PhD
Praxis Business School, Calcutta
prithwis mukerjee, ph.d.

Agenda
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains


Volume
Data is being acquired from a
variety of sources
●
●
●
●
●
●
●

EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites


Variety / Velocity
●
●
●
●
●
●

Numeric data
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other


Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●

Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”

So what is Big Data ?
●
●
●

Volume
Velocity
Variety ?

A new term coined by
IT vendors to push new
technology like
●
●
●


Map Reduce
Hadoop
NOSQL

A new way to
●
●
●
●
●

collect
store
manage
analyse
visualise data

Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos

But what
about
refining ?

The Science (and Art ) of Data
Think of data as crude oil !

Data Science
●

Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos


●
●
●

Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories

Two Perspectives

Programming
or “Hacking”
Skills

Machine
Learning

Mathematics,
Statistics
Knowledge

Data
Science
RDBMS
ERP / BI

Operations
Research

Business
Domain
Knowledge


10 Things {most} Data Scientists do ...
1. Ask good questions

6. Create models, algorithms

What is what ?

7. Under data relationships

We do not know ! We would like to
know

8. Tell the machine how to learn
from the data

2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns


9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data

Statistics - World of Data
●

Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count


●

Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes

Descriptive Statistics
●

Numeric Description
○ Mean, Median, Mode
○ Quartile, Percentile
○ Variance / Standard
Deviation


Statistics : The Path Ahead

Probability,
Distributions


Testing of
Hypothesis

Regression,
Testing

Predictive
Analysis

Data Mining / Machine Learning
Is the process of obtaining

Typical tasks are

●

novel

●

classification

●

valid

●

clustering

●

potentially useful

●

association rules

●

understandable

●

sequential patterns

●

regression

●

deviation detection

patterns in data


Some definitions
Instance ( an item or record)
●

an observation that is
characterised by a number of
attributes
○
○

person - with attributes like age,
salary, qualification
sale - with product, quantity, price

Attribute
●

measuring characteristics of an
instance

Class
●

grouping of an instance into
○
○

acceptable, not acceptable
mammal, fish, bird

Nominal
●

colour, PIN code, state

Ordinal
●

ranking : tall, medium, short or
feedback on a scale of 1 - 10

Ratio
●

length, price, duration, quantity

Interval
●

date, temperature

Data Mining : Classification
Classification
●
●

Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?


Classification Example
s
l
ca uou
ri
go ontin lass
c
ate c

l

a
ric

o

teg
ca

c

Test
Set

Learn
Classifier


Training
Set

Model

Data Mining : Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them

●

Can we segment the market in
some way that is not yet known ?


Example of Document Clustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )


Clustering of S&P Stock Data
●
●
●

●

Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.


Regression
● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○

Greatly studied in statistics, neural network fields.

● Examples:
○

Predicting sales amounts of new product based on advertising
expenditure.

○

Predicting wind velocities as a function of temperature, humidity, air

○

pressure, etc.
Time series prediction of stock market indices.

Data Mining : Association Rules Mining
Association Rules
●

●

which products
should be kept
along with other
products
which two
products should
never be
discounted
together


Visualisation : The need to tell a story


Definitions
Data Mining
●

●

Is the process of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data

Data Science is a rare combination of
multiple skills that include
●

Technology : obviously !

but also
●

●
●

Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles

R : Your first step into Data Science


Try out this free interactive tutorial just now

Statistical Tools


http://r4stats.com/articles/popularity/

Some Comparisons


Map Reduce
●
●

●

Input : A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs


Hadoop
A programming framework that
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●

the Map and Reduce functions
loading data into HDFS

1.

2.

3.
4.


HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS

Data-in-Flight


JavaScript for Data Visualisation


Business Domain
●

●

Financial Sector
○ Risk Management, Credit
Scoring
○ Predict Customer Spend
○ Stock and Investment
Analysis
○ Loan approval
Telecom Sector
○ Fraud Detection
○ Churn Prediction


●

●

Retail and Marketing
○ Market segmentation
○ Promotional strategy
○ Market Basket Analysis
○ Trend Analysis
Healthcare & Insurance
○ Fraud Detection
○ Drug Development
○ Medical Diagnostic Tools

Conclusion
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

Data Science is a rare combination of
multiple skills that include
●

but also
●

●
●


Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles

Thank You
Contact

This presentation is accessible at at
the blog

Prithwis Mukerjee
Professor, Praxis Business School

http://blog.yantrajaal.com

prithwis@praxis.ac.in

at the following URL
http://bit.ly/pm-datascience


Introduction to Data Science: Techniques, Tools and Business Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Data Science: Techniques, Tools and Business Applications

Ähnlich wie Introduction to Data Science: Techniques, Tools and Business Applications (20)

Mehr von Prithwis Mukerjee

Mehr von Prithwis Mukerjee (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Data Science: Techniques, Tools and Business Applications