This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
4. Volume
Data is being acquired from a
variety of sources
●
●
●
●
●
●
●
EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites
prithwis mukerjee, ph.d.
5. Variety / Velocity
●
●
●
●
●
●
Numeric data
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other
prithwis mukerjee, ph.d.
Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●
Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”
6. So what is Big Data ?
●
●
●
Volume
Velocity
Variety ?
A new term coined by
IT vendors to push new
technology like
●
●
●
prithwis mukerjee, ph.d.
Map Reduce
Hadoop
NOSQL
A new way to
●
●
●
●
●
collect
store
manage
analyse
visualise data
7. Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos
But what
about
refining ?
prithwis mukerjee, ph.d.
8. The Science (and Art ) of Data
Think of data as crude oil !
Data Science
●
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos
prithwis mukerjee, ph.d.
●
●
●
Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories
10. 10 Things {most} Data Scientists do ...
1. Ask good questions
6. Create models, algorithms
What is what ?
7. Under data relationships
We do not know ! We would like to
know
8. Tell the machine how to learn
from the data
2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns
prithwis mukerjee, ph.d.
9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data
11. Statistics - World of Data
●
Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count
prithwis mukerjee, ph.d.
●
Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes
13. Statistics : The Path Ahead
Probability,
Distributions
prithwis mukerjee, ph.d.
Testing of
Hypothesis
Regression,
Testing
Predictive
Analysis
14. Data Mining / Machine Learning
Is the process of obtaining
Typical tasks are
●
novel
●
classification
●
valid
●
clustering
●
potentially useful
●
association rules
●
understandable
●
sequential patterns
●
regression
●
deviation detection
patterns in data
prithwis mukerjee, ph.d.
15. Some definitions
Instance ( an item or record)
●
an observation that is
characterised by a number of
attributes
○
○
person - with attributes like age,
salary, qualification
sale - with product, quantity, price
Attribute
●
measuring characteristics of an
instance
Class
●
grouping of an instance into
○
○
acceptable, not acceptable
mammal, fish, bird
prithwis mukerjee, ph.d.
Nominal
●
colour, PIN code, state
Ordinal
●
ranking : tall, medium, short or
feedback on a scale of 1 - 10
Ratio
●
length, price, duration, quantity
Interval
●
date, temperature
16. Data Mining : Classification
Classification
●
●
Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?
prithwis mukerjee, ph.d.
18. Data Mining : Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them
●
Can we segment the market in
some way that is not yet known ?
prithwis mukerjee, ph.d.
19. Example of Document Clustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )
prithwis mukerjee, ph.d.
20. Clustering of S&P Stock Data
●
●
●
●
Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.
prithwis mukerjee, ph.d.
21. Regression
● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○
Greatly studied in statistics, neural network fields.
● Examples:
○
Predicting sales amounts of new product based on advertising
expenditure.
○
Predicting wind velocities as a function of temperature, humidity, air
○
pressure, etc.
Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
22. Data Mining : Association Rules Mining
Association Rules
●
●
which products
should be kept
along with other
products
which two
products should
never be
discounted
together
prithwis mukerjee, ph.d.
25. Definitions
Data Mining
●
●
Is the process of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data
prithwis mukerjee, ph.d.
Data Science is a rare combination of
multiple skills that include
●
Technology : obviously !
but also
●
●
●
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
30. Map Reduce
●
●
●
Input : A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs
prithwis mukerjee, ph.d.
31. Hadoop
A programming framework that
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●
the Map and Reduce functions
loading data into HDFS
1.
2.
3.
4.
prithwis mukerjee, ph.d.
HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS
35. Conclusion
●
●
●
●
Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains
Data Science is a rare combination of
multiple skills that include
●
but also
●
●
●
prithwis mukerjee, ph.d.
Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
37. Thank You
Contact
This presentation is accessible at at
the blog
Prithwis Mukerjee
Professor, Praxis Business School
http://blog.yantrajaal.com
prithwis@praxis.ac.in
at the following URL
http://bit.ly/pm-datascience
prithwis mukerjee, ph.d.