SlideShare ist ein Scribd-Unternehmen logo
1 von 150
DATA SCIENCE
Colloquium (7)
MS(LIS) 2013-2015
Indian Statistical Institute
Documentation Research and Training Centre
●
Data Science is a newly emerging field dedicated to
analyzing and manipulating data to derive insights
and build data products. It combines skill-sets
ranging from computer science, to mathematics, to
art. (www.kaggle.com)
●
Data science imply a focus involving data and, by
extension, statistics, or the systematic study of the
organization, properties, and analysis of data and its
role in inference, including our confidence in the
inference. (D.J.Patil)
●
In simple word we can say that it is process which
extract information/knowledge from huge data.
Evolution
• 1900 - Statistics
• 1960 - “Data Mining”
• 2006 - Google Analytics appears
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data surge
• 2013 - Data Science
• 2015 - ??
●
Data is growing at very high pace(exponentially).
●
According to IBM, 2.5 exabytes - that's 2.5 billion
gigabytes (GB) - of data was generated every day in
2012. About 75% of data is unstructured, coming
from sources such as text, voice and video.
●
In 2012 it reached 2.8 zettabytes and IDC forecasts
that we will generate 40 zettabytes (ZB) by 2020
which is the equivalent of 5,200 GB of data for every
man, woman and child on Earth.
●
90% of all the data in the world today has been
created in the past few years.
S.No. Sub-Topic Speaker
1. What is Data Science Sandip Das
2. Data Scientist Anwesha Bhattacharya
3. Applications of Data Science Manasa Rath
4. Workflow of Data Science Dibakar Sen
5. Challenges in Workflow of Data
Science
Jayanta Kr. Nayek
6. Tools and Technology Tanmay & Manash
7. Machine Learning in Data
Science
Samhati Soor
8. Conclusion Shiv Shakti Ghosh
References
●
http://bit.ly/1gyRYcM
●
http://bit.ly/SdJ2OU
●
http://bit.ly/RzrZ9k
●
http://bit.ly/1pwlEY4
●
http://bit.ly/1pwlUq6
What is Data Science
Sandip Das
DATA SCIENCE
DATA SCIENCE
Data
What kind of data might you collect?
Data

How many Lily pads

Measures the inches
of the Lily pads

How many small,
medium or large
Lily pads

How many frogs
What is Data?

It is something you want to know.

A collection of fact.

Facts and statistics collected together for reference or analysis.

Data as the plural form of datum; as pieces of information; and as a collection
of object-units that are distinct from one another.

Data is undifferentiated observation of facts in terms of words, numbers,
symbols, etc.
What is Data?
Computer data is information processed or stored by a computer.
This information may be in the form of text documents, images,
audio clips, software programs, or other types of data. Computer
data may be processed by the computer's CPU and is stored in
files and folders on the computer's hard disk.
Science

The systematic observation of natural events and
conditions in order to discover facts about them and to
formulate laws and priciples based on these facts.

Science involves more than the gaining of knowledge.It is
about gaining a deeper and often useful understanding of
the world.
The Science is an art of

Discovering what we don't know from data

Obtaining predictive,actionable insight from data

Creating Data products that have business impact

Building confidence in decisions that drive business
value
Data science

According to Computer scientist Peter Nauer
“The science of dealing with Data, once they have
been established”

Data Science is the scientific study of the creation,
validation and transformation of data to create meaning.

Data science is the study of the generalizable extraction of
knowledge from data.
Multidisciplinary Approach
Domain Expertise

Domain expertise is proficiency, with special
knowledge or skills, in a particular area or topic.

Domain expertise includes knowing what problems
are important to solve and knowing what sufficient
answers look like. Domain experts understand what
the customers of their knowledge want to know.
Data Engineering
It is the data part of data science. It involves
Acquiring
Ingesting
Transforming
Storing
Retrieving data
Scientific Method
It is the process for acquiring new knowledge by applying
the principles of reasoning on empirical evidence derived
from testing hypotheses through repeatable experiments.
Statistics & Mathematics
Statistics (along with mathematics) is the cerebral part of
Data Science. They collect, Organize, analyse and
interpret data.
Advanced Computing
Advanced computing is the heavy lifting of data science. It
consists software design and programming language.
Visualization

It is the pretty face of data science.

A good visualization is the result of a creative process
that composes an abstraction of the data in an
informative and aesthetically interesting form.
Hacker mindset

Hacking is modifying one's own computer system,
icluding building, rebuilding, modifying and creating
software, electronic hardware or peripherals, in order
to make it better, make it faster, give it added
features.

Data science hacking involves inventing new models,
exploring.
References
●
http://bit.ly/1jZR0WA
●
http:// bit.ly/1pwmV1m
●
http://bit.ly/1tkKyKG
●
http://bit.ly/1ntd13L
●
http://bit.ly/1wi9t5Z
Data Scientist
Anwesha Bhattacharya
(& I am not a data scientist)
Who is a data scientist?
●
A practitioner of data science is
called a data scientist.(~Wikipedia)
●
Data scientists use technology
and skills to increase awareness,
clarity and direction for those
working with data.
(http://www.datascientists.net)
Why do we need data scientists?
●
Firstly, there is more data than we can consume. We
require a data scientist who can look at the data and
say, “This is important. Check out this one.”
●
They are the people who can understand and provide
meaning to the piles and piles of data that are
collected. “Big data” is the buzzword that represents
those piles.
●
Minimise the disruption that are encountered while
dealing with data.
●
Present data with an awareness of the consequences of
presenting that data.
Data Scientist aims
Types of Data Scientists
Data scientists can be
broadly classified into
two categories:
Product-focused data scientists.
Business Intelligence style of
data scientists.
There are roughly 4 to 5
groups in each
category.
Product-focused Data Scientists

Data Researcher
The professionals in this category come from the
academic world and have in-depth backgrounds in
statistics or the physical or social sciences. This
type of data scientist often holds a PhD but is
weakly skilled in Machine learning, Programming
or Business.
Data Developer
These guys tend to concentrate on technical issues
that come with handling data. They are strong in
programming and machine learning but weak in
business and statistics skills.

Data Creatives
These are the guys who make something
innovative out of mountains of data. They are
strongly skilled in machine learning, Big Data,
programming and other skills to handle massive
data.

Data Business people
They represent the business side and are
responsible for making vital business decisions
through data analytics techniques. They are a
blend of business and technical proficiency.
Business Intelligence based Data Scientists
●
Quantitative, exploratory Data Scientists
Quantitative, exploratory data scientists are inclined to have
PhDs and use theory to comprehend behaviour. By
combining theory and exploratory research, these data
scientists improve products.
●
Operational Data Scientists
Operational data scientists frequently work in finance, sales or
operations teams in an organization. His role is to analyse
performance, responses and behavior of a process, to
improve organization’s strategy and efficiency.
●
Product Data Scientists
Product data scientists fit in to product management or
engineering. Their job is to understand the way users
make use of a product and make use of that knowledge to
fine tune the product.
●
Marketing Data Scientists
Marketing data scientists focuses on the user base, evaluate
performance and work on improving efficiency, pretty
much like the standard marketing guy.
●
Research Data Scientists
Research data scientists create insights from a data set.
Profile of Data Scientist
●
They love data
●
Have investigative mind set
●
Goal of work: finding patterns in data
and data driven products
●
Are practitioners, not theorists
●
Have “hands on” skills
●
Have domain expertise
●
Team players
●
Technically focused
●
Versatile communication and
collaboration skills
●
Curiosity for exploring and
experimenting with data.
●
Sceptical people, likely to ask a lot of
questions around the viability of a
given solution and whether it will
really work.
Required skills
●
Data mining - Computational process of discovering patterns in large data
sets. The analysis step of the "Knowledge Discovery in Databases".
●
Programming - The act of instructing computers to perform tasks.
●
Algorithms - Step-by-step procedure for calculations used for analysis of
data.
●
Statistics – The collection, organization, analysis, interpretation and
presentation of data.
●
NLP - Interactions between computers and human languages.
●
Machine learning - The science of getting computers to act without being
explicitly programmed.
●
Distributed systems – The components located on networked computers
communicate and coordinate their actions by passing messages.
●
Visualization - The creation and study of the visual representation of data,
communicate both abstract and concrete ideas.
●
.........
What Does a Data Scientist Do?
10 Things [most] Data Scientists Do:
1) Ask Good Questions.
What is What?
We don’t know! We’d like to know?
2) Explore data & generate hypothesis. Run experiments
3) Scoop, Scrap & Sample Data
4) Tame Data
5) Discover the unknowns.
6) Model Data. Model Algorithms.
7) Understand Data Relationships
8) Tell the Machine How to Learn from Data
9) Create Data Products that Deliver Actionable Insight
10) Communicate the results using visualization, presentations
DIKUW
I K U WD
Raw What How to Why When
Numbers Description Extract Cause & Effect Prediction
Letters Context Test Proved What's best
Symbols Relationships Instruction Known
Unknowns
Unknown
Unknowns
Data Information Knowledge Understanding Wisdom
Data Engineer Data Analyst Data Miner Data Scientist
PAST FUTURE
Data Scientist Data Analyst
Familiarity with
database systems e.g
MySQL
Familiarity with data
warehousing and
business intelligence
concepts
Better to be familiar
with Java, Python
In-depth exposure of
SQL and analytics
Should have clear
understanding of
various analytical
functions - median,
rank etc. and how to
use them on data sets
Strong understanding of
Hadoop based analytics
Perfection in
mathemetics,
statistics, correlation,
data mining etc.
Perfection regarding the
tools and components of
data architecture
Deep statistical
insights and machine
learning
Proficiency in decision
making
● Data analysis has been generally
used as a way of explaining some
phenomenon by extracting interesting
patterns from individual data sets with
well-formulated queries.
● Data science, on the other hand, aims
to discover and extract actionable
knowledge from the data, that is,
knowledge that can be used to make
decisions and predictions, not just to
explain what’s going on.
Data Scientist vs Data Analyst
Challenges of data scientist
●
Red tape
No access allowed
●
Unknown need
What's the organization's
goal?
●
Terminology
What's a wonkulator?
●
Real world data
Messy, noisy, missing
●
Analysis distrust
...but I dont like that
result
References
●
Zhukov, Leonid. Data Scientists. Higher School of Economics.
National Research University.
●
http://bit.ly/1kduMvA
●
http://bit.ly/1orF9DL
●
http://bit.ly/1tMBBvQ
●
http://bit.ly/1kJ9gU8
●
http://bit.ly/TS9H5e
●
http://bit.ly/1jZR0WA
APPLICATIONS of DATA
SCIENCE
by
Manasa Rath
Reaching to Data Science
APPLICATIONS
agriculture
pharmacy
energy
retail
tourism
realestate
import-export
finance
business
services
Applications in Education sector
-Survey done by Pearson group to improve the learing softwares,
course materials better quality and efficacy in learning
-Tools used is Python, R, Google Big Query
Data Science in Healthcare Industry
-where a group has been diagnosed with Type2 Diabetes & some subset of
this group has developed complications
-would like to know whether there is any pattern to complications and
whether the probability of complication can be predicted and therefore
acted upon
Healthcare Use Database Snippet
Extracting Interesting Patterns of Health outcomes from Healthcare System Care
Whether the pattern is robust and
predictive ??
OBSERVATIONS
What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?
Remarks
-Predictive accuracy becomes a primary objective, the computer
tends to play a significant role in model building and decision
making
Shows an integrated
skill set spanning
mathematics,statistics,
AI,databases,
optimization along with
deep understanding of
the craft problem
formulation to engineer
effective problems
Applications in Social Networking
sites
Key Points
--ability to interpret unstructured data and integrate it with numbers
further increases our ability to extract useful knowledge in real-
time and act on it
References
1.Data Science and Prediction by Vasant Dhar
http://bit.ly/1tiRvMr
Workflow of Data Science
Dibakar Sen
Work flow of Data Science
●
The work flow process
consist of three major
activities-
-Organising
-Packaging
-Delivering
Work flow Phases
Understanding
of
data
/ Evaluation
Understanding of Data
- set objectives or goal
- set data fields
- data collection procedure
Preparation Phase
Understanding
of
data
/ Evaluation
Preparation Phase
●
Acquire data
The obvious first step in any data science
workflow is to acquire the data to analyze. Data can
be acquired from a variety of sources. e.g.,:
-Existing Data can be used (e.g., U.S. Census data
sets).
-Data can be automatically generated by computer
software.
-Data can be manually entered into a spreadsheet
or text file by a human through survey.
Preparation Phase
●
Reform and clean data
-Before analysis begins, we need to verify that the data are accurate
and that the variables are well named and properly labeled.
-We have to store the data in desired format,
- Verify the sample and variables
- Do the variables have the correct values?
- Are missing data coded appropriately?
-Are the data internally consistent?
- Is the sample size correct? etc.
-Programmers reformat and clean data either by writing scripts or by
manually editing data, say, a spreadsheet.
Analysis Phase
Understanding
of
data
/ Evaluation
Analysis Phase
●
Data Analysis
- The core activity of data science is the
analysis phase: writing, executing, and refining
computer programs to analyze and obtain insights
from data.
- Different "scripting" languages such
as Python, Perl, R, and MATLAB are used to
analysis the data. However, they also use compiled
languages such as C, C++, and Fortran when
appropriate.
●
In the analysis phase, the programmer engages in a repeated
iteration cycle of editing scripts, executing to produce output files,
inspecting the output files to gain insights and discover mistakes,
debugging, and re-editing.
Reflection/Evaluation Phase
Understanding
of
data
/ Evaluation
Reflection / Evaluation Phase
The analysis phase involves programming, the reflection phase
involves thinking and communicating about the outputs of
analyses. After inspecting a set of output files, a data scientist might
perform the following types of reflection:
-Take notes
- Hold meetings
- Make comparisons and
explore alternatives
Dissemination Phase
Understanding
of
data
/ Evaluation
Dissemination Phase
The final phase of data science is disseminating results. Prepare
reports in order to communicate findings to the appropriate
audience. Results are most commonly in the form of written reports
such as internal memos, slideshow presentation, business / policy
white paper, or academic research publications.
●
Beyond presenting results in written form,
some data scientists also want to distribute
their software so that colleagues can
reproduce their experiments or play with
their prototype systems.
References
●
http://bit.ly/1jZcx2I
●
http://bit.ly/1jZeTyN
●
http://bit.ly/1hbQuWx
Challenges in Workflow of Data Science
Jayanta Kr. Nayek
Preparation phase
Acquire data:
-Keeping track of provenance :
-Where each piece of data comes from and whether it is still up-to-date.
-Data management :
-Programmers must assign names to data files that they create or download
and then organize those files into directories.
-When they create or download new versions of those files, they must make
sure to assign proper filenames to all versions and keep track of their
differences.
-Storage :
-Sometimes there is so much data that it cannot fit on a single hard drive,
so it must be stored on remote servers.
Preparation Phase
Reformat and clean data :
-A related problem is that raw data often contains semantic errors(an error in
logic or arithmetic that must be detected at run time), missing entries, or
inconsistent formatting, so it needs to be "cleaned" prior to analysis.
-Data integration :
-Data integration involves combining data residing in different sources and
providing users with a unified view of these data.
-Heterogeneous Data:
-data integration involves synchronizing huge quantities of variable,
heterogeneous data resulting from internal legacy systems (an old method,
technology, computer system, or application program,"of, relating to, or
being a previous or outdated computer system) that vary in data format.
Legacy systems may have been created around flat file, network, or
hierarchical databases.
Preparation Phase
●
Data Integration Problems:
-Unanticipated Costs:
-Labor costs for initial planning, evaluation, programming and
additional data acquisition
-Software and hardware purchases
-Unanticipated technology changes/advances
-Both labor and the direct costs of data storage and
maintenance
-Lack of Data Management Expertise:
-support required to engage and convey to everyone in the agency
the need for and benefits of data integration is unlikely
to flow from leaders who lack awareness of or
commitment to the benefits of data integration.
Preparation Phase
Data transmission:
-It is the physical transfer of data over a point-to-point or
point-to-multipoint communication channel.
-Cloud data storage is popularly used as the development of cloud
technologies.
-We know that the network bandwidth capacity is the bottleneck in
cloud and distributed systems, especially when the volume of
communication is large.
-On the other side, cloud storage also lead to data security problems as
the requirements of data integrity checking.
Analysis Phase
-Data inconsistence and incompleteness:
-A number of data preprocessing techniques, including data cleaning, data
integration, data transformation and date reduction, can be applied to remove
noise and correct inconsistencies.
-Scalability:
-The biggest and most important challenge is scalability when we deal with the
Big Data analysis.
-In the last few decades, researchers paid more attentions to accelerate analysis
algorithms to cope with increasing volumes of data and speed up processors
following the Moore’s Law.
-Data Curation:
-Data curation is aimed at data discovery and retrieval, data quality assurance,
value addition, reuse and preservation over time.
-The existing database management tools are unable to process Big Data that
grow so large and complex.
Analysis Phase-Timeliness:
-Real-time Big Data applications, like navigation, social networks, finance, biomedicine,
astronomy, intelligent transport systems, and internet of thing, timeliness is at the top
priority. How can we guarantee the timeliness of response when the volume of
data will be processed is very large?
-File and metadata management:
-Repeatedly editing and executing scripts while iterating on experiments causes the
production of numerous output files, such as intermediate data, textual reports, tables,
and graphical visualizations.
-However, doing so leads to data management problems due to the abundance of files and
the fact that programmers often later forget their own ad-hoc naming conventions.
-Data security:
-Firstly, the size of Big Data is extremely large, channelling the protection approaches.
-Secondly, it also leads to much heavier workload of the security.
Analysis Phase
-Absolute running times:
Scripts might take a long time to terminate, either due to large amounts
of data being processed or the algorithms being slow.
-Incremental running times:
Scripts might take a long time to terminate after minor incremental code
edits done while iterating on analyses, which wastes time re-
computing almost the same results as previous runs.
-Crashes from errors:
Scripts might crash prematurely due to errors in either the code or
inconsistencies in data sets. Programmers often need to endure several
rounds of debugging before their scripts can terminate with useful results.
Reflection Phase
●
Take notes:
Since notes are a form of data, the usual data management problems arise in
notetaking, most notably how to organize notes and link them with
the context in which they were originally written.
●
Make comparisons and explore alternatives:
Data scientists must organize, manage, and compare these graphs to gain
insights and ideas for what alternative hypotheses to explore.
Dissemination Phase
-Functionalities:
-To convey information easily by providing knowledge hidden in the complex
and large-scale data sets, both aesthetic form and functionality are
necessary.
-Current tools mostly have poor performances in functionalities and
response time.
-Scalability :
-It is particularly difficult to conduct data visualization (the main objective
of data visualization is to represent knowledge more intuitively and
effectively by using different graphs) because of the large size and high
dimension of Big Data.
Dissemination Phase
●
Difficult to distribute research code:
Some data scientists also want to distribute their software so
that colleagues can reproduce their experiments or play
with their prototype systems. It is difficult to distribute
research code in a form that other people can easily
execute on their own computers.
●
Difficult to reproduce the results:
It is even difficult to reproduce the results of one's own
experiments a few months or years in the future, since
one's own operating system and software inevitably
get upgraded in some incompatible manner such that
the original code no longer runs.
Reference
●
Chen,Philip C.L. And Zhang,Chun-Yang.(2014).
Data-intensive applications, challenges, techniques
and technologies: A survey on Big Data.Information
Sciences.ELSEVIER.Department of Computer and
Information Science, Faculty of Science and
Technology, University of Macau, Macau, China.
●
http://bit.ly/1jZcx2I
●
http://1.usa.gov/SNspKm
TECHNOLOGY and Tools for DATA
SCIENCE
TANMAY MONDAL & MANASH KUMAR
We need
● Organise Data
● Analyse Data
● Package and Deliver Data
Data Science Tools

Language
− Java, R, Python, ...

Databases/Data Warehouses
− Apache Cassandra, Apache HBase, MongoDB, ....

Data Mining
− RapidMiner/RapidAnalytics, Orange, Weka, ....

File Systems
− Gluster, Hadoop Distributed File System, ...
Data Science Tools

Big Data Search
− Lucene, Solr, ...

Data Aggregation and Transfer
− Sqoop, Flume, ....

Miscellaneous Big Data Tools
– Hadoop, Avro, Zookeeper, ...

......................
What is Hadoop?
●
The Apache Hadoop is a framework that allows for
the distributed processing of large data sets across
clusters of computers using simple programming
models.
N
o
d
e
s
Hadoop cluster
Why Hadoop?
• Handles enormous data volumes.
• Cost-effective.
• Scalable.
• Fault tolerant.
Origin of Hadoop
• Google introduced two key technology for handling Big data, Google File
System (a distributed file system technology) in 2003 and MapReduce
( framework for distributed compute model) in 2004 to the world.
• Early in 2005, the Nutch developers had a working MapReduce
implementation in Nutch, and by the middle of that year all the major
Nutch algorithms had been ported to run using MapReduce and NDFS.
• In February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
• First release of Apache Hadoop in September 2007
When should we go for Hadoop ?

Data is too huge

Unstructured data

Parallelism

Processes are independent

Need better scalability
The Hadoop Ecosystem●
HDFS - Hadoop Distributed File System.
●
MapReduce - A distributed framework for executing work in
parallel.
• Hive - Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
●
Pig – Pig is a high-level platform for creating MapReduce
programs used with Hadoop.
●
HBase – A non-rational, distributed database system.
●
..........
The Major Component of Hadoop

Hadoop use its own distributed file system,HDFS, which makes
data available to multiple computing nodes.

Hadoop uses MapReduce, where the application is divided into
many small fragments of work, each of which may be executed or
re-executed on any node in the cluster.
HDFS

Hierarchical UNIX-like file system for data storage
sort of Splitting of large files into blocks.

Stores files in blocks across many nodes in a
cluster.

Distribution and replication of blocks to different
nodes.

Have master slave architecture.
HDFS Architecture
HDFS ...
NameNode

Runs on a single node as a master process

Holds file metadata (which blocks are where)

Directs client access to files in HDFS
SecondaryNameNode

Maintains a copy of the NameNode metadata
Data Node
●
Stores data in the local file system
●
Periodically sends a report of all existing blocks
to the NameNode
WHAT IS MAP REDUCE?
MapReduce is a programming model for
processing large data sets with a parallel,
distributed algorithm on a cluster
Map Reduce Paradigm
Data processing system with two key phase
Map
Perform a map function on input key/value pairs to
generate intermediate key/value pairs
Reduce
Perform a reduce function on intermediate
key/value groups to generate output key/value pairs
Map Reduce Daemons
•JobTracker (Master)
-Monitors job and task progress
- Manages MapReduce jobs
-Giving tasks to different nodes
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
-Runs on same node as DataNode service
Hadoop Map Reduce Components
Reduce Phase
Shuffle
Sort
Reducer
Output Format
Map Phase
Input Format
Record Reader
Mapper
Combiner
105
How does Map Reduce work?
➢
The run time partitions the input and provides it to different
Map instances
➢
Map (key, value)  (key’, value’)
➢
The run time collects the (key’, value’) pairs and distributes
them to several Reduce functions so that each Reduce function
gets the pairs with the same key’.
➢
Map and Reduce are user written functions in java
WORD COUNT IN MAP REDUCE
Validation of data extract and load into
EDW(Enterprise Data Warehouse)
Once map-reduce process is completed and data
output files are generated, then data is moved to
enterprise data warehouse or any other
transactional systems depending on the
requirement.
USERS OF HADOOP
Yahoo! -

More than 100,000 CPUs in 40,000 computers running
Hadoop

Produces data that was used in every Yahoo! Web search
query
Facebook -

In 2010 Facebook claimed that they had the largest
Hadoop cluster in the world with 21 PB of storage.

On June 13, 2012 they announced the data had grown to
100 PB.

Each (commodity) node has 8 cores and 12 TB of storage
USERS OF HADOOP
Adobe -
Adobe uses Apache Hadoop and Apache HBase in
several areas from social services to structured data storage
and processing for internal use.
Currently have about 30 nodes running HDFS
Ebay -
532 nodes cluster (8 * 532 cores, 5.3PB)
Heavy usage of Java MapReduce, Apache Pig, Apache
Hive, Apache HBase
Using it for Search optimization and Research.
Twitter

We use Apache Hadoop to store and process tweets, log
files, and many other types of data generated across
Twitter.
GBIF (Global Biodiversity Information Facility)

Nonprofit organization that focuses on making scientific
data on biodiversity available via the Internet

18 nodes running a mix of Apache Hadoop and Apache
HBase
University of Glasgow

30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM,
1TB/node storage).
To facilitate information retrieval research & experimentation,
particularly for TREC
Greece.com

Using Apache Hadoop for analyzing data for millions of
images, log analysis, data mining
References
http://bit.ly/1km1e46

http://bit.ly/Rzuzfz

http://yhoo.it/1pheFVK

Big data: Testing Approach to Overcome Quality Challenges
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen
Kumar Gajja.
Machine Learning
Samhati Soor
What is it?
Learning is a process of knowledge acquisition with specific
purpose.
Machine learning is the study of how to use computers to
simulate human learning activities.
Training
Set
Learning
Algorithm
hypothesis Predicted OutputInput
Feedback
Why Machine Learning is Possible?
Mass Storage
More data available
Higher Performance of Computer
Larger memory in handling the data
Greater computational power for calculating and even
online learning
Machine Learning Basics: 1. General Introduction
Basic Structure of the Machine
Learning System
External
environment
Corpus
study
Knowledge
Representation
Execution
Machine Learning Model
The Goal of Machine Learning is...
to create a predictive model that is indistinguishable
from a correct model.
Without Logic
With Logic
Two Phases
Machine learning methods are broken into two
phases:
Training
Application
Types of Machine Learning
Other types:
1. Semi-supervised learning
2. Time-series forecasting
3. Anomaly detection
4. Active learning
Main types:
1. Supervised Learning
2. Unsupervised learning
3. Reinforcement learning
The Main Research Work on Machine
Learning Field
Task-oriented research
Cognitive simulation
Theoretical analysis
Data Science and Machine Learning
 If we are giving
the computer rules
and/or algorithms
to automatically
search through your
data to “learn” how
recognize patterns
and make complex
decisions (such as
identifying spam
emails), we are
implementing
machine learning.
In Data science, Data scientists use both statistical techniques
and machine learning algorithms for identifying patterns and
structure in data.
.
Role of Machine Learning in Data
Science
https://doubleclix.wordpress.com/category/data-science/
A Simple Implementation
Let, we have a model consisted of the likelihood of the
coin landing heads (prior over θ), while the data
consisted of the results of N coin flips.
We are observing some data.
Our goal is to determine the model from the data i.e. we
will find the probability of getting desired model using
the given data or p(model|data).
Using Conditional Probability,
p(data|model) =p(data and model) * p(model) --(1)
p(model|data) =p(data and model) * p(data) --(2)
From (1) and (2) we get,
p(data|model) / p(model) = p(model|data) / p(data)
That implies :
p(model|data) = (p(model|data) * p(data)) / p(model)
posterior likelihood prior evidence
The likelihood distribution describes the likelihood of data
given model — it reflects our assumptions about how the
data c was generated.
The prior distribution describes our assumptions about
model before observing the data.
The posterior distribution describes our knowledge of
model, incorporating both the data and the prior.
The evidence is useful in model selection.
Working Method of a Predictive
Modeler and a Data Scientist
A predictive modeler may use machine learning approach to predict a
value or likelihood of an outcome, given a number of input variables.
A data scientist applies these same approaches on large data sets,
writing code and using software adapted to work on big data.
The available library of statistical and machine learning
algorithms for evaluating and learning from big data is
growing, but is not yet as comprehensive as the algorithms
available for the non-distributed world.
The algorithms vary by product, so it is important to
understand what is and is not available.
Even not all algorithms familiar to the statistician and data
miner are easily converted to the distributed computing
environment.
The bottom line is that, while fitting models on big data has
the potential benefit of greater predictive power, some of the
costs are loss of flexibility in algorithm choices and/or
extensive programming time.
Prospective
References
Machine Learning and Data Mining
Lecture Notes
CSC 411/D11
Computer Science Department
University of Toronto
Version: February 6, 2012
The Discipline of Machine Learning
Tom M. Mitchell
July 2006
CMU-ML-06-108
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Statistical Machine
Learning-Nic Schraudolph
http://bit.ly/1oFt1ws
http://bit.ly/1oFtNty
Conclusion
Shiv Shakti Ghosh
Research Areas
Cloud computing
Databases and Database Management Systems
Natural language processing
Signal Processing
Computer vision
Cloud computing
Cloud computing involves distributed computing
over a network, where a program or application may
run on many connected computers at the same time.
It specifically refers to a server connected through a
communication network such as the Internet, an
intranet, a local area network (LAN) or wide area
network (WAN).
Issues
Privacy -The increased use of cloud computing
services such as Gmail and Google Docs has pressed
the issue of privacy concerns. The greater use of
cloud computing services has given access to a
plethora of data which has the immense risk of data
being disclosed either accidentally or deliberately.
Contd..
Legal-certain legal issues arise with cloud
computing, including trademark infringement,
security concerns and sharing of proprietary data
resources.
Vendor lock-in-cloud computing is still relatively
new, standards are still being developed. Many cloud
platforms and services are built on the specific
standards, tools and protocols developed by a
particular vendor for its particular cloud offering.
This is a major challenge in interoperability.
Research areas
open interoperation across cloud solutions at IaaS,
PaaS and SaaS levels
managing multi tenancy at large scale and in
heterogeneous environments
dynamic and seamless elasticity from private clouds
to public clouds for unusual and/or infrequent
requirements
data management in a cloud environment, taking the
technical and legal constraints into consideration
Databases &DBMS
A database is an organized collection of data. The
data are typically organized to in a way that supports
processes requiring this information.
Database management systems (DBMSs) are
specially designed software applications that interact
with the user, other applications, and the database
itself to capture and analyze data.
Issues
Data definition – Defining new data structures for a
database, removing data structures from the database,
modifying the structure of existing data.
Update – Inserting, modifying, and deleting data.
Retrieval – Obtaining information either for end-user
queries and reports or for processing by applications.
Administration – Registering and monitoring users,
enforcing data security, monitoring performance,
maintaining data integrity, dealing with concurrency
control, and recovering information if the system fails.
Research areas
Research activity includes theory and development of
prototypes and models. Notable research topics
include, the atomic transaction concept and related
concurrency control techniques, query languages and
query optimization methods, RAID, and more.
NLP
Natural language processing (NLP) is a field of
computer science, artificial intelligence, and
linguistics concerned with the interactions between
computers and human (natural) languages. As such,
NLP is related to the area of human–computer
interaction.
Human-level natural language processing is an AI
problem, that is equivalent to making computers as
intelligent as people. NLP's future is therefore tied
closely to the development of AI in general.
As natural language understanding improves,
computers will be able to learn from the information
online and apply what they learned in the real world.
In the future, humans may not need to code
programs, but will dictate to a computer in a human
natural language, and the computer will understand
and act upon the instructions.
Signal Processing
Signal processing is an area of Systems Engineering,
Electrical Engineering and applied mathematics that
deals with operations on or analysis of analog as well
as digitized signals, representing time-varying or
spatially varying physical quantities.
Signals of interest can include sound,
electromagnetic radiation, images, and sensor
readings, for example biological measurements such
as electrocardiograms, control system signals,
telecommunication transmission signals, and many
others.
Computer vision
Computer vision is a field that includes methods for
acquiring, processing, analyzing, and understanding
images and, in general, high-dimensional data from
the real world in order to produce numerical or
symbolic information.
A theme in the development of this field has been to
duplicate the abilities of human vision by
electronically perceiving and understanding an
image.
Data Science Higher Education programmes 2014
Programs in 2014Institute / Organization Course
Indiana University, Indiana, US * Online Certificate in Data
Science(January 2014 ).
University of California, Berkeley Master of Information and Data
Science program.
Saint Peters University, US ** Master of Science in Data Science
program.
Worcester Polytechnic Institute,
Worcester, Massachusetts, US
Master of Science in Data Science
program.
University of Virginia , US *** Master of Science in Data Science
* The program consists of 12 credits, including cloud computing, data management and
data analysis.
** The program’s curriculum will include topics such as decision analysis and
optimization, predictive modeling, data mining and visualization.
*** A professional program to prepare students for the use of data analysis in major
industries such as health care, business, and science.
Conferences on Data Science
2014
International Conference on Data Science and
Engineering, (26-28 August 2014)
Hosted By :
School of Computer Science Studies
Cochin University of Science & Technology,
Co-Sponsored by IEEE Kerala.
DataEDGE Conference : A new vision for data science,
(May 8–9, 2014 Berkeley, CA )
Discussions will be on the way organizations are using
data to address business and social issues, about the
challenges of working with data at scale, and about the
most pressing questions and debates facing data
scientists today.
O’REILLY Strata is organising three conferences:
New York(October 15-17, 2014 ) Discussions will be
on complex issues and opportunities brought to
business by big data, data science, and pervasive
computing.
Barcelona, Spain (November 19–21,2014) Discussions
will be on big data analytics.
San Jose, CA (February 18–20, 2015)
ASE(Academy of Science and Engineering) is organising
three conferences:
Stanford University, CA, USA, (May 27 - May 31, 2014)
Tsinghua University, Beijing, China, (August 4-7, 2014)
Harvard University, Cambridge, MA, US (December 15-
19, 2014).
IEEE International Conference on Big Data Science and
Engineering (Tsinghua University, Beijing, China, 24-26
Sept. 2014).
The 2014 International Conference on Data Science and
Advanced Analytics(October 30 - November 1, 2014,
Shanghai, China).
Journals of Data Science
Journal of Data Science-an international journal
devoted to applications of statistical methods at
large.
Online version is free.
Hard copy version- 300 USD/ year
CODATA Data Science Journal
Published by Codata.
EPJ Data Science: a Springer Open Journal
International Journal of Data Science : Inder
Science Publishers.
References
http://bit.ly/1omFc3B
http://bit.ly/1jZbP5F
http://bit.ly/1mCBzqv
http://oreil.ly/1jZc4O0
http://bit.ly/1mnyJRe
http://bit.ly/1tMzzvx
http://bit.ly/1pwnZlN
http://bit.ly/1iq0y9a
https://bitly.com/
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Weitere ähnliche Inhalte

Was ist angesagt?

Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiPistoia Alliance
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientistVijayMohan Vasu
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Data Science
Data ScienceData Science
Data ScienceRabin BK
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsJian Qin
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceFerdin Joe John Joseph PhD
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Pistoia Alliance Demystifying AI & ML part 2
Pistoia Alliance Demystifying AI & ML part 2Pistoia Alliance Demystifying AI & ML part 2
Pistoia Alliance Demystifying AI & ML part 2Pistoia Alliance
 
Data Science Applications | Data Science For Beginners | Data Science Trainin...
Data Science Applications | Data Science For Beginners | Data Science Trainin...Data Science Applications | Data Science For Beginners | Data Science Trainin...
Data Science Applications | Data Science For Beginners | Data Science Trainin...Edureka!
 
Data Literacy -- Necessity and challenges
Data Literacy -- Necessity and challengesData Literacy -- Necessity and challenges
Data Literacy -- Necessity and challengesSrdjan Verbić
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Understand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.SUnderstand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.SJiaming Zhang
 

Was ist angesagt? (20)

NLP & ML Webinar
NLP & ML WebinarNLP & ML Webinar
NLP & ML Webinar
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future JobsData Science: An Emerging Field for Future Jobs
Data Science: An Emerging Field for Future Jobs
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data Science
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Pistoia Alliance Demystifying AI & ML part 2
Pistoia Alliance Demystifying AI & ML part 2Pistoia Alliance Demystifying AI & ML part 2
Pistoia Alliance Demystifying AI & ML part 2
 
Data Science Applications | Data Science For Beginners | Data Science Trainin...
Data Science Applications | Data Science For Beginners | Data Science Trainin...Data Science Applications | Data Science For Beginners | Data Science Trainin...
Data Science Applications | Data Science For Beginners | Data Science Trainin...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Literacy -- Necessity and challenges
Data Literacy -- Necessity and challengesData Literacy -- Necessity and challenges
Data Literacy -- Necessity and challenges
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Data analytics
Data analyticsData analytics
Data analytics
 
Understand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.SUnderstand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.S
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 

Andere mochten auch

HR Conference 2014 - Smartree Romania
HR Conference 2014 - Smartree RomaniaHR Conference 2014 - Smartree Romania
HR Conference 2014 - Smartree RomaniaSmartree Romania
 
ImPOS - Restaurant Diary
ImPOS - Restaurant DiaryImPOS - Restaurant Diary
ImPOS - Restaurant Diary3floorsup
 
صور العابد 2013
صور العابد 2013صور العابد 2013
صور العابد 2013menasolomon
 
New microsoft-powerpoint-presentation-(2)
New microsoft-powerpoint-presentation-(2)New microsoft-powerpoint-presentation-(2)
New microsoft-powerpoint-presentation-(2)Batuhan Batuhan
 
Im pos restaurant diary
Im pos   restaurant diaryIm pos   restaurant diary
Im pos restaurant diary3floorsup
 
Storyboard animatic
Storyboard animaticStoryboard animatic
Storyboard animaticelliereedx
 
Modal verbs (1)
Modal verbs (1)Modal verbs (1)
Modal verbs (1)t-saqer
 
New microsoft-powerpoint-presentation-(2) (1)
New microsoft-powerpoint-presentation-(2) (1)New microsoft-powerpoint-presentation-(2) (1)
New microsoft-powerpoint-presentation-(2) (1)Batuhan Batuhan
 
Hi there, it's a pleasure to meet you, see you real
Hi there, it's a pleasure to meet you, see you real Hi there, it's a pleasure to meet you, see you real
Hi there, it's a pleasure to meet you, see you real oldchrome
 
Aziende Partner CasaClima Tour 2015
Aziende Partner CasaClima Tour 2015Aziende Partner CasaClima Tour 2015
Aziende Partner CasaClima Tour 2015klimahaus_casaclima
 
Analysis of film trailer the conjuring
Analysis of film trailer     the conjuringAnalysis of film trailer     the conjuring
Analysis of film trailer the conjuringelliereedx
 
CasaClima Tour Cagliari 21/05/2015
CasaClima Tour Cagliari 21/05/2015CasaClima Tour Cagliari 21/05/2015
CasaClima Tour Cagliari 21/05/2015klimahaus_casaclima
 
Marco Europeo Por Jefferson Defas
Marco Europeo Por Jefferson DefasMarco Europeo Por Jefferson Defas
Marco Europeo Por Jefferson DefasJeferson Defas
 

Andere mochten auch (20)

HR Conference 2014 - Smartree Romania
HR Conference 2014 - Smartree RomaniaHR Conference 2014 - Smartree Romania
HR Conference 2014 - Smartree Romania
 
CasaClima Tour 2015 a Trieste
CasaClima Tour 2015 a TriesteCasaClima Tour 2015 a Trieste
CasaClima Tour 2015 a Trieste
 
ImPOS - Restaurant Diary
ImPOS - Restaurant DiaryImPOS - Restaurant Diary
ImPOS - Restaurant Diary
 
صور العابد 2013
صور العابد 2013صور العابد 2013
صور العابد 2013
 
Presentation1
Presentation1Presentation1
Presentation1
 
New microsoft-powerpoint-presentation-(2)
New microsoft-powerpoint-presentation-(2)New microsoft-powerpoint-presentation-(2)
New microsoft-powerpoint-presentation-(2)
 
Untitled
UntitledUntitled
Untitled
 
15.05.14 comune clima_it
15.05.14 comune clima_it15.05.14 comune clima_it
15.05.14 comune clima_it
 
Im pos restaurant diary
Im pos   restaurant diaryIm pos   restaurant diary
Im pos restaurant diary
 
Storyboard animatic
Storyboard animaticStoryboard animatic
Storyboard animatic
 
Imc ppt 2
Imc ppt 2Imc ppt 2
Imc ppt 2
 
Modal verbs (1)
Modal verbs (1)Modal verbs (1)
Modal verbs (1)
 
New microsoft-powerpoint-presentation-(2) (1)
New microsoft-powerpoint-presentation-(2) (1)New microsoft-powerpoint-presentation-(2) (1)
New microsoft-powerpoint-presentation-(2) (1)
 
Hi there, it's a pleasure to meet you, see you real
Hi there, it's a pleasure to meet you, see you real Hi there, it's a pleasure to meet you, see you real
Hi there, it's a pleasure to meet you, see you real
 
Aziende Partner CasaClima Tour 2015
Aziende Partner CasaClima Tour 2015Aziende Partner CasaClima Tour 2015
Aziende Partner CasaClima Tour 2015
 
العابد
العابدالعابد
العابد
 
tugas
tugastugas
tugas
 
Analysis of film trailer the conjuring
Analysis of film trailer     the conjuringAnalysis of film trailer     the conjuring
Analysis of film trailer the conjuring
 
CasaClima Tour Cagliari 21/05/2015
CasaClima Tour Cagliari 21/05/2015CasaClima Tour Cagliari 21/05/2015
CasaClima Tour Cagliari 21/05/2015
 
Marco Europeo Por Jefferson Defas
Marco Europeo Por Jefferson DefasMarco Europeo Por Jefferson Defas
Marco Europeo Por Jefferson Defas
 

Ähnlich wie Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

ds.pptx
ds.pptxds.pptx
ds.pptxElves3
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfmustaq4
 
What is data science artical
What is data science articalWhat is data science artical
What is data science articalkavyapandala
 
Demand For Data Scientist
Demand For Data ScientistDemand For Data Scientist
Demand For Data ScientistZaranTech LLC
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdfUniversity of Sindh
 
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptxINTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptxMadhumitha N
 
Data Science- Basics.pptx
Data Science- Basics.pptxData Science- Basics.pptx
Data Science- Basics.pptxRupaliKute3
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analyticssunnypatil1778
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st centuryMartinFrigaard
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxNagarajanG35
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data ScienceNyraSehgal
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfmallikarjuntalakal
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfikenossama03
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)Robert Smith
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 

Ähnlich wie Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg (20)

Untitled document.pdf
Untitled document.pdfUntitled document.pdf
Untitled document.pdf
 
ds.pptx
ds.pptxds.pptx
ds.pptx
 
Information & data science (1) converted
Information & data science (1) convertedInformation & data science (1) converted
Information & data science (1) converted
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdf
 
What is data science artical
What is data science articalWhat is data science artical
What is data science artical
 
Demand For Data Scientist
Demand For Data ScientistDemand For Data Scientist
Demand For Data Scientist
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptxINTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
 
Data Science- Basics.pptx
Data Science- Basics.pptxData Science- Basics.pptx
Data Science- Basics.pptx
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data Science
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Kürzlich hochgeladen (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

  • 1. DATA SCIENCE Colloquium (7) MS(LIS) 2013-2015 Indian Statistical Institute Documentation Research and Training Centre
  • 2.
  • 3. ● Data Science is a newly emerging field dedicated to analyzing and manipulating data to derive insights and build data products. It combines skill-sets ranging from computer science, to mathematics, to art. (www.kaggle.com)
  • 4. ● Data science imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. (D.J.Patil) ● In simple word we can say that it is process which extract information/knowledge from huge data.
  • 5.
  • 6. Evolution • 1900 - Statistics • 1960 - “Data Mining” • 2006 - Google Analytics appears • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data surge • 2013 - Data Science • 2015 - ??
  • 7.
  • 8. ● Data is growing at very high pace(exponentially). ● According to IBM, 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. About 75% of data is unstructured, coming from sources such as text, voice and video.
  • 9. ● In 2012 it reached 2.8 zettabytes and IDC forecasts that we will generate 40 zettabytes (ZB) by 2020 which is the equivalent of 5,200 GB of data for every man, woman and child on Earth. ● 90% of all the data in the world today has been created in the past few years.
  • 10.
  • 11. S.No. Sub-Topic Speaker 1. What is Data Science Sandip Das 2. Data Scientist Anwesha Bhattacharya 3. Applications of Data Science Manasa Rath 4. Workflow of Data Science Dibakar Sen 5. Challenges in Workflow of Data Science Jayanta Kr. Nayek 6. Tools and Technology Tanmay & Manash 7. Machine Learning in Data Science Samhati Soor 8. Conclusion Shiv Shakti Ghosh
  • 13. What is Data Science Sandip Das
  • 15. Data What kind of data might you collect?
  • 16. Data  How many Lily pads  Measures the inches of the Lily pads  How many small, medium or large Lily pads  How many frogs
  • 17. What is Data?  It is something you want to know.  A collection of fact.  Facts and statistics collected together for reference or analysis.  Data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.  Data is undifferentiated observation of facts in terms of words, numbers, symbols, etc.
  • 18. What is Data? Computer data is information processed or stored by a computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Computer data may be processed by the computer's CPU and is stored in files and folders on the computer's hard disk.
  • 19. Science  The systematic observation of natural events and conditions in order to discover facts about them and to formulate laws and priciples based on these facts.  Science involves more than the gaining of knowledge.It is about gaining a deeper and often useful understanding of the world.
  • 20. The Science is an art of  Discovering what we don't know from data  Obtaining predictive,actionable insight from data  Creating Data products that have business impact  Building confidence in decisions that drive business value
  • 21. Data science  According to Computer scientist Peter Nauer “The science of dealing with Data, once they have been established”  Data Science is the scientific study of the creation, validation and transformation of data to create meaning.  Data science is the study of the generalizable extraction of knowledge from data.
  • 23. Domain Expertise  Domain expertise is proficiency, with special knowledge or skills, in a particular area or topic.  Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know.
  • 24. Data Engineering It is the data part of data science. It involves Acquiring Ingesting Transforming Storing Retrieving data
  • 25. Scientific Method It is the process for acquiring new knowledge by applying the principles of reasoning on empirical evidence derived from testing hypotheses through repeatable experiments.
  • 26. Statistics & Mathematics Statistics (along with mathematics) is the cerebral part of Data Science. They collect, Organize, analyse and interpret data.
  • 27. Advanced Computing Advanced computing is the heavy lifting of data science. It consists software design and programming language.
  • 28. Visualization  It is the pretty face of data science.  A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.
  • 29. Hacker mindset  Hacking is modifying one's own computer system, icluding building, rebuilding, modifying and creating software, electronic hardware or peripherals, in order to make it better, make it faster, give it added features.  Data science hacking involves inventing new models, exploring.
  • 31. Data Scientist Anwesha Bhattacharya (& I am not a data scientist)
  • 32. Who is a data scientist? ● A practitioner of data science is called a data scientist.(~Wikipedia) ● Data scientists use technology and skills to increase awareness, clarity and direction for those working with data. (http://www.datascientists.net)
  • 33.
  • 34. Why do we need data scientists? ● Firstly, there is more data than we can consume. We require a data scientist who can look at the data and say, “This is important. Check out this one.” ● They are the people who can understand and provide meaning to the piles and piles of data that are collected. “Big data” is the buzzword that represents those piles. ● Minimise the disruption that are encountered while dealing with data. ● Present data with an awareness of the consequences of presenting that data.
  • 36. Types of Data Scientists Data scientists can be broadly classified into two categories: Product-focused data scientists. Business Intelligence style of data scientists. There are roughly 4 to 5 groups in each category.
  • 37. Product-focused Data Scientists  Data Researcher The professionals in this category come from the academic world and have in-depth backgrounds in statistics or the physical or social sciences. This type of data scientist often holds a PhD but is weakly skilled in Machine learning, Programming or Business. Data Developer These guys tend to concentrate on technical issues that come with handling data. They are strong in programming and machine learning but weak in business and statistics skills.  Data Creatives These are the guys who make something innovative out of mountains of data. They are strongly skilled in machine learning, Big Data, programming and other skills to handle massive data.  Data Business people They represent the business side and are responsible for making vital business decisions through data analytics techniques. They are a blend of business and technical proficiency.
  • 38. Business Intelligence based Data Scientists ● Quantitative, exploratory Data Scientists Quantitative, exploratory data scientists are inclined to have PhDs and use theory to comprehend behaviour. By combining theory and exploratory research, these data scientists improve products. ● Operational Data Scientists Operational data scientists frequently work in finance, sales or operations teams in an organization. His role is to analyse performance, responses and behavior of a process, to improve organization’s strategy and efficiency. ● Product Data Scientists Product data scientists fit in to product management or engineering. Their job is to understand the way users make use of a product and make use of that knowledge to fine tune the product. ● Marketing Data Scientists Marketing data scientists focuses on the user base, evaluate performance and work on improving efficiency, pretty much like the standard marketing guy. ● Research Data Scientists Research data scientists create insights from a data set.
  • 39. Profile of Data Scientist ● They love data ● Have investigative mind set ● Goal of work: finding patterns in data and data driven products ● Are practitioners, not theorists ● Have “hands on” skills ● Have domain expertise ● Team players ● Technically focused ● Versatile communication and collaboration skills ● Curiosity for exploring and experimenting with data. ● Sceptical people, likely to ask a lot of questions around the viability of a given solution and whether it will really work.
  • 40. Required skills ● Data mining - Computational process of discovering patterns in large data sets. The analysis step of the "Knowledge Discovery in Databases". ● Programming - The act of instructing computers to perform tasks. ● Algorithms - Step-by-step procedure for calculations used for analysis of data. ● Statistics – The collection, organization, analysis, interpretation and presentation of data. ● NLP - Interactions between computers and human languages. ● Machine learning - The science of getting computers to act without being explicitly programmed. ● Distributed systems – The components located on networked computers communicate and coordinate their actions by passing messages. ● Visualization - The creation and study of the visual representation of data, communicate both abstract and concrete ideas. ● .........
  • 41. What Does a Data Scientist Do? 10 Things [most] Data Scientists Do: 1) Ask Good Questions. What is What? We don’t know! We’d like to know? 2) Explore data & generate hypothesis. Run experiments 3) Scoop, Scrap & Sample Data 4) Tame Data 5) Discover the unknowns. 6) Model Data. Model Algorithms. 7) Understand Data Relationships 8) Tell the Machine How to Learn from Data 9) Create Data Products that Deliver Actionable Insight 10) Communicate the results using visualization, presentations
  • 42. DIKUW I K U WD Raw What How to Why When Numbers Description Extract Cause & Effect Prediction Letters Context Test Proved What's best Symbols Relationships Instruction Known Unknowns Unknown Unknowns Data Information Knowledge Understanding Wisdom Data Engineer Data Analyst Data Miner Data Scientist PAST FUTURE
  • 43. Data Scientist Data Analyst Familiarity with database systems e.g MySQL Familiarity with data warehousing and business intelligence concepts Better to be familiar with Java, Python In-depth exposure of SQL and analytics Should have clear understanding of various analytical functions - median, rank etc. and how to use them on data sets Strong understanding of Hadoop based analytics Perfection in mathemetics, statistics, correlation, data mining etc. Perfection regarding the tools and components of data architecture Deep statistical insights and machine learning Proficiency in decision making ● Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries. ● Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on. Data Scientist vs Data Analyst
  • 44. Challenges of data scientist ● Red tape No access allowed ● Unknown need What's the organization's goal? ● Terminology What's a wonkulator? ● Real world data Messy, noisy, missing ● Analysis distrust ...but I dont like that result
  • 45. References ● Zhukov, Leonid. Data Scientists. Higher School of Economics. National Research University. ● http://bit.ly/1kduMvA ● http://bit.ly/1orF9DL ● http://bit.ly/1tMBBvQ ● http://bit.ly/1kJ9gU8 ● http://bit.ly/TS9H5e ● http://bit.ly/1jZR0WA
  • 47. Reaching to Data Science
  • 49. Applications in Education sector -Survey done by Pearson group to improve the learing softwares, course materials better quality and efficacy in learning -Tools used is Python, R, Google Big Query
  • 50. Data Science in Healthcare Industry -where a group has been diagnosed with Type2 Diabetes & some subset of this group has developed complications -would like to know whether there is any pattern to complications and whether the probability of complication can be predicted and therefore acted upon Healthcare Use Database Snippet
  • 51. Extracting Interesting Patterns of Health outcomes from Healthcare System Care Whether the pattern is robust and predictive ?? OBSERVATIONS What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?
  • 52. Remarks -Predictive accuracy becomes a primary objective, the computer tends to play a significant role in model building and decision making Shows an integrated skill set spanning mathematics,statistics, AI,databases, optimization along with deep understanding of the craft problem formulation to engineer effective problems
  • 53. Applications in Social Networking sites
  • 54.
  • 55.
  • 56. Key Points --ability to interpret unstructured data and integrate it with numbers further increases our ability to extract useful knowledge in real- time and act on it
  • 57. References 1.Data Science and Prediction by Vasant Dhar http://bit.ly/1tiRvMr
  • 58. Workflow of Data Science Dibakar Sen
  • 59. Work flow of Data Science ● The work flow process consist of three major activities- -Organising -Packaging -Delivering
  • 61. Understanding of Data - set objectives or goal - set data fields - data collection procedure
  • 63. Preparation Phase ● Acquire data The obvious first step in any data science workflow is to acquire the data to analyze. Data can be acquired from a variety of sources. e.g.,: -Existing Data can be used (e.g., U.S. Census data sets). -Data can be automatically generated by computer software. -Data can be manually entered into a spreadsheet or text file by a human through survey.
  • 64. Preparation Phase ● Reform and clean data -Before analysis begins, we need to verify that the data are accurate and that the variables are well named and properly labeled. -We have to store the data in desired format, - Verify the sample and variables - Do the variables have the correct values? - Are missing data coded appropriately? -Are the data internally consistent? - Is the sample size correct? etc. -Programmers reformat and clean data either by writing scripts or by manually editing data, say, a spreadsheet.
  • 66. Analysis Phase ● Data Analysis - The core activity of data science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data. - Different "scripting" languages such as Python, Perl, R, and MATLAB are used to analysis the data. However, they also use compiled languages such as C, C++, and Fortran when appropriate.
  • 67. ● In the analysis phase, the programmer engages in a repeated iteration cycle of editing scripts, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.
  • 69. Reflection / Evaluation Phase The analysis phase involves programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist might perform the following types of reflection: -Take notes - Hold meetings - Make comparisons and explore alternatives
  • 71. Dissemination Phase The final phase of data science is disseminating results. Prepare reports in order to communicate findings to the appropriate audience. Results are most commonly in the form of written reports such as internal memos, slideshow presentation, business / policy white paper, or academic research publications. ● Beyond presenting results in written form, some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems.
  • 73. Challenges in Workflow of Data Science Jayanta Kr. Nayek
  • 74. Preparation phase Acquire data: -Keeping track of provenance : -Where each piece of data comes from and whether it is still up-to-date. -Data management : -Programmers must assign names to data files that they create or download and then organize those files into directories. -When they create or download new versions of those files, they must make sure to assign proper filenames to all versions and keep track of their differences. -Storage : -Sometimes there is so much data that it cannot fit on a single hard drive, so it must be stored on remote servers.
  • 75. Preparation Phase Reformat and clean data : -A related problem is that raw data often contains semantic errors(an error in logic or arithmetic that must be detected at run time), missing entries, or inconsistent formatting, so it needs to be "cleaned" prior to analysis. -Data integration : -Data integration involves combining data residing in different sources and providing users with a unified view of these data. -Heterogeneous Data: -data integration involves synchronizing huge quantities of variable, heterogeneous data resulting from internal legacy systems (an old method, technology, computer system, or application program,"of, relating to, or being a previous or outdated computer system) that vary in data format. Legacy systems may have been created around flat file, network, or hierarchical databases.
  • 76. Preparation Phase ● Data Integration Problems: -Unanticipated Costs: -Labor costs for initial planning, evaluation, programming and additional data acquisition -Software and hardware purchases -Unanticipated technology changes/advances -Both labor and the direct costs of data storage and maintenance -Lack of Data Management Expertise: -support required to engage and convey to everyone in the agency the need for and benefits of data integration is unlikely to flow from leaders who lack awareness of or commitment to the benefits of data integration.
  • 77. Preparation Phase Data transmission: -It is the physical transfer of data over a point-to-point or point-to-multipoint communication channel. -Cloud data storage is popularly used as the development of cloud technologies. -We know that the network bandwidth capacity is the bottleneck in cloud and distributed systems, especially when the volume of communication is large. -On the other side, cloud storage also lead to data security problems as the requirements of data integrity checking.
  • 78. Analysis Phase -Data inconsistence and incompleteness: -A number of data preprocessing techniques, including data cleaning, data integration, data transformation and date reduction, can be applied to remove noise and correct inconsistencies. -Scalability: -The biggest and most important challenge is scalability when we deal with the Big Data analysis. -In the last few decades, researchers paid more attentions to accelerate analysis algorithms to cope with increasing volumes of data and speed up processors following the Moore’s Law. -Data Curation: -Data curation is aimed at data discovery and retrieval, data quality assurance, value addition, reuse and preservation over time. -The existing database management tools are unable to process Big Data that grow so large and complex.
  • 79. Analysis Phase-Timeliness: -Real-time Big Data applications, like navigation, social networks, finance, biomedicine, astronomy, intelligent transport systems, and internet of thing, timeliness is at the top priority. How can we guarantee the timeliness of response when the volume of data will be processed is very large? -File and metadata management: -Repeatedly editing and executing scripts while iterating on experiments causes the production of numerous output files, such as intermediate data, textual reports, tables, and graphical visualizations. -However, doing so leads to data management problems due to the abundance of files and the fact that programmers often later forget their own ad-hoc naming conventions. -Data security: -Firstly, the size of Big Data is extremely large, channelling the protection approaches. -Secondly, it also leads to much heavier workload of the security.
  • 80. Analysis Phase -Absolute running times: Scripts might take a long time to terminate, either due to large amounts of data being processed or the algorithms being slow. -Incremental running times: Scripts might take a long time to terminate after minor incremental code edits done while iterating on analyses, which wastes time re- computing almost the same results as previous runs. -Crashes from errors: Scripts might crash prematurely due to errors in either the code or inconsistencies in data sets. Programmers often need to endure several rounds of debugging before their scripts can terminate with useful results.
  • 81. Reflection Phase ● Take notes: Since notes are a form of data, the usual data management problems arise in notetaking, most notably how to organize notes and link them with the context in which they were originally written. ● Make comparisons and explore alternatives: Data scientists must organize, manage, and compare these graphs to gain insights and ideas for what alternative hypotheses to explore.
  • 82. Dissemination Phase -Functionalities: -To convey information easily by providing knowledge hidden in the complex and large-scale data sets, both aesthetic form and functionality are necessary. -Current tools mostly have poor performances in functionalities and response time. -Scalability : -It is particularly difficult to conduct data visualization (the main objective of data visualization is to represent knowledge more intuitively and effectively by using different graphs) because of the large size and high dimension of Big Data.
  • 83. Dissemination Phase ● Difficult to distribute research code: Some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems. It is difficult to distribute research code in a form that other people can easily execute on their own computers. ● Difficult to reproduce the results: It is even difficult to reproduce the results of one's own experiments a few months or years in the future, since one's own operating system and software inevitably get upgraded in some incompatible manner such that the original code no longer runs.
  • 84. Reference ● Chen,Philip C.L. And Zhang,Chun-Yang.(2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data.Information Sciences.ELSEVIER.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China. ● http://bit.ly/1jZcx2I ● http://1.usa.gov/SNspKm
  • 85. TECHNOLOGY and Tools for DATA SCIENCE TANMAY MONDAL & MANASH KUMAR
  • 86. We need ● Organise Data ● Analyse Data ● Package and Deliver Data
  • 87. Data Science Tools  Language − Java, R, Python, ...  Databases/Data Warehouses − Apache Cassandra, Apache HBase, MongoDB, ....  Data Mining − RapidMiner/RapidAnalytics, Orange, Weka, ....  File Systems − Gluster, Hadoop Distributed File System, ...
  • 88. Data Science Tools  Big Data Search − Lucene, Solr, ...  Data Aggregation and Transfer − Sqoop, Flume, ....  Miscellaneous Big Data Tools – Hadoop, Avro, Zookeeper, ...  ......................
  • 89.
  • 90. What is Hadoop? ● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. N o d e s Hadoop cluster
  • 91. Why Hadoop? • Handles enormous data volumes. • Cost-effective. • Scalable. • Fault tolerant.
  • 92. Origin of Hadoop • Google introduced two key technology for handling Big data, Google File System (a distributed file system technology) in 2003 and MapReduce ( framework for distributed compute model) in 2004 to the world. • Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. • In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. • First release of Apache Hadoop in September 2007
  • 93. When should we go for Hadoop ?  Data is too huge  Unstructured data  Parallelism  Processes are independent  Need better scalability
  • 94. The Hadoop Ecosystem● HDFS - Hadoop Distributed File System. ● MapReduce - A distributed framework for executing work in parallel. • Hive - Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. ● Pig – Pig is a high-level platform for creating MapReduce programs used with Hadoop. ● HBase – A non-rational, distributed database system. ● ..........
  • 95. The Major Component of Hadoop  Hadoop use its own distributed file system,HDFS, which makes data available to multiple computing nodes.  Hadoop uses MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
  • 96. HDFS  Hierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks.  Stores files in blocks across many nodes in a cluster.  Distribution and replication of blocks to different nodes.  Have master slave architecture.
  • 98. HDFS ... NameNode  Runs on a single node as a master process  Holds file metadata (which blocks are where)  Directs client access to files in HDFS SecondaryNameNode  Maintains a copy of the NameNode metadata Data Node ● Stores data in the local file system ● Periodically sends a report of all existing blocks to the NameNode
  • 99. WHAT IS MAP REDUCE? MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster
  • 100.
  • 101. Map Reduce Paradigm Data processing system with two key phase Map Perform a map function on input key/value pairs to generate intermediate key/value pairs Reduce Perform a reduce function on intermediate key/value groups to generate output key/value pairs
  • 102. Map Reduce Daemons •JobTracker (Master) -Monitors job and task progress - Manages MapReduce jobs -Giving tasks to different nodes •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker -Runs on same node as DataNode service
  • 103.
  • 104. Hadoop Map Reduce Components Reduce Phase Shuffle Sort Reducer Output Format Map Phase Input Format Record Reader Mapper Combiner
  • 105. 105 How does Map Reduce work? ➢ The run time partitions the input and provides it to different Map instances ➢ Map (key, value)  (key’, value’) ➢ The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. ➢ Map and Reduce are user written functions in java
  • 106.
  • 107. WORD COUNT IN MAP REDUCE
  • 108. Validation of data extract and load into EDW(Enterprise Data Warehouse) Once map-reduce process is completed and data output files are generated, then data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.
  • 109. USERS OF HADOOP Yahoo! -  More than 100,000 CPUs in 40,000 computers running Hadoop  Produces data that was used in every Yahoo! Web search query Facebook -  In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.  On June 13, 2012 they announced the data had grown to 100 PB.  Each (commodity) node has 8 cores and 12 TB of storage
  • 110. USERS OF HADOOP Adobe - Adobe uses Apache Hadoop and Apache HBase in several areas from social services to structured data storage and processing for internal use. Currently have about 30 nodes running HDFS Ebay - 532 nodes cluster (8 * 532 cores, 5.3PB) Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase Using it for Search optimization and Research.
  • 111. Twitter  We use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. GBIF (Global Biodiversity Information Facility)  Nonprofit organization that focuses on making scientific data on biodiversity available via the Internet  18 nodes running a mix of Apache Hadoop and Apache HBase
  • 112. University of Glasgow  30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). To facilitate information retrieval research & experimentation, particularly for TREC Greece.com  Using Apache Hadoop for analyzing data for millions of images, log analysis, data mining
  • 113. References http://bit.ly/1km1e46  http://bit.ly/Rzuzfz  http://yhoo.it/1pheFVK  Big data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja.
  • 115. What is it? Learning is a process of knowledge acquisition with specific purpose. Machine learning is the study of how to use computers to simulate human learning activities. Training Set Learning Algorithm hypothesis Predicted OutputInput Feedback
  • 116. Why Machine Learning is Possible? Mass Storage More data available Higher Performance of Computer Larger memory in handling the data Greater computational power for calculating and even online learning Machine Learning Basics: 1. General Introduction
  • 117. Basic Structure of the Machine Learning System External environment Corpus study Knowledge Representation Execution Machine Learning Model
  • 118. The Goal of Machine Learning is... to create a predictive model that is indistinguishable from a correct model. Without Logic With Logic
  • 119. Two Phases Machine learning methods are broken into two phases: Training Application
  • 120. Types of Machine Learning Other types: 1. Semi-supervised learning 2. Time-series forecasting 3. Anomaly detection 4. Active learning Main types: 1. Supervised Learning 2. Unsupervised learning 3. Reinforcement learning
  • 121. The Main Research Work on Machine Learning Field Task-oriented research Cognitive simulation Theoretical analysis
  • 122. Data Science and Machine Learning  If we are giving the computer rules and/or algorithms to automatically search through your data to “learn” how recognize patterns and make complex decisions (such as identifying spam emails), we are implementing machine learning. In Data science, Data scientists use both statistical techniques and machine learning algorithms for identifying patterns and structure in data.
  • 123. . Role of Machine Learning in Data Science https://doubleclix.wordpress.com/category/data-science/
  • 124. A Simple Implementation Let, we have a model consisted of the likelihood of the coin landing heads (prior over θ), while the data consisted of the results of N coin flips. We are observing some data. Our goal is to determine the model from the data i.e. we will find the probability of getting desired model using the given data or p(model|data).
  • 125. Using Conditional Probability, p(data|model) =p(data and model) * p(model) --(1) p(model|data) =p(data and model) * p(data) --(2) From (1) and (2) we get, p(data|model) / p(model) = p(model|data) / p(data) That implies : p(model|data) = (p(model|data) * p(data)) / p(model) posterior likelihood prior evidence
  • 126. The likelihood distribution describes the likelihood of data given model — it reflects our assumptions about how the data c was generated. The prior distribution describes our assumptions about model before observing the data. The posterior distribution describes our knowledge of model, incorporating both the data and the prior. The evidence is useful in model selection.
  • 127. Working Method of a Predictive Modeler and a Data Scientist A predictive modeler may use machine learning approach to predict a value or likelihood of an outcome, given a number of input variables. A data scientist applies these same approaches on large data sets, writing code and using software adapted to work on big data.
  • 128. The available library of statistical and machine learning algorithms for evaluating and learning from big data is growing, but is not yet as comprehensive as the algorithms available for the non-distributed world. The algorithms vary by product, so it is important to understand what is and is not available. Even not all algorithms familiar to the statistician and data miner are easily converted to the distributed computing environment. The bottom line is that, while fitting models on big data has the potential benefit of greater predictive power, some of the costs are loss of flexibility in algorithm choices and/or extensive programming time. Prospective
  • 129. References Machine Learning and Data Mining Lecture Notes CSC 411/D11 Computer Science Department University of Toronto Version: February 6, 2012 The Discipline of Machine Learning Tom M. Mitchell July 2006 CMU-ML-06-108 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Statistical Machine Learning-Nic Schraudolph http://bit.ly/1oFt1ws http://bit.ly/1oFtNty
  • 131. Research Areas Cloud computing Databases and Database Management Systems Natural language processing Signal Processing Computer vision
  • 132. Cloud computing Cloud computing involves distributed computing over a network, where a program or application may run on many connected computers at the same time. It specifically refers to a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN).
  • 133. Issues Privacy -The increased use of cloud computing services such as Gmail and Google Docs has pressed the issue of privacy concerns. The greater use of cloud computing services has given access to a plethora of data which has the immense risk of data being disclosed either accidentally or deliberately.
  • 134. Contd.. Legal-certain legal issues arise with cloud computing, including trademark infringement, security concerns and sharing of proprietary data resources. Vendor lock-in-cloud computing is still relatively new, standards are still being developed. Many cloud platforms and services are built on the specific standards, tools and protocols developed by a particular vendor for its particular cloud offering. This is a major challenge in interoperability.
  • 135. Research areas open interoperation across cloud solutions at IaaS, PaaS and SaaS levels managing multi tenancy at large scale and in heterogeneous environments dynamic and seamless elasticity from private clouds to public clouds for unusual and/or infrequent requirements data management in a cloud environment, taking the technical and legal constraints into consideration
  • 136. Databases &DBMS A database is an organized collection of data. The data are typically organized to in a way that supports processes requiring this information. Database management systems (DBMSs) are specially designed software applications that interact with the user, other applications, and the database itself to capture and analyze data.
  • 137. Issues Data definition – Defining new data structures for a database, removing data structures from the database, modifying the structure of existing data. Update – Inserting, modifying, and deleting data. Retrieval – Obtaining information either for end-user queries and reports or for processing by applications. Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.
  • 138. Research areas Research activity includes theory and development of prototypes and models. Notable research topics include, the atomic transaction concept and related concurrency control techniques, query languages and query optimization methods, RAID, and more.
  • 139. NLP Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.
  • 140. Human-level natural language processing is an AI problem, that is equivalent to making computers as intelligent as people. NLP's future is therefore tied closely to the development of AI in general. As natural language understanding improves, computers will be able to learn from the information online and apply what they learned in the real world. In the future, humans may not need to code programs, but will dictate to a computer in a human natural language, and the computer will understand and act upon the instructions.
  • 141. Signal Processing Signal processing is an area of Systems Engineering, Electrical Engineering and applied mathematics that deals with operations on or analysis of analog as well as digitized signals, representing time-varying or spatially varying physical quantities. Signals of interest can include sound, electromagnetic radiation, images, and sensor readings, for example biological measurements such as electrocardiograms, control system signals, telecommunication transmission signals, and many others.
  • 142. Computer vision Computer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image.
  • 143. Data Science Higher Education programmes 2014 Programs in 2014Institute / Organization Course Indiana University, Indiana, US * Online Certificate in Data Science(January 2014 ). University of California, Berkeley Master of Information and Data Science program. Saint Peters University, US ** Master of Science in Data Science program. Worcester Polytechnic Institute, Worcester, Massachusetts, US Master of Science in Data Science program. University of Virginia , US *** Master of Science in Data Science * The program consists of 12 credits, including cloud computing, data management and data analysis. ** The program’s curriculum will include topics such as decision analysis and optimization, predictive modeling, data mining and visualization. *** A professional program to prepare students for the use of data analysis in major industries such as health care, business, and science.
  • 144. Conferences on Data Science 2014 International Conference on Data Science and Engineering, (26-28 August 2014) Hosted By : School of Computer Science Studies Cochin University of Science & Technology, Co-Sponsored by IEEE Kerala. DataEDGE Conference : A new vision for data science, (May 8–9, 2014 Berkeley, CA ) Discussions will be on the way organizations are using data to address business and social issues, about the challenges of working with data at scale, and about the most pressing questions and debates facing data scientists today.
  • 145. O’REILLY Strata is organising three conferences: New York(October 15-17, 2014 ) Discussions will be on complex issues and opportunities brought to business by big data, data science, and pervasive computing. Barcelona, Spain (November 19–21,2014) Discussions will be on big data analytics. San Jose, CA (February 18–20, 2015)
  • 146. ASE(Academy of Science and Engineering) is organising three conferences: Stanford University, CA, USA, (May 27 - May 31, 2014) Tsinghua University, Beijing, China, (August 4-7, 2014) Harvard University, Cambridge, MA, US (December 15- 19, 2014). IEEE International Conference on Big Data Science and Engineering (Tsinghua University, Beijing, China, 24-26 Sept. 2014). The 2014 International Conference on Data Science and Advanced Analytics(October 30 - November 1, 2014, Shanghai, China).
  • 147. Journals of Data Science Journal of Data Science-an international journal devoted to applications of statistical methods at large. Online version is free. Hard copy version- 300 USD/ year CODATA Data Science Journal Published by Codata. EPJ Data Science: a Springer Open Journal International Journal of Data Science : Inder Science Publishers.

Hinweis der Redaktion

  1. Data never sleeps
  2. This classification shows that any bunch of people can be put in any one of the category. The right type of data scientist can be chosen based on the organization’s requirement Before choosing the type of data scientist you want to become, consider the skills required or the skills you already posses to proceed in the appropriate direction. So who are you gonna be?? A programmer, a statistician, a marketer, a business lead or a jack of all trades??
  3. Department of Statistics, Columbia University, New York + Department of Statistics and Information Science, Catholic Fu-jen University, Taipei + Data Mining Center, Renmin University of China, Beijing