Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

DATA SCIENCE
Colloquium (7)
MS(LIS) 2013-2015
Indian Statistical Institute
Documentation Research and Training Centre

●
Data Science is a newly emerging field dedicated to
analyzing and manipulating data to derive insights
and build data products. It combines skill-sets
ranging from computer science, to mathematics, to
art. (www.kaggle.com)

●
Data science imply a focus involving data and, by
extension, statistics, or the systematic study of the
organization, properties, and analysis of data and its
role in inference, including our confidence in the
inference. (D.J.Patil)
●
In simple word we can say that it is process which
extract information/knowledge from huge data.

Evolution
• 1900 - Statistics
• 1960 - “Data Mining”
• 2006 - Google Analytics appears
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data surge
• 2013 - Data Science
• 2015 - ??

●
Data is growing at very high pace(exponentially).
●
According to IBM, 2.5 exabytes - that's 2.5 billion
gigabytes (GB) - of data was generated every day in
2012. About 75% of data is unstructured, coming
from sources such as text, voice and video.

●
In 2012 it reached 2.8 zettabytes and IDC forecasts
that we will generate 40 zettabytes (ZB) by 2020
which is the equivalent of 5,200 GB of data for every
man, woman and child on Earth.
●
90% of all the data in the world today has been
created in the past few years.

S.No. Sub-Topic Speaker
1. What is Data Science Sandip Das
2. Data Scientist Anwesha Bhattacharya
3. Applications of Data Science Manasa Rath
4. Workflow of Data Science Dibakar Sen
5. Challenges in Workflow of Data
Science
Jayanta Kr. Nayek
6. Tools and Technology Tanmay & Manash
7. Machine Learning in Data
Science
Samhati Soor
8. Conclusion Shiv Shakti Ghosh

References
●
http://bit.ly/1gyRYcM
●
http://bit.ly/SdJ2OU
●
http://bit.ly/RzrZ9k
●
http://bit.ly/1pwlEY4
●
http://bit.ly/1pwlUq6

What is Data Science
Sandip Das

Data
What kind of data might you collect?

Data

How many Lily pads

Measures the inches
of the Lily pads

How many small,
medium or large
Lily pads

How many frogs

What is Data?

It is something you want to know.

A collection of fact.

Facts and statistics collected together for reference or analysis.

Data as the plural form of datum; as pieces of information; and as a collection
of object-units that are distinct from one another.

Data is undifferentiated observation of facts in terms of words, numbers,
symbols, etc.

What is Data?
Computer data is information processed or stored by a computer.
This information may be in the form of text documents, images,
audio clips, software programs, or other types of data. Computer
data may be processed by the computer's CPU and is stored in
files and folders on the computer's hard disk.

Science

The systematic observation of natural events and
conditions in order to discover facts about them and to
formulate laws and priciples based on these facts.

Science involves more than the gaining of knowledge.It is
about gaining a deeper and often useful understanding of
the world.

The Science is an art of

Discovering what we don't know from data

Obtaining predictive,actionable insight from data

Creating Data products that have business impact

Building confidence in decisions that drive business
value

Data science

According to Computer scientist Peter Nauer
“The science of dealing with Data, once they have
been established”

Data Science is the scientific study of the creation,
validation and transformation of data to create meaning.

Data science is the study of the generalizable extraction of
knowledge from data.

Domain Expertise

Domain expertise is proficiency, with special
knowledge or skills, in a particular area or topic.

Domain expertise includes knowing what problems
are important to solve and knowing what sufficient
answers look like. Domain experts understand what
the customers of their knowledge want to know.

Data Engineering
It is the data part of data science. It involves
Acquiring
Ingesting
Transforming
Storing
Retrieving data

Scientific Method
It is the process for acquiring new knowledge by applying
the principles of reasoning on empirical evidence derived
from testing hypotheses through repeatable experiments.

Statistics & Mathematics
Statistics (along with mathematics) is the cerebral part of
Data Science. They collect, Organize, analyse and
interpret data.

Advanced Computing
Advanced computing is the heavy lifting of data science. It
consists software design and programming language.

Visualization

It is the pretty face of data science.

A good visualization is the result of a creative process
that composes an abstraction of the data in an
informative and aesthetically interesting form.

Hacker mindset

Hacking is modifying one's own computer system,
icluding building, rebuilding, modifying and creating
software, electronic hardware or peripherals, in order
to make it better, make it faster, give it added
features.

Data science hacking involves inventing new models,
exploring.

References
●
http://bit.ly/1jZR0WA
●
http:// bit.ly/1pwmV1m
●
http://bit.ly/1tkKyKG
●
http://bit.ly/1ntd13L
●
http://bit.ly/1wi9t5Z

Data Scientist
Anwesha Bhattacharya
(& I am not a data scientist)

Who is a data scientist?
●
A practitioner of data science is
called a data scientist.(~Wikipedia)
●
Data scientists use technology
and skills to increase awareness,
clarity and direction for those
working with data.
(http://www.datascientists.net)

Why do we need data scientists?
●
Firstly, there is more data than we can consume. We
require a data scientist who can look at the data and
say, “This is important. Check out this one.”
●
They are the people who can understand and provide
meaning to the piles and piles of data that are
collected. “Big data” is the buzzword that represents
those piles.
●
Minimise the disruption that are encountered while
dealing with data.
●
Present data with an awareness of the consequences of
presenting that data.

Types of Data Scientists
Data scientists can be
broadly classified into
two categories:
Product-focused data scientists.
Business Intelligence style of
data scientists.
There are roughly 4 to 5
groups in each
category.

Product-focused Data Scientists

Data Researcher
The professionals in this category come from the
academic world and have in-depth backgrounds in
statistics or the physical or social sciences. This
type of data scientist often holds a PhD but is
weakly skilled in Machine learning, Programming
or Business.
Data Developer
These guys tend to concentrate on technical issues
that come with handling data. They are strong in
programming and machine learning but weak in
business and statistics skills.

Data Creatives
These are the guys who make something
innovative out of mountains of data. They are
strongly skilled in machine learning, Big Data,
programming and other skills to handle massive
data.

Data Business people
They represent the business side and are
responsible for making vital business decisions
through data analytics techniques. They are a
blend of business and technical proficiency.

Business Intelligence based Data Scientists
●
Quantitative, exploratory Data Scientists
Quantitative, exploratory data scientists are inclined to have
PhDs and use theory to comprehend behaviour. By
combining theory and exploratory research, these data
scientists improve products.
●
Operational Data Scientists
Operational data scientists frequently work in finance, sales or
operations teams in an organization. His role is to analyse
performance, responses and behavior of a process, to
improve organization’s strategy and efficiency.
●
Product Data Scientists
Product data scientists fit in to product management or
engineering. Their job is to understand the way users
make use of a product and make use of that knowledge to
fine tune the product.
●
Marketing Data Scientists
Marketing data scientists focuses on the user base, evaluate
performance and work on improving efficiency, pretty
much like the standard marketing guy.
●
Research Data Scientists
Research data scientists create insights from a data set.

Profile of Data Scientist
●
They love data
●
Have investigative mind set
●
Goal of work: finding patterns in data
and data driven products
●
Are practitioners, not theorists
●
Have “hands on” skills
●
Have domain expertise
●
Team players
●
Technically focused
●
Versatile communication and
collaboration skills
●
Curiosity for exploring and
experimenting with data.
●
Sceptical people, likely to ask a lot of
questions around the viability of a
given solution and whether it will
really work.

Required skills
●
Data mining - Computational process of discovering patterns in large data
sets. The analysis step of the "Knowledge Discovery in Databases".
●
Programming - The act of instructing computers to perform tasks.
●
Algorithms - Step-by-step procedure for calculations used for analysis of
data.
●
Statistics – The collection, organization, analysis, interpretation and
presentation of data.
●
NLP - Interactions between computers and human languages.
●
Machine learning - The science of getting computers to act without being
explicitly programmed.
●
Distributed systems – The components located on networked computers
communicate and coordinate their actions by passing messages.
●
Visualization - The creation and study of the visual representation of data,
communicate both abstract and concrete ideas.
●
.........

What Does a Data Scientist Do?
10 Things [most] Data Scientists Do:
1) Ask Good Questions.
What is What?
We don’t know! We’d like to know?
2) Explore data & generate hypothesis. Run experiments
3) Scoop, Scrap & Sample Data
4) Tame Data
5) Discover the unknowns.
6) Model Data. Model Algorithms.
7) Understand Data Relationships
8) Tell the Machine How to Learn from Data
9) Create Data Products that Deliver Actionable Insight
10) Communicate the results using visualization, presentations

DIKUW
I K U WD
Raw What How to Why When
Numbers Description Extract Cause & Effect Prediction
Letters Context Test Proved What's best
Symbols Relationships Instruction Known
Unknowns
Unknown
Unknowns
Data Information Knowledge Understanding Wisdom
Data Engineer Data Analyst Data Miner Data Scientist
PAST FUTURE

Data Scientist Data Analyst
Familiarity with
database systems e.g
MySQL
Familiarity with data
warehousing and
business intelligence
concepts
Better to be familiar
with Java, Python
In-depth exposure of
SQL and analytics
Should have clear
understanding of
various analytical
functions - median,
rank etc. and how to
use them on data sets
Strong understanding of
Hadoop based analytics
Perfection in
mathemetics,
statistics, correlation,
data mining etc.
Perfection regarding the
tools and components of
data architecture
Deep statistical
insights and machine
learning
Proficiency in decision
making
● Data analysis has been generally
used as a way of explaining some
phenomenon by extracting interesting
patterns from individual data sets with
well-formulated queries.
● Data science, on the other hand, aims
to discover and extract actionable
knowledge from the data, that is,
knowledge that can be used to make
decisions and predictions, not just to
explain what’s going on.
Data Scientist vs Data Analyst

Challenges of data scientist
●
Red tape
No access allowed
●
Unknown need
What's the organization's
goal?
●
Terminology
What's a wonkulator?
●
Real world data
Messy, noisy, missing
●
Analysis distrust
...but I dont like that
result

References
●
Zhukov, Leonid. Data Scientists. Higher School of Economics.
National Research University.
●
http://bit.ly/1kduMvA
●
http://bit.ly/1orF9DL
●
http://bit.ly/1tMBBvQ
●
http://bit.ly/1kJ9gU8
●
http://bit.ly/TS9H5e
●
http://bit.ly/1jZR0WA

APPLICATIONS of DATA
SCIENCE
by
Manasa Rath

APPLICATIONS
agriculture
pharmacy
energy
retail
tourism
realestate
import-export
finance
business
services

Applications in Education sector
-Survey done by Pearson group to improve the learing softwares,
course materials better quality and efficacy in learning
-Tools used is Python, R, Google Big Query

Data Science in Healthcare Industry
-where a group has been diagnosed with Type2 Diabetes & some subset of
this group has developed complications
-would like to know whether there is any pattern to complications and
whether the probability of complication can be predicted and therefore
acted upon
Healthcare Use Database Snippet

Extracting Interesting Patterns of Health outcomes from Healthcare System Care
Whether the pattern is robust and
predictive ??
OBSERVATIONS
What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?

Remarks
-Predictive accuracy becomes a primary objective, the computer
tends to play a significant role in model building and decision
making
Shows an integrated
skill set spanning
mathematics,statistics,
AI,databases,
optimization along with
deep understanding of
the craft problem
formulation to engineer
effective problems

Applications in Social Networking
sites

Key Points
--ability to interpret unstructured data and integrate it with numbers
further increases our ability to extract useful knowledge in real-
time and act on it

References
1.Data Science and Prediction by Vasant Dhar
http://bit.ly/1tiRvMr

Workflow of Data Science
Dibakar Sen

Work flow of Data Science
●
The work flow process
consist of three major
activities-
-Organising
-Packaging
-Delivering

Work flow Phases
Understanding
of
data
/ Evaluation

Understanding of Data
- set objectives or goal
- set data fields
- data collection procedure

Preparation Phase
Understanding
of
data
/ Evaluation

Preparation Phase
●
Acquire data
The obvious first step in any data science
workflow is to acquire the data to analyze. Data can
be acquired from a variety of sources. e.g.,:
-Existing Data can be used (e.g., U.S. Census data
sets).
-Data can be automatically generated by computer
software.
-Data can be manually entered into a spreadsheet
or text file by a human through survey.

Preparation Phase
●
Reform and clean data
-Before analysis begins, we need to verify that the data are accurate
and that the variables are well named and properly labeled.
-We have to store the data in desired format,
- Verify the sample and variables
- Do the variables have the correct values?
- Are missing data coded appropriately?
-Are the data internally consistent?
- Is the sample size correct? etc.
-Programmers reformat and clean data either by writing scripts or by
manually editing data, say, a spreadsheet.

Analysis Phase
Understanding
of
data
/ Evaluation

Analysis Phase
●
Data Analysis
- The core activity of data science is the
analysis phase: writing, executing, and refining
computer programs to analyze and obtain insights
from data.
- Different "scripting" languages such
as Python, Perl, R, and MATLAB are used to
analysis the data. However, they also use compiled
languages such as C, C++, and Fortran when
appropriate.

●
In the analysis phase, the programmer engages in a repeated
iteration cycle of editing scripts, executing to produce output files,
inspecting the output files to gain insights and discover mistakes,
debugging, and re-editing.

Reflection/Evaluation Phase
Understanding
of
data
/ Evaluation

Reflection / Evaluation Phase
The analysis phase involves programming, the reflection phase
involves thinking and communicating about the outputs of
analyses. After inspecting a set of output files, a data scientist might
perform the following types of reflection:
-Take notes
- Hold meetings
- Make comparisons and
explore alternatives

Dissemination Phase
Understanding
of
data
/ Evaluation

Dissemination Phase
The final phase of data science is disseminating results. Prepare
reports in order to communicate findings to the appropriate
audience. Results are most commonly in the form of written reports
such as internal memos, slideshow presentation, business / policy
white paper, or academic research publications.
●
Beyond presenting results in written form,
some data scientists also want to distribute
their software so that colleagues can
reproduce their experiments or play with
their prototype systems.

References
●
http://bit.ly/1jZcx2I
●
http://bit.ly/1jZeTyN
●
http://bit.ly/1hbQuWx

Challenges in Workflow of Data Science
Jayanta Kr. Nayek

Preparation phase
Acquire data:
-Keeping track of provenance :
-Where each piece of data comes from and whether it is still up-to-date.
-Data management :
-Programmers must assign names to data files that they create or download
and then organize those files into directories.
-When they create or download new versions of those files, they must make
sure to assign proper filenames to all versions and keep track of their
differences.
-Storage :
-Sometimes there is so much data that it cannot fit on a single hard drive,
so it must be stored on remote servers.

Preparation Phase
Reformat and clean data :
-A related problem is that raw data often contains semantic errors(an error in
logic or arithmetic that must be detected at run time), missing entries, or
inconsistent formatting, so it needs to be "cleaned" prior to analysis.
-Data integration :
-Data integration involves combining data residing in different sources and
providing users with a unified view of these data.
-Heterogeneous Data:
-data integration involves synchronizing huge quantities of variable,
heterogeneous data resulting from internal legacy systems (an old method,
technology, computer system, or application program,"of, relating to, or
being a previous or outdated computer system) that vary in data format.
Legacy systems may have been created around flat file, network, or
hierarchical databases.

Preparation Phase
●
Data Integration Problems:
-Unanticipated Costs:
-Labor costs for initial planning, evaluation, programming and
additional data acquisition
-Software and hardware purchases
-Unanticipated technology changes/advances
-Both labor and the direct costs of data storage and
maintenance
-Lack of Data Management Expertise:
-support required to engage and convey to everyone in the agency
the need for and benefits of data integration is unlikely
to flow from leaders who lack awareness of or
commitment to the benefits of data integration.

Preparation Phase
Data transmission:
-It is the physical transfer of data over a point-to-point or
point-to-multipoint communication channel.
-Cloud data storage is popularly used as the development of cloud
technologies.
-We know that the network bandwidth capacity is the bottleneck in
cloud and distributed systems, especially when the volume of
communication is large.
-On the other side, cloud storage also lead to data security problems as
the requirements of data integrity checking.

Analysis Phase
-Data inconsistence and incompleteness:
-A number of data preprocessing techniques, including data cleaning, data
integration, data transformation and date reduction, can be applied to remove
noise and correct inconsistencies.
-Scalability:
-The biggest and most important challenge is scalability when we deal with the
Big Data analysis.
-In the last few decades, researchers paid more attentions to accelerate analysis
algorithms to cope with increasing volumes of data and speed up processors
following the Moore’s Law.
-Data Curation:
-Data curation is aimed at data discovery and retrieval, data quality assurance,
value addition, reuse and preservation over time.
-The existing database management tools are unable to process Big Data that
grow so large and complex.

Analysis Phase-Timeliness:
-Real-time Big Data applications, like navigation, social networks, finance, biomedicine,
astronomy, intelligent transport systems, and internet of thing, timeliness is at the top
priority. How can we guarantee the timeliness of response when the volume of
data will be processed is very large?
-File and metadata management:
-Repeatedly editing and executing scripts while iterating on experiments causes the
production of numerous output files, such as intermediate data, textual reports, tables,
and graphical visualizations.
-However, doing so leads to data management problems due to the abundance of files and
the fact that programmers often later forget their own ad-hoc naming conventions.
-Data security:
-Firstly, the size of Big Data is extremely large, channelling the protection approaches.
-Secondly, it also leads to much heavier workload of the security.

Analysis Phase
-Absolute running times:
Scripts might take a long time to terminate, either due to large amounts
of data being processed or the algorithms being slow.
-Incremental running times:
Scripts might take a long time to terminate after minor incremental code
edits done while iterating on analyses, which wastes time re-
computing almost the same results as previous runs.
-Crashes from errors:
Scripts might crash prematurely due to errors in either the code or
inconsistencies in data sets. Programmers often need to endure several
rounds of debugging before their scripts can terminate with useful results.

Reflection Phase
●
Take notes:
Since notes are a form of data, the usual data management problems arise in
notetaking, most notably how to organize notes and link them with
the context in which they were originally written.
●
Make comparisons and explore alternatives:
Data scientists must organize, manage, and compare these graphs to gain
insights and ideas for what alternative hypotheses to explore.

Dissemination Phase
-Functionalities:
-To convey information easily by providing knowledge hidden in the complex
and large-scale data sets, both aesthetic form and functionality are
necessary.
-Current tools mostly have poor performances in functionalities and
response time.
-Scalability :
-It is particularly difficult to conduct data visualization (the main objective
of data visualization is to represent knowledge more intuitively and
effectively by using different graphs) because of the large size and high
dimension of Big Data.

Dissemination Phase
●
Difficult to distribute research code:
Some data scientists also want to distribute their software so
that colleagues can reproduce their experiments or play
with their prototype systems. It is difficult to distribute
research code in a form that other people can easily
execute on their own computers.
●
Difficult to reproduce the results:
It is even difficult to reproduce the results of one's own
experiments a few months or years in the future, since
one's own operating system and software inevitably
get upgraded in some incompatible manner such that
the original code no longer runs.

Reference
●
Chen,Philip C.L. And Zhang,Chun-Yang.(2014).
Data-intensive applications, challenges, techniques
and technologies: A survey on Big Data.Information
Sciences.ELSEVIER.Department of Computer and
Information Science, Faculty of Science and
Technology, University of Macau, Macau, China.
●
http://bit.ly/1jZcx2I
●
http://1.usa.gov/SNspKm

TECHNOLOGY and Tools for DATA
SCIENCE
TANMAY MONDAL & MANASH KUMAR

We need
● Organise Data
● Analyse Data
● Package and Deliver Data

Data Science Tools

Language
− Java, R, Python, ...

Databases/Data Warehouses
− Apache Cassandra, Apache HBase, MongoDB, ....

Data Mining
− RapidMiner/RapidAnalytics, Orange, Weka, ....

File Systems
− Gluster, Hadoop Distributed File System, ...

Data Science Tools

Big Data Search
− Lucene, Solr, ...

Data Aggregation and Transfer
− Sqoop, Flume, ....

Miscellaneous Big Data Tools
– Hadoop, Avro, Zookeeper, ...

......................

What is Hadoop?
●
The Apache Hadoop is a framework that allows for
the distributed processing of large data sets across
clusters of computers using simple programming
models.
N
o
d
e
s
Hadoop cluster

Why Hadoop?
• Handles enormous data volumes.
• Cost-effective.
• Scalable.
• Fault tolerant.

Origin of Hadoop
• Google introduced two key technology for handling Big data, Google File
System (a distributed file system technology) in 2003 and MapReduce
( framework for distributed compute model) in 2004 to the world.
• Early in 2005, the Nutch developers had a working MapReduce
implementation in Nutch, and by the middle of that year all the major
Nutch algorithms had been ported to run using MapReduce and NDFS.
• In February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
• First release of Apache Hadoop in September 2007

When should we go for Hadoop ?

Data is too huge

Unstructured data

Parallelism

Processes are independent

Need better scalability

The Hadoop Ecosystem●
HDFS - Hadoop Distributed File System.
●
MapReduce - A distributed framework for executing work in
parallel.
• Hive - Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
●
Pig – Pig is a high-level platform for creating MapReduce
programs used with Hadoop.
●
HBase – A non-rational, distributed database system.
●
..........

The Major Component of Hadoop

Hadoop use its own distributed file system,HDFS, which makes
data available to multiple computing nodes.

Hadoop uses MapReduce, where the application is divided into
many small fragments of work, each of which may be executed or
re-executed on any node in the cluster.

HDFS

Hierarchical UNIX-like file system for data storage
sort of Splitting of large files into blocks.

Stores files in blocks across many nodes in a
cluster.

Distribution and replication of blocks to different
nodes.

Have master slave architecture.

HDFS ...
NameNode

Runs on a single node as a master process

Holds file metadata (which blocks are where)

Directs client access to files in HDFS
SecondaryNameNode

Maintains a copy of the NameNode metadata
Data Node
●
Stores data in the local file system
●
Periodically sends a report of all existing blocks
to the NameNode

WHAT IS MAP REDUCE?
MapReduce is a programming model for
processing large data sets with a parallel,
distributed algorithm on a cluster

Map Reduce Paradigm
Data processing system with two key phase
Map
Perform a map function on input key/value pairs to
generate intermediate key/value pairs
Reduce
Perform a reduce function on intermediate
key/value groups to generate output key/value pairs

Map Reduce Daemons
•JobTracker (Master)
-Monitors job and task progress
- Manages MapReduce jobs
-Giving tasks to different nodes
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
-Runs on same node as DataNode service

Hadoop Map Reduce Components
Reduce Phase
Shuffle
Sort
Reducer
Output Format
Map Phase
Input Format
Record Reader
Mapper
Combiner

105
How does Map Reduce work?
➢
The run time partitions the input and provides it to different
Map instances
➢
Map (key, value)  (key’, value’)
➢
The run time collects the (key’, value’) pairs and distributes
them to several Reduce functions so that each Reduce function
gets the pairs with the same key’.
➢
Map and Reduce are user written functions in java

Validation of data extract and load into
EDW(Enterprise Data Warehouse)
Once map-reduce process is completed and data
output files are generated, then data is moved to
enterprise data warehouse or any other
transactional systems depending on the
requirement.

USERS OF HADOOP
Yahoo! -

More than 100,000 CPUs in 40,000 computers running
Hadoop

Produces data that was used in every Yahoo! Web search
query
Facebook -

In 2010 Facebook claimed that they had the largest
Hadoop cluster in the world with 21 PB of storage.

On June 13, 2012 they announced the data had grown to
100 PB.

Each (commodity) node has 8 cores and 12 TB of storage

USERS OF HADOOP
Adobe -
Adobe uses Apache Hadoop and Apache HBase in
several areas from social services to structured data storage
and processing for internal use.
Currently have about 30 nodes running HDFS
Ebay -
532 nodes cluster (8 * 532 cores, 5.3PB)
Heavy usage of Java MapReduce, Apache Pig, Apache
Hive, Apache HBase
Using it for Search optimization and Research.

Twitter

We use Apache Hadoop to store and process tweets, log
files, and many other types of data generated across
Twitter.
GBIF (Global Biodiversity Information Facility)

Nonprofit organization that focuses on making scientific
data on biodiversity available via the Internet

18 nodes running a mix of Apache Hadoop and Apache
HBase

University of Glasgow

30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM,
1TB/node storage).
To facilitate information retrieval research & experimentation,
particularly for TREC
Greece.com

Using Apache Hadoop for analyzing data for millions of
images, log analysis, data mining

References
http://bit.ly/1km1e46

http://bit.ly/Rzuzfz

http://yhoo.it/1pheFVK

Big data: Testing Approach to Overcome Quality Challenges
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen
Kumar Gajja.

What is it?
Learning is a process of knowledge acquisition with specific
purpose.
Machine learning is the study of how to use computers to
simulate human learning activities.
Training
Set
Learning
Algorithm
hypothesis Predicted OutputInput
Feedback

Why Machine Learning is Possible?
Mass Storage
More data available
Higher Performance of Computer
Larger memory in handling the data
Greater computational power for calculating and even
online learning
Machine Learning Basics: 1. General Introduction

Basic Structure of the Machine
Learning System
External
environment
Corpus
study
Knowledge
Representation
Execution
Machine Learning Model

The Goal of Machine Learning is...
to create a predictive model that is indistinguishable
from a correct model.
Without Logic
With Logic

Two Phases
Machine learning methods are broken into two
phases:
Training
Application

Types of Machine Learning
Other types:
1. Semi-supervised learning
2. Time-series forecasting
3. Anomaly detection
4. Active learning
Main types:
1. Supervised Learning
2. Unsupervised learning
3. Reinforcement learning

The Main Research Work on Machine
Learning Field
Task-oriented research
Cognitive simulation
Theoretical analysis

Data Science and Machine Learning
 If we are giving
the computer rules
and/or algorithms
to automatically
search through your
data to “learn” how
recognize patterns
and make complex
decisions (such as
identifying spam
emails), we are
implementing
machine learning.
In Data science, Data scientists use both statistical techniques
and machine learning algorithms for identifying patterns and
structure in data.

.
Role of Machine Learning in Data
Science
https://doubleclix.wordpress.com/category/data-science/

A Simple Implementation
Let, we have a model consisted of the likelihood of the
coin landing heads (prior over θ), while the data
consisted of the results of N coin flips.
We are observing some data.
Our goal is to determine the model from the data i.e. we
will find the probability of getting desired model using
the given data or p(model|data).

Using Conditional Probability,
p(data|model) =p(data and model) * p(model) --(1)
p(model|data) =p(data and model) * p(data) --(2)
From (1) and (2) we get,
p(data|model) / p(model) = p(model|data) / p(data)
That implies :
p(model|data) = (p(model|data) * p(data)) / p(model)
posterior likelihood prior evidence

The likelihood distribution describes the likelihood of data
given model — it reflects our assumptions about how the
data c was generated.
The prior distribution describes our assumptions about
model before observing the data.
The posterior distribution describes our knowledge of
model, incorporating both the data and the prior.
The evidence is useful in model selection.

Working Method of a Predictive
Modeler and a Data Scientist
A predictive modeler may use machine learning approach to predict a
value or likelihood of an outcome, given a number of input variables.
A data scientist applies these same approaches on large data sets,
writing code and using software adapted to work on big data.

The available library of statistical and machine learning
algorithms for evaluating and learning from big data is
growing, but is not yet as comprehensive as the algorithms
available for the non-distributed world.
The algorithms vary by product, so it is important to
understand what is and is not available.
Even not all algorithms familiar to the statistician and data
miner are easily converted to the distributed computing
environment.
The bottom line is that, while fitting models on big data has
the potential benefit of greater predictive power, some of the
costs are loss of flexibility in algorithm choices and/or
extensive programming time.
Prospective

References
Machine Learning and Data Mining
Lecture Notes
CSC 411/D11
Computer Science Department
University of Toronto
Version: February 6, 2012
The Discipline of Machine Learning
Tom M. Mitchell
July 2006
CMU-ML-06-108
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Statistical Machine
Learning-Nic Schraudolph
http://bit.ly/1oFt1ws
http://bit.ly/1oFtNty

Research Areas
Cloud computing
Databases and Database Management Systems
Natural language processing
Signal Processing
Computer vision

Cloud computing
Cloud computing involves distributed computing
over a network, where a program or application may
run on many connected computers at the same time.
It specifically refers to a server connected through a
communication network such as the Internet, an
intranet, a local area network (LAN) or wide area
network (WAN).

Issues
Privacy -The increased use of cloud computing
services such as Gmail and Google Docs has pressed
the issue of privacy concerns. The greater use of
cloud computing services has given access to a
plethora of data which has the immense risk of data
being disclosed either accidentally or deliberately.

Contd..
Legal-certain legal issues arise with cloud
computing, including trademark infringement,
security concerns and sharing of proprietary data
resources.
Vendor lock-in-cloud computing is still relatively
new, standards are still being developed. Many cloud
platforms and services are built on the specific
standards, tools and protocols developed by a
particular vendor for its particular cloud offering.
This is a major challenge in interoperability.

Research areas
open interoperation across cloud solutions at IaaS,
PaaS and SaaS levels
managing multi tenancy at large scale and in
heterogeneous environments
dynamic and seamless elasticity from private clouds
to public clouds for unusual and/or infrequent
requirements
data management in a cloud environment, taking the
technical and legal constraints into consideration

Databases &DBMS
A database is an organized collection of data. The
data are typically organized to in a way that supports
processes requiring this information.
Database management systems (DBMSs) are
specially designed software applications that interact
with the user, other applications, and the database
itself to capture and analyze data.

Issues
Data definition – Defining new data structures for a
database, removing data structures from the database,
modifying the structure of existing data.
Update – Inserting, modifying, and deleting data.
Retrieval – Obtaining information either for end-user
queries and reports or for processing by applications.
Administration – Registering and monitoring users,
enforcing data security, monitoring performance,
maintaining data integrity, dealing with concurrency
control, and recovering information if the system fails.

Research areas
Research activity includes theory and development of
prototypes and models. Notable research topics
include, the atomic transaction concept and related
concurrency control techniques, query languages and
query optimization methods, RAID, and more.

NLP
Natural language processing (NLP) is a field of
computer science, artificial intelligence, and
linguistics concerned with the interactions between
computers and human (natural) languages. As such,
NLP is related to the area of human–computer
interaction.

Human-level natural language processing is an AI
problem, that is equivalent to making computers as
intelligent as people. NLP's future is therefore tied
closely to the development of AI in general.
As natural language understanding improves,
computers will be able to learn from the information
online and apply what they learned in the real world.
In the future, humans may not need to code
programs, but will dictate to a computer in a human
natural language, and the computer will understand
and act upon the instructions.

Signal Processing
Signal processing is an area of Systems Engineering,
Electrical Engineering and applied mathematics that
deals with operations on or analysis of analog as well
as digitized signals, representing time-varying or
spatially varying physical quantities.
Signals of interest can include sound,
electromagnetic radiation, images, and sensor
readings, for example biological measurements such
as electrocardiograms, control system signals,
telecommunication transmission signals, and many
others.

Computer vision
Computer vision is a field that includes methods for
acquiring, processing, analyzing, and understanding
images and, in general, high-dimensional data from
the real world in order to produce numerical or
symbolic information.
A theme in the development of this field has been to
duplicate the abilities of human vision by
electronically perceiving and understanding an
image.

Data Science Higher Education programmes 2014
Programs in 2014Institute / Organization Course
Indiana University, Indiana, US * Online Certificate in Data
Science(January 2014 ).
University of California, Berkeley Master of Information and Data
Science program.
Saint Peters University, US ** Master of Science in Data Science
program.
Worcester Polytechnic Institute,
Worcester, Massachusetts, US
Master of Science in Data Science
program.
University of Virginia , US *** Master of Science in Data Science
* The program consists of 12 credits, including cloud computing, data management and
data analysis.
** The program’s curriculum will include topics such as decision analysis and
optimization, predictive modeling, data mining and visualization.
*** A professional program to prepare students for the use of data analysis in major
industries such as health care, business, and science.

Conferences on Data Science
2014
International Conference on Data Science and
Engineering, (26-28 August 2014)
Hosted By :
School of Computer Science Studies
Cochin University of Science & Technology,
Co-Sponsored by IEEE Kerala.
DataEDGE Conference : A new vision for data science,
(May 8–9, 2014 Berkeley, CA )
Discussions will be on the way organizations are using
data to address business and social issues, about the
challenges of working with data at scale, and about the
most pressing questions and debates facing data
scientists today.

O’REILLY Strata is organising three conferences:
New York(October 15-17, 2014 ) Discussions will be
on complex issues and opportunities brought to
business by big data, data science, and pervasive
computing.
Barcelona, Spain (November 19–21,2014) Discussions
will be on big data analytics.
San Jose, CA (February 18–20, 2015)

ASE(Academy of Science and Engineering) is organising
three conferences:
Stanford University, CA, USA, (May 27 - May 31, 2014)
Tsinghua University, Beijing, China, (August 4-7, 2014)
Harvard University, Cambridge, MA, US (December 15-
19, 2014).
IEEE International Conference on Big Data Science and
Engineering (Tsinghua University, Beijing, China, 24-26
Sept. 2014).
The 2014 International Conference on Data Science and
Advanced Analytics(October 30 - November 1, 2014,
Shanghai, China).

Journals of Data Science
Journal of Data Science-an international journal
devoted to applications of statistical methods at
large.
Online version is free.
Hard copy version- 300 USD/ year
CODATA Data Science Journal
Published by Codata.
EPJ Data Science: a Springer Open Journal
International Journal of Data Science : Inder
Science Publishers.

References
http://bit.ly/1omFc3B
http://bit.ly/1jZbP5F
http://bit.ly/1mCBzqv
http://oreil.ly/1jZc4O0
http://bit.ly/1mnyJRe
http://bit.ly/1tMzzvx
http://bit.ly/1pwnZlN
http://bit.ly/1iq0y9a
https://bitly.com/

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Ähnlich wie Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Hinweis der Redaktion