SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 1
Statistics, Data Mining, Big Data, Data
Science: More than a play of names
Edgar Acuna
Department of Mathematical Science
University of Puerto Rico-Mayaguez
E-mail: edgar.acuna@upr.edu , eacunaf@gmail.com
Website: academic.uprm.edu/eacuna
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 2
Outline
I. Introduction
II. Data Mining
III. From Statistics to Data Mining
IV. From STAT/DM to Big Data
V. From Statistics to Data Science
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 3
Introduction
The mechanisms for automatic recollection of
data and the development of databases
technology has made possible that a large
amount of data can be available in
databases, data warehouses and other
information repositories.
Nowdays, there is the need to convert this
data in knowledge and information.
The first hard drive, 1956
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 4
IBM 350, had the size of two refrigerators and about 3.75MB of
storage. It weighted over a ton. Approximately price of 50,000
dollars. Today, Seagate sells a 2TB hard drive it weights only .33
pounds. It costs around $100.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña
5
Size (in Bytes) of datasets
Description Size Storage Media
Very small 102 Piece of paper
Small 104 Several sheets of
paper
Medium 106 (megabyte) Floppy Disk
Large 109(gigabite) USB/Hard Disk
Massive 1012(Terabyte) Hard disk/USB
Super-massive 1015(Petabyte) File of distributed data
Exabyte(1018), Zettabytes(1021), Yottabytes(1024)
The economist, February 2010
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 6
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 7
Data Revolution
Year Digital Analog Amount
2000 25% 75% 2 Exabytes
2007 93% 7% 300 Exaby
2013 98% 2% 1,200 Exaby
Source: Viktor Mayer-Schönberger and Kenneth Cukier: Big Data: A
Revolution that will Transform how We Live, Work and Think 2013)
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 8
What is Data Mining?
 It is the process of extracting valid
knowledge/information from a very large dataset.
The knowledge is given as patterns and rules that
are non-trivial, previously unknown, understandable
and with a high potential to be useful.
 Other names: Knowledge discovery in databases
(KDD), Intelligent Data Analysis, Business
Intelligence.
 The first paper in Data Mining: Agrawal et al. Mining
Association rules, ACM SIGMOD 1993.
Areas related to Data Mining
Statistics
Machine
Learning
Databases
Visualization
Data Mining
9UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 10
Contribution of of each area to Data
MIning
 Statistics (~35% ): Estimation of prediction models. Assume distribution for
the features used in the model. Use of sampling
 Machine learning: (~30 % ): Part of Artificial Intelligence. More heuristic
than Statistics. Small data and complex models
 Databases: (~20%):large Scale data, simple queries. The data is
maintaned in tables that are accesed quickly.
 Visualization: ( ~ 5%).It can be used in either the pre-processing o post-
processing step of the KDD process.
 Other Areas: ( ~10%): Pattern Recognition, Expert Systems, High
Performance Computing.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 11
Data Mining Applications
Science: Astronomy, Bioinformatics (Genomics,
Proteonomics, Metabolomics), drug discovery.
Business: Marketing, credit risk, Security and Fraud
detection,
Govermment: detection of tax cheaters, anti-terrorism.
Text Mining:
Discover distinct groups of potential buyers according
to a user text based profile. Draw information from
different written sources (e-mails).
Web mining: Identifying groups of competitors web
pages. Recomemder systems(Netflix, Amazon,
Ebay)
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 12
Data Mining: Type of tasks
 Descriptive: General properties of the
database are determined. The most important
features of the databases are discovered. It includes
outlier detection, visualization, clustering and association
rules.
Predictive: The collected data is used to train a
model for making future predictions. Never is
100% accurate and the most important matter
is the performance of the model when is applied to future
data. It Includes Regression and Classification
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 13
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous Data.
 Quality of Data
 Privacy Data
 Streaming Data
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 14
Software for Data Mining
 Open source
 R (cran.r-project.org). Related to Statistics (38.5% users,
Kdnuggets June 2014).
 RapidMiner (http://rapidminer.com). (44.2%) Related to the
Database community. **
 Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) (17.0%):
Related to Machine Learning. Written in Java.
 Python, iPython (19.5%). Hadoop(12.7%).
 Comercial: SAS (10.8%), XLMiner(25%), Microsoft
SQL(10.5%), KNIME (15.0%), Oracle(2.2%).
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 15
A quick comparison of R and
iPython Notebook
 Dataset: Titanic: Machine learning from disaster (
www.kaggle.com). The goal is to apply the tools of ML to
predict which passengers survive the tragedy.
 S(Chambers, 1983), R (Ihaka nad Gentleman, 1994),
Rstudio (2011).
 Python(1989, G. Van Rossum, Netherlands), iPython
(2011, F. Perez, UC Berkeley)
 Pandas (2012) is a software library written for the Python
programming language for data manipulation and analysis.
 Rpython(Gil Bellosta, 2014)

UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 16
Google Trends for Data Mining
From STAT/DM Big Data[1]
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 17
February 1977, L. Breiman organized the conference “Analysis of large
complex data sets”. Sponsored by ASA and IMS Dallas, USA.
1994 COMPSTAT, Proceedings of Computational Statistics. Part I.
Treatment of “huge” Data sets.
May 1997. The 29th Symposium on the Interface (Houston, TX) “ Data
Mining and the analysis of large data sets”.
October 1997, M. Cox y D. Ellsworth used the term “big data” in a
conference on visualization organized by the IEEE.
1998, Workshop on Massive data sets. Committee on Applied and
Theoretical Statistics. NRC, USA.
April 1998, John Massey, chief scientist of SGI, presented a paper “Big
data… and the next wave of Infra-stress” in a meeting of the USENIX.
From STAT/DM to Big Data[2]
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 18
In February 2001, Doug Laney, analyst for the Meta Group, defined
data challenges and opportunities as being three-dimensional:
increasing volume (amount of data), velocity (speed of data in and out),
and variety (range of data types and sources).
September 2008, the scientific magazine Nature published a special
edition about “big data”.
May 2011, Manyika, J. et al. researchers of the McKinsey Global
Institute published the report: “Big data: the next frontier for innovation,
competition and productivity”.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 19
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 20
Big Data: Examples
 The Large Hadron Collider (LCH) storages around
25 Petabytes of sensor data per year.
 In 2010, the ATT’s database of calling records was
of 323 Terabytes.
 El 2010, Walmart handled 2.5 Petabytes of
transactions.
 In 2009, there was 500 exabytes of information on
the internet.
 In 2011, Google searched in more than 20 billions of
web pages. This represents aprox. 400 TB.
 In 2013, it was announced that the NSA’s Data
Center at Utah will storage up to 5 zettabytes (5,000
exabytes).
Big Data: Definition
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 21
In 2012, Doug Laney, working now at the Gartner Group, updated its
definition of Big Data as follows: "Big data is high volume, high velocity,
and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and
process optimization.“
Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, manage, and process the
data within a tolerable elapsed time.
Big data sizes are constantly changing. In 2002, maybe 100GB, in
2012 perhaps from 10TB to many petabytes of data in a single data
set.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 22
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 23
Google trends for Big data
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 24
Big data is …
Big data is like teenage sex:
everyone talks about it, nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
Dan Ariely ( June 2013), Professor of Psychology and
Behavioral Economics. Duke University.
Single Node Architecture
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 25
Motivation: Google Example
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 26
Google searches in more than 20 billion of webpages x20KB= 400+
TB. One computer reads with a speed of 30-35 MB/sec from disk.
It will need aprox 4 months to read the web.
It will be necessary aprox 1000 hard drives to read the web
It will be necessary even more than that to analyze the data
Today a standard architecture for such problems is being used. It
consists of
-a cluster of commodity Linux nodes
-commodity network (ethernet) to connect them
Cluster Architecture
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 27
Challenges in large-scale
computing for data mining
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 28
How to distribute the computation?
How to write down distributed program easily?
Machines fail!.
One computer may stay up three years (1000 days)
If you have 1000 servers , expect to loose 1 per day
In 2011, it was estimated that Google has 1 million of
computers, so 1000 servers can fail everyday
Using R in parallel and distributed
computation
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 29
RMPI: Package to run R on MPI
Snow: Package to run R on MPI
Gputools: Package to run R on CUDA
Rhadoop: Interface between R y Hadoop. It includes rhbase, rhdfs,
rmr. Developed by Revolution Analytics, a company bought by
Microsoft in March 2015.
Between 2003-2006, Elio Lozano and Edgar Acuna wrote several
articles on aplicacion of R in parallel and distributed computation to
calculate metaclassifiers, metaclustering, boosting, bagging, kernel
density estimation, ensembles and outlier detection.
What is Hadoop?
 In 2004, J. Dean and S. Ghemawhat wrote a paper explaining
Google’s MapReduce, a programming model and a associated
infrastucture for storage of large data sets (file system) called
Google File System (GFS).
 GFS is not open source.
 In 2006, Doug Cutting at Yahoo! , created a open source GFS
and called it Hadoop Distributed File System (HDFS). In 2009,
he left to Cloudera.
 The software framework that supports HDFS, MapReduce and
other related entities is called the project Hadoop or simply
Hadoop.
 Hadoop is distributed by the Apache Software Foundation.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 30
Hadoop
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 31
Hadoop includes:
Distributed Files System(HDFS) –distributes data
Map/Reduce-distributes application
It is written in Java
Runs on
-Linux, MacOS/X, Windows, and Solaris
-Uses commodity hardware
Distributed File System
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 32
HDFS Architecture
 Provides
• Automatic Parallelization and Distribution
• Fault Tolerance
• I/O Scheduling
• Monitoring and Status Updates
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 33
MapReduce
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 34
Map-Reduce is a programming model for efficient distributed
computing
It works like a Unix pipeline:
-cat imput |grep | sort |unique –c |cat > ouput
- Input | Map | Shuffle & Sort |Reduce |Output
Efficient because reduces seeks and the use of pipeline
The WorkFlow
 Load data into the Cluster (HDFS writes)
 Analyze the data (MapReduce)
 Store results in the Cluster (HDFS)
 Read the results from the Cluster (HDFS
reads)
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 35
Hadoop Non-Java Interfaces
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 36
Hadoop streaming: C++,Python, perl,ruby.
Rhadoop (R and Hadoop),
Weka(Mark Hall is working on that),
Radoop (Rapidminer and hadoop, comercial)
Hadoop Pipes: (C++) It not recommendable
Where can I run hadoop?
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 37
In your personal computer using the single node hadoop cluster. If you
have windows install a virtual machine for running Ubuntu(See Michael
Noll’s website)
Free:
At the Gordon cluster of the SDSC (1024 nodes, each node has 16
cores) and the Stampede cluster of TACC (6400 nodes, each node has
16 cores). Access can be gained through the XSEDE project.
Non-Free, but not too expensive
Amazon Elastic Compute Cloud ( EC2)
Who is using Hadoop?
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 38
Yahoo
Facebook
Amazon
Google
IBM
Netflix
Ebay
Linkedln
Twitter
Example of Performance:Hadoop for
Feature Selection(Gomez & Acuna,2014)
39
Before Hadoop After Hadoop
Time 17 hours 5m23s
Language R and c++ Java
The Goal is to select the best features for prediction in the Covtype dataset.
Covtype has 581,102 instances and 54 predictors and 7 classes.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña
Mahout
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 40
Scalable Machine learning library that runs on Hadoop
Includes some algorithms for:
Recommendation mining
Clustering
Classification
Finding frequent itemsets
Some people thinks that the algorithms included in Mahout are not
programmed in a optimal way.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 41
In February 2009, researchers from Google tried to predict the spread of
the flu epidemic building a algorithmic model. Google’s algorithm mined
five years of web logs, containing hundreds of billions of searches, and
created a predictive model utilizing 45 search terms that proved to be a
more useful and timely indicator of flu than government statistics with their
natural reporting lags.(Viktor Mayer-Schönberger and Kenneth Cukier: Big
Data, 2013).
During the first two years the model was given acceptable predictions. But,
in 2011 the model starts to fail beginning with a 50% up to 92% of error in
the predictions. (Science, March 2014). The GFT has over-estimated the
prevalence of flu for 100 out of the last 108 weeks; it’s been wrong since
August 2011.
In May 2011, Google launched the Dengue Google Trends
Google Flu Trends: A Failure of Big data
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 42
Spark developed at the AMP lab of the UC Berkeley. Programming must be
done in Scala, Java, Python and R. It is Free. AMPLAB claims the Spark
beats MapReduce on several algorithms.
Watson Analytics from IBM (SPSS’s owner). Free for academia.
Azure Machine Learning from Microsoft Research, software built on Linux.
Free for academia.
Accoding to kdnuggets, Azure is better developed than Watson
Amazon Web Services.
Other computational tools for Big data
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 43
Face detection: Azure Machine Learning
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 44
-May 2013, White House Data Partners Workshop-0/19 statisticians.
-April 2014, National Academy of Science Big Data Workshop-2/16
speakers statisticians.
- In 2012, a NSF formed a working group of 100 members on Big data.
None statistician were included.
-2012, NIH BD2K executive committee composed of 18 members does
not include any statistician.
More examples in http://simplystatistics.org
The absence of statisticians in Big Data
activities
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 45
Layman’s perception of statisticians
A statistician is someone that counts numbers
(similar to an accountant).
A statistician makes tables and graphs to report
and/or summarize information.
A statistician computes averages percentages and
standard deviations to summarize information.
A statistician helps politicians and the government to
fool the people (Lies , Dammed lies and Statistics).
More details in J. Wu’s presentation(1997).
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 46
From Statistics to Data Science[1]
John Tukey: “For a long time I have thought I was a
statistician interested in inferences… I have come to
feel that my central interest is in data analysis “.
(Annals of Mathematical Statistics, 1962).
In 1977, Tukey published “Exploratory Data
Analysis”. The data is analyzed to summarize theur
main characteristics using mostly visual
methods:Boxplots, stem-and-leaf, Parallel
coordinates, Multidimensional Scaling, etc.
In 1996, the International Federation of Classification
Societies (IFCS) held a conference titled “Data
Science,Classification and Related Methods”.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 47
From Statistics to Data Science [2]
J. Wu (1997). In his acceptance lecture as a chairman
of the Statistics Department at University of Michigan
proposed to change the name from Statistics to Data
Science.
Prof. Wu also proposed to rename statisticians as data
scientists [But Stigler’s law will apply here].
Stigler's law of eponymy". Transactions of the New
York Academy of Sciences 39: 147–58(1980).
Wu also proposed that statisticians should get
involved with large and complex data.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 48
From Statistics to Data Science[3]
J. H. Friedman proposed that Statistics should
embrace Data Mining as a sub-discipline.
“The role of the statistics in the data revolution”.
Stanford University Tech Report (2000).
 In 2001 appeared the Breiman’s paper. “Statistical
Modelling: The two cultures”. He states; “There are
two cultures in the use of statistical modeling to
reach conclusions from data. One assumes that the
data are generated by a given stochastic data
model. The other uses algorithmic models and treats
the data mechanism as unknown.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 49
From Statistics to Data Science[4]
The statistical community has been committed to the
almost exclusive use of data models. This commitment
has led to irrelevant theory, questionable conclusions,
and has kept statisticians from working on a large
range of interesting current problems.
Algorithmic modeling can be used both on large
complex data sets and as a more accurate and
informative alternative to data modeling on smaller data
sets. If our goal as a field is to use data to solve
problems, then we need to move away from exclusive
dependence on data models and adopt a more diverse
set of tools.
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 50
From Statistics to Data Science[5]
W. Cleveland (2001). Proposed the creation of the fiel
of Data Science, similar to multi-disciplinary
statistics. ”Data Science: An Action Plan for
Expanding the Technical Areas of the Field of
Statistics”.
In 2002 starts the publication of the “Data Science
Journal” and the “Journal of Data Science”.
En el 2008, D.J.Patil (LinkedIn) y Jeff Hamerbacher
(Facebook) introduced the term “data
scientist”.(Stigler’s law was applied here).
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 51
Data Science as a discipline
In January 2009, Hail Varian, chief economist of
Google stated “I keep saying the sexy job in the next
ten years will be statisticians..” (The McKinsey
Quarterly).
From 2009 through 2012, several professionals from
computer science, social networks and business
schools began introducing the idea that Statistics
according to Varian was not the traditional one, but a
new field called “Data Science”. T.H Davenport and
J.H. Patil (October 2012, Harvard Business Review)
“Data Scientist: The sexiest job of the 21st century”
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 52
Statistician versus Data Scientist (D.
Smith Revolution Analytics, 2013)
Statistician Data Scientist
Imagen Baseball/Football HBR sexiest job
of the 21st century
Works Solo In a team
Data Prepared, clean Distributed,
Messy,
Unstructured
Data Size Kilobytes Gigabytes
Tools SAS, Mainframe R, Python,
Hadoop, Linux
Focus Inference(why) Prediction(what)
Latency weeks seconds
Output Report Data
Product/data App
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 53
Hillary Mason (MS in CS, bit.ly-Accel Partners)
John Myles White (PhD Psychology, Facebook)
Peter Skomoroch (BS in Mathematics, Linkedin-DataWrangling)
Gregory Piatesky (PhD in CS, Kdnuggets)
D.J Patil (PhD Math, Linkedin- White House)
Jeff Hammerbacher(BS Mathematics, Facebook-Cloudera y Columbia
University)
David Smith(PhD Statistics, Revolution Analytics)
Christopher D. Long (MS in Math/STAT, JMI Equity)*
Peter Norvig (PhD in CS, Google)
Ben Lorica (PhD Math, O’Reilly Media)
Top 10 Data Scientist
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 54
Undergraduate: University of Rochester, University of San Francisco,
College of Charleston, Northern Kentucky University, Case Western
University (more than a dozen in USA). Warwick University (UK).
Master: Carnegie Mellon University, Stanford University, Columbia
University, Indiana University, NYU, SMU, Virginia. More than 30 in USA,
Canada, UK, Spain, France, Germany and the rest of the world.
Latinoamerica: Only Mexico (ITAM, since 2013). Next August in Peru.
Doctorate: Edimburgh University (UK), Columbia. Less than one dozen.
Academic Programs in Data Science
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 55
Core Courses:
Theory of Statistics
Computational Statistics
Data Mining
Machine Learning
Databases Systems
Data Science Capstone (Data Science Practicum)
Three elective courses: Algorithms, High Performance Computing,
Scientific Visualization, Regression, Big data analytics , a business course,
a biology course, a chemistry course, a marine science course, etc.
Sample Curriculum of a MS Program in
Data Science
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 56
Data Science segun Google Trends
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 57
Comparacion de Data Mining, Big Data, Bioinformatica y
Data Science segun Google Trends
Segun Google Trends
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 58
… And soon, in a two- to five-year span, people will say, “The whole big-
data thing came and went. It died. It was wrong.” I am predicting that. (Prof.
Michael Jordan, EECS/STAT UC Berkeley in a interview for IEEE
Spectrum, October 20, 2014)
Talking at Gartner’s Business Intelligence and Information Management
summit in São Paulo, Feinberg said “The term ‘big data’ is going to
disappear in the next two years, to become just ‘data’ or ‘any data.’ But, of
course, the analysis of data will continue.” (dataconomy.com, May 2014)
On the Future of Big Data
UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 59
Is Data Science same as Statistics using more computer power?
Some people, particularly statisticians consider it so.
Is Data Science same as multidisciplinary modern applied Statistics?
Most of the people consider that this is the case.
Is Data Science the field to analyze only Big data?
Some people, particularly from the CS community consider it so.
Only two of the top ten Business School (Sloan at MIT and Kellog at NU) do
data science. Kellog has a statistics department inside its business school.
But some business schools are trying to rename “business statistics” as
“data science”.
On the Future Data Science

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Big Data Analytics : Understanding for Research Activity
Big Data Analytics : Understanding for Research ActivityBig Data Analytics : Understanding for Research Activity
Big Data Analytics : Understanding for Research ActivityAndry Alamsyah
 
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen HamiltonDN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen HamiltonDataconomy Media
 
DN 2017 | A New Data Economy with Power to the People | Trent McConaghy | B...
DN 2017 |  A New Data Economy with Power  to the People | Trent McConaghy | B...DN 2017 |  A New Data Economy with Power  to the People | Trent McConaghy | B...
DN 2017 | A New Data Economy with Power to the People | Trent McConaghy | B...Dataconomy Media
 
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsJOSEPH FRANCIS
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Big Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUBig Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUEdison Lim Jun Hao
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 

Was ist angesagt? (20)

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Big Data Analytics : Understanding for Research Activity
Big Data Analytics : Understanding for Research ActivityBig Data Analytics : Understanding for Research Activity
Big Data Analytics : Understanding for Research Activity
 
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen HamiltonDN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
 
DN 2017 | A New Data Economy with Power to the People | Trent McConaghy | B...
DN 2017 |  A New Data Economy with Power  to the People | Trent McConaghy | B...DN 2017 |  A New Data Economy with Power  to the People | Trent McConaghy | B...
DN 2017 | A New Data Economy with Power to the People | Trent McConaghy | B...
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data for Ag (2019)
Big Data for Ag (2019)Big Data for Ag (2019)
Big Data for Ag (2019)
 
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
Overview of Big Data
Overview of Big DataOverview of Big Data
Overview of Big Data
 
Spark
SparkSpark
Spark
 
Big Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUBig Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMU
 
Big Data Paper
Big Data PaperBig Data Paper
Big Data Paper
 
GADLJRIET850691
GADLJRIET850691GADLJRIET850691
GADLJRIET850691
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big data
Big dataBig data
Big data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 

Ähnlich wie Math2015

BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionIvan Gruer
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Geoffrey Fox
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniquesijistjournal
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databasesbutest
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin
 
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air FranceQu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air FranceJedha Bootcamp
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public CloudIMC Institute
 
A Short History of Big Data
A Short History of Big DataA Short History of Big Data
A Short History of Big DataGadi Eichhorn
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraChun Myung Kyu
 
Carlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Vaccari
 

Ähnlich wie Math2015 (20)

BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" Introduction
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
Business with Big data
Business with Big dataBusiness with Big data
Business with Big data
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniques
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data Scientists
 
Data_Mining.ppt
Data_Mining.pptData_Mining.ppt
Data_Mining.ppt
 
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air FranceQu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
isd314-01
isd314-01isd314-01
isd314-01
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
L18 Big Data and Analytics
L18 Big Data and AnalyticsL18 Big Data and Analytics
L18 Big Data and Analytics
 
Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public Cloud
 
Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
A Short History of Big Data
A Short History of Big DataA Short History of Big Data
A Short History of Big Data
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 
Carlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for business
 

Kürzlich hochgeladen

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

Math2015

  • 1. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 1 Statistics, Data Mining, Big Data, Data Science: More than a play of names Edgar Acuna Department of Mathematical Science University of Puerto Rico-Mayaguez E-mail: edgar.acuna@upr.edu , eacunaf@gmail.com Website: academic.uprm.edu/eacuna
  • 2. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 2 Outline I. Introduction II. Data Mining III. From Statistics to Data Mining IV. From STAT/DM to Big Data V. From Statistics to Data Science
  • 3. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 3 Introduction The mechanisms for automatic recollection of data and the development of databases technology has made possible that a large amount of data can be available in databases, data warehouses and other information repositories. Nowdays, there is the need to convert this data in knowledge and information.
  • 4. The first hard drive, 1956 UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 4 IBM 350, had the size of two refrigerators and about 3.75MB of storage. It weighted over a ton. Approximately price of 50,000 dollars. Today, Seagate sells a 2TB hard drive it weights only .33 pounds. It costs around $100.
  • 5. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 5 Size (in Bytes) of datasets Description Size Storage Media Very small 102 Piece of paper Small 104 Several sheets of paper Medium 106 (megabyte) Floppy Disk Large 109(gigabite) USB/Hard Disk Massive 1012(Terabyte) Hard disk/USB Super-massive 1015(Petabyte) File of distributed data Exabyte(1018), Zettabytes(1021), Yottabytes(1024)
  • 6. The economist, February 2010 UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 6
  • 7. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 7 Data Revolution Year Digital Analog Amount 2000 25% 75% 2 Exabytes 2007 93% 7% 300 Exaby 2013 98% 2% 1,200 Exaby Source: Viktor Mayer-Schönberger and Kenneth Cukier: Big Data: A Revolution that will Transform how We Live, Work and Think 2013)
  • 8. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 8 What is Data Mining?  It is the process of extracting valid knowledge/information from a very large dataset. The knowledge is given as patterns and rules that are non-trivial, previously unknown, understandable and with a high potential to be useful.  Other names: Knowledge discovery in databases (KDD), Intelligent Data Analysis, Business Intelligence.  The first paper in Data Mining: Agrawal et al. Mining Association rules, ACM SIGMOD 1993.
  • 9. Areas related to Data Mining Statistics Machine Learning Databases Visualization Data Mining 9UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña
  • 10. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 10 Contribution of of each area to Data MIning  Statistics (~35% ): Estimation of prediction models. Assume distribution for the features used in the model. Use of sampling  Machine learning: (~30 % ): Part of Artificial Intelligence. More heuristic than Statistics. Small data and complex models  Databases: (~20%):large Scale data, simple queries. The data is maintaned in tables that are accesed quickly.  Visualization: ( ~ 5%).It can be used in either the pre-processing o post- processing step of the KDD process.  Other Areas: ( ~10%): Pattern Recognition, Expert Systems, High Performance Computing.
  • 11. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 11 Data Mining Applications Science: Astronomy, Bioinformatics (Genomics, Proteonomics, Metabolomics), drug discovery. Business: Marketing, credit risk, Security and Fraud detection, Govermment: detection of tax cheaters, anti-terrorism. Text Mining: Discover distinct groups of potential buyers according to a user text based profile. Draw information from different written sources (e-mails). Web mining: Identifying groups of competitors web pages. Recomemder systems(Netflix, Amazon, Ebay)
  • 12. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 12 Data Mining: Type of tasks  Descriptive: General properties of the database are determined. The most important features of the databases are discovered. It includes outlier detection, visualization, clustering and association rules. Predictive: The collected data is used to train a model for making future predictions. Never is 100% accurate and the most important matter is the performance of the model when is applied to future data. It Includes Regression and Classification
  • 13. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 13 Challenges of Data Mining  Scalability  Dimensionality  Complex and Heterogeneous Data.  Quality of Data  Privacy Data  Streaming Data
  • 14. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 14 Software for Data Mining  Open source  R (cran.r-project.org). Related to Statistics (38.5% users, Kdnuggets June 2014).  RapidMiner (http://rapidminer.com). (44.2%) Related to the Database community. **  Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) (17.0%): Related to Machine Learning. Written in Java.  Python, iPython (19.5%). Hadoop(12.7%).  Comercial: SAS (10.8%), XLMiner(25%), Microsoft SQL(10.5%), KNIME (15.0%), Oracle(2.2%).
  • 15. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 15 A quick comparison of R and iPython Notebook  Dataset: Titanic: Machine learning from disaster ( www.kaggle.com). The goal is to apply the tools of ML to predict which passengers survive the tragedy.  S(Chambers, 1983), R (Ihaka nad Gentleman, 1994), Rstudio (2011).  Python(1989, G. Van Rossum, Netherlands), iPython (2011, F. Perez, UC Berkeley)  Pandas (2012) is a software library written for the Python programming language for data manipulation and analysis.  Rpython(Gil Bellosta, 2014) 
  • 16. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 16 Google Trends for Data Mining
  • 17. From STAT/DM Big Data[1] UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 17 February 1977, L. Breiman organized the conference “Analysis of large complex data sets”. Sponsored by ASA and IMS Dallas, USA. 1994 COMPSTAT, Proceedings of Computational Statistics. Part I. Treatment of “huge” Data sets. May 1997. The 29th Symposium on the Interface (Houston, TX) “ Data Mining and the analysis of large data sets”. October 1997, M. Cox y D. Ellsworth used the term “big data” in a conference on visualization organized by the IEEE. 1998, Workshop on Massive data sets. Committee on Applied and Theoretical Statistics. NRC, USA. April 1998, John Massey, chief scientist of SGI, presented a paper “Big data… and the next wave of Infra-stress” in a meeting of the USENIX.
  • 18. From STAT/DM to Big Data[2] UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 18 In February 2001, Doug Laney, analyst for the Meta Group, defined data challenges and opportunities as being three-dimensional: increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). September 2008, the scientific magazine Nature published a special edition about “big data”. May 2011, Manyika, J. et al. researchers of the McKinsey Global Institute published the report: “Big data: the next frontier for innovation, competition and productivity”.
  • 19. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 19
  • 20. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 20 Big Data: Examples  The Large Hadron Collider (LCH) storages around 25 Petabytes of sensor data per year.  In 2010, the ATT’s database of calling records was of 323 Terabytes.  El 2010, Walmart handled 2.5 Petabytes of transactions.  In 2009, there was 500 exabytes of information on the internet.  In 2011, Google searched in more than 20 billions of web pages. This represents aprox. 400 TB.  In 2013, it was announced that the NSA’s Data Center at Utah will storage up to 5 zettabytes (5,000 exabytes).
  • 21. Big Data: Definition UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 21 In 2012, Doug Laney, working now at the Gartner Group, updated its definition of Big Data as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.“ Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are constantly changing. In 2002, maybe 100GB, in 2012 perhaps from 10TB to many petabytes of data in a single data set.
  • 22. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 22
  • 23. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 23 Google trends for Big data
  • 24. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 24 Big data is … Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Dan Ariely ( June 2013), Professor of Psychology and Behavioral Economics. Duke University.
  • 25. Single Node Architecture UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 25
  • 26. Motivation: Google Example UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 26 Google searches in more than 20 billion of webpages x20KB= 400+ TB. One computer reads with a speed of 30-35 MB/sec from disk. It will need aprox 4 months to read the web. It will be necessary aprox 1000 hard drives to read the web It will be necessary even more than that to analyze the data Today a standard architecture for such problems is being used. It consists of -a cluster of commodity Linux nodes -commodity network (ethernet) to connect them
  • 27. Cluster Architecture UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 27
  • 28. Challenges in large-scale computing for data mining UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 28 How to distribute the computation? How to write down distributed program easily? Machines fail!. One computer may stay up three years (1000 days) If you have 1000 servers , expect to loose 1 per day In 2011, it was estimated that Google has 1 million of computers, so 1000 servers can fail everyday
  • 29. Using R in parallel and distributed computation UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 29 RMPI: Package to run R on MPI Snow: Package to run R on MPI Gputools: Package to run R on CUDA Rhadoop: Interface between R y Hadoop. It includes rhbase, rhdfs, rmr. Developed by Revolution Analytics, a company bought by Microsoft in March 2015. Between 2003-2006, Elio Lozano and Edgar Acuna wrote several articles on aplicacion of R in parallel and distributed computation to calculate metaclassifiers, metaclustering, boosting, bagging, kernel density estimation, ensembles and outlier detection.
  • 30. What is Hadoop?  In 2004, J. Dean and S. Ghemawhat wrote a paper explaining Google’s MapReduce, a programming model and a associated infrastucture for storage of large data sets (file system) called Google File System (GFS).  GFS is not open source.  In 2006, Doug Cutting at Yahoo! , created a open source GFS and called it Hadoop Distributed File System (HDFS). In 2009, he left to Cloudera.  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  Hadoop is distributed by the Apache Software Foundation. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 30
  • 31. Hadoop UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 31 Hadoop includes: Distributed Files System(HDFS) –distributes data Map/Reduce-distributes application It is written in Java Runs on -Linux, MacOS/X, Windows, and Solaris -Uses commodity hardware
  • 32. Distributed File System UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 32
  • 33. HDFS Architecture  Provides • Automatic Parallelization and Distribution • Fault Tolerance • I/O Scheduling • Monitoring and Status Updates UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 33
  • 34. MapReduce UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 34 Map-Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: -cat imput |grep | sort |unique –c |cat > ouput - Input | Map | Shuffle & Sort |Reduce |Output Efficient because reduces seeks and the use of pipeline
  • 35. The WorkFlow  Load data into the Cluster (HDFS writes)  Analyze the data (MapReduce)  Store results in the Cluster (HDFS)  Read the results from the Cluster (HDFS reads) UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 35
  • 36. Hadoop Non-Java Interfaces UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 36 Hadoop streaming: C++,Python, perl,ruby. Rhadoop (R and Hadoop), Weka(Mark Hall is working on that), Radoop (Rapidminer and hadoop, comercial) Hadoop Pipes: (C++) It not recommendable
  • 37. Where can I run hadoop? UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 37 In your personal computer using the single node hadoop cluster. If you have windows install a virtual machine for running Ubuntu(See Michael Noll’s website) Free: At the Gordon cluster of the SDSC (1024 nodes, each node has 16 cores) and the Stampede cluster of TACC (6400 nodes, each node has 16 cores). Access can be gained through the XSEDE project. Non-Free, but not too expensive Amazon Elastic Compute Cloud ( EC2)
  • 38. Who is using Hadoop? UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 38 Yahoo Facebook Amazon Google IBM Netflix Ebay Linkedln Twitter
  • 39. Example of Performance:Hadoop for Feature Selection(Gomez & Acuna,2014) 39 Before Hadoop After Hadoop Time 17 hours 5m23s Language R and c++ Java The Goal is to select the best features for prediction in the Covtype dataset. Covtype has 581,102 instances and 54 predictors and 7 classes. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña
  • 40. Mahout UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 40 Scalable Machine learning library that runs on Hadoop Includes some algorithms for: Recommendation mining Clustering Classification Finding frequent itemsets Some people thinks that the algorithms included in Mahout are not programmed in a optimal way.
  • 41. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 41 In February 2009, researchers from Google tried to predict the spread of the flu epidemic building a algorithmic model. Google’s algorithm mined five years of web logs, containing hundreds of billions of searches, and created a predictive model utilizing 45 search terms that proved to be a more useful and timely indicator of flu than government statistics with their natural reporting lags.(Viktor Mayer-Schönberger and Kenneth Cukier: Big Data, 2013). During the first two years the model was given acceptable predictions. But, in 2011 the model starts to fail beginning with a 50% up to 92% of error in the predictions. (Science, March 2014). The GFT has over-estimated the prevalence of flu for 100 out of the last 108 weeks; it’s been wrong since August 2011. In May 2011, Google launched the Dengue Google Trends Google Flu Trends: A Failure of Big data
  • 42. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 42 Spark developed at the AMP lab of the UC Berkeley. Programming must be done in Scala, Java, Python and R. It is Free. AMPLAB claims the Spark beats MapReduce on several algorithms. Watson Analytics from IBM (SPSS’s owner). Free for academia. Azure Machine Learning from Microsoft Research, software built on Linux. Free for academia. Accoding to kdnuggets, Azure is better developed than Watson Amazon Web Services. Other computational tools for Big data
  • 43. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 43 Face detection: Azure Machine Learning
  • 44. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 44 -May 2013, White House Data Partners Workshop-0/19 statisticians. -April 2014, National Academy of Science Big Data Workshop-2/16 speakers statisticians. - In 2012, a NSF formed a working group of 100 members on Big data. None statistician were included. -2012, NIH BD2K executive committee composed of 18 members does not include any statistician. More examples in http://simplystatistics.org The absence of statisticians in Big Data activities
  • 45. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 45 Layman’s perception of statisticians A statistician is someone that counts numbers (similar to an accountant). A statistician makes tables and graphs to report and/or summarize information. A statistician computes averages percentages and standard deviations to summarize information. A statistician helps politicians and the government to fool the people (Lies , Dammed lies and Statistics). More details in J. Wu’s presentation(1997).
  • 46. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 46 From Statistics to Data Science[1] John Tukey: “For a long time I have thought I was a statistician interested in inferences… I have come to feel that my central interest is in data analysis “. (Annals of Mathematical Statistics, 1962). In 1977, Tukey published “Exploratory Data Analysis”. The data is analyzed to summarize theur main characteristics using mostly visual methods:Boxplots, stem-and-leaf, Parallel coordinates, Multidimensional Scaling, etc. In 1996, the International Federation of Classification Societies (IFCS) held a conference titled “Data Science,Classification and Related Methods”.
  • 47. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 47 From Statistics to Data Science [2] J. Wu (1997). In his acceptance lecture as a chairman of the Statistics Department at University of Michigan proposed to change the name from Statistics to Data Science. Prof. Wu also proposed to rename statisticians as data scientists [But Stigler’s law will apply here]. Stigler's law of eponymy". Transactions of the New York Academy of Sciences 39: 147–58(1980). Wu also proposed that statisticians should get involved with large and complex data.
  • 48. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 48 From Statistics to Data Science[3] J. H. Friedman proposed that Statistics should embrace Data Mining as a sub-discipline. “The role of the statistics in the data revolution”. Stanford University Tech Report (2000).  In 2001 appeared the Breiman’s paper. “Statistical Modelling: The two cultures”. He states; “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
  • 49. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 49 From Statistics to Data Science[4] The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
  • 50. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 50 From Statistics to Data Science[5] W. Cleveland (2001). Proposed the creation of the fiel of Data Science, similar to multi-disciplinary statistics. ”Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics”. In 2002 starts the publication of the “Data Science Journal” and the “Journal of Data Science”. En el 2008, D.J.Patil (LinkedIn) y Jeff Hamerbacher (Facebook) introduced the term “data scientist”.(Stigler’s law was applied here).
  • 51. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 51 Data Science as a discipline In January 2009, Hail Varian, chief economist of Google stated “I keep saying the sexy job in the next ten years will be statisticians..” (The McKinsey Quarterly). From 2009 through 2012, several professionals from computer science, social networks and business schools began introducing the idea that Statistics according to Varian was not the traditional one, but a new field called “Data Science”. T.H Davenport and J.H. Patil (October 2012, Harvard Business Review) “Data Scientist: The sexiest job of the 21st century”
  • 52. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 52 Statistician versus Data Scientist (D. Smith Revolution Analytics, 2013) Statistician Data Scientist Imagen Baseball/Football HBR sexiest job of the 21st century Works Solo In a team Data Prepared, clean Distributed, Messy, Unstructured Data Size Kilobytes Gigabytes Tools SAS, Mainframe R, Python, Hadoop, Linux Focus Inference(why) Prediction(what) Latency weeks seconds Output Report Data Product/data App
  • 53. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 53 Hillary Mason (MS in CS, bit.ly-Accel Partners) John Myles White (PhD Psychology, Facebook) Peter Skomoroch (BS in Mathematics, Linkedin-DataWrangling) Gregory Piatesky (PhD in CS, Kdnuggets) D.J Patil (PhD Math, Linkedin- White House) Jeff Hammerbacher(BS Mathematics, Facebook-Cloudera y Columbia University) David Smith(PhD Statistics, Revolution Analytics) Christopher D. Long (MS in Math/STAT, JMI Equity)* Peter Norvig (PhD in CS, Google) Ben Lorica (PhD Math, O’Reilly Media) Top 10 Data Scientist
  • 54. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 54 Undergraduate: University of Rochester, University of San Francisco, College of Charleston, Northern Kentucky University, Case Western University (more than a dozen in USA). Warwick University (UK). Master: Carnegie Mellon University, Stanford University, Columbia University, Indiana University, NYU, SMU, Virginia. More than 30 in USA, Canada, UK, Spain, France, Germany and the rest of the world. Latinoamerica: Only Mexico (ITAM, since 2013). Next August in Peru. Doctorate: Edimburgh University (UK), Columbia. Less than one dozen. Academic Programs in Data Science
  • 55. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 55 Core Courses: Theory of Statistics Computational Statistics Data Mining Machine Learning Databases Systems Data Science Capstone (Data Science Practicum) Three elective courses: Algorithms, High Performance Computing, Scientific Visualization, Regression, Big data analytics , a business course, a biology course, a chemistry course, a marine science course, etc. Sample Curriculum of a MS Program in Data Science
  • 56. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 56 Data Science segun Google Trends
  • 57. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 57 Comparacion de Data Mining, Big Data, Bioinformatica y Data Science segun Google Trends Segun Google Trends
  • 58. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 58 … And soon, in a two- to five-year span, people will say, “The whole big- data thing came and went. It died. It was wrong.” I am predicting that. (Prof. Michael Jordan, EECS/STAT UC Berkeley in a interview for IEEE Spectrum, October 20, 2014) Talking at Gartner’s Business Intelligence and Information Management summit in São Paulo, Feinberg said “The term ‘big data’ is going to disappear in the next two years, to become just ‘data’ or ‘any data.’ But, of course, the analysis of data will continue.” (dataconomy.com, May 2014) On the Future of Big Data
  • 59. UPRM 2015 Statistics, Data Mining, BIg Data,... Edgar Acuña 59 Is Data Science same as Statistics using more computer power? Some people, particularly statisticians consider it so. Is Data Science same as multidisciplinary modern applied Statistics? Most of the people consider that this is the case. Is Data Science the field to analyze only Big data? Some people, particularly from the CS community consider it so. Only two of the top ten Business School (Sloan at MIT and Kellog at NU) do data science. Kellog has a statistics department inside its business school. But some business schools are trying to rename “business statistics” as “data science”. On the Future Data Science