This PowerPoint helps students to consider the concept of infinity.
“Big Data” and the Challenges for Statisticians
1. “Big Data” and the
Challenges for
Statisticians
Setia Pramana
Math Department, FMIPA Brawijaya University
Malang 7 February 2014
2. Data Explosion
• Interactions of billions of people using computers, GPS
devices, cell phones, and medical devices.
• online or mobile financial transactions, social media traffic,
and GPS coordinates.
• “In the next five years, we’ll generate more data as humankind
than we generated in the previous 5,000 years”. Eron Kelly,
GM Microsoft
3. Data Explosion
• Interactions of billions of people using
computers, GPS devices, cell phones, and
medical devices.
• online or mobile financial transactions, social
media traffic, and GPS coordinates.
• “In the next five years, we’ll generate more data
as humankind than we generated in the previous
5,000 years”. Eron Kelly, GM Microsoft
6. Big Data
• Volume
• Velocity
• Variety
http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data
7. Big Data
• Volume
• Velocity
• Variety
http://www.documentcapture.co.uk
A challenge: managing its size
Storing, searching, analyzing, comparing, refining, combining, and
visualizing.
8. Big Data
• Veracity: biases, noise and abnormality in data.
• Validity: is the data correct and accurate for the intended use?
• Volatility: how long is data valid and how long should it be
stored?
12. Connecting the Data
• After Haiti’s earthquake (2010),
researchers at the Karolinska Institute and
Columbia Univ showed that mobile data
patterns could be used to understand the
movement of refugees and the
consequent health risks posed by these
movements.
13. Big Data in Biomedicine
From where the big data comes from?
• Billions of measurements in the health
system: physician diagnose, drug dispense,
blood test, x-ray or CT scan, etc..
• Advanced Molecular tech: Microarray,
Next generation Sequencing
14. Big Data in Biomedicine
From where the big data comes from?
15. Microarray
• Measure expression of thousands of genes under different
conditions.
• Thousands of variables -> need special statistics methods
21. Relate Several Data Repository
Disease Gene
Expression DB
Drug Gene
Expression DB
• New Drug Discovery
• Drug Repositioning e.g., Viagra (unexpected)
22. More..
• Genome Project:
• Next generation Sequencing, e.g, Whole
Genome Seq: info our 3 billion bp DNA code
• And many more……
23. Data Science
• A multidisiplinary
science: Statistics, Math,
Comp Science, Machine
learning, Data
Munging/Cleaning, and
Data Visualization,
Domain expertise.
http://drewconway.com
Data Scientist: The Sexiest Job of the 21st
Century
27. Statisticians Should...
• Have strong foundation in statistical
theory, methods, and software.
• Be expert in R and Python.
• Familiarity with data visualization and
machine learning techniques.
• Know about parallel computing, combining
data from disparate sources, and handling
textual and streaming data.
• Get engaged to the real world.
• More innovative ..
29. R
• The continued rapid growth in add-on packages.
• The near monopoly R has on the latest analytic
methods.
• Its free price.
• The freedom to teach with real-world examples
from outside organization.
Before it was diffiicult to get data, especially for Skripsi….
Researchers from the two organisations obtained data on the outflow of people from Port-au-Prince following the earthquake by tracking the movement of nearly two million SIM cards in the country. They were able to accurately analyse the destination of over 600,000 people displaced from Port-au-Prince, and they made this information available to government and humanitarian organisations dealing with the crisis