SlideShare a Scribd company logo
1 of 118
Download to read offline
Dr.A.Kathirvel
Professor/IT
Mother Teresa Women’s University
Kodaikanal
Data Mining and Big Data Challenges and
Research Opportunities
3
10 Challenging Problems in Data Mining
Research
4
1. Developing a Unifying Theory of Data
Mining
• The current state of the art of
data-mining research is too ``ad-
hoc“
– techniques are designed for
individual problems
– no unifying theory
• Needs unifying research
– Exploration vs explanation
• Long standing theoretical issues
– How to avoid spurious
correlations?
• Deep research
– Knowledge discovery on hidden
causes?
– Similar to discovery of Newton’s
Law?
An Example (from Tutorial Slides by Andrew Moore ):
• VC dimension. If you've got a learning
algorithm in one hand and a dataset in
the other hand, to what extent can you
decide whether the learning algorithm is
in danger of overfitting or underfitting?
– formal analysis into the fascinating
question of how overfitting can
happen,
– estimating how well an algorithm will
perform on future data that is solely
based on its training set error,
– a property (VC dimension) of the
learning algorithm. VC-dimension
thus gives an alternative to cross-
validation, called Structural Risk
Minimization (SRM), for choosing
classifiers.
– CV,SRM, AIC and BIC.
5
2. Scaling Up for High Dimensional Data
and High Speed Streams
• Scaling up is needed
– ultra-high dimensional
classification problems
(millions or billions of features,
e.g., bio data)
– Ultra-high speed data streams
• Streams
– continuous, online process
– e.g. how to monitor network
packets for intruders?
– concept drift and environment
drift?
– RFID network and sensor
network data
Excerpt from Jian Pei’s Tutorial
http://www.cs.sfu.ca/~jpei/
6
3. Sequential and Time Series Data
• How to efficiently and accurately
cluster, classify and predict the
trends ?
• Time series data used for
predictions are contaminated by
noise
– How to do accurate short-term
and long-term predictions?
– Signal processing techniques
introduce lags in the filtered data,
which reduces accuracy
– Key in source selection, domain
knowledge in rules, and
optimization methods
Real time series data obtained from
Wireless sensors in Hong Kong UST
CS department hallway
7
4. Mining Complex Knowledge from
Complex Data
• Mining graphs
• Data that are not i.i.d. (independent and identically distributed)
– many objects are not independent of each other, and are not of a single type.
– mine the rich structure of relations among objects,
– E.g.: interlinked Web pages, social networks, metabolic networks in the cell
• Integration of data mining and knowledge inference
– The biggest gap: unable to relate the results of mining to the real-world
decisions they affect - all they can do is hand the results back to the user.
• More research on interestingness of knowledge
Citation (Paper 2)
Author (Paper1)Title
Conference Name
8
5. Data Mining in a Network Setting
• Community and Social Networks
– Linked data between emails, Web
pages, blogs, citations, sequences
and people
– Static and dynamic structural
behavior
• Mining in and for Computer
Networks
– detect anomalies (e.g., sudden
traffic spikes due to a DoS
(Denial of Service) attacks
– Need to handle 10Gig Ethernet
links (a) detect (b) trace back (c )
drop packet
Picture from Matthew Pirretti’s slides,penn state
An Example of packet streams (data courtesy
of NCSA, UIUC)
9
6. Distributed Data Mining and Mining
Multi-agent Data
• Need to correlate the data seen at
the various probes (such as in a
sensor network)
• Adversary data mining:
deliberately manipulate the data to
sabotage them (e.g., make them
produce false negatives)
• Game theory may be needed for
help
• Games
Player 1:miner
Player 2
Action: H
H
H
T
TT
(-1,1) (-1,1)(1,-1) (1,-1)
Outcome
7. Data Mining for Biological and
Environmental Problems
• New problems raise new
questions
• Large scale problems especially
so
– Biological data mining, such as
HIV vaccine design
– DNA, chemical properties, 3D
structures, and functional
properties  need to be fused
– Environmental data mining
– Mining for solving the energy
crisis
10
11
8. Data-mining-Process Related Problems
• How to automate mining process?
– the composition of data mining
operations
– Data cleaning, with logging
capabilities
– Visualization and mining
automation
• Need a methodology: help users
avoid many data mining mistakes
– What is a canonical set of data
mining operations?
Sampling
Feature Sel
Mining…
12
9. Security, Privacy and Data Integrity
• How to ensure the users privacy
while their data are being mined?
• How to do data mining for
protection of security and privacy?
• Knowledge integrity assessment
– Data are intentionally modified
from their original version, in
order to misinform the recipients
or for privacy and security
– Development of measures to
evaluate the knowledge integrity
of a collection of
• Data
• Knowledge and patterns
http://www.cdt.org/privacy/
Headlines (Nov 21 2005)
Senate Panel Approves Data Security
Bill - The Senate Judiciary Committee on
Thursday passed legislation designed to
protect consumers against data security
failures by, among other things, requiring
companies to notify consumers when their
personal information has been
compromised. While several other
committees in both the House and Senate
have their own versions of data security
legislation, S. 1789 breaks new ground by
including provisions permitting consumers
to access their personal files …
13
10. Dealing with Non-static, Unbalanced
and Cost-sensitive Data
• The UCI datasets are small and
not highly unbalanced
• Real world data are large (10^5
features) but only < 1% of the
useful classes (+’ve)
• There is much information on
costs and benefits, but no overall
model of profit and loss
• Data may evolve with a bias
introduced by sampling
• Each test incurs a cost
• Data extremely unbalanced
• Data change with time
temperature
pressure blood test
cardiogram
essay
39oc
? ? ?
?
New Challenges
• Privacy-preserving data mining
• Data mining over compartmentalized databases
Inducing Classifiers over Privacy
Preserved Numeric Data
30 | 25K | … 50 | 40K | …
Randomizer
65 | 50K | …
Randomizer
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
30
becomes
65
(30+35)
Alice’s
age
Alice’s
salary
John’s
age
Other Recent Work
• Cryptographic approach to privacy-preserving data mining
– Lindell & Pinkas, Crypto 2000
• Privacy-Preserving discovery of association rules
– Vaidya & Clifton, KDD2002
– Evfimievski et. Al, KDD 2002
– Rizvi & Haritsa, VLDB 2002
Credit
Agencies
Criminal
Records
Demo-
graphic
Birth Marriage
Phone
Email
"Frequent Traveler" Rating Model
State
Local
Computation over Compartmentalized
Databases
Randomized Data
Shipping
Local computations
followed by
combination of
partial models
On-demand secure
data shipping and
data composition
Some Hard Problems
• Past may be a poor predictor of future
– Abrupt changes
– Wrong training examples
• Actionable patterns (principled use of domain knowledge?)
• Over-fitting vs. not missing the rare nuggets
• Richer patterns
• Simultaneous mining over multiple data types
• When to use which algorithm?
• Automatic, data-dependent selection of algorithm parameters
Discussion
• Should data mining be viewed as “rich’’ querying and “deeply’’ integrated
with database systems?
– Most of current work make little use of database functionality
• Should analytics be an integral concern of database systems?
• Issues in data mining over heterogeneous data repositories (Relationship to
the heterogeneous systems discussion)
20
Summary
• Developing a Unifying Theory of Data Mining
• Scaling Up for High Dimensional Data/High Speed Streams
• Mining Sequence Data and Time Series Data
• Mining Complex Knowledge from Complex Data
• Data Mining in a Network Setting
• Distributed Data Mining and Mining Multi-agent Data
• Data Mining for Biological and Environmental Problems
• Data-Mining-Process Related Problems
• Security, Privacy and Data Integrity
• Dealing with Non-static, Unbalanced and Cost-sensitive Data
Summary
• Data mining has shown promise but needs much more further research
We stand on the brink of great new answers, but even more, of great new
questions -- Matt Ridley
Introduction to Big Data
What are we going to understand
• What is Big Data?
• Why we landed up there?
• To whom does it matter
• Where is the money?
• Are we ready to handle it?
• What are the concerns?
• Tools and Technologies
– Is Big Data <=> Hadoop
Simple to start
• What is the maximum file size you have dealt so far?
– Movies/Files/Streaming video that you have used?
– What have you observed?
• What is the maximum download speed you get?
• Simple computation
– How much time to just transfer.
What is big data?
• “Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last two years
alone. This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a
few.
This data is “big data.”
Huge amount of data
• There are huge volumes of data in the world:
– From the beginning of recorded time until 2003,
– We created 5 billion gigabytes (exabytes) of data.
– In 2011, the same amount was created every two days
– In 2013, the same amount of data is created every 10 minutes.
Big data spans three dimensions: Volume, Velocity and
Variety
• Volume: Enterprises are awash with ever-growing data of all types, easily amassing
terabytes—even petabytes—of information.
– Turn 12 terabytes of Tweets created each day into improved product sentiment
analysis
– Convert 350 billion annual meter readings to better predict power consumption
• Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching
fraud, big data must be used as it streams into your enterprise in order to maximize its
value.
– Scrutinize 5 million trade events created each day to identify potential fraud
– Analyze 500 million daily call detail records in real-time to predict customer churn
faster
– The latest I have heard is 10 nano seconds delay is too much.
• Variety: Big data is any type of data - structured and unstructured data such as text,
sensor data, audio, video, click streams, log files and more. New insights are found when
analyzing these data types together.
– Monitor 100’s of live video feeds from surveillance cameras to target points of
interest
– Exploit the 80% data growth in images, video and documents to improve customer
satisfaction
Finally….
`Big- Data’ is similar to ‘Small-data’ but bigger
.. But having data bigger it requires different approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way
Whom does it matter
• Research Community 
• Business Community - New tools, new capabilities, new infrastructure, new
business models etc.,
• On sectors
Financial Services..
How are revenues looking like….
The Social Layer in an Instrumented
Interconnected World
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices sold
annually
76 million smart
meters in 2009…
200M by 2014
12+ TBs
of tweet data
every day
25+ TBs of
log data every
day
?TBsof
dataeveryday
What does Big Data trigger?
• From “Big Data and the Web: Algorithms for Data Intensive Scalable Computing”, Ph.D Thesis, Gianmarco
BIG DATA is not just HADOOP
Manage & store huge volume
of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
Types of tools typically used in Big
Data Scenario
• Where is the processing hosted?
– Distributed server/cloud
• Where data is stored?
– Distributed Storage (eg: Amazon s3)
• Where is the programming model?
– Distributed processing (Map Reduce)
• How data is stored and indexed?
– High performance schema free database
• What operations are performed on the data?
– Analytic/Semantic Processing (Eg. RDF/OWL)
When dealing with Big Data is hard
• When the operations on data are complex:
– Eg. Simple counting is not a complex problem.
– Modeling and reasoning with data of different kinds can get extremely complex
• Good news with big-data:
– Often, because of the vast amount of data, modeling techniques can get simpler
(e.g., smart counting can replace complex model-based analytics)…
– …as long as we deal with the scale.
Why Big-Data?
• Key enablers for the appearance and growth of ‘Big-Data’ are:
– Increase in storage capabilities
– Increase in processing power
– Availability of data
BIG DATA: CHALLEGES AND
OPPORTUNITIES
WHAT IS BIG DATA?
 Big Data is a popular term used to denote the exponential growth
and availability of data
- both structured and unstructured.
Very important! Why?
More data leads to better analysis enabling better decisions!
38
Definition of BIG DATA
• As far back as 2001, industry analyst Doug Laney (currently with Gartner)
articulated the now mainstream definition of big data as three Vs:
Volume,
Velocity and
Variety.
Let us look at each one of these items.
39
Definition of BIG DATA: Volume
• Factors that contribute to increase in volume of data:
– Transactions-based data stored through the years.
– Unstructured data streaming in from social media
– Ever increasing amounts of sensor and machine-to-machine data being collected
40
Definition of BIG DATA: Volume..
• Previously data storage was a big issue ; today it is not. But other issues
emerge such as:
• How to determine relevance within large data volumes?
• How to use analytics to create value from relevant data.
41
Definition of BIG DATA:Velocity
• Data is streaming in at unprecedented speed and must be dealt with in a
timely manner.
• RFID tags, sensors and smart metering are driving the need to deal with
torrents of data in near-real time.
• Reacting quickly enough to deal with data velocity is a challenge for most
organizations.
42
Definition of BIG DATA: Variety
• Data today comes in all types of formats. Structured, numeric data in
traditional databases.
• Information created from line-of-business applications; Unstructured text
documents, email, video, audio, stock ticker data and financial transactions.
• Managing, merging and governing different varieties of data is something
many organizations still grapple with.
43
Definition of BIG DATA: Others
• Variability:
In addition to the increasing velocities and varieties of data, data flows
can be highly inconsistent with periodic peaks.
• Is something trending in social media? Daily, seasonal and event-triggered
peak data loads can be challenging to manage.
• Even more so when unstructured data is involved.
44
Definition of BIG DATA: Others
• Complexity:
Today's data comes from diverse sources.
• It is still an undertaking to link, match, cleanse and transform data across
systems.
• It is necessary to connect and correlate relationships, hierarchies and
multiple data linkages or our data can quickly spiral out of control.
• PRIVACY: Associated Problems
45
BIG DATA : BIG PROBLEMS
• The problems start right away during data acquisition, when the
data tsunami requires us to make decisions, currently in an ad hoc
manner, about:
• what data to keep?
• what to discard?
• how to store?
• what we keep reliably with the right metadata.
46
BIG DATA : BIG PROBLEMS (2)
• Much data today is not in structured format;
• for example, tweets and blogs are weakly structured pieces of text, while
images and video are structured for storage and display, but not for
semantic content and search.
• transforming such content into a structured format for later analysis is a
major challenge.
47
BIG DATA : BIG PROBLEMS (3)
• The value of data explodes when it can be linked with other data,
thus data integration is a major creator of value.
• Since most data is directly generated in digital format today, we
have the opportunity and the challenge both to influence the
creation to facilitate later linkage and to automatically link
previously created data.
• Data analysis, organization, retrieval, and modeling are other
foundational challenges. Data analysis is a clear bottleneck in many
applications, both due to lack of scalability of the underlying
algorithms and due to the complexity of the data that needs to be
analyzed.
48
BIG DATA : BIG PROBLEMS (4)
• Presentation of the results and its interpretation by non-technical domain
experts is crucial to extracting actionable knowledge.
49
What has been Achieved…
• During the last 35 years, data management principles such as:
• physical and logical independence,
• declarative querying and
• cost-based optimization have enabled the first round of business
intelligence applications and laid the foundation for managing and
analyzing Big Data today
50
What is to be done?
• The many novel challenges and opportunities associated with Big Data
necessitate rethinking many aspects of these data management platforms,
while retaining other desirable aspects.
• Appropriate research & investment in Big Data will lead to a new wave of
fundamental technological advances that will be embodied in the next
generations of Big Data management and analysis platforms, products and
systems.
51
Big Data: Opportunity
• In a broad range of application areas, data is being collected at
unprecedented scale.
• Decisions that previously were based on guesswork, or on painstakingly
constructed models of reality, can now be made based on the data itself.
• Such Big Data analysis now drives nearly every aspect of our modern
society, including mobile services, retail, manufacturing, financial
services, life sciences, and physical sciences.
52
Big Data: Opportunity
• Scientific research has been revolutionized by Big Data.
• The Sloan Digital Sky Survey has today become a central resource for
astronomers the world over.
• The field of Astronomy is being transformed from one where taking
pictures of the sky was a large part of an astronomer’s job to one where the
pictures are all in a database already and the astronomer’s task is to find
interesting objects and phenomena in the database.
53
Big Data: Opportunity
• In the biological sciences, there is now a well-established tradition of
depositing scientific data into a public repository, and also of creating
public databases for use by other scientists.
• There is an entire discipline of bioinformatics that is largely devoted to the
curation and analysis of such data.
• As technology advances, particularly with the advent of Next Generation
Sequencing, the size and number of experimental data sets available is
increasing exponentially.
54
Big Data: Opportunities
• Big Data has the potential to revolutionize not just research, but also
education.
• A recent detailed quantitative comparison of different approaches taken by
35 charter schools in NYC has found that one of the top five policies
correlated with measurable academic effectiveness was the use of data to
guide instruction .
• Imagine a world in which we have access to a huge database where we
collect every detailed measure of every student's academic performance
55
Big Data: Opportunities
• This data could be used to design the most effective approaches to
education, starting from reading, writing, and math, to advanced, college-
level, courses.
• We are far from having access to such data, but there are powerful trends
in this direction.
• In particular, there is a strong trend for massive Web deployment of
educational activities, and this will generate an increasingly large amount of
detailed data about students' performance
56
Big Data: Opportunities: Health Care
• It is widely believed that the use of information technology:
• can reduce the cost of healthcare
• improve its quality by making care more preventive and personalized
and basing it on more extensive (home-based) continuous monitoring
57
Big Data: Opportunities:Others
• Effective use of Big Data for urban planning (through fusion of high-
fidelity geographical data)
• Intelligent transportation (through analysis and visualization of live and
detailed road network data)
• Environmental modeling (through sensor networks ubiquitously
collecting data)
• Energy saving (through unveiling patterns of use)
• Smart materials (through the new materials genome initiative),
• Computational social sciences
58
Big Data: Opportunities:Others
• financial systemic risk analysis (through integrated analysis of a web of
contracts to find dependencies between financial entities)
• homeland security (through analysis of social networks and financial
transactions of possible terrorists)
• computer security (through analysis of logged information and other
events, known as Security Information and Event Management (SIEM))
59
Challenges
• The sheer size of the data, Data Volume of course, is a major challenge,
and is the one that is most easily recognized.
• However, there are others. Industry analysis companies like to point out that
there are challenges not just in Volume, but also in Variety and Velocity.
• While these three are important, this short list fails to include additional
important requirements such as privacy and usability.
60
BIG DATA PIPE LINE
61
The FIVE PHASES
• There are five distinct phases in handling Big Data
• Acquisition/Recording
• Extraction, Cleaning/Annotation
• Integration/Aggregation/Representation
• Analysis/Modeling
• Interpretation
62
Data Acquisition and Recording(1/3)
• Big Data does not arise out of a vacuum: it is recorded from some data
generating source.
• Scientific experiments and simulations can easily produce petabytes of data
today.
• Much of this data is of no interest, and it can be filtered and compressed
by orders of magnitude.
• One challenge is to define these filters in such a way that they do not
discard useful information
63
Data Acquisition and Recording(2/3)
• The second big challenge is to automatically generate the right metadata
to describe what data is recorded and how it is recorded and measured.
• For example, in scientific experiments, considerable detail regarding
specific experimental conditions and procedures may be required to be
able to interpret the results correctly, and it is important that such metadata
be recorded with observational data. Metadata acquisition systems can
minimize the human burden in recording metadata.
64
Data Acquisition and Recording(3/3)
 Another important issue here is data provenance.
 Recording information about the data at its birth is not useful unless this
information can be interpreted and carried along through the data analysis
pipeline.
 For example, a processing error at one step can render subsequent analysis
useless; with suitable provenance, we can easily identify all subsequent
processing that dependent on this step.
 Thus we need research both into generating suitable metadata and into data
systems that carry the provenance of metadata through data analysis
pipelines.
65
Provenance:meaning
• The place of origin or earliest known history of something.
• The history of the ownership of an object, especially when documented or
authenticated.
• Provenance Sentence Example:
• Items with a known provenance excluding them from further scrutiny.
66
Information Extraction and Cleaning (1/4)
 Frequently, the information collected will not be in a format ready
for analysis.
 For example, consider the collection of electronic health records in
a hospital, comprising:
transcribed dictations from several physicians,
structured data from sensors
measurements (possibly with some associated uncertainty)
67
Information Extraction and Cleaning (2/4)
• Image data such as x-rays.
• We cannot leave the data in this form and still effectively analyze it.
• we require an information extraction process that pulls out the required
information from the underlying sources and expresses it in a structured
form suitable for analysis
68
Information Extraction and Cleaning (3/4)
• Doing this correctly and completely is a continuing technical challenge.
• Note that this data also includes images and will in the future include
video; such extraction is often highly application dependent (e.g., what
you want to pull out of an MRI is very different from what you would pull
out of a picture of the stars, or a surveillance photo).
• In addition, due to the ubiquity of surveillance cameras and popularity of
GPS-enabled mobile phones, cameras, and other portable devices, rich and
high fidelity location and trajectory (i.e., movement in space) data can
also be extracted.
69
Information Extraction and Cleaning (4/4)
• We are used to thinking of Big Data as always telling us the truth, but this
is actually far from reality.
• For example, patients may choose to hide risky behavior and care givers
may sometimes mis-diagnose a condition; patients may also inaccurately
recall the name of a drug or even that they ever took it, leading to missing
information in (the history portion of) their medical record.
• Existing work on data cleaning assumes well-recognized constraints on
valid data or well-understood error models; for many emerging Big Data
domains these do not exist.
70
Data Integration, Aggregation, and
Representation(1/4)
• Data analysis is considerably more challenging than simply locating,
identifying, understanding, and citing data.
• For effective large-scale analysis all of this has to happen in a completely
automated manner. This requires differences in data structure and
semantics to be expressed in forms that are computer understandable, and
then “robotically” resolvable.
• There is a strong body of work in data integration that can provide some of
the answers. However, considerable additional work is required to achieve
automated error-free difference resolution.
71
Data Integration, Aggregation, and
Representation (2/4)
• Even for simpler analyses that depend on only one data set, there remains
an important question of suitable database design.
• Usually, there will be many alternative ways in which to store the same
information.
• Certain designs will have advantages over others for certain purposes,
and possibly drawbacks for other purposes.
72
Data Integration, Aggregation, and
Representation(3/4)
• Witness, for instance, the tremendous variety in the structure of
bioinformatics databases with information regarding substantially similar
entities, such as genes.
• Database design is today an art, and is carefully executed in the enterprise
context by highly-paid professionals.
• We must enable other professionals, such as domain scientists, to create
effective database designs, either through devising tools to assist them in
the design process or through forgoing the design process completely and
developing techniques so that databases can be used effectively in the
absence of intelligent database design.
73
Data Integration, Aggregation, and
Representation (4/4)
• Database design is today an art, and is carefully executed in the enterprise
context by highly-paid professionals.
• We must enable other professionals, such as domain scientists, to create
effective database designs, either through devising tools to assist them in
the design process or through forgoing the design process completely and
developing techniques so that databases can be used effectively in the
absence of intelligent database design.
74
Query Processing, Data Modeling, and
Analysis(1/6)
• Methods for querying and mining Big Data are fundamentally different
from traditional statistical analysis on small samples.
• Big Data is often noisy, dynamic, heterogeneous, inter-related and
untrustworthy.
• Nevertheless, even noisy Big Data could be more valuable than tiny
samples because general statistics obtained from frequent patterns and
correlation analysis usually overpower individual fluctuations and often
disclose more reliable hidden patterns and knowledge.
75
Query Processing, Data Modeling, and
Analysis (2/6)
• Further, interconnected Big Data forms large heterogeneous information
networks, with which information redundancy can be explored to
compensate for :
– missing data,
– to crosscheck conflicting cases,
– to validate trustworthy relationships,
– to disclose inherent clusters, and
– to uncover hidden relationships and models
76
Query Processing, Data Modeling, and
Analysis (3/6)
• Mining requires
Integrated, cleaned, trustworthy, and efficiently accessible data, declarative
query and mining interfaces, scalable mining algorithms, and big-data
computing environments.
• At the same time, data mining itself can also be used to help improve the
quality and trustworthiness of the data, understand its semantics, and
provide intelligent querying functions.
77
Query Processing, Data Modeling, and
Analysis (4/6)
• Real-life medical records have errors,
are heterogeneous, and are distributed across multiple systems.
• The value of Big Data analysis in health care, to take just one example
application domain, can only be realized if it can be applied robustly under
these difficult conditions. On the flip side, knowledge developed from data
can help in correcting errors and removing ambiguity.
78
Query Processing, Data Modeling, and
Analysis (5/6)
• For example, a physician may write “DVT” as the diagnosis for a patient.
This abbreviation is commonly used for both “deep vein thrombosis” and
“diverticulitis,” two very different medical conditions.
• A knowledge-base constructed from related data can use associated
symptoms or medications to determine which of two the physician meant.
79
Query Processing, Data Modeling, and
Analysis (6/6)
• A problem with current Big Data analysis is the lack of coordination
between database systems, which host the data and provide SQL querying,
with analytics packages that perform various forms of non-SQL processing,
such as data mining and statistical analyses.
• Today’s analysts are impeded by a tedious process of exporting data from
the database, performing a non-SQL process and bringing the data back.
This is an obstacle to carrying over the interactive elegance of the first
generation of SQL-driven OLAP systems into the data mining type of
analysis that is in increasing demand.
• A tight coupling between declarative query languages and the functions
of such packages will benefit both expressiveness and performance of the
analysis.
80
Interpretation (1/5)
• Having the ability to analyze Big Data is of limited value if users cannot
understand the analysis. Ultimately, a decision-maker, provided with the
result of analysis, has to interpret these results.
• This interpretation cannot happen in a vacuum.
• it involves examining all the assumptions made and retracing the
analysis.
81
Interpretation (2/5)
• Further, there are many possible sources of error: computer systems can
have bugs, models almost always have assumptions, and results can be
based on erroneous data.
• For all of these reasons, no responsible user will cede authority to the
computer system
• Analyst will try to understand, and verify, the results produced by the
computer.
• The computer system must make it easy for her to do so. This is particularly
a challenge with Big Data due to its complexity. There are often crucial
assumptions behind the data recorded.
82
Interpretation (3/5)
• There are often crucial assumptions behind the data recorded.
• Analytical pipelines can often involve multiple steps, again with
assumptions built in.
• The recent mortgage-related shock to the financial system dramatically
underscored the need for such decision-maker diligence -- rather than
accept the stated solvency of a financial institution at face value, a decision-
maker has to examine critically the many assumptions at multiple stages
of analysis.
83
Interpretation (4/5)
• In short, it is rarely enough to provide just the results.
• one must provide supplementary information that explains how each result
was derived, and based upon precisely what inputs.
• Such supplementary information is called the provenance of the (result)
data.
84
Interpretation (5/5)
• Furthermore, with a few clicks the user should be able to drill down into each
piece of data that she sees and understand its provenance, which is a key
feature to understanding the data.
• That is, users need to be able to see not just the results, but also understand
why they are seeing those results.
• However, raw provenance, particularly regarding the phases in the analytics
pipeline, is likely to be too technical for many users to grasp completely.
• One alternative is to enable the users to “play” with the steps in the analysis –
make small changes to the pipeline, for example, or modify values for some
parameters.
85
Challenges in Big Data Analysis
Heterogeneity and Incompleteness (1/5)
• When humans consume information, a great deal of heterogeneity is
comfortably tolerated
• The nuance and richness of natural language can provide valuable depth.
• Machine analysis algorithms expect homogeneous data, and cannot
understand nuance.
• Hence data must be carefully structured as a first step in (or prior to) data
analysis.
• Example: A patient who has multiple medical procedures at a hospital. We
could create one record per medical procedure or laboratory test, one
record for the entire hospital stay, or one record for all lifetime hospital
interactions of this patient. With anything other than the first design, the
number of medical procedures and lab tests per record would be different
for each patient.
86
Challenges in Big Data Analysis
Heterogeneity and Incompleteness (2/5)
• The three design choices listed have successively less structure and,
conversely, successively greater variety.
• Greater structure is likely to be required by many (traditional) data
analysis systems.
• However, the less structured design is likely to be more effective
for many purposes – for example questions relating to disease
progression over time will require an expensive join operation with
the first two designs, but can be avoided with the latter.
• However, computer systems work most efficiently if they can store
multiple items that are all identical in size and structure. Efficient
representation, access, and analysis of semi-structured data require
further work
87
Challenges in Big Data Analysis
Heterogeneity and Incompleteness (3/5)
• Consider an electronic health record database design that has fields for birth
date, occupation, and blood type for each patient.
• What do we do if one or more of these pieces of information is not
provided by a patient?
• Obviously, the health record is still placed in the database, but with the
corresponding attribute values being set to NULL..
88
Challenges in Big Data Analysis
Heterogeneity and Incompleteness (4/5)
• A data analysis that looks to classify patients by, say, occupation, must take
into account patients for which this information is not known.
• Worse, these patients with unknown occupations can be ignored in the
analysis only if we have reason to believe that they are otherwise
statistically similar to the patients with known occupation for the analysis
performed.
• For example, if unemployed patients are more likely to hide their
employment status, analysis results may be skewed in that it considers a
more employed population mix than exists, and hence potentially one that
has differences in occupation-related health-profiles
89
Challenges in Big Data Analysis:
Heterogeneity and Incompleteness (5/5)
• Even after data cleaning and error correction, some incompleteness and
some errors in data are likely to remain.
• This incompleteness and these errors must be managed during data
analysis. Doing this correctly is a challenge.
• Recent work on managing probabilistic data suggests one way to make
progress.
90
Challenges in Big Data Analysis:
VOLUME (Scale)(1/9)
• The first thing anyone thinks of with Big Data is its size. After all, the word
“big” is there in the very name. Managing large and rapidly increasing
volumes of data has been a challenging issue for many decades.
• In the past, this challenge was mitigated by processors getting faster.
• But, there is a fundamental shift underway now: data volume is scaling
faster than compute resources, and CPU speeds are static.
91
Challenges in Big Data Analysis:
VOLUME (2/9)
• First, over the last five years the processor technology has made a dramatic
shift - rather than processors doubling their clock cycle frequency every 18-
24 months, now, due to power constraints, clock speeds have largely stalled
and processors are being built with increasing numbers of cores.
• In the past, large data processing systems had to worry about parallelism
across nodes in a cluster; now, one has to deal with parallelism within a
single node.
92
Challenges in Big Data Analysis:
Volume(3/9)
• Unfortunately, parallel data processing techniques that were applied in the
past for processing data across nodes don’t directly apply for intra-node
parallelism.
• this is because the architecture looks very different; for example, there
are many more hardware resources such as processor caches and
processor memory channels that are shared across cores in a single node.
93
Challenges in Big Data Analysis:
VOLUME(4/9)
• Further, the move towards packing multiple sockets (each with 10s of
cores) adds another level of complexity for intra-node parallelism.
• Finally, with predictions of “dark silicon”, namely that power
consideration will likely in the future prohibit us from using all of the
hardware in the system continuously, data processing systems will likely
have to actively manage the power consumption of the processor.
• These unprecedented changes require us to rethink how we design, build
and operate data processing components.
94
Challenges in Big Data Analysis:
VOLUME(5/9)
• The second dramatic shift that is underway is the move towards cloud
computing.
• Cloud computing aggregates multiple disparate workloads with varying
performance goals into very large clusters.
• Example:Interactive services demand that the data processing engine return
back an answer within a fixed response time cap.
95
Challenges in Big Data Analysis:
Volume(6/9)
• This level of sharing of resources is expensive,
• Large clusters requires new ways of determining-
how to run and execute data processing jobs so that we can meet
the goals of each workload cost-effectively.
• how to deal with system failures, which occur more frequently
as we operate on larger and larger clusters
96
Challenges in Big Data Analysis
VOLUME(7/9)
• This places a premium on declarative approaches to expressing programs,
even those doing complex machine learning tasks, since global
optimization across multiple users’ programs is necessary for good
overall performance.
• Reliance on user-driven program optimizations is likely to lead to poor
cluster utilization, since users are unaware of other users’ programs.
97
Challenges in Big Data Analysis
VOLUME(8/9)
• A third dramatic shift that is underway is the transformative change of the
traditional I/O subsystem.
• For many decades, hard disk drives (HDDs) were used to store persistent
data.
• HDDs had far slower random IO performance than sequential IO
performance.
• Data processing engines formatted their data and designed their query
processing methods to “work around” this limitation.
98
Challenges in Big Data Analysis:
VOLUME(9/9)
• But, HDDs are increasingly being replaced by solid state drives today.
• Other technologies such as Phase Change Memory are around the corner.
• These newer storage technologies do not have the same large spread in
performance between the sequential and random I/O performance.
• This requires a rethinking of how we design storage subsystems for Big
data processing systems.
99
Challenges in Big Data Analysis:
Timeliness(1/5)
• The flip side of size is speed.
• The larger the data set to be processed, the longer it will take to analyze.
• The design of a system that effectively deals with size is likely also to
result in a system that can process a given size of data set faster.
• However, it is not just this speed that is usually meant when one speaks of
Velocity in the context of Big Data. Rather, there is an acquisition rate
challenge, and a timeliness challenge.
100
Challenges in Big Data Analysis:
Timeliness (2/5)
• There are many situations in which the result of the analysis is required
immediately.
• For example, if a fraudulent credit card transaction is suspected, it should
ideally be flagged before the transaction is completed –preventing the
transaction from taking place at all.
• A full analysis of a user’s purchase history is not feasible in real-time.
• We need to develop partial results in advance so that a small amount of
incremental computation with new data can be used to arrive at a quick
determination.
101
Challenges in Big Data Analysis
Timeliness(3/5)
• Given a large data set, it is often necessary to find elements in it that meet a
specified criterion.
• In the course of data analysis, this sort of search occurs repeatedly.
• Scanning the entire data set to find suitable elements is impractical.
• Index structures are to be created in advance to permit finding qualifying
elements quickly.
102
Challenges in Big Data Analysis
Timeliness(4/5)
• The problem is that each index structure is designed to support only some
classes of criteria.
• With new analyses desired using Big Data, there are new types of criteria
specified, and a need to devise new index structures to support such
criteria.
103
Challenges in Big Data Analysis
Timeliness(5/5)
• For example, consider a traffic management system with information
regarding thousands of vehicles and local hot spots on roadways.
• The system may need to predict potential congestion points along a route
chosen by a user, and suggest alternatives.
• Doing so requires evaluating multiple spatial proximity queries working
with the trajectories of moving objects.
• New index structures are required to support such queries. Designing such
structures becomes particularly challenging when the data volume is
growing rapidly and the queries have tight response time limits.
104
Challenges in Big Data Analysis: Privacy
(1/3)
• The privacy of data is another huge concern in the context of Big Data.
• For electronic health records, there are strict laws governing what can and
cannot be done. For other data, regulations are less forceful.
•
• However, there is great public fear regarding the inappropriate use of
personal data, particularly through linking of data from multiple sources.
• Managing privacy is effectively both a technical and a sociological
problem, which must be addressed jointly from both perspectives to realize
the promise of big data.
105
Challenges in Big Data Analysis:
Privacy(2/3)
• Consider, for example, data gleaned from location-based services.
• These new architectures require a user to share his/her location with the
service provider, resulting in obvious privacy concerns.
• Note that hiding the user’s identity alone without hiding her location
would not properly address these privacy concerns.
106
Challenges in Big Data Analysis:
Privacy(3/3)
• There are many additional challenging research problems.
• For example, we do not know yet how to share private data while limiting
disclosure and ensuring sufficient data utility in the shared data.
• The existing paradigm of differential privacy is a very important step in
the right direction, but it unfortunately reduces information content too far
in order to be useful in most practical cases.
107
Challenges in Big Data Analysis: Human
Collaboration (1/3)
• There remain many patterns that humans can easily detect but computer
algorithms have a hard time finding.
• Ideally, analytics for Big Data will not be all computational – rather it will
be designed explicitly to have a human in the loop.
• The new sub-field of visual analytics is attempting to do this, at least with
respect to the modeling and analysis phase in the pipeline.
• There is similar value to human input at all stages of the analysis pipeline
108
Challenges in Big Data Analysis: Human
Collaboration (2/3)
• In today’s complex world, it often takes multiple experts from different
domains to really understand what is going on.
• A Big Data analysis system must support input from multiple human
experts, and shared exploration of results.
• These multiple experts may be separated in space and time when it is too
expensive to assemble an entire team together in one room.
• The data system has to accept this distributed expert input, and support
their collaboration.
109
Challenges in Big Data Analysis: Human
Collaboration (3/3)
• A popular new method of harnessing human ingenuity to solve problems is
through crowd-sourcing.
• Wikipedia, the online encyclopedia, is perhaps the best known example of
crowd-sourced data. We are relying upon information provided by unvetted
strangers.
• Most often, what they say is correct.
• However, we should expect there to be individuals who have other motives
and abilities – some may have a reason to provide false information in an
intentional attempt to mislead.
110
Conclusion(1/2)
• We have entered an era of Big Data.
• Through better analysis of the large volumes of data that are becoming
available,
• there is the potential for making faster advances in many scientific
disciplines and improving the profitability and success of many enterprises.
• However, many technical challenges described in this presentation must be
addressed before this potential can be realized fully.
111
Conclusion (2/2)
• The challenges include not just the obvious issues of scale, but also-
• heterogeneity,
• lack of structure,
• error-handling,
• privacy,
• timeliness,
• provenance,
• and visualization
• at all stages of the analysis pipeline from data acquisition to result
interpretation.
112
References(1/4)
• This presentation is mainly based on the community white paper
“ Challenges and Opportunities with Big Data”
developed by
leading researchers across the United States
113
References(2/4)
• [CCC2011a] Advancing Discovery in Science and Engineering. Computing
Community Consortium. Spring 2011.
• [CCC2011b] Advancing Personalized Education. Computing Community
Consortium. Spring 2011.
• [CCC2011c] Smart Health and Wellbeing. Computing Community Consortium.
Spring 2011.
• [CCC2011d] A Sustainable Future. Computing Community Consortium. Summer
2011.
• [DF2011] Getting Beneath the Veil of Effective Schools: Evidence from New York
City. Will Dobbie, Roland G. Fryer, Jr. NBER Working Paper No. 17632. Issued
Dec. 2011.
• .
114
References(3/4)
• [Eco2011] Drowning in numbers -- Digital data will flood the planet—and help us
understand it better. The Economist, Nov 18, 2011.
http://www.economist.com/blogs/dailychart/2011/11/big-data-0
• [FJ+2011] Using Data for Systemic Financial Risk Management. Mark Flood, H V
Jagadish, Albert Kyle, Frank Olken, and Louiqa Raschid. Proc. Fifth Biennial Conf.
Innovative Data Systems Research, Jan. 2011.
• [Gar2011] Pattern-Based Strategy: Getting Value from Big Data. Gartner Group
press release. July 2011. Available at
http://www.gartner.com/it/page.jsp?id=1731916
• [Gon2008] Understanding individual human mobility patterns. Marta C. González,
César A. Hidalgo, and Albert-László Barabási. Nature 453, 779-782 (5 June 2008)
• [LP+2009] Computational Social Science. David Lazer, Alex Pentland, Lada
Adamic, Sinan Aral, Albert-László Barabási, Devon Brewer,Nicholas Christakis,
Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King,
Michael Macy, Deb Roy, and Marshall Van Alstyne. Science 6 February 2009: 323
(5915), 721-723.
115
References(4/4)
• McK2011] Big data: The next frontier for innovation, competition, and productivity.
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs,
Charles Roxburgh, and Angela Hung Byers. McKinsey Global Institute. May 2011.
• [MGI2011] Materials Genome Initiative for Global Competitiveness. National
Science and Technology Council. June 2011.
• [NPR2011a] Folowing the Breadcrumbs to Big Data Gold. Yuki Noguchi. National
Public Radio, Nov. 29, 2011.
• http://www.npr.org/2011/11/29/142521910/the-digital-breadcrumbs-that-lead-to-
big-data
• [NPR2011b] The Search for Analysts to Make Sense of Big Data. Yuki Noguchi.
National Public Radio, Nov. 30, 2011.
http://www.npr.org/2011/11/30/142893065/the-search-for-analysts-to-make-sense-of-
big-data
• [NYT2012] The Age of Big Data. Steve Lohr. New York Times, Feb 11, 2012.
http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world
116
Thank You!
Dr.A.Kathirvel, Professor & Head
Department of Information Technology
Anand Institute of Higher Technology
Chennai
Email: ayyakathir@gmail.com
Mobile: 90950 59465
117
TEST
• The three main features of big data are:
• V------, V-------,V--------
• Provenance means -----------
• The five distinct phases in handling BIG DATA are ---------,-------,--------
,---------,-----------
• Analytics for BIGDATA will not be all computational; must involve -------
in the loop.
118

More Related Content

What's hot

Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social mediaSupriya Radhakrishna
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & ChallengesRupen Momaya
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacyredpel dot com
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 

What's hot (20)

Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data
Big dataBig data
Big data
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social media
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & Challenges
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data mining
Data miningData mining
Data mining
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big data 101
Big data 101Big data 101
Big data 101
 

Viewers also liked

Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Snaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnapLogic
 
Distributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityDistributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityAman Hamrey
 
Data mining for baseball new ppt
Data mining for baseball new pptData mining for baseball new ppt
Data mining for baseball new pptSalford Systems
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligenceNithya Ravi
 
Zen & the art of data mining
Zen & the art of data miningZen & the art of data mining
Zen & the art of data miningheinestien
 
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...Building a Business Case for Innovation: Project Considerations for Cloud, Mo...
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...Fred Isbell
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
erp and related technologies
erp and related technologieserp and related technologies
erp and related technologiesMadan Kumawat
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesJames Serra
 
Adaptable Designs for Agile Software Development
Adaptable Designs for Agile  Software DevelopmentAdaptable Designs for Agile  Software Development
Adaptable Designs for Agile Software DevelopmentHayim Makabee
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 

Viewers also liked (20)

Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Snaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in Motion
 
Distributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityDistributed Datamining and Agent System,security
Distributed Datamining and Agent System,security
 
Data mining for baseball new ppt
Data mining for baseball new pptData mining for baseball new ppt
Data mining for baseball new ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligence
 
Zen & the art of data mining
Zen & the art of data miningZen & the art of data mining
Zen & the art of data mining
 
Data mining tools overall
Data mining tools overallData mining tools overall
Data mining tools overall
 
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...Building a Business Case for Innovation: Project Considerations for Cloud, Mo...
Building a Business Case for Innovation: Project Considerations for Cloud, Mo...
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
erp and related technologies
erp and related technologieserp and related technologies
erp and related technologies
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 
Adaptable Designs for Agile Software Development
Adaptable Designs for Agile  Software DevelopmentAdaptable Designs for Agile  Software Development
Adaptable Designs for Agile Software Development
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 

Similar to Data Mining and Big Data Challenges and Research Opportunities

Similar to Data Mining and Big Data Challenges and Research Opportunities (20)

DBMS
DBMSDBMS
DBMS
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information Privacy
 
Big Data
Big Data Big Data
Big Data
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 
10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Unit 1
Unit 1Unit 1
Unit 1
 

More from Kathirvel Ayyaswamy

22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTUREKathirvel Ayyaswamy
 
20CS2021-Distributed Computing module 2
20CS2021-Distributed Computing module 220CS2021-Distributed Computing module 2
20CS2021-Distributed Computing module 2Kathirvel Ayyaswamy
 
Recent Trends in IoT and Sustainability
Recent Trends in IoT and SustainabilityRecent Trends in IoT and Sustainability
Recent Trends in IoT and SustainabilityKathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network SecurityKathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network SecurityKathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network SecurityKathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security 18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security Kathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network SecurityKathirvel Ayyaswamy
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network SecurityKathirvel Ayyaswamy
 
20CS024 Ethics in Information Technology
20CS024 Ethics in Information Technology20CS024 Ethics in Information Technology
20CS024 Ethics in Information TechnologyKathirvel Ayyaswamy
 

More from Kathirvel Ayyaswamy (20)

22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
 
22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
18CS3040_Distributed Systems
18CS3040_Distributed Systems18CS3040_Distributed Systems
18CS3040_Distributed Systems
 
20CS2021-Distributed Computing module 2
20CS2021-Distributed Computing module 220CS2021-Distributed Computing module 2
20CS2021-Distributed Computing module 2
 
18CS3040 Distributed System
18CS3040 Distributed System	18CS3040 Distributed System
18CS3040 Distributed System
 
20CS2021 Distributed Computing
20CS2021 Distributed Computing 20CS2021 Distributed Computing
20CS2021 Distributed Computing
 
20CS2021 DISTRIBUTED COMPUTING
20CS2021 DISTRIBUTED COMPUTING20CS2021 DISTRIBUTED COMPUTING
20CS2021 DISTRIBUTED COMPUTING
 
18CS3040 DISTRIBUTED SYSTEMS
18CS3040 DISTRIBUTED SYSTEMS18CS3040 DISTRIBUTED SYSTEMS
18CS3040 DISTRIBUTED SYSTEMS
 
Recent Trends in IoT and Sustainability
Recent Trends in IoT and SustainabilityRecent Trends in IoT and Sustainability
Recent Trends in IoT and Sustainability
 
20CS2008 Computer Networks
20CS2008 Computer Networks 20CS2008 Computer Networks
20CS2008 Computer Networks
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security 18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security18CS2005 Cryptography and Network Security
18CS2005 Cryptography and Network Security
 
20CS2008 Computer Networks
20CS2008 Computer Networks20CS2008 Computer Networks
20CS2008 Computer Networks
 
20CS2008 Computer Networks
20CS2008 Computer Networks 20CS2008 Computer Networks
20CS2008 Computer Networks
 
20CS024 Ethics in Information Technology
20CS024 Ethics in Information Technology20CS024 Ethics in Information Technology
20CS024 Ethics in Information Technology
 

Recently uploaded

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Recently uploaded (20)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Data Mining and Big Data Challenges and Research Opportunities

  • 2. Data Mining and Big Data Challenges and Research Opportunities
  • 3. 3 10 Challenging Problems in Data Mining Research
  • 4. 4 1. Developing a Unifying Theory of Data Mining • The current state of the art of data-mining research is too ``ad- hoc“ – techniques are designed for individual problems – no unifying theory • Needs unifying research – Exploration vs explanation • Long standing theoretical issues – How to avoid spurious correlations? • Deep research – Knowledge discovery on hidden causes? – Similar to discovery of Newton’s Law? An Example (from Tutorial Slides by Andrew Moore ): • VC dimension. If you've got a learning algorithm in one hand and a dataset in the other hand, to what extent can you decide whether the learning algorithm is in danger of overfitting or underfitting? – formal analysis into the fascinating question of how overfitting can happen, – estimating how well an algorithm will perform on future data that is solely based on its training set error, – a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to cross- validation, called Structural Risk Minimization (SRM), for choosing classifiers. – CV,SRM, AIC and BIC.
  • 5. 5 2. Scaling Up for High Dimensional Data and High Speed Streams • Scaling up is needed – ultra-high dimensional classification problems (millions or billions of features, e.g., bio data) – Ultra-high speed data streams • Streams – continuous, online process – e.g. how to monitor network packets for intruders? – concept drift and environment drift? – RFID network and sensor network data Excerpt from Jian Pei’s Tutorial http://www.cs.sfu.ca/~jpei/
  • 6. 6 3. Sequential and Time Series Data • How to efficiently and accurately cluster, classify and predict the trends ? • Time series data used for predictions are contaminated by noise – How to do accurate short-term and long-term predictions? – Signal processing techniques introduce lags in the filtered data, which reduces accuracy – Key in source selection, domain knowledge in rules, and optimization methods Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway
  • 7. 7 4. Mining Complex Knowledge from Complex Data • Mining graphs • Data that are not i.i.d. (independent and identically distributed) – many objects are not independent of each other, and are not of a single type. – mine the rich structure of relations among objects, – E.g.: interlinked Web pages, social networks, metabolic networks in the cell • Integration of data mining and knowledge inference – The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user. • More research on interestingness of knowledge Citation (Paper 2) Author (Paper1)Title Conference Name
  • 8. 8 5. Data Mining in a Network Setting • Community and Social Networks – Linked data between emails, Web pages, blogs, citations, sequences and people – Static and dynamic structural behavior • Mining in and for Computer Networks – detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attacks – Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet Picture from Matthew Pirretti’s slides,penn state An Example of packet streams (data courtesy of NCSA, UIUC)
  • 9. 9 6. Distributed Data Mining and Mining Multi-agent Data • Need to correlate the data seen at the various probes (such as in a sensor network) • Adversary data mining: deliberately manipulate the data to sabotage them (e.g., make them produce false negatives) • Game theory may be needed for help • Games Player 1:miner Player 2 Action: H H H T TT (-1,1) (-1,1)(1,-1) (1,-1) Outcome
  • 10. 7. Data Mining for Biological and Environmental Problems • New problems raise new questions • Large scale problems especially so – Biological data mining, such as HIV vaccine design – DNA, chemical properties, 3D structures, and functional properties  need to be fused – Environmental data mining – Mining for solving the energy crisis 10
  • 11. 11 8. Data-mining-Process Related Problems • How to automate mining process? – the composition of data mining operations – Data cleaning, with logging capabilities – Visualization and mining automation • Need a methodology: help users avoid many data mining mistakes – What is a canonical set of data mining operations? Sampling Feature Sel Mining…
  • 12. 12 9. Security, Privacy and Data Integrity • How to ensure the users privacy while their data are being mined? • How to do data mining for protection of security and privacy? • Knowledge integrity assessment – Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security – Development of measures to evaluate the knowledge integrity of a collection of • Data • Knowledge and patterns http://www.cdt.org/privacy/ Headlines (Nov 21 2005) Senate Panel Approves Data Security Bill - The Senate Judiciary Committee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. While several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files …
  • 13. 13 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data • The UCI datasets are small and not highly unbalanced • Real world data are large (10^5 features) but only < 1% of the useful classes (+’ve) • There is much information on costs and benefits, but no overall model of profit and loss • Data may evolve with a bias introduced by sampling • Each test incurs a cost • Data extremely unbalanced • Data change with time temperature pressure blood test cardiogram essay 39oc ? ? ? ?
  • 14. New Challenges • Privacy-preserving data mining • Data mining over compartmentalized databases
  • 15. Inducing Classifiers over Privacy Preserved Numeric Data 30 | 25K | … 50 | 40K | … Randomizer 65 | 50K | … Randomizer 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model 30 becomes 65 (30+35) Alice’s age Alice’s salary John’s age
  • 16. Other Recent Work • Cryptographic approach to privacy-preserving data mining – Lindell & Pinkas, Crypto 2000 • Privacy-Preserving discovery of association rules – Vaidya & Clifton, KDD2002 – Evfimievski et. Al, KDD 2002 – Rizvi & Haritsa, VLDB 2002
  • 17. Credit Agencies Criminal Records Demo- graphic Birth Marriage Phone Email "Frequent Traveler" Rating Model State Local Computation over Compartmentalized Databases Randomized Data Shipping Local computations followed by combination of partial models On-demand secure data shipping and data composition
  • 18. Some Hard Problems • Past may be a poor predictor of future – Abrupt changes – Wrong training examples • Actionable patterns (principled use of domain knowledge?) • Over-fitting vs. not missing the rare nuggets • Richer patterns • Simultaneous mining over multiple data types • When to use which algorithm? • Automatic, data-dependent selection of algorithm parameters
  • 19. Discussion • Should data mining be viewed as “rich’’ querying and “deeply’’ integrated with database systems? – Most of current work make little use of database functionality • Should analytics be an integral concern of database systems? • Issues in data mining over heterogeneous data repositories (Relationship to the heterogeneous systems discussion)
  • 20. 20 Summary • Developing a Unifying Theory of Data Mining • Scaling Up for High Dimensional Data/High Speed Streams • Mining Sequence Data and Time Series Data • Mining Complex Knowledge from Complex Data • Data Mining in a Network Setting • Distributed Data Mining and Mining Multi-agent Data • Data Mining for Biological and Environmental Problems • Data-Mining-Process Related Problems • Security, Privacy and Data Integrity • Dealing with Non-static, Unbalanced and Cost-sensitive Data
  • 21. Summary • Data mining has shown promise but needs much more further research We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley
  • 23. What are we going to understand • What is Big Data? • Why we landed up there? • To whom does it matter • Where is the money? • Are we ready to handle it? • What are the concerns? • Tools and Technologies – Is Big Data <=> Hadoop
  • 24. Simple to start • What is the maximum file size you have dealt so far? – Movies/Files/Streaming video that you have used? – What have you observed? • What is the maximum download speed you get? • Simple computation – How much time to just transfer.
  • 25. What is big data? • “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.”
  • 26. Huge amount of data • There are huge volumes of data in the world: – From the beginning of recorded time until 2003, – We created 5 billion gigabytes (exabytes) of data. – In 2011, the same amount was created every two days – In 2013, the same amount of data is created every 10 minutes.
  • 27. Big data spans three dimensions: Volume, Velocity and Variety • Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. – Turn 12 terabytes of Tweets created each day into improved product sentiment analysis – Convert 350 billion annual meter readings to better predict power consumption • Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. – Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer churn faster – The latest I have heard is 10 nano seconds delay is too much. • Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. – Monitor 100’s of live video feeds from surveillance cameras to target points of interest – Exploit the 80% data growth in images, video and documents to improve customer satisfaction
  • 28. Finally…. `Big- Data’ is similar to ‘Small-data’ but bigger .. But having data bigger it requires different approaches: Techniques, tools, architecture … with an aim to solve new problems Or old problems in a better way
  • 29. Whom does it matter • Research Community  • Business Community - New tools, new capabilities, new infrastructure, new business models etc., • On sectors Financial Services..
  • 30. How are revenues looking like….
  • 31. The Social Layer in an Instrumented Interconnected World 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeveryday
  • 32. What does Big Data trigger? • From “Big Data and the Web: Algorithms for Data Intensive Scalable Computing”, Ph.D Thesis, Gianmarco
  • 33. BIG DATA is not just HADOOP Manage & store huge volume of any data Hadoop File System MapReduce Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Data WarehousingStructure and control data Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM Understand and navigate federated big data sources Federated Discovery and Navigation
  • 34. Types of tools typically used in Big Data Scenario • Where is the processing hosted? – Distributed server/cloud • Where data is stored? – Distributed Storage (eg: Amazon s3) • Where is the programming model? – Distributed processing (Map Reduce) • How data is stored and indexed? – High performance schema free database • What operations are performed on the data? – Analytic/Semantic Processing (Eg. RDF/OWL)
  • 35. When dealing with Big Data is hard • When the operations on data are complex: – Eg. Simple counting is not a complex problem. – Modeling and reasoning with data of different kinds can get extremely complex • Good news with big-data: – Often, because of the vast amount of data, modeling techniques can get simpler (e.g., smart counting can replace complex model-based analytics)… – …as long as we deal with the scale.
  • 36. Why Big-Data? • Key enablers for the appearance and growth of ‘Big-Data’ are: – Increase in storage capabilities – Increase in processing power – Availability of data
  • 37. BIG DATA: CHALLEGES AND OPPORTUNITIES
  • 38. WHAT IS BIG DATA?  Big Data is a popular term used to denote the exponential growth and availability of data - both structured and unstructured. Very important! Why? More data leads to better analysis enabling better decisions! 38
  • 39. Definition of BIG DATA • As far back as 2001, industry analyst Doug Laney (currently with Gartner) articulated the now mainstream definition of big data as three Vs: Volume, Velocity and Variety. Let us look at each one of these items. 39
  • 40. Definition of BIG DATA: Volume • Factors that contribute to increase in volume of data: – Transactions-based data stored through the years. – Unstructured data streaming in from social media – Ever increasing amounts of sensor and machine-to-machine data being collected 40
  • 41. Definition of BIG DATA: Volume.. • Previously data storage was a big issue ; today it is not. But other issues emerge such as: • How to determine relevance within large data volumes? • How to use analytics to create value from relevant data. 41
  • 42. Definition of BIG DATA:Velocity • Data is streaming in at unprecedented speed and must be dealt with in a timely manner. • RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. • Reacting quickly enough to deal with data velocity is a challenge for most organizations. 42
  • 43. Definition of BIG DATA: Variety • Data today comes in all types of formats. Structured, numeric data in traditional databases. • Information created from line-of-business applications; Unstructured text documents, email, video, audio, stock ticker data and financial transactions. • Managing, merging and governing different varieties of data is something many organizations still grapple with. 43
  • 44. Definition of BIG DATA: Others • Variability: In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. • Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. • Even more so when unstructured data is involved. 44
  • 45. Definition of BIG DATA: Others • Complexity: Today's data comes from diverse sources. • It is still an undertaking to link, match, cleanse and transform data across systems. • It is necessary to connect and correlate relationships, hierarchies and multiple data linkages or our data can quickly spiral out of control. • PRIVACY: Associated Problems 45
  • 46. BIG DATA : BIG PROBLEMS • The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about: • what data to keep? • what to discard? • how to store? • what we keep reliably with the right metadata. 46
  • 47. BIG DATA : BIG PROBLEMS (2) • Much data today is not in structured format; • for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search. • transforming such content into a structured format for later analysis is a major challenge. 47
  • 48. BIG DATA : BIG PROBLEMS (3) • The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. • Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. • Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed. 48
  • 49. BIG DATA : BIG PROBLEMS (4) • Presentation of the results and its interpretation by non-technical domain experts is crucial to extracting actionable knowledge. 49
  • 50. What has been Achieved… • During the last 35 years, data management principles such as: • physical and logical independence, • declarative querying and • cost-based optimization have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today 50
  • 51. What is to be done? • The many novel challenges and opportunities associated with Big Data necessitate rethinking many aspects of these data management platforms, while retaining other desirable aspects. • Appropriate research & investment in Big Data will lead to a new wave of fundamental technological advances that will be embodied in the next generations of Big Data management and analysis platforms, products and systems. 51
  • 52. Big Data: Opportunity • In a broad range of application areas, data is being collected at unprecedented scale. • Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. • Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. 52
  • 53. Big Data: Opportunity • Scientific research has been revolutionized by Big Data. • The Sloan Digital Sky Survey has today become a central resource for astronomers the world over. • The field of Astronomy is being transformed from one where taking pictures of the sky was a large part of an astronomer’s job to one where the pictures are all in a database already and the astronomer’s task is to find interesting objects and phenomena in the database. 53
  • 54. Big Data: Opportunity • In the biological sciences, there is now a well-established tradition of depositing scientific data into a public repository, and also of creating public databases for use by other scientists. • There is an entire discipline of bioinformatics that is largely devoted to the curation and analysis of such data. • As technology advances, particularly with the advent of Next Generation Sequencing, the size and number of experimental data sets available is increasing exponentially. 54
  • 55. Big Data: Opportunities • Big Data has the potential to revolutionize not just research, but also education. • A recent detailed quantitative comparison of different approaches taken by 35 charter schools in NYC has found that one of the top five policies correlated with measurable academic effectiveness was the use of data to guide instruction . • Imagine a world in which we have access to a huge database where we collect every detailed measure of every student's academic performance 55
  • 56. Big Data: Opportunities • This data could be used to design the most effective approaches to education, starting from reading, writing, and math, to advanced, college- level, courses. • We are far from having access to such data, but there are powerful trends in this direction. • In particular, there is a strong trend for massive Web deployment of educational activities, and this will generate an increasingly large amount of detailed data about students' performance 56
  • 57. Big Data: Opportunities: Health Care • It is widely believed that the use of information technology: • can reduce the cost of healthcare • improve its quality by making care more preventive and personalized and basing it on more extensive (home-based) continuous monitoring 57
  • 58. Big Data: Opportunities:Others • Effective use of Big Data for urban planning (through fusion of high- fidelity geographical data) • Intelligent transportation (through analysis and visualization of live and detailed road network data) • Environmental modeling (through sensor networks ubiquitously collecting data) • Energy saving (through unveiling patterns of use) • Smart materials (through the new materials genome initiative), • Computational social sciences 58
  • 59. Big Data: Opportunities:Others • financial systemic risk analysis (through integrated analysis of a web of contracts to find dependencies between financial entities) • homeland security (through analysis of social networks and financial transactions of possible terrorists) • computer security (through analysis of logged information and other events, known as Security Information and Event Management (SIEM)) 59
  • 60. Challenges • The sheer size of the data, Data Volume of course, is a major challenge, and is the one that is most easily recognized. • However, there are others. Industry analysis companies like to point out that there are challenges not just in Volume, but also in Variety and Velocity. • While these three are important, this short list fails to include additional important requirements such as privacy and usability. 60
  • 61. BIG DATA PIPE LINE 61
  • 62. The FIVE PHASES • There are five distinct phases in handling Big Data • Acquisition/Recording • Extraction, Cleaning/Annotation • Integration/Aggregation/Representation • Analysis/Modeling • Interpretation 62
  • 63. Data Acquisition and Recording(1/3) • Big Data does not arise out of a vacuum: it is recorded from some data generating source. • Scientific experiments and simulations can easily produce petabytes of data today. • Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. • One challenge is to define these filters in such a way that they do not discard useful information 63
  • 64. Data Acquisition and Recording(2/3) • The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. • For example, in scientific experiments, considerable detail regarding specific experimental conditions and procedures may be required to be able to interpret the results correctly, and it is important that such metadata be recorded with observational data. Metadata acquisition systems can minimize the human burden in recording metadata. 64
  • 65. Data Acquisition and Recording(3/3)  Another important issue here is data provenance.  Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline.  For example, a processing error at one step can render subsequent analysis useless; with suitable provenance, we can easily identify all subsequent processing that dependent on this step.  Thus we need research both into generating suitable metadata and into data systems that carry the provenance of metadata through data analysis pipelines. 65
  • 66. Provenance:meaning • The place of origin or earliest known history of something. • The history of the ownership of an object, especially when documented or authenticated. • Provenance Sentence Example: • Items with a known provenance excluding them from further scrutiny. 66
  • 67. Information Extraction and Cleaning (1/4)  Frequently, the information collected will not be in a format ready for analysis.  For example, consider the collection of electronic health records in a hospital, comprising: transcribed dictations from several physicians, structured data from sensors measurements (possibly with some associated uncertainty) 67
  • 68. Information Extraction and Cleaning (2/4) • Image data such as x-rays. • We cannot leave the data in this form and still effectively analyze it. • we require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis 68
  • 69. Information Extraction and Cleaning (3/4) • Doing this correctly and completely is a continuing technical challenge. • Note that this data also includes images and will in the future include video; such extraction is often highly application dependent (e.g., what you want to pull out of an MRI is very different from what you would pull out of a picture of the stars, or a surveillance photo). • In addition, due to the ubiquity of surveillance cameras and popularity of GPS-enabled mobile phones, cameras, and other portable devices, rich and high fidelity location and trajectory (i.e., movement in space) data can also be extracted. 69
  • 70. Information Extraction and Cleaning (4/4) • We are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. • For example, patients may choose to hide risky behavior and care givers may sometimes mis-diagnose a condition; patients may also inaccurately recall the name of a drug or even that they ever took it, leading to missing information in (the history portion of) their medical record. • Existing work on data cleaning assumes well-recognized constraints on valid data or well-understood error models; for many emerging Big Data domains these do not exist. 70
  • 71. Data Integration, Aggregation, and Representation(1/4) • Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. • For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable. • There is a strong body of work in data integration that can provide some of the answers. However, considerable additional work is required to achieve automated error-free difference resolution. 71
  • 72. Data Integration, Aggregation, and Representation (2/4) • Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. • Usually, there will be many alternative ways in which to store the same information. • Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes. 72
  • 73. Data Integration, Aggregation, and Representation(3/4) • Witness, for instance, the tremendous variety in the structure of bioinformatics databases with information regarding substantially similar entities, such as genes. • Database design is today an art, and is carefully executed in the enterprise context by highly-paid professionals. • We must enable other professionals, such as domain scientists, to create effective database designs, either through devising tools to assist them in the design process or through forgoing the design process completely and developing techniques so that databases can be used effectively in the absence of intelligent database design. 73
  • 74. Data Integration, Aggregation, and Representation (4/4) • Database design is today an art, and is carefully executed in the enterprise context by highly-paid professionals. • We must enable other professionals, such as domain scientists, to create effective database designs, either through devising tools to assist them in the design process or through forgoing the design process completely and developing techniques so that databases can be used effectively in the absence of intelligent database design. 74
  • 75. Query Processing, Data Modeling, and Analysis(1/6) • Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. • Big Data is often noisy, dynamic, heterogeneous, inter-related and untrustworthy. • Nevertheless, even noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. 75
  • 76. Query Processing, Data Modeling, and Analysis (2/6) • Further, interconnected Big Data forms large heterogeneous information networks, with which information redundancy can be explored to compensate for : – missing data, – to crosscheck conflicting cases, – to validate trustworthy relationships, – to disclose inherent clusters, and – to uncover hidden relationships and models 76
  • 77. Query Processing, Data Modeling, and Analysis (3/6) • Mining requires Integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. • At the same time, data mining itself can also be used to help improve the quality and trustworthiness of the data, understand its semantics, and provide intelligent querying functions. 77
  • 78. Query Processing, Data Modeling, and Analysis (4/6) • Real-life medical records have errors, are heterogeneous, and are distributed across multiple systems. • The value of Big Data analysis in health care, to take just one example application domain, can only be realized if it can be applied robustly under these difficult conditions. On the flip side, knowledge developed from data can help in correcting errors and removing ambiguity. 78
  • 79. Query Processing, Data Modeling, and Analysis (5/6) • For example, a physician may write “DVT” as the diagnosis for a patient. This abbreviation is commonly used for both “deep vein thrombosis” and “diverticulitis,” two very different medical conditions. • A knowledge-base constructed from related data can use associated symptoms or medications to determine which of two the physician meant. 79
  • 80. Query Processing, Data Modeling, and Analysis (6/6) • A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. • Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back. This is an obstacle to carrying over the interactive elegance of the first generation of SQL-driven OLAP systems into the data mining type of analysis that is in increasing demand. • A tight coupling between declarative query languages and the functions of such packages will benefit both expressiveness and performance of the analysis. 80
  • 81. Interpretation (1/5) • Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. • This interpretation cannot happen in a vacuum. • it involves examining all the assumptions made and retracing the analysis. 81
  • 82. Interpretation (2/5) • Further, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. • For all of these reasons, no responsible user will cede authority to the computer system • Analyst will try to understand, and verify, the results produced by the computer. • The computer system must make it easy for her to do so. This is particularly a challenge with Big Data due to its complexity. There are often crucial assumptions behind the data recorded. 82
  • 83. Interpretation (3/5) • There are often crucial assumptions behind the data recorded. • Analytical pipelines can often involve multiple steps, again with assumptions built in. • The recent mortgage-related shock to the financial system dramatically underscored the need for such decision-maker diligence -- rather than accept the stated solvency of a financial institution at face value, a decision- maker has to examine critically the many assumptions at multiple stages of analysis. 83
  • 84. Interpretation (4/5) • In short, it is rarely enough to provide just the results. • one must provide supplementary information that explains how each result was derived, and based upon precisely what inputs. • Such supplementary information is called the provenance of the (result) data. 84
  • 85. Interpretation (5/5) • Furthermore, with a few clicks the user should be able to drill down into each piece of data that she sees and understand its provenance, which is a key feature to understanding the data. • That is, users need to be able to see not just the results, but also understand why they are seeing those results. • However, raw provenance, particularly regarding the phases in the analytics pipeline, is likely to be too technical for many users to grasp completely. • One alternative is to enable the users to “play” with the steps in the analysis – make small changes to the pipeline, for example, or modify values for some parameters. 85
  • 86. Challenges in Big Data Analysis Heterogeneity and Incompleteness (1/5) • When humans consume information, a great deal of heterogeneity is comfortably tolerated • The nuance and richness of natural language can provide valuable depth. • Machine analysis algorithms expect homogeneous data, and cannot understand nuance. • Hence data must be carefully structured as a first step in (or prior to) data analysis. • Example: A patient who has multiple medical procedures at a hospital. We could create one record per medical procedure or laboratory test, one record for the entire hospital stay, or one record for all lifetime hospital interactions of this patient. With anything other than the first design, the number of medical procedures and lab tests per record would be different for each patient. 86
  • 87. Challenges in Big Data Analysis Heterogeneity and Incompleteness (2/5) • The three design choices listed have successively less structure and, conversely, successively greater variety. • Greater structure is likely to be required by many (traditional) data analysis systems. • However, the less structured design is likely to be more effective for many purposes – for example questions relating to disease progression over time will require an expensive join operation with the first two designs, but can be avoided with the latter. • However, computer systems work most efficiently if they can store multiple items that are all identical in size and structure. Efficient representation, access, and analysis of semi-structured data require further work 87
  • 88. Challenges in Big Data Analysis Heterogeneity and Incompleteness (3/5) • Consider an electronic health record database design that has fields for birth date, occupation, and blood type for each patient. • What do we do if one or more of these pieces of information is not provided by a patient? • Obviously, the health record is still placed in the database, but with the corresponding attribute values being set to NULL.. 88
  • 89. Challenges in Big Data Analysis Heterogeneity and Incompleteness (4/5) • A data analysis that looks to classify patients by, say, occupation, must take into account patients for which this information is not known. • Worse, these patients with unknown occupations can be ignored in the analysis only if we have reason to believe that they are otherwise statistically similar to the patients with known occupation for the analysis performed. • For example, if unemployed patients are more likely to hide their employment status, analysis results may be skewed in that it considers a more employed population mix than exists, and hence potentially one that has differences in occupation-related health-profiles 89
  • 90. Challenges in Big Data Analysis: Heterogeneity and Incompleteness (5/5) • Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain. • This incompleteness and these errors must be managed during data analysis. Doing this correctly is a challenge. • Recent work on managing probabilistic data suggests one way to make progress. 90
  • 91. Challenges in Big Data Analysis: VOLUME (Scale)(1/9) • The first thing anyone thinks of with Big Data is its size. After all, the word “big” is there in the very name. Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. • In the past, this challenge was mitigated by processors getting faster. • But, there is a fundamental shift underway now: data volume is scaling faster than compute resources, and CPU speeds are static. 91
  • 92. Challenges in Big Data Analysis: VOLUME (2/9) • First, over the last five years the processor technology has made a dramatic shift - rather than processors doubling their clock cycle frequency every 18- 24 months, now, due to power constraints, clock speeds have largely stalled and processors are being built with increasing numbers of cores. • In the past, large data processing systems had to worry about parallelism across nodes in a cluster; now, one has to deal with parallelism within a single node. 92
  • 93. Challenges in Big Data Analysis: Volume(3/9) • Unfortunately, parallel data processing techniques that were applied in the past for processing data across nodes don’t directly apply for intra-node parallelism. • this is because the architecture looks very different; for example, there are many more hardware resources such as processor caches and processor memory channels that are shared across cores in a single node. 93
  • 94. Challenges in Big Data Analysis: VOLUME(4/9) • Further, the move towards packing multiple sockets (each with 10s of cores) adds another level of complexity for intra-node parallelism. • Finally, with predictions of “dark silicon”, namely that power consideration will likely in the future prohibit us from using all of the hardware in the system continuously, data processing systems will likely have to actively manage the power consumption of the processor. • These unprecedented changes require us to rethink how we design, build and operate data processing components. 94
  • 95. Challenges in Big Data Analysis: VOLUME(5/9) • The second dramatic shift that is underway is the move towards cloud computing. • Cloud computing aggregates multiple disparate workloads with varying performance goals into very large clusters. • Example:Interactive services demand that the data processing engine return back an answer within a fixed response time cap. 95
  • 96. Challenges in Big Data Analysis: Volume(6/9) • This level of sharing of resources is expensive, • Large clusters requires new ways of determining- how to run and execute data processing jobs so that we can meet the goals of each workload cost-effectively. • how to deal with system failures, which occur more frequently as we operate on larger and larger clusters 96
  • 97. Challenges in Big Data Analysis VOLUME(7/9) • This places a premium on declarative approaches to expressing programs, even those doing complex machine learning tasks, since global optimization across multiple users’ programs is necessary for good overall performance. • Reliance on user-driven program optimizations is likely to lead to poor cluster utilization, since users are unaware of other users’ programs. 97
  • 98. Challenges in Big Data Analysis VOLUME(8/9) • A third dramatic shift that is underway is the transformative change of the traditional I/O subsystem. • For many decades, hard disk drives (HDDs) were used to store persistent data. • HDDs had far slower random IO performance than sequential IO performance. • Data processing engines formatted their data and designed their query processing methods to “work around” this limitation. 98
  • 99. Challenges in Big Data Analysis: VOLUME(9/9) • But, HDDs are increasingly being replaced by solid state drives today. • Other technologies such as Phase Change Memory are around the corner. • These newer storage technologies do not have the same large spread in performance between the sequential and random I/O performance. • This requires a rethinking of how we design storage subsystems for Big data processing systems. 99
  • 100. Challenges in Big Data Analysis: Timeliness(1/5) • The flip side of size is speed. • The larger the data set to be processed, the longer it will take to analyze. • The design of a system that effectively deals with size is likely also to result in a system that can process a given size of data set faster. • However, it is not just this speed that is usually meant when one speaks of Velocity in the context of Big Data. Rather, there is an acquisition rate challenge, and a timeliness challenge. 100
  • 101. Challenges in Big Data Analysis: Timeliness (2/5) • There are many situations in which the result of the analysis is required immediately. • For example, if a fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed –preventing the transaction from taking place at all. • A full analysis of a user’s purchase history is not feasible in real-time. • We need to develop partial results in advance so that a small amount of incremental computation with new data can be used to arrive at a quick determination. 101
  • 102. Challenges in Big Data Analysis Timeliness(3/5) • Given a large data set, it is often necessary to find elements in it that meet a specified criterion. • In the course of data analysis, this sort of search occurs repeatedly. • Scanning the entire data set to find suitable elements is impractical. • Index structures are to be created in advance to permit finding qualifying elements quickly. 102
  • 103. Challenges in Big Data Analysis Timeliness(4/5) • The problem is that each index structure is designed to support only some classes of criteria. • With new analyses desired using Big Data, there are new types of criteria specified, and a need to devise new index structures to support such criteria. 103
  • 104. Challenges in Big Data Analysis Timeliness(5/5) • For example, consider a traffic management system with information regarding thousands of vehicles and local hot spots on roadways. • The system may need to predict potential congestion points along a route chosen by a user, and suggest alternatives. • Doing so requires evaluating multiple spatial proximity queries working with the trajectories of moving objects. • New index structures are required to support such queries. Designing such structures becomes particularly challenging when the data volume is growing rapidly and the queries have tight response time limits. 104
  • 105. Challenges in Big Data Analysis: Privacy (1/3) • The privacy of data is another huge concern in the context of Big Data. • For electronic health records, there are strict laws governing what can and cannot be done. For other data, regulations are less forceful. • • However, there is great public fear regarding the inappropriate use of personal data, particularly through linking of data from multiple sources. • Managing privacy is effectively both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of big data. 105
  • 106. Challenges in Big Data Analysis: Privacy(2/3) • Consider, for example, data gleaned from location-based services. • These new architectures require a user to share his/her location with the service provider, resulting in obvious privacy concerns. • Note that hiding the user’s identity alone without hiding her location would not properly address these privacy concerns. 106
  • 107. Challenges in Big Data Analysis: Privacy(3/3) • There are many additional challenging research problems. • For example, we do not know yet how to share private data while limiting disclosure and ensuring sufficient data utility in the shared data. • The existing paradigm of differential privacy is a very important step in the right direction, but it unfortunately reduces information content too far in order to be useful in most practical cases. 107
  • 108. Challenges in Big Data Analysis: Human Collaboration (1/3) • There remain many patterns that humans can easily detect but computer algorithms have a hard time finding. • Ideally, analytics for Big Data will not be all computational – rather it will be designed explicitly to have a human in the loop. • The new sub-field of visual analytics is attempting to do this, at least with respect to the modeling and analysis phase in the pipeline. • There is similar value to human input at all stages of the analysis pipeline 108
  • 109. Challenges in Big Data Analysis: Human Collaboration (2/3) • In today’s complex world, it often takes multiple experts from different domains to really understand what is going on. • A Big Data analysis system must support input from multiple human experts, and shared exploration of results. • These multiple experts may be separated in space and time when it is too expensive to assemble an entire team together in one room. • The data system has to accept this distributed expert input, and support their collaboration. 109
  • 110. Challenges in Big Data Analysis: Human Collaboration (3/3) • A popular new method of harnessing human ingenuity to solve problems is through crowd-sourcing. • Wikipedia, the online encyclopedia, is perhaps the best known example of crowd-sourced data. We are relying upon information provided by unvetted strangers. • Most often, what they say is correct. • However, we should expect there to be individuals who have other motives and abilities – some may have a reason to provide false information in an intentional attempt to mislead. 110
  • 111. Conclusion(1/2) • We have entered an era of Big Data. • Through better analysis of the large volumes of data that are becoming available, • there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises. • However, many technical challenges described in this presentation must be addressed before this potential can be realized fully. 111
  • 112. Conclusion (2/2) • The challenges include not just the obvious issues of scale, but also- • heterogeneity, • lack of structure, • error-handling, • privacy, • timeliness, • provenance, • and visualization • at all stages of the analysis pipeline from data acquisition to result interpretation. 112
  • 113. References(1/4) • This presentation is mainly based on the community white paper “ Challenges and Opportunities with Big Data” developed by leading researchers across the United States 113
  • 114. References(2/4) • [CCC2011a] Advancing Discovery in Science and Engineering. Computing Community Consortium. Spring 2011. • [CCC2011b] Advancing Personalized Education. Computing Community Consortium. Spring 2011. • [CCC2011c] Smart Health and Wellbeing. Computing Community Consortium. Spring 2011. • [CCC2011d] A Sustainable Future. Computing Community Consortium. Summer 2011. • [DF2011] Getting Beneath the Veil of Effective Schools: Evidence from New York City. Will Dobbie, Roland G. Fryer, Jr. NBER Working Paper No. 17632. Issued Dec. 2011. • . 114
  • 115. References(3/4) • [Eco2011] Drowning in numbers -- Digital data will flood the planet—and help us understand it better. The Economist, Nov 18, 2011. http://www.economist.com/blogs/dailychart/2011/11/big-data-0 • [FJ+2011] Using Data for Systemic Financial Risk Management. Mark Flood, H V Jagadish, Albert Kyle, Frank Olken, and Louiqa Raschid. Proc. Fifth Biennial Conf. Innovative Data Systems Research, Jan. 2011. • [Gar2011] Pattern-Based Strategy: Getting Value from Big Data. Gartner Group press release. July 2011. Available at http://www.gartner.com/it/page.jsp?id=1731916 • [Gon2008] Understanding individual human mobility patterns. Marta C. González, César A. Hidalgo, and Albert-László Barabási. Nature 453, 779-782 (5 June 2008) • [LP+2009] Computational Social Science. David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-László Barabási, Devon Brewer,Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne. Science 6 February 2009: 323 (5915), 721-723. 115
  • 116. References(4/4) • McK2011] Big data: The next frontier for innovation, competition, and productivity. James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. McKinsey Global Institute. May 2011. • [MGI2011] Materials Genome Initiative for Global Competitiveness. National Science and Technology Council. June 2011. • [NPR2011a] Folowing the Breadcrumbs to Big Data Gold. Yuki Noguchi. National Public Radio, Nov. 29, 2011. • http://www.npr.org/2011/11/29/142521910/the-digital-breadcrumbs-that-lead-to- big-data • [NPR2011b] The Search for Analysts to Make Sense of Big Data. Yuki Noguchi. National Public Radio, Nov. 30, 2011. http://www.npr.org/2011/11/30/142893065/the-search-for-analysts-to-make-sense-of- big-data • [NYT2012] The Age of Big Data. Steve Lohr. New York Times, Feb 11, 2012. http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world 116
  • 117. Thank You! Dr.A.Kathirvel, Professor & Head Department of Information Technology Anand Institute of Higher Technology Chennai Email: ayyakathir@gmail.com Mobile: 90950 59465 117
  • 118. TEST • The three main features of big data are: • V------, V-------,V-------- • Provenance means ----------- • The five distinct phases in handling BIG DATA are ---------,-------,-------- ,---------,----------- • Analytics for BIGDATA will not be all computational; must involve ------- in the loop. 118