2. Definition
Big data is the term for a collection of data
sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data
processing applications.
4. ABC of BIG DATA
Analytics. This solution area focuses on providing efficient analytics for
extremely large datasets. Analytics is all about gaining insight, taking
advantage of the digital universe, and turning data into high-quality
information, providing deeper insights about the business to enable better
decisions.
Bandwidth. This solution area focuses on obtaining better performance
for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at
extremely high speeds; high-performance video streaming for surveillance
and mission planning; and as video editing and play-out in media and
entertainment.
Content. This solution area focuses on the need to provide boundless
secure scalable data storage. Content solutions must enable storing
virtually unlimited amounts of data, so that enterprises can store as much
data as they want, find it when they need it, and never lose it.
5. 3 V’S of BIG DATA
Volume:
Velocity: As a direct consequence of the rate at which data is being
Not only can each data source contain a huge volume of data,
but also the number of data sources, even for a single domain, has grown
to be in the tens of thousands.
collected and continuously made available,many of the data sources are
very dynamic.
Variety: Data sources (even in the same domain) are extremely
heterogeneous both at the schema level regarding how they structure their
data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar
entities.
6. Examples
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
climate observations
Big data analysis played a large role in Barack Obama's successful 2012 reelection campaign.
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
Amazon.com handles millions of back-end operations every day, as well as
queries from more than half a million third-party sellers. The core technology
that keeps Amazon running is Linux-based and as of 2005 they had the world’s
three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [
Walmart handles more than 1 million customer transactions every hour, which
is imported into databases estimated to contain more than 2.5 petabytes (2560
terabytes) of data – the equivalent of 167 times the information contained in all
the books in the US Library of Congress.
Facebook handles 50 billion photos from its user base.
7.
8.
9. Big data Integration
A lot of data growth is happening around these so-called
unstructured data types. Big data integration is all about
automation of the collection, organization and analysis
of these data types.
The importance of big data integration has led to a
substantial amount of research over the past few years
on the topics of schema mapping, record linkage and
data fusion.
11. Big data vs Traditional Data Integration
The number of data sources, even for a single
domain, has grown to be in the tens of thousands.
Many of the data sources are very dynamic, as a huge
amount of newly collected data are continuously made
available.
The data sources are extremely heterogeneous in their
structure, with considerable variety even for substantially
similar entities.
The data sources are of widely differing qualities, with
significant differences in the coverage, accuracy and
timeliness
of data provided.
12. Schema Mapping
Schema mapping in a data integration system refers to
i) creating a mediated (global) schema, and
(ii) Identifying the mappings between the mediated (global)
schema and the local schemas of the data sources to
determine which (sets of) attributes contain the same
information
13. Example
Entities like people (customers, employees), companies
(the enterprise itself, competitors, partners, suppliers),
products (those owned by the enterprise and its
competitors)
Defined Relationships among these entities
Activities with one or more entities as actors and/or
subjects - Documents can represent these activities
14. Record Linkage
Record linkage (RL) refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g., data
files, books, websites, databases).
Record linkage is necessary when joining data sets
based on entities that may or may not share a common
identifier (e.g., database key, URI, National identification
number), as may be the case due to differences in
record shape, storage location, and/or curator style or
preference
15. Challenge in BDI
In BDI, (i) data sources tend to be heterogeneous in
their structure and many sources (e.g., tweets, blog
posts) provide unstructured data, and
(ii) data sources are dynamic and continuously evolving.
To address the volume dimension, new techniques have
been proposed to enable parallel record linkage using
MapReduce.
Adaptive blocking is another technique been used to
overcome this.
16. MapReduce
MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm on
a cluster.
The model is inspired by the map and reduce functions
commonly used in functional programming.
A MapReduce program is composed of
a Map() procedure that performs filtering and sorting
and Reduce() procedure that performs a summary
operation.
17. Adaptive Blocking
Blocking methods alleviate this big data integration
problem by efficiently selecting approximately similar
object pairs for subsequent distance computations,
leaving out the remaining pairs as dissimilar.
18. Data fusion
Data fusion refers to resolving conflicts from different
sources and finding the truth that reflects the real world.
Its motivation is exactly the veracity of data: the Web has
made it easy to publish and spread false information across
multiple sources.
Data integration might be viewed as set combination
wherein the larger set is retained, whereas fusion is a
set reduction technique
19. Data fusion model
Level 0: Source Preprocessing.
Level 1: Object Assessment
Level 2: Situation Assessment
Level 3: Impact Assessment
Level 4: Process Refinement
Level 5: User Refinement
20. Advantages
Real-time rerouting of transportation fleets based on
weather patterns
Customer sentiment analysis based on social postings
Targeted disease therapies based on genomic data
Allocation of disaster relief supplies based on mobile
and social messages from victims
Cars driving themselves.
21. Conclusion
This seminar gives a basic insight of what is big data
and reviews the state-of-the-art techniques for data
integration in addressing the new challenges raised by
Big Data, including volume and number of sources,
velocity, variety, and veracity. It also lists out the
advantages of harnessing the potential of big data.