SlideShare ist ein Scribd-Unternehmen logo
1 von 21
BIG DATA
Definition


Big data is the term for a collection of data
sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data
processing applications.
How big ?
ABC of BIG DATA


Analytics. This solution area focuses on providing efficient analytics for

extremely large datasets. Analytics is all about gaining insight, taking
advantage of the digital universe, and turning data into high-quality
information, providing deeper insights about the business to enable better
decisions.


Bandwidth. This solution area focuses on obtaining better performance

for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at
extremely high speeds; high-performance video streaming for surveillance
and mission planning; and as video editing and play-out in media and
entertainment.


Content. This solution area focuses on the need to provide boundless

secure scalable data storage. Content solutions must enable storing
virtually unlimited amounts of data, so that enterprises can store as much
data as they want, find it when they need it, and never lose it.
3 V’S of BIG DATA


Volume:



Velocity: As a direct consequence of the rate at which data is being

Not only can each data source contain a huge volume of data,
but also the number of data sources, even for a single domain, has grown
to be in the tens of thousands.
collected and continuously made available,many of the data sources are
very dynamic.



Variety: Data sources (even in the same domain) are extremely

heterogeneous both at the schema level regarding how they structure their
data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar
entities.
Examples












The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
climate observations
Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign.
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
Amazon.com handles millions of back-end operations every day, as well as
queries from more than half a million third-party sellers. The core technology
that keeps Amazon running is Linux-based and as of 2005 they had the world’s
three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [
Walmart handles more than 1 million customer transactions every hour, which
is imported into databases estimated to contain more than 2.5 petabytes (2560
terabytes) of data – the equivalent of 167 times the information contained in all
the books in the US Library of Congress.
Facebook handles 50 billion photos from its user base.
Big data Integration




A lot of data growth is happening around these so-called
unstructured data types. Big data integration is all about
automation of the collection, organization and analysis
of these data types.
The importance of big data integration has led to a
substantial amount of research over the past few years
on the topics of schema mapping, record linkage and
data fusion.
Structured data vs Unstructured
data
Big data vs Traditional Data Integration


The number of data sources, even for a single

domain, has grown to be in the tens of thousands.


Many of the data sources are very dynamic, as a huge
amount of newly collected data are continuously made
available.



The data sources are extremely heterogeneous in their
structure, with considerable variety even for substantially
similar entities.



The data sources are of widely differing qualities, with

significant differences in the coverage, accuracy and
timeliness
of data provided.
Schema Mapping
Schema mapping in a data integration system refers to
i) creating a mediated (global) schema, and
(ii) Identifying the mappings between the mediated (global)
schema and the local schemas of the data sources to
determine which (sets of) attributes contain the same
information

Example





Entities like people (customers, employees), companies
(the enterprise itself, competitors, partners, suppliers),
products (those owned by the enterprise and its
competitors)
Defined Relationships among these entities
Activities with one or more entities as actors and/or
subjects - Documents can represent these activities
Record Linkage




Record linkage (RL) refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g., data
files, books, websites, databases).
Record linkage is necessary when joining data sets
based on entities that may or may not share a common
identifier (e.g., database key, URI, National identification
number), as may be the case due to differences in
record shape, storage location, and/or curator style or
preference
Challenge in BDI






In BDI, (i) data sources tend to be heterogeneous in
their structure and many sources (e.g., tweets, blog
posts) provide unstructured data, and
(ii) data sources are dynamic and continuously evolving.
To address the volume dimension, new techniques have
been proposed to enable parallel record linkage using
MapReduce.
Adaptive blocking is another technique been used to
overcome this.
MapReduce






MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm on
a cluster.
The model is inspired by the map and reduce functions
commonly used in functional programming.
A MapReduce program is composed of
a Map() procedure that performs filtering and sorting
and  Reduce() procedure that performs a summary
operation.
Adaptive Blocking


Blocking methods alleviate this big data integration
problem by efficiently selecting approximately similar
object pairs for subsequent distance computations,
leaving out the remaining pairs as dissimilar.
Data fusion






Data fusion refers to resolving conflicts from different
sources and finding the truth that reflects the real world.
Its motivation is exactly the veracity of data: the Web has
made it easy to publish and spread false information across
multiple sources.
 Data integration might be viewed as set combination
wherein the larger set is retained, whereas fusion is a
set reduction technique
Data fusion model







Level 0: Source Preprocessing.
Level 1: Object Assessment
Level 2: Situation Assessment
Level 3: Impact Assessment 
Level 4: Process Refinement
Level 5: User Refinement 
Advantages








Real-time rerouting of transportation fleets based on
weather patterns
Customer sentiment analysis based on social postings
Targeted disease therapies based on genomic data
Allocation of disaster relief supplies based on mobile
and social messages from victims
Cars driving themselves.
Conclusion
This seminar gives a basic insight of what is big data
and reviews the state-of-the-art techniques for data
integration in addressing the new challenges raised by
Big Data, including volume and number of sources,
velocity, variety, and veracity. It also lists out the
advantages of harnessing the potential of big data.

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesSlideTeam
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data PresentationMatthew Urdan
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Big Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideBig Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideSlideTeam
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 

Was ist angesagt? (20)

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data
Big dataBig data
Big data
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data Presentation
 
Big data
Big dataBig data
Big data
 
What is big data?
What is big data?What is big data?
What is big data?
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideBig Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation Slide
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 

Andere mochten auch

Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challengesPrayukth K V
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlDevraNurEkaKusuma
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXTexanUrgentCare
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضميayshamashani
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013Sarod Paichayonrittha
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singaporeDAsia India
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Мрск Урала
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overviewDmitry Rodionov
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasaPanjiKN
 

Andere mochten auch (15)

Token
TokenToken
Token
 
Big data veracity challenges
Big data veracity challengesBig data veracity challenges
Big data veracity challenges
 
Sejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.htmlSejarah komputer & perkembangannya.html
Sejarah komputer & perkembangannya.html
 
Walk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TXWalk-In Clinic In Austin, TX
Walk-In Clinic In Austin, TX
 
Doc1
Doc1Doc1
Doc1
 
الجهاز الهضمي
الجهاز الهضميالجهاز الهضمي
الجهاز الهضمي
 
Kwater investor presentation oct2013
Kwater investor presentation oct2013Kwater investor presentation oct2013
Kwater investor presentation oct2013
 
Nanowrimo castle
Nanowrimo castleNanowrimo castle
Nanowrimo castle
 
Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689Somnath City Plots For Booking-7503367689
Somnath City Plots For Booking-7503367689
 
School and college tour packages to singapore
School and college tour packages to singaporeSchool and college tour packages to singapore
School and college tour packages to singapore
 
Drogas2
Drogas2Drogas2
Drogas2
 
Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...Предоставление частных земельных участков под строительство электрических сет...
Предоставление частных земельных участков под строительство электрических сет...
 
Global Travelling overview
Global Travelling overviewGlobal Travelling overview
Global Travelling overview
 
Protocolos de red
Protocolos de redProtocolos de red
Protocolos de red
 
Bajigur spesial rasa
Bajigur spesial rasaBajigur spesial rasa
Bajigur spesial rasa
 

Ähnlich wie Big Data

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 

Ähnlich wie Big Data (20)

Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Big Data
Big DataBig Data
Big Data
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BigData
BigDataBigData
BigData
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
12575474.ppt
12575474.ppt12575474.ppt
12575474.ppt
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Big Data

  • 2. Definition  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 4. ABC of BIG DATA  Analytics. This solution area focuses on providing efficient analytics for extremely large datasets. Analytics is all about gaining insight, taking advantage of the digital universe, and turning data into high-quality information, providing deeper insights about the business to enable better decisions.  Bandwidth. This solution area focuses on obtaining better performance for very fast workloads. High-bandwidth applications include highperformance computing: the ability to perform complex analyses at extremely high speeds; high-performance video streaming for surveillance and mission planning; and as video editing and play-out in media and entertainment.  Content. This solution area focuses on the need to provide boundless secure scalable data storage. Content solutions must enable storing virtually unlimited amounts of data, so that enterprises can store as much data as they want, find it when they need it, and never lose it.
  • 5. 3 V’S of BIG DATA  Volume:  Velocity: As a direct consequence of the rate at which data is being Not only can each data source contain a huge volume of data, but also the number of data sources, even for a single domain, has grown to be in the tens of thousands. collected and continuously made available,many of the data sources are very dynamic.  Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level regarding how they structure their data and at the instance level regarding how they describe the same realworld entity, exhibiting considerable variety even for substantially similar entities.
  • 6. Examples       The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations Big data analysis played a large role in  Barack Obama's successful 2012 reelection campaign. eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [ Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 50 billion photos from its user base.
  • 7.
  • 8.
  • 9. Big data Integration   A lot of data growth is happening around these so-called unstructured data types. Big data integration is all about automation of the collection, organization and analysis of these data types. The importance of big data integration has led to a substantial amount of research over the past few years on the topics of schema mapping, record linkage and data fusion.
  • 10. Structured data vs Unstructured data
  • 11. Big data vs Traditional Data Integration  The number of data sources, even for a single domain, has grown to be in the tens of thousands.  Many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available.  The data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities.  The data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided.
  • 12. Schema Mapping Schema mapping in a data integration system refers to i) creating a mediated (global) schema, and (ii) Identifying the mappings between the mediated (global) schema and the local schemas of the data sources to determine which (sets of) attributes contain the same information 
  • 13. Example    Entities like people (customers, employees), companies (the enterprise itself, competitors, partners, suppliers), products (those owned by the enterprise and its competitors) Defined Relationships among these entities Activities with one or more entities as actors and/or subjects - Documents can represent these activities
  • 14. Record Linkage   Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference
  • 15. Challenge in BDI    In BDI, (i) data sources tend to be heterogeneous in their structure and many sources (e.g., tweets, blog posts) provide unstructured data, and (ii) data sources are dynamic and continuously evolving. To address the volume dimension, new techniques have been proposed to enable parallel record linkage using MapReduce. Adaptive blocking is another technique been used to overcome this.
  • 16. MapReduce    MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. The model is inspired by the map and reduce functions commonly used in functional programming. A MapReduce program is composed of a Map() procedure that performs filtering and sorting and  Reduce() procedure that performs a summary operation.
  • 17. Adaptive Blocking  Blocking methods alleviate this big data integration problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar.
  • 18. Data fusion    Data fusion refers to resolving conflicts from different sources and finding the truth that reflects the real world. Its motivation is exactly the veracity of data: the Web has made it easy to publish and spread false information across multiple sources.  Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique
  • 19. Data fusion model       Level 0: Source Preprocessing. Level 1: Object Assessment Level 2: Situation Assessment Level 3: Impact Assessment  Level 4: Process Refinement Level 5: User Refinement 
  • 20. Advantages      Real-time rerouting of transportation fleets based on weather patterns Customer sentiment analysis based on social postings Targeted disease therapies based on genomic data Allocation of disaster relief supplies based on mobile and social messages from victims Cars driving themselves.
  • 21. Conclusion This seminar gives a basic insight of what is big data and reviews the state-of-the-art techniques for data integration in addressing the new challenges raised by Big Data, including volume and number of sources, velocity, variety, and veracity. It also lists out the advantages of harnessing the potential of big data.