SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Big Data Analytics
CE Dep, IAUSDJ.ac.ir
23-10-95(12 Jan, 2017)
Survey
 explosive increase of global data
 In 2011 overall created and copied data
volume in the world was 1.8ZB
 the term of big data is mainly used to
describe enormous datasets
 many government agencies announced
 major plans to accelerate big data research
and applications
 Big data is an abstract concept
 Different with “massive data” or “very big
data.”
 In 2010, Apache Hadoop defined big data as
“datasets which could not be captured,
managed, and processed by general
computers within an acceptable scope.”
 in May 2011, McKinsey & Company. Big data
shall mean such datasets which could not be
acquired, stored, and managed by classic
database software
 Doug Laney, an analyst of META (presently
Gartner)defined challenges and opportunities
brought about by increased data with a 3Vs
model
 The increase ofVolume,Velocity, andVariety
 The increase ofVolume,Velocity,Variety and
Big DataValue
 retailers that fully utilize big data may
improve their profit by more than 60 %
 developed economies in Europe could save
over EUR 100 billion (which excludes the
effect of reduced frauds, errors, and tax
difference)
 Extract wisdom from data
Data representation
 Data representation aims to make data more
meaningful for computer analysis and user
interpretation
 Levels of heterogeneity in type, structure,
semantics, organization, granularity, and
accessibility.
Redundancy reduction and data
compression
 There is high level of redundancy in datasets
 Is effective to reduce the indirect cost of the
entire system
Data life cycle management
 Values hidden in big data depend on data
freshness
 Which data shall be stored and which data
shall be discarded.
Analytical mechanism
 Traditional RDBMSs are strictly designed
with a lack of scalability and expandability
 We shall find a compromising solution
between RDBMSs and non-relational
databases.
Data confidentiality
 Distributed nodes and servers
 Third party for processing
Energy management
 From both economy and environment
perspectives
 With increase of data volume and analytical
demands, processing, storage, and
transmission of big data consume more
electric energy so:
 System-level power consumption control
 Management mechanism.
Expendability and scalability
 The analytical system of big data must
support present and future datasets.
 The analytical algorithm must be able to
process increasingly expanding and more
complex datasets.
Cooperation
 Analysis of big data is an interdisciplinary
research, which requires experts in different
fields cooperate.
 The main objective of cloud computing is to
use huge computing and storage resources
under concentrated management.
 The development of cloud computing
provides solutions for the storage and
processing of big data.
 The distributed storage technology based on cloud computing can
effectively manage big data
 The parallel computing capacity by virtue of cloud computing can
improve the efficiency of acquisition and analyzing big data.
 Cloud computing transforms the IT
architecture
 While big data influences business decision
making.
 Big data and cloud computing have different
 target customers
 Cloud computing is an advanced IT solution
 Big data focusing on business operations.
 Enormous amount of networking sensors are
embedded into various devices and machines
in the real world.
 collect various kinds of data, such as
environmental data, geographical data,
astronomical data, and logistic data.
 Mobile equipment's, transportation facilities,
public facilities, and home appliances could
all be data acquisition equipments in IoT
 Not only is a platform for concentrated
storage of data.
 acquiring data
 managing data
 organizing data
 leveraging the data values and functions.
 Hadoop is widely used in big data
applications in the industry, e.g., spam
filtering, network searching, clickstream
analysis, and social recommendation.
 Academic research is now based on Hadoop.
 In June 2012,Yahoo runs Hadoop in 42,000
servers at four data centers to support its
products and services, e.g., searching and
spam filtering
 Facebook announced that their Hadoop
cluster can process 100 PB data, which grew
by 0.5 PB per day as in November 2012.
 many companies provide Hadoop ommercial
execution and/or support, including Cloudera,
IBM, MapR, EMC, and Oracle.
Enterprise data
 Amazon processes millions of terminal
operations and more than 500,000 queries
from third-party sellers per day.
 Walmart processes one million customer
trades per hour, over 2.5PB.
 Akamai analyzes 75 million events per day
 Come from industry, agriculture, traffic,
transportation, medical care, public
departments, and families, etc.
 The data generated from IoT has the following
features:
 Large-scale data
 Heterogeneity
 Strong time and space correlation
 Effective data accounts for only a small portion of
the big data
 High-throughput bio-measurement
technologies are developed
 The completion of HGP (Human Genome
Project)
 One sequencing of human gene may
generate 100-600GB raw data.
 1.15 million human samples and 150,000
animal, plant, and microorganism samples in
China National Genebank, 2015, 30 million.
 Clinical medical care and medical R & D Data
 Pittsburgh Medical Center has stored 2TB such
data.
 By 2015, the average data volume of every
hospital will increase from 167TB to 665TB.
 Astronomy data, recorded 25TB data from
1998 to 2008.
 In the beginning of 2008, the Atlas
experiment of Large Hadron Collider (LHC) of
European Organization for Nuclear Research
generates raw data at 2PB/s and stores about
10TB processed data per year.
 Mobile data.
 Etc.
 Data collection
 Data transmission
 Data pre-processing
 Log files
 Sensing
 Methods for acquiring network data(crowlers
etc.)
 Lipcap-based pocket capture-easy to use library
 Zero copy packet capture
 Monitoring software’s:Wireshark, Smartsniff,
WinNetCap
 Mobile equipment’s
 Inter-DCN transmissions
 from data source to data center
 Intra-DCNTransmissions
 data communication flows within data centers
 High speed infrastructure
The collected datasets vary with respect to noise,
redundancy, and consistency, and it is undoubtedly a
waste to store meaningless data.
Integration
 Combination of data from different sources and
provides users with a uniform view of data
Cleaning
 Cleaning: data cleaning is a process to identify
inaccurate, incomplete, or unreasonable data,
Redundancy elimination
 Direct Attached Storage (DAS)
 Various hard disks are directly connected with
servers
 Network Storage
 Network Attached Storage (NAS)
 Storage Area Network (SAN).
 NAS
 Consistency
 A distributed storage system requires multiple
servers to cooperatively store data
 Availability
 It would be desirable if the entire system is not
seriously affected to satisfy customer’s requests in
terms of reading and writing.
 PartitionTolerance
 The distributed storage still works well when the
network is partitioned
 Eric Brewer proposed a CAP theory in 2000,
which indicated that a distributed system
could not simultaneously meet the
requirements on consistency, availability, and
partition tolerance; at most two of the three
requirements can be satisfied simultaneously.
 Seth Gilbert and Nancy Lynch from MIT
proved the correctness of CAP theory in 2002.
 Eric Brewer proposed a CAP theory in 2000,
which indicated that a distributed system
could not simultaneously meet the
requirements on consistency, availability, and
partition tolerance; at most two of the three
requirements can be satisfied simultaneously.
 Seth Gilbert and Nancy Lynch from MIT
proved the correctness of CAP theory in 2002.
 classified into three bottom-up levels:
1. File systems
2. Databases
3. Programming models.
 classified into three bottom-up levels:
1. File systems
 Google’s GFS, HDFS, Microsoft Cosmos,
Facebook Haystack,TaobaoTFS and FastDFS
 classified into three bottom-up levels:
1. File systems
2. Databases, NoSQL
1. Key-value Databases: Dynamo, Voldemort
2. Column-oriented Database: BigTable, Cassandra
3. Document Database: More complex with KV
MongoDB, SimpleDB, and CouchDB
 classified into three bottom-up levels:
1. File systems
2. Databases, NoSQL
3. Programming models
1.MapReduce: Automatic parallel processing. Sawzall
of Google, Pig Latin ofYahoo, Hive of Facebook,
and Scope of Microsoft.
2.Dryad: Dryad is a general-purpose distributed
execution engine for processing parallel
applications. And All-Pairs.
 Use proper statistical methods to analyze
massive data, to concentrate, extract, and
refine useful data hidden in a batch of chaotic
datasets, and to identify the inherent law of
the subject matter.
 Cluster Analysis: is a statistical method for
grouping objects, and specifically, classifying
objects according to some features.
 Factor Analysis: is basically targeted at
describing the relation among many
elements with only a few factors, i.e.,
grouping several closely related variables into
a factor, and the few factors are then used to
reveal the most information of the original
data.
 Correlation Analysis: is an analytical method
for determining the law of relations, such as
correlation, correlative dependence, and
mutual restriction, among observed
phenomena and accordingly conducting
forecast and control.
 Regression Analysis: is a mathematical tool
for revealing correlations between one
variable and several other variables.
 A/BTesting: also called bucket testing. It is a
technology for determining how to improve
target variables by comparing the tested
group. Big data will require a large number of
tests to be executed and analyzed.
 Statistical Analysis: Statistical analysis is
based on the statistical theory, a branch of
applied mathematics.
 Data Mining Algorithms: Data mining is a
process for extracting hidden, unknown, but
potentially useful information and knowledge
from massive, incomplete, noisy, fuzzy, and
random data.
 Bloom Filter: Bloom Filter consists of a series
of Hash functions.
 Hashing: it is a method that essentially
transforms data into shorter fixed-length
numerical values or index values.
 Index: index is always an effective method to
reduce the expense of disk reading and
writing, and improve insertion, deletion,
modification, and query speeds in both
traditional relational databases that manage
structured data, and other technologies that
manage semi structured and unstructured
data
 Triel: also called trie tree, a variant of Hash
Tree. It is mainly applied to rapid retrieval and
word frequency statistics.The main idea of
Triel is to utilize common prefixes of
character strings to reduce comparison on
character strings to the greatest extent, so as
to improve query efficiency.
 Parallel Computing: compared to traditional
serial computing, parallel computing refers to
simultaneously utilizing several computing
resources to complete a computation task.
 Presently, some classic parallel computing
models include MPI (Message Passing
Interface), MapReduce, and Dryad
1.Real-time vs. offline analysis:
Real-time is mainly used in E-commerce and
finance.
Offline analysis: is usually used for
applications without high requirements on
response time, e.g., machine learning,
statistical analysis, and recommendation
algorithms.
2.Analysis at different levels
Memory-level analysis: is for the case where the
total data volume is smaller than the maximum
memory of a cluster.
BI analysis: is for the case when the data scale
surpasses the memory level but may be imported
into the BI analysis environment.
Massive analysis: is for the case when the data
scale has completely surpassed the capacities of BI
products and traditional relational databases.
3. Analysis with different complexity
 R (30.7 %): R, an open source programming
language and software environment, is
designed for data mining/analysis and
visualization.
 Excel (29.8 %): Excel, a core component of
Microsoft Office, provides powerful data
processing and statistical analysis
capabilities.
 Rapid-I Rapidminer (26.7 %): Rapidminer is an
open source software used for data mining,
machine learning, and predictive analysis.
 KNMINE (21.8%): KNIME (Konstanz
Information Miner) is a user-friendly,
intelligent, and open-source rich data
integration, data processing, data analysis,
and data mining platform
 Weka/Pentaho (14.8 %):Weka, abbreviated
fromWaikato Environment for Knowledge
Analysis, is a free and open-source machine
learning and data mining software written in
Java.
 Application of big data in enterprises
 Application of IoT based big data
 Application of online social network-oriented
big data
 Application of online social network-oriented big
data
 Structure-based Applications
 EarlyWarning
 Real-time Monitoring
 Real-time Feedback
 Applications of healthcare and medical big
data
 prediction
 Fundamental problems of big data:There is a
compelling need for a rigorous and holistic
definition of big data, a structural model of
big data, a formal description of big data.
 Standardization of big data: An evaluation
system of data quality and an evaluation
standard/benchmark of data computing
efficiency should be developed.
 Evolution of big data computing modes:This
includes memory mode, data flow mode,
PRAM mode, and MR mode, etc.
 Evolution of big data computing modes:This
includes memory mode, data flow mode,
PRAM mode, and MR mode, etc.
 Format conversion of big data
 Big data transfer
 Real-time performance of big data
 Processing of big data
 Big data management
 Searching, mining, and analysis of big data
 Integration and provenance of big data
 Big data application
 Big data privacy
 Protection of personal privacy during data acquisition
 Personal privacy data may also be leaked during storage,
transmission, and usage, even if acquired with the
permission of users
 Data quality: Data quality influences big data utilization.
Low quality data wastes transmission and storage
resources with poor usability
 Big data safety mechanism: Big data brings about
challenges to data encryption due to its large scale and
high diversity.
“Patience ensures
victory.”
Thank you for
your attention.

Weitere ähnliche Inhalte

Was ist angesagt?

The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 

Was ist angesagt? (20)

Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Structuring Big Data
Structuring Big DataStructuring Big Data
Structuring Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 

Ähnlich wie Big data analytics, survey r.nabati

TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 

Ähnlich wie Big data analytics, survey r.nabati (20)

Big Data
Big DataBig Data
Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
E018142329
E018142329E018142329
E018142329
 
Big Data
Big DataBig Data
Big Data
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Big data mining
Big data miningBig data mining
Big data mining
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Mehr von nabati

Mehr von nabati (9)

Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadehSmart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
 
Ip core -iausdj.ac.ir
Ip core -iausdj.ac.irIp core -iausdj.ac.ir
Ip core -iausdj.ac.ir
 
Introduction to R r.nabati - iausdj.ac.ir
Introduction to R   r.nabati - iausdj.ac.irIntroduction to R   r.nabati - iausdj.ac.ir
Introduction to R r.nabati - iausdj.ac.ir
 
Contiki IoT simulation
Contiki IoT simulationContiki IoT simulation
Contiki IoT simulation
 
Cloud computing standards and protocols r.nabati
Cloud computing standards and protocols r.nabatiCloud computing standards and protocols r.nabati
Cloud computing standards and protocols r.nabati
 
Internet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabatiInternet of things (IoT) and big data- r.nabati
Internet of things (IoT) and big data- r.nabati
 
Graph theory concepts complex networks presents-rouhollah nabati
Graph theory concepts   complex networks presents-rouhollah nabatiGraph theory concepts   complex networks presents-rouhollah nabati
Graph theory concepts complex networks presents-rouhollah nabati
 
Random walks on graphs - link prediction by Rouhollah Nabati
Random walks on graphs - link prediction by Rouhollah NabatiRandom walks on graphs - link prediction by Rouhollah Nabati
Random walks on graphs - link prediction by Rouhollah Nabati
 
Introduction to latex by Rouhollah Nabati
Introduction to latex by Rouhollah NabatiIntroduction to latex by Rouhollah Nabati
Introduction to latex by Rouhollah Nabati
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Big data analytics, survey r.nabati

  • 1. Big Data Analytics CE Dep, IAUSDJ.ac.ir 23-10-95(12 Jan, 2017) Survey
  • 2.
  • 3.  explosive increase of global data  In 2011 overall created and copied data volume in the world was 1.8ZB  the term of big data is mainly used to describe enormous datasets  many government agencies announced  major plans to accelerate big data research and applications
  • 4.
  • 5.  Big data is an abstract concept  Different with “massive data” or “very big data.”  In 2010, Apache Hadoop defined big data as “datasets which could not be captured, managed, and processed by general computers within an acceptable scope.”
  • 6.  in May 2011, McKinsey & Company. Big data shall mean such datasets which could not be acquired, stored, and managed by classic database software
  • 7.  Doug Laney, an analyst of META (presently Gartner)defined challenges and opportunities brought about by increased data with a 3Vs model  The increase ofVolume,Velocity, andVariety
  • 8.  The increase ofVolume,Velocity,Variety and Big DataValue
  • 9.  retailers that fully utilize big data may improve their profit by more than 60 %  developed economies in Europe could save over EUR 100 billion (which excludes the effect of reduced frauds, errors, and tax difference)  Extract wisdom from data
  • 10. Data representation  Data representation aims to make data more meaningful for computer analysis and user interpretation  Levels of heterogeneity in type, structure, semantics, organization, granularity, and accessibility.
  • 11. Redundancy reduction and data compression  There is high level of redundancy in datasets  Is effective to reduce the indirect cost of the entire system
  • 12. Data life cycle management  Values hidden in big data depend on data freshness  Which data shall be stored and which data shall be discarded.
  • 13. Analytical mechanism  Traditional RDBMSs are strictly designed with a lack of scalability and expandability  We shall find a compromising solution between RDBMSs and non-relational databases.
  • 14. Data confidentiality  Distributed nodes and servers  Third party for processing
  • 15. Energy management  From both economy and environment perspectives  With increase of data volume and analytical demands, processing, storage, and transmission of big data consume more electric energy so:  System-level power consumption control  Management mechanism.
  • 16. Expendability and scalability  The analytical system of big data must support present and future datasets.  The analytical algorithm must be able to process increasingly expanding and more complex datasets.
  • 17. Cooperation  Analysis of big data is an interdisciplinary research, which requires experts in different fields cooperate.
  • 18.
  • 19.  The main objective of cloud computing is to use huge computing and storage resources under concentrated management.  The development of cloud computing provides solutions for the storage and processing of big data.  The distributed storage technology based on cloud computing can effectively manage big data  The parallel computing capacity by virtue of cloud computing can improve the efficiency of acquisition and analyzing big data.
  • 20.  Cloud computing transforms the IT architecture  While big data influences business decision making.  Big data and cloud computing have different  target customers  Cloud computing is an advanced IT solution  Big data focusing on business operations.
  • 21.
  • 22.  Enormous amount of networking sensors are embedded into various devices and machines in the real world.  collect various kinds of data, such as environmental data, geographical data, astronomical data, and logistic data.  Mobile equipment's, transportation facilities, public facilities, and home appliances could all be data acquisition equipments in IoT
  • 23.
  • 24.  Not only is a platform for concentrated storage of data.  acquiring data  managing data  organizing data  leveraging the data values and functions.
  • 25.  Hadoop is widely used in big data applications in the industry, e.g., spam filtering, network searching, clickstream analysis, and social recommendation.  Academic research is now based on Hadoop.  In June 2012,Yahoo runs Hadoop in 42,000 servers at four data centers to support its products and services, e.g., searching and spam filtering
  • 26.  Facebook announced that their Hadoop cluster can process 100 PB data, which grew by 0.5 PB per day as in November 2012.  many companies provide Hadoop ommercial execution and/or support, including Cloudera, IBM, MapR, EMC, and Oracle.
  • 27.
  • 28. Enterprise data  Amazon processes millions of terminal operations and more than 500,000 queries from third-party sellers per day.  Walmart processes one million customer trades per hour, over 2.5PB.  Akamai analyzes 75 million events per day
  • 29.  Come from industry, agriculture, traffic, transportation, medical care, public departments, and families, etc.  The data generated from IoT has the following features:  Large-scale data  Heterogeneity  Strong time and space correlation  Effective data accounts for only a small portion of the big data
  • 30.  High-throughput bio-measurement technologies are developed  The completion of HGP (Human Genome Project)  One sequencing of human gene may generate 100-600GB raw data.  1.15 million human samples and 150,000 animal, plant, and microorganism samples in China National Genebank, 2015, 30 million.
  • 31.  Clinical medical care and medical R & D Data  Pittsburgh Medical Center has stored 2TB such data.  By 2015, the average data volume of every hospital will increase from 167TB to 665TB.
  • 32.  Astronomy data, recorded 25TB data from 1998 to 2008.  In the beginning of 2008, the Atlas experiment of Large Hadron Collider (LHC) of European Organization for Nuclear Research generates raw data at 2PB/s and stores about 10TB processed data per year.  Mobile data.  Etc.
  • 33.  Data collection  Data transmission  Data pre-processing
  • 34.  Log files  Sensing  Methods for acquiring network data(crowlers etc.)  Lipcap-based pocket capture-easy to use library  Zero copy packet capture  Monitoring software’s:Wireshark, Smartsniff, WinNetCap  Mobile equipment’s
  • 35.  Inter-DCN transmissions  from data source to data center  Intra-DCNTransmissions  data communication flows within data centers  High speed infrastructure
  • 36. The collected datasets vary with respect to noise, redundancy, and consistency, and it is undoubtedly a waste to store meaningless data. Integration  Combination of data from different sources and provides users with a uniform view of data Cleaning  Cleaning: data cleaning is a process to identify inaccurate, incomplete, or unreasonable data, Redundancy elimination
  • 37.
  • 38.  Direct Attached Storage (DAS)  Various hard disks are directly connected with servers  Network Storage  Network Attached Storage (NAS)  Storage Area Network (SAN).
  • 40.  Consistency  A distributed storage system requires multiple servers to cooperatively store data  Availability  It would be desirable if the entire system is not seriously affected to satisfy customer’s requests in terms of reading and writing.  PartitionTolerance  The distributed storage still works well when the network is partitioned
  • 41.  Eric Brewer proposed a CAP theory in 2000, which indicated that a distributed system could not simultaneously meet the requirements on consistency, availability, and partition tolerance; at most two of the three requirements can be satisfied simultaneously.  Seth Gilbert and Nancy Lynch from MIT proved the correctness of CAP theory in 2002.
  • 42.  Eric Brewer proposed a CAP theory in 2000, which indicated that a distributed system could not simultaneously meet the requirements on consistency, availability, and partition tolerance; at most two of the three requirements can be satisfied simultaneously.  Seth Gilbert and Nancy Lynch from MIT proved the correctness of CAP theory in 2002.
  • 43.  classified into three bottom-up levels: 1. File systems 2. Databases 3. Programming models.
  • 44.  classified into three bottom-up levels: 1. File systems  Google’s GFS, HDFS, Microsoft Cosmos, Facebook Haystack,TaobaoTFS and FastDFS
  • 45.  classified into three bottom-up levels: 1. File systems 2. Databases, NoSQL 1. Key-value Databases: Dynamo, Voldemort 2. Column-oriented Database: BigTable, Cassandra 3. Document Database: More complex with KV MongoDB, SimpleDB, and CouchDB
  • 46.  classified into three bottom-up levels: 1. File systems 2. Databases, NoSQL 3. Programming models 1.MapReduce: Automatic parallel processing. Sawzall of Google, Pig Latin ofYahoo, Hive of Facebook, and Scope of Microsoft. 2.Dryad: Dryad is a general-purpose distributed execution engine for processing parallel applications. And All-Pairs.
  • 47.
  • 48.
  • 49.  Use proper statistical methods to analyze massive data, to concentrate, extract, and refine useful data hidden in a batch of chaotic datasets, and to identify the inherent law of the subject matter.
  • 50.  Cluster Analysis: is a statistical method for grouping objects, and specifically, classifying objects according to some features.
  • 51.  Factor Analysis: is basically targeted at describing the relation among many elements with only a few factors, i.e., grouping several closely related variables into a factor, and the few factors are then used to reveal the most information of the original data.
  • 52.  Correlation Analysis: is an analytical method for determining the law of relations, such as correlation, correlative dependence, and mutual restriction, among observed phenomena and accordingly conducting forecast and control.
  • 53.  Regression Analysis: is a mathematical tool for revealing correlations between one variable and several other variables.
  • 54.  A/BTesting: also called bucket testing. It is a technology for determining how to improve target variables by comparing the tested group. Big data will require a large number of tests to be executed and analyzed.
  • 55.  Statistical Analysis: Statistical analysis is based on the statistical theory, a branch of applied mathematics.
  • 56.  Data Mining Algorithms: Data mining is a process for extracting hidden, unknown, but potentially useful information and knowledge from massive, incomplete, noisy, fuzzy, and random data.
  • 57.  Bloom Filter: Bloom Filter consists of a series of Hash functions.
  • 58.  Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values.
  • 59.  Index: index is always an effective method to reduce the expense of disk reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and other technologies that manage semi structured and unstructured data
  • 60.  Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics.The main idea of Triel is to utilize common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency.
  • 61.  Parallel Computing: compared to traditional serial computing, parallel computing refers to simultaneously utilizing several computing resources to complete a computation task.  Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad
  • 62. 1.Real-time vs. offline analysis: Real-time is mainly used in E-commerce and finance. Offline analysis: is usually used for applications without high requirements on response time, e.g., machine learning, statistical analysis, and recommendation algorithms.
  • 63. 2.Analysis at different levels Memory-level analysis: is for the case where the total data volume is smaller than the maximum memory of a cluster. BI analysis: is for the case when the data scale surpasses the memory level but may be imported into the BI analysis environment. Massive analysis: is for the case when the data scale has completely surpassed the capacities of BI products and traditional relational databases.
  • 64. 3. Analysis with different complexity
  • 65.  R (30.7 %): R, an open source programming language and software environment, is designed for data mining/analysis and visualization.  Excel (29.8 %): Excel, a core component of Microsoft Office, provides powerful data processing and statistical analysis capabilities.
  • 66.  Rapid-I Rapidminer (26.7 %): Rapidminer is an open source software used for data mining, machine learning, and predictive analysis.  KNMINE (21.8%): KNIME (Konstanz Information Miner) is a user-friendly, intelligent, and open-source rich data integration, data processing, data analysis, and data mining platform
  • 67.  Weka/Pentaho (14.8 %):Weka, abbreviated fromWaikato Environment for Knowledge Analysis, is a free and open-source machine learning and data mining software written in Java.
  • 68.
  • 69.  Application of big data in enterprises  Application of IoT based big data  Application of online social network-oriented big data  Application of online social network-oriented big data  Structure-based Applications  EarlyWarning  Real-time Monitoring  Real-time Feedback
  • 70.  Applications of healthcare and medical big data
  • 72.
  • 73.  Fundamental problems of big data:There is a compelling need for a rigorous and holistic definition of big data, a structural model of big data, a formal description of big data.  Standardization of big data: An evaluation system of data quality and an evaluation standard/benchmark of data computing efficiency should be developed.
  • 74.  Evolution of big data computing modes:This includes memory mode, data flow mode, PRAM mode, and MR mode, etc.
  • 75.  Evolution of big data computing modes:This includes memory mode, data flow mode, PRAM mode, and MR mode, etc.
  • 76.  Format conversion of big data  Big data transfer  Real-time performance of big data  Processing of big data
  • 77.  Big data management  Searching, mining, and analysis of big data  Integration and provenance of big data  Big data application
  • 78.  Big data privacy  Protection of personal privacy during data acquisition  Personal privacy data may also be leaked during storage, transmission, and usage, even if acquired with the permission of users  Data quality: Data quality influences big data utilization. Low quality data wastes transmission and storage resources with poor usability  Big data safety mechanism: Big data brings about challenges to data encryption due to its large scale and high diversity.
  • 80. Thank you for your attention.

Hinweis der Redaktion

  1. Dryad is a directed acyclic graph (DAG) Dryad executes operations on the vertexes in clusters and transmits data via data channels, including documents, TCP connections, and shared-memory FIFO. During operation, resources in a logic operation graph are automatically map to physical resources.