SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop & Complex (Systems) Research

Thoughts About Large Scale Data Management
GridKa School 2014, Karlsruhe, Germany | 2014-09-02	

Mirko Kämpf | Cloudera
AGENDA
1) Complex Systems Research and Big Data: Challenges
2) The role of Metadata in the Big Data world	

3) Data Management in Hadoop	

4) Cluster Federation using the ETOSHA data catalog
2
COLLECT SOME INFORMATION…
• Do you work interdisciplinary?	

• Do you collect your own data?	

• Do you share data sets often?	

• Do you contribute sub sets of results?	

• Do you integrate multiple data sets?
1: CHALLENGES
• As research projects become more complex, larger groups
need scalable collaborative technology and methods.	

• As soon as datasets become to large it is impossible for them
to be copied between partners. We need data federation
for public and private (anonymous) data.	

• Research results have to be transparent and reproducible. 

This requires not just access to research papers, but also 

access to comprehensive contextual information.
4
5
Source: http://www.idiagram.com/examples/characterisitcs.gif
across types of systems,

across scales, and thus 

across disciplines
many components
dynamic
interactions
giving rise to a
number of"
hierarchical levels
which exhibit
common behaviors
Emergent behavior that connote"
be simply inferred from the behavior "
of the components
CHARACTERISTICS OF BIG DATA
6
Source: http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
IS IT A BIG DATA PROBLEM?
• Share and integrate measured data and simulation results from
multiple sub projects on different scales (space and time) into
one large meta model. 	

• Volume: For better statistical results large datasets are required. 	

• Velocity: Becomes more important in real world real time applications.	

• Variety: New methods lead to evolution of procedures and schema updates.	

• Veracity: Data collection tools using novel methods may not be perfect.	

7
2:THE ROLE OF METADATA
• What can be expected from a data set or: “What is in it?”
• Data set description is provided by metadata.	

• How can I get access to a particular dataset? 	

• legacy systems: FTP, NFS, and HTTP downloads	

• state of the art: cloud storage & distributed file systems (Amazon S3)	

• Location, version and schema is also metadata.
8
DEFNITION: DATA & METADATA
"
from: WIKIPEDIA!
"
Data: Data as an abstract concept can be viewed as the lowest level of
abstraction, 

from which information and then knowledge are derived.

Metadata: Metadata (metacontent) is defined as the data providing information
about 

one or more aspects of the data, such as: 

" - Location on a computer network where the data were created" "
" - Purpose of the data"
" - Time and date of creation"
" - Creator or author of the data"
" - Means of creation of the data"
" - Standards used …

9
SUMMARY"
according to Dr. J. Pomeranz in its Coursera course about metadata:"
"


(1) Metadata is a description about 

" - natural or artificial"
" - physical or digital things"
"
(2) Metadata can be: "
" - descriptive"
" - structural"
" - administrative"
"
(3) Metadata exists on a per item or per collection level"
"
(4) Metadata can be embedded or linked!
"
"
10
Metadata covers multiple aspects of a system.
PROCEDURAL & ENTITY
METADATA
• Entity Metadata is about a
document, a record, a file, a
table, a collection.	

• Procedural Metadata
describe the collection,
generation, filtering, and
transformation procedures.
11
RESEARCH APPROACH &
DATA REPRESENTATION
12
METADATA & CONTEXT
13
METADATA & CONTEXT
well managed 

in Hadoop
partially manageable

with Dublin Core
several monitoring tools
several project management tools
14
3: DATA MANAGEMENT IN HADOOP
• Hadoop is a distributed platform to store and process large data sets in parallel.

• Hadoop offers multiple processing engines:

- traditionally the Map-Reduce engine

- Machine learning with Mahout

- Graph processing with Giraph 

- Spark and GraphX for in memory processing

• Hadoop offers data management capabilities:

- HDFS for simple file storage 

- HBase for random access based on a row key

- SOLR to search in indexed datasets
15
HCATALOG / HIVE
• Define table structure for data stored in	

• simple HDFS files	

• HBase tables	

• Query such “Hive-Tables” also with Impala, Pig & MapReduce	

• Access data sets using the Kite-SDK and within Spark:	

"
" Source: https://gist.github.com/granturing/7201912
16
LIMITATIONS:
• Hive-Metastore contains primarily “technical” metadata.	

• Meaning of a data set is not managed.	

• Context and life cycle of a data set is not actively managed.	

• Processing parameters are not available in core Hadoop.	

• Hive-Metastore is shared, but only for users in one cluster.
17
4: WHAT IS ETOSHA?
• Etosha will be a distributed, decentralized, and fault tolerant
metadata service.	

• Its main purpose is publishing and linking of information 

about datasets. 	

• Etosha enables cluster federation for Hadoop. 	

• Etosha allows knowledge management while HCatalog and
the Hive-Metastore are focused on low-level technical
18
ETOSHA ARCHITECTURE
"
"
"
"
"
"
"
19
Hadoop cluster
once per Hadoop cluster
ETOSHA CONTEXT 

BROWSER
20
Context browser works across clusters.
NEXT STEPS
• Expose administrative, structural, and descriptive metadata

about Hadoop data sets via SPARQL endpoint.	

• Implement a data set discovery API to find relevant data.	

• Enable subquery delegation between Hadoop clusters.	

• Collect and publish metadata about public open datasets 

in the Etosha Data Catalog (EDC) project.
21
MANYTHANKS …	

• … to the organizers of this great event!	

• … to my friend and contributor EricTessenow.	

• … to my colleagues from Cloudera.	

• … and most importantly, to the members of the open source community!
22

Weitere ähnliche Inhalte

Was ist angesagt?

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesSriram Krishnan
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsSriram Krishnan
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!
 

Was ist angesagt? (20)

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 

Andere mochten auch

DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?Sailaja Tennati
 
SEO Training Workshop Presentation
SEO Training Workshop PresentationSEO Training Workshop Presentation
SEO Training Workshop PresentationTim Aldiss
 
R2i Marketo Case Studies
R2i Marketo Case StudiesR2i Marketo Case Studies
R2i Marketo Case StudiesR2integrated
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44Jisc
 
Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016Jisc
 
Multiprotocol label switching (mpls) - Networkshop44
Multiprotocol label switching (mpls)  - Networkshop44Multiprotocol label switching (mpls)  - Networkshop44
Multiprotocol label switching (mpls) - Networkshop44Jisc
 

Andere mochten auch (8)

DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?
 
«ПЕРВЫЙ ШАГ» - утешительные поединки
«ПЕРВЫЙ ШАГ» - утешительные поединки«ПЕРВЫЙ ШАГ» - утешительные поединки
«ПЕРВЫЙ ШАГ» - утешительные поединки
 
SEO Training Workshop Presentation
SEO Training Workshop PresentationSEO Training Workshop Presentation
SEO Training Workshop Presentation
 
R2i Marketo Case Studies
R2i Marketo Case StudiesR2i Marketo Case Studies
R2i Marketo Case Studies
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44
 
Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016
 
Multiprotocol label switching (mpls) - Networkshop44
Multiprotocol label switching (mpls)  - Networkshop44Multiprotocol label switching (mpls)  - Networkshop44
Multiprotocol label switching (mpls) - Networkshop44
 

Ähnlich wie Hadoop & Complex Systems Research

Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by SunnyDignitasDigital1
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentationmlang222
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 

Ähnlich wie Hadoop & Complex Systems Research (20)

Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hadoop
HadoopHadoop
Hadoop
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 

Mehr von Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationDr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata IntegrationDr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 

Mehr von Dr. Mirko Kämpf (11)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Kürzlich hochgeladen

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Kürzlich hochgeladen (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Hadoop & Complex Systems Research

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop & Complex (Systems) Research
 Thoughts About Large Scale Data Management GridKa School 2014, Karlsruhe, Germany | 2014-09-02 Mirko Kämpf | Cloudera
  • 2. AGENDA 1) Complex Systems Research and Big Data: Challenges 2) The role of Metadata in the Big Data world 3) Data Management in Hadoop 4) Cluster Federation using the ETOSHA data catalog 2
  • 3. COLLECT SOME INFORMATION… • Do you work interdisciplinary? • Do you collect your own data? • Do you share data sets often? • Do you contribute sub sets of results? • Do you integrate multiple data sets?
  • 4. 1: CHALLENGES • As research projects become more complex, larger groups need scalable collaborative technology and methods. • As soon as datasets become to large it is impossible for them to be copied between partners. We need data federation for public and private (anonymous) data. • Research results have to be transparent and reproducible. 
 This requires not just access to research papers, but also 
 access to comprehensive contextual information. 4
  • 5. 5 Source: http://www.idiagram.com/examples/characterisitcs.gif across types of systems,
 across scales, and thus 
 across disciplines many components dynamic interactions giving rise to a number of" hierarchical levels which exhibit common behaviors Emergent behavior that connote" be simply inferred from the behavior " of the components
  • 6. CHARACTERISTICS OF BIG DATA 6 Source: http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
  • 7. IS IT A BIG DATA PROBLEM? • Share and integrate measured data and simulation results from multiple sub projects on different scales (space and time) into one large meta model. • Volume: For better statistical results large datasets are required. • Velocity: Becomes more important in real world real time applications. • Variety: New methods lead to evolution of procedures and schema updates. • Veracity: Data collection tools using novel methods may not be perfect. 7
  • 8. 2:THE ROLE OF METADATA • What can be expected from a data set or: “What is in it?” • Data set description is provided by metadata. • How can I get access to a particular dataset? • legacy systems: FTP, NFS, and HTTP downloads • state of the art: cloud storage & distributed file systems (Amazon S3) • Location, version and schema is also metadata. 8
  • 9. DEFNITION: DATA & METADATA " from: WIKIPEDIA! " Data: Data as an abstract concept can be viewed as the lowest level of abstraction, 
 from which information and then knowledge are derived.
 Metadata: Metadata (metacontent) is defined as the data providing information about 
 one or more aspects of the data, such as: 
 " - Location on a computer network where the data were created" " " - Purpose of the data" " - Time and date of creation" " - Creator or author of the data" " - Means of creation of the data" " - Standards used …
 9
  • 10. SUMMARY" according to Dr. J. Pomeranz in its Coursera course about metadata:" " 
 (1) Metadata is a description about 
 " - natural or artificial" " - physical or digital things" " (2) Metadata can be: " " - descriptive" " - structural" " - administrative" " (3) Metadata exists on a per item or per collection level" " (4) Metadata can be embedded or linked! " " 10 Metadata covers multiple aspects of a system.
  • 11. PROCEDURAL & ENTITY METADATA • Entity Metadata is about a document, a record, a file, a table, a collection. • Procedural Metadata describe the collection, generation, filtering, and transformation procedures. 11
  • 12. RESEARCH APPROACH & DATA REPRESENTATION 12
  • 14. METADATA & CONTEXT well managed 
 in Hadoop partially manageable
 with Dublin Core several monitoring tools several project management tools 14
  • 15. 3: DATA MANAGEMENT IN HADOOP • Hadoop is a distributed platform to store and process large data sets in parallel.
 • Hadoop offers multiple processing engines:
 - traditionally the Map-Reduce engine
 - Machine learning with Mahout
 - Graph processing with Giraph 
 - Spark and GraphX for in memory processing
 • Hadoop offers data management capabilities:
 - HDFS for simple file storage 
 - HBase for random access based on a row key
 - SOLR to search in indexed datasets 15
  • 16. HCATALOG / HIVE • Define table structure for data stored in • simple HDFS files • HBase tables • Query such “Hive-Tables” also with Impala, Pig & MapReduce • Access data sets using the Kite-SDK and within Spark: " " Source: https://gist.github.com/granturing/7201912 16
  • 17. LIMITATIONS: • Hive-Metastore contains primarily “technical” metadata. • Meaning of a data set is not managed. • Context and life cycle of a data set is not actively managed. • Processing parameters are not available in core Hadoop. • Hive-Metastore is shared, but only for users in one cluster. 17
  • 18. 4: WHAT IS ETOSHA? • Etosha will be a distributed, decentralized, and fault tolerant metadata service. • Its main purpose is publishing and linking of information 
 about datasets. • Etosha enables cluster federation for Hadoop. • Etosha allows knowledge management while HCatalog and the Hive-Metastore are focused on low-level technical 18
  • 20. ETOSHA CONTEXT 
 BROWSER 20 Context browser works across clusters.
  • 21. NEXT STEPS • Expose administrative, structural, and descriptive metadata
 about Hadoop data sets via SPARQL endpoint. • Implement a data set discovery API to find relevant data. • Enable subquery delegation between Hadoop clusters. • Collect and publish metadata about public open datasets 
 in the Etosha Data Catalog (EDC) project. 21
  • 22. MANYTHANKS … • … to the organizers of this great event! • … to my friend and contributor EricTessenow. • … to my colleagues from Cloudera. • … and most importantly, to the members of the open source community! 22