SlideShare ist ein Scribd-Unternehmen logo
1 von 53
CSIRO DIGITAL PRODUCTIVITY FLAGSHIP
Platform for Big Data Analytics and Visual Analytics:
CSIRO Use Cases
Tomasz Bednarz | Research Team Leader
23rd February 2015 | Statistical Modelling and Analysis of Big Data Workshop 2015
The ARC Centre of Excellence in Mathematical and Statistical Frontiers in Big Data, Big Models and New Insights
Project Team: Piotr Szul, Yulia Arzhaeva, Luke Domanski, Ryan Lagerstrom, Surya Nepal, John Zic, John Taylor
Platform for Big Data Analytics and Visual Analytics
CSIRO Computational Simulation Sciences TCP project, Digital Productivity
Flagship  Platform for Big Data Analytics and Visual Analytics
Dual use of Platform:
• Support and foster a community around Big Data processing and visualisation
• Provide computing tools and services supporting CSIRO specific Big Data
Analytics needs
What will the tools be:
• Facility (software + hardware)
• Portable VM or container image (run everywhere)
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Definition
Platform for Big Data Analytics and Visual Analytics
Platform is a software solution stack (on
hardware infrastructure) that support
development of big data analytics and
visual analytics applications.
It is:
• Scalable: give appropriate hardware can
scale to petabytes of data and
thousands of nodes.
• Universal: can be deployed on variety of
computational platforms (clouds, HPC
clusters, dedicated clusters, can use
GPGPUs transparently).
• Integrated: is integrated with relevant
CSIRO systems (e.g. Digital Access Portal,
Bowen Clouds).
Isn’t Big Data a solved problem?
Can’t we just install the most popular software and be done with it?
No….for CSIRO, it is more complex
 Science vs Commercial has a different set of needs
 CSIRO = many disciplines/applications = different tool requirements
 CSIRO = diverse large scale storage facilities, discipline specific/optimised data
cubes, HPC parallel storage systems
 CSIRO = diverse set of compute infrastructure
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Why?
What does Big Data Analytics mean to Science?
Big data software survey and analysis
R Big data package survey and analysis
Conceptual Platform Design
 Planning layered architecture
– Big picture view: available software,
CSIRO Infrastructure + Science
 Plan of attack
Assessment of user requirements
 User and project group outreach
 Workshop Questionnaires and Abstracts
Platform for Big Data Analytics and Visual Analytics
What we’ve been doing?
Understanding
Understand
 Big Data Analytics in Science?
 Scientist & CSIRO specific needs
 Tools and software landscape
Big Picture Design
 Forest from the trees
 Layering: General to Specific, extensible, clear
boundaries/responsibility/interfaces
 Portable & Interoperable: share nothing/minimum, technology adapters,
diverse infrastructure, diverse applications, extensible
Refine Design + Implementation (Plan of attack)
 Driven by Real business/use cases
Platform for Big Data Analytics and Visual Analytics
Goals + Progress
Tools to empower scientists
Platform for Big Data Analytics and Visual Analytics
What is “Big Data” processing?
“Python is like the jazz movement in machine learning to R is like classical music.”
Definition:
Collection of data sets so large and complex that it becomes difficult
to process using traditional data processing applications (Wikipedia)
Simple right? But private sector has the loudest Big Data voice.
Most popular tools and resources lean heavily towards:
 Unstructured data, high number of small loosely related data elements
 Hadoop, HDFS, NoSQL, Hadoop, HDFS, NoSQL, Hadoop, HDFS… etc.
Platform for Big Data Analytics and Visual Analytics
Big Data definition vs discussion?
Understand
Some science problems fit the commercial mold. Many don’t e.g:
 Highly regular and structured data samples
 Single large datasets of tightly coupled samples
 Streaming data from sensors
 Getting data from domain specific data cubes
Right tools do exists, just not as visible in the community
 Which ones do we need?
 How do we integrate them with popular tools?
Can we still use commercially driven tools for science problems that
break the mold???
Platform for Big Data Analytics and Visual Analytics
Big Data: where does science fit?
Understand
Definition:
The discovery and communication of meaningful patterns in data
(Wikipedia)
Wow that’s broad! But commercial world has loudest voice again:
 Analytics = [predictions] used to recommend action or to guide decision
making rooted in business context (Wikipedia)
Fortunately, this requires tools commonly used in science also:
 data modeling, machine learning, optimization algorithms, visualisation etc.
Platform for Big Data Analytics and Visual Analytics
Analytics definition vs discussion?
Understand
Who is REALY doing Big Data? What are their needs?
 Application/tools
– Linear Algebra? Machine Learning? Image Processing? Text/pattern
matching/mining?
 Data
– Streaming vs Persistent+(Static||Dynamic)? Unstructured vs Structured? SQL vs
NoSQL vs Text vs Binary
 Human Workflow
– Prototype vs Production, Exploratory vs Directed, Interactive vs Batch
– Scale code+tools from Interactive+Prototype => Production+Batch
What will they need to work on? How much can we support!?
 CSIRO infrastructure: Storage + Compute
– Where is (should be) the data? Don’t move it!!
– What/Where is the compute?
 Possible?? Transparency + Interoperability + Portability over Infrastructure
– HPC + Internal Cloud + Dedicated System
Platform for Big Data Analytics and Visual Analytics
The punters
want this!
Scientist and CSIRO specific needs
Understand
What is out there?
What delivers our scientists requirements?
Does it support CSIRO infrastructure?
How does it all fit together?
 Inter layer: Does product X work with product Y
 Intra layer: Can data stored by A be easily abstracted/ingested by B
Platform for Big Data Analytics and Visual Analytics
Tools and Software Landscape
Understand
Data A, B, C + Infrastructure 1, 2, 4 + Tool/Software ι, β, γ + Science
App/Domain l, m, n
How to deal with Complexity!!!
1. Define the Forest
2. Map the Trees to Forest
3. Pick which Trees to keep/use
Platform for Big Data Analytics and Visual Analytics
Seeing the Forest from the Trees
Design
Platform for Big Data Analytics and Visual Analytics
Seeing the Forest from the Trees
Design
Big Data: Petabytes
Storage of low-value data
H/W failure common
Code: frequency, graphs, machine-
learning, rendering
Ingress/egress problems
Dense storage of data
Mix CPU and data
Spindle:core ratio
HPC: Petaflops
Storage for checkpointing
Surprised by H/W failure
Code: simulation, rendering
Less persistent data, ingress &
egress
Dense compute
CPU + GPU
Bandwidth to other servers
• Failure is inevitable  fault tolerance build-in
• Bandwidth and IO is precious  topology aware scheduling
• Linear scalability  massive parallelisation, minimal communication
• Hide the complexities from developers  expressive programming model
Platform for Big Data Analytics and Visual Analytics
Big Data versus HPC
Understand
Platform for Big Data Analytics and Visual Analytics
• Become Big Data Excellence Centre with the
vision/mission to be a hub for big data analytics
and processing technology and provide
technical expertise in this area.
• Achieve a step change in the size of big data
problems that are being tackled in CSIRO.
• Decrease the effort and time required for CSIRO
to discover new patterns in Massive datasets.
• Simplify Scientist’s workflows with big data set.
• Develop solution architectures and software
components to support specific needs of big
data processing and visualisation in CSIRO.
• Deliver CSIRO shared "big data facility”
supporting integration and processing data
from different data sources. That would be
more of an infrastructure project that built
together with IM&T (Bowen Clouds) for certain
types of in-house big data processing scenarios.
Platform for Big Data Analytics and Visual Analytics
Vision
• Connect data analytics, simulations,
statistical modeling, image & video
analytics, machine learning, visualisation
into one stack of reusable solutions
supporting various science domains.
• Build more interactive solutions that
connect users with analytical models to
improve business decisions.
• Create new business cases.
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Mission
• Uptake of the technology in
CSIRO, transforming the way we
do science.
• Contribution to Big Data
Science globally.
• International collaborations.
• Enable new discoveries.
• Reduce time to new discovery.
• Global outreach.
• External grants, engagements
with industry.
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Success factors
• Data discovery
• Quantitative visualisation focus:
• Measurement on visualisation
• Uncertainty - from data to display
• Integration
• Interaction
• Views of the data
• Collaboration across virtual
environments
• Annotated 3D videos
• Augmented Reality
• Immersive Virtual Reality
• Wearables + Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Visual Analytics
RAVE @ NIST/USA
Platform for Big Data and Visual Analytics
Our project is orientated at providing incremental, use-case driven development of
technical capabilities including skills, software and infrastructure to facilitate
scientists’ access to big data processing
Come talk to us!
https://wiki.csiro.au/display/bigdata/PBDAVA+Collaboration
Platform for Big Data Analytics and Visual Analytics
Funded from CAPEX & build in collaboration with IM&T
Deployed on Bowen Cloud
16 nodes each:
 128GB RAM and 16 CPU cores
 Infiniband network
 ~100 TB of storage (planned)
Various storage options being consider: OSM/NFS HDFS, GPFS+FPO
YARN cluster (CDH5) : Hadoop MR, Spark, h2o … (any YARN compatible
framework)
Status: storage testing
For more see:
https://wiki.csiro.au/display/ICTCRC/DP+Research+Big+Data+Cluster
The DB Research Big Data Cluster is a dedicated hardware cluster intended both to
support big data related computer science research and to provide experimental
big data processing capabilities for scientific projects within DP.
Platform for Big Data Analytics and Visual Analytics
DP Big Data Cluster
OSM/NFS
DP Big Data Cluster - Architecture
GPFS DAS
Edge Node
Clients, Compiler,
Staging, Monitor
Bowen
Storage
Worker nodes
Yarn Worker
HDFS Worker
Master nodes
Yarn Master
HDFS Master
Bowen
Compute
NexusAuthentication
GangliaMonitor
CSIRO Intranet
Workstations
hadoop1-01-cdchadoop1-{03..16}-cdchadoop1-02-cdc
Infiniband Network
Bragg, Pearcey
Platform for Big Data Analytics and Visual Analytics
Hadoop
What is it?
Platform for Big Data Analytics and Visual Analytics
● The Apache Hadoop is a framework that allows for
the distributed processing of large data sets across
cluster of computers using simple programming
models.
● Designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
● Designed to detect and handle failures at the
application layer.
http://hadoop.apache.org
Hadoop
Components
Platform for Big Data Analytics and Visual Analytics
● Hadoop components:
● Hadoop Distributed File System (HDFS)
● MapReduce
●Handles any data type
● Structured
● Unstructured
● Schema
● No schema
● High volume
● Low volume
Hadoop
Hadoop Distributed File System
Platform for Big Data Analytics and Visual Analytics
● Breaks incoming files into blocks and stores them
redundantly across the cluster
● A single large file is split into blocks, and the blocks
are distributed among the nodes
● Blocks in HDFS are large – typically 128MB in size
● Files in HDFS are ‘write ones’ (no random writes
allowed) and processed by MR framework. Results
stored back in HDFS.
● Original data file not modified during lifecycle
Hadoop
HDFS
Platform for Big Data Analytics and Visual Analytics
● Data replication (to enhance reliability and
availability) – default is threefold
● HDFS optimised for large, streaming reads of
files (rather than random reads)
● A master node NameNode keeps track
(metadata) of blocks that make a file and their
locations
Hadoop
Example
Platform for Big Data Analytics and Visual Analytics
● NameNode holds metadata for files
● DataNodes hold the actual blocks
MapReduce
Word count example
Platform for Big Data Analytics and Visual Analytics
Map: reads each line in the text
one at a time, splits out each
word into a separate string, and
for each word output the word
and a 1 to indicate it has seen the
word one time.
Shuffle: uses the word as the key,
hashing the records to reducers.
Reduce: sums up the number of
times each word was seen and
write that together with the word
as output.
Big Volume Processing
 Architectures
– Share nothing
– Traditional: compute + storage
 Parallel file systems
– HDFS, GPFS + FPO,
– S3, Swift, Lustre, Gluster
 Processing
– Out of core (MapReduce)
– In memory
 Scheduling:
– Yarn, Mesos
A programming model and an associated implementation for processing and generating
large data sets with a parallel, distributed algorithm on a cluster + a parallel filesystem
MapReduce
Model
DAG Model Graph Model
BSP/Collectiv
e Model
Twister
Hadoop
MPI
Drya
d
Spark
Giraph
Hama
GraphLab
Harp
GraphX
HaLoop
Stratosphere
Reef
Iterative
Platform for Big Data Analytics and Visual Analytics
Pig
Philosophy
● Pigs eat anything
○ Input data can come in any format – popular formats, such
as tab-delimited are natively supported. Users can add
functions to support other data formats.
○ Operates on data: relational, nested, semi-structured, or
unstructured
● Pigs live anywhere
● Pigs are domestic animals
● Pigs fly
○ Pig processes data quickly.
Platform for Big Data Analytics and Visual Analytics
Pig
What is it?
● Pig provides an engine for executing data flows in parallel on Hadoop
● Pig includes a language called Pig Latin for expressing data flows
● Pig Latin includes operators for many of the traditional data operations (not
to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD
and STORE.
● Pig makes use of: the Hadoop Distributed File System (HDFS) and processing
system MapReduce
Why?
Faster Development (increases productivity 10x), Flexible,
Express data transformation tasks in just a few lines of code
Don’t reinvent the wheel, 10 lines of Pig Latin = ~200 lines of Java
Platform for Big Data Analytics and Visual Analytics
Pig
Workflow
● A LOAD statement reads data from the file system.
● A series of transformation statements process the data.
● A STORE statement writes output to the file system or, a DUMP
statement displays output to the screen.
● Pig always at first validates the syntax and semantics of all
statements and execute them only when encounters DUMP or
STORE statements.
Platform for Big Data Analytics and Visual Analytics
Pig
The whole Picture
Platform for Big Data Analytics and Visual Analytics
Pig
Pig Latin
● Pig Latin is a dataflow language --> allows users to describe how data
from one or more inputs should be read, processed and stored to one or
more outputs in parallel.
● Data flows can be:
○ Linear: as in the word count example
○ Complex: multiple inputs are joined and where data is split into
multiple streams to be processed by different operators
● Pig Latin script describes a directed acyclic graph (DAG) where the edges
are data flows and the nodes are operators that process the data
● Pig Latin has no if statements or for loops (= it focuses on data flow)
○ Traditional procedural and OO programing languages describe control
flow; data flow is a side effect of the program.
Platform for Big Data Analytics and Visual Analytics
Pig
Running Pig / Starting Grunt
Platform for Big Data Analytics and Visual Analytics
● Pig supports local mode: useful for prototyping and debugging Pig
Latin scripts. Test on small data and move to large data.
● Pig also runs in mapreduce mode: it does parsing, checking and
planning locally, but executes MapReduce jobs on Hadoop cluster (it
needs to know where NameNode and JobTracker are located).
You can execute Pig Latin statements:
● Using command line / Grunt shell
● In local mode or mapreduce mode (to
interact with HDFS on your cluster)
● Either interactively or in batch
● Embedded Pig
Pig
Data types: scalar and complex
Pig
Schemas
● Pig eats everything - lax attitude for schemas
● If schema for data is available, Pig will use it
● If schema for data is not available, Pig will process the data and will
make the best guesses (on how script treats data)
Pig
Commands
Platform for Big Data Analytics and Visual Analytics
Pig
Words count example
Platform for Big Data Analytics and Visual Analytics
Pig
User Defined Functions (UDF)
Platform for Big Data Analytics and Visual Analytics
● Benefits
○ Use legacy code
○ Use library in scripting language
○ Leverage Hadoop for non-Java programmers
● Extensible Interface
○ Minimum effort to support another language
● Currently supported languages
○ Python
○ JavaScript
○ Ruby
Pig
DataFu
Platform for Big Data Analytics and Visual Analytics
● DataFu is a collection of user-defined functions for working with large-scale data
in Hadoop and Pig.
● This library was born out of the need for a stable, well-tested library of UDFs for
data mining and statistics.
● Used at LinkedIn in many of our off-line workflows for data derived products like
"People You May Know" and "Skills". It contains functions for:
○ PageRank
○ Statistics (e.g. quantiles, median, variance, etc.)
○ Sampling (e.g. weighted, reservoir, etc.)
○ Convenience bag functions (e.g. enumerating items)
○ Convenience utility function (e.g., assertions, etc.)
○ Set operations (intersect, union)
Pig
ABC Radio Stations and Toilets example
Platform for Big Data Analytics and Visual Analytics
● We have list of local ABC Radio
stations in Australia
● We have list of all Public Toilets
across Australia
● We want to find a closest toilet to a
Radio Station
Demonstration of:
● Data Schemas
● Use of external libraries
● Google Maps API
https://github.com/tomaszbednarz/pig-abc-toilets
Apache Spark
Fast, general engine for large-scale data processing and analysis
• Open source, developed at the UC Berkeley
• Written in Scala (functional programming language that runs in a JVM)
• Key Concepts
• Avoid the data bottleneck by distributing data when it is stored
• Bring the processing to the data
• Data is stored in memory
• Improves efficiency through (up to 100x faster):
 In-memory computing primitives
 General computation graphs
• Improves usability through:
 Rich APIs in Java, Scala, Python
 Interactive shell in Python, Scala
 Up to 2-10x less code
Platform for Big Data Analytics and Visual Analytics
API
Spark
Cluster Computing
• Spark Standalone
• YARN
• Mesos
Storage
HDFS
Apache Spark
RDD (Resilient Distributed Dataset)
• RDD (Resilient Distributed Dataset)
• Resilient – if data in memory is lost, it can be recreated
• Distributed – stored in memory across the cluster
• Dataset – initial data can come from a file or created programmaticaly
• RDDs are the fundamental unit of data in Spark
• Concept: Resilient Distributed Datasets (RDDs)
 Immutable collections of objects spread across a cluster
 Built through parallel transformations (map, filter, etc)
 Automatically rebuilt on failure
 Controllable persistence (e.g. caching in RAM)
Platform for Big Data Analytics and Visual Analytics
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
Operations
Two types: transformation and actions
Transformations (e.g. map, filter, groupBy, join, flatMap)
 Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, reduce)
 Return a result or write it to storage
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
Platform for Big Data Analytics and Visual Analytics
RDDs can hold any type of element:
- Primitive types:
- Integers, characters, strings, etc.
- Sequence types:
- Lists, arrays, dics, etc.
- Scala/Java Objects
- Mixed types
Apache Spark
API
Platform for Big Data Analytics and Visual Analytics
http://www.slideshare.net/frodriguezolivera/apache-spark-41601032
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then interactively search for
patterns
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
QI group has developed algorithm to
extract significant frames
 A single 30 day trip produces 720 hours or
180 GB of video footage – single CPU
processing takes about 9 hours
We developed Sparkle
 prototype integration of SPARK and OpenCV
 video reduction tool on top of Sparkle
Results
 Processing (reduction) of 256 x 0.5GB =
128GB video files on bragg with SPARK-HPC
 Resources requested: 128 nodes with 4
process per node = 512 CPU cores
 Execution time: 137s
Automated Big Video Analysis
Integrated video camera systems have been installed on fishing boats to trial for
the 24/7 fishery monitoring of tuna longline operations in Australia.
Platform for Big Data Analytics and Visual Analytics
WebVR: Virtual Reality in Web Browsers
collaboration with NIST (Sandy Ressler)
Platform for Big Data Analytics and Visual Analytics
SPARK-HPC
SPARK-HPC is an open-source adapter for running Spark on PBS
clusters
Well suited for compute and memory intensive applications (e.g.,
large scale machine learning)
Enables Spark computation on CSIRO HPC clusters including bragg
(128 Dual Xeon 8-core E5-2650 nodes with 384 Kepler Tesla K20
GPUs)
Open-source see: https://github.com/csirobigdata/spark-hpc
Status on CSIRO HPC Clusters:
Needs to be migrated to SLURM and redeployed
Platform for Big Data Analytics and Visual Analytics
www.bdva.net
Platform for Big Data Analytics and Visual Analytics
For even more discussions
Directions
• Connect Big Data and Science
• Infrastructure
• Data Provenance
• How to link data centers together
• Visual Analytics
• Real time data processing
• Internet of Things
• Art + Science: communication
• Spark + GPUs
http://devblogs.nvidia.com/parallelf
orall/bidmach-machine-learning-
limit-gpus/
Platform for Big Data Analytics and Visual Analytics
Thank you
CONTACT Tomasz Bednarz
E: tomasz.bednarz@csiro.au
T: (07) 3833 5544
CSIRO DIGITAL PRODUCTIVITY FLAGSHIP

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Big Data for Ag (2019)
Big Data for Ag (2019)Big Data for Ag (2019)
Big Data for Ag (2019)Benjamin Wielgosz
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Jobs Complexity
Jobs ComplexityJobs Complexity
Jobs Complexitysuresh sood
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analyticsSwarnaLatha177
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesPetteri Alahuhta
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018Leanne Hwee
 
Datapreneurs
DatapreneursDatapreneurs
Datapreneurssuresh sood
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data MiningMd Mizanur Rahman
 
AI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesAI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesSarvesh Kumar
 
From Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceInstitute of Contemporary Sciences
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraChun Myung Kyu
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaEdureka!
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data IntroductionTiago Knoch
 

Was ist angesagt? (20)

Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Big Data for Ag (2019)
Big Data for Ag (2019)Big Data for Ag (2019)
Big Data for Ag (2019)
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Jobs Complexity
Jobs ComplexityJobs Complexity
Jobs Complexity
 
Big Data
Big DataBig Data
Big Data
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Data Activities in Austria
Data Activities in AustriaData Activities in Austria
Data Activities in Austria
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challenges
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018
 
Datapreneurs
DatapreneursDatapreneurs
Datapreneurs
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
AI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use casesAI & Big Data Analytics : Innovation trends and use cases
AI & Big Data Analytics : Innovation trends and use cases
 
From Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data Science
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Andere mochten auch

Towards Visual Analytics to Improve Scientific Reading & Writing
Towards Visual Analytics to Improve Scientific Reading & WritingTowards Visual Analytics to Improve Scientific Reading & Writing
Towards Visual Analytics to Improve Scientific Reading & WritingDuygu Bektik
 
PhD Mini Viva Talk
PhD Mini Viva Talk PhD Mini Viva Talk
PhD Mini Viva Talk Duygu Bektik
 
Visual Analytics Best Practices
Visual Analytics Best PracticesVisual Analytics Best Practices
Visual Analytics Best PracticesTableau Software
 
Business analytics
Business analyticsBusiness analytics
Business analyticsSilla Rupesh
 
Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Dr. Mohan K. Bavirisetty
 

Andere mochten auch (10)

Towards Visual Analytics to Improve Scientific Reading & Writing
Towards Visual Analytics to Improve Scientific Reading & WritingTowards Visual Analytics to Improve Scientific Reading & Writing
Towards Visual Analytics to Improve Scientific Reading & Writing
 
PhD Mini Viva Talk
PhD Mini Viva Talk PhD Mini Viva Talk
PhD Mini Viva Talk
 
OpenVX 1.1 Reference Guide
OpenVX 1.1 Reference GuideOpenVX 1.1 Reference Guide
OpenVX 1.1 Reference Guide
 
OpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick ReferenceOpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick Reference
 
OpenCL 2.1 Reference Guide
OpenCL 2.1 Reference GuideOpenCL 2.1 Reference Guide
OpenCL 2.1 Reference Guide
 
WebGL 2.0 Reference Guide
WebGL 2.0 Reference GuideWebGL 2.0 Reference Guide
WebGL 2.0 Reference Guide
 
Visual Analytics Best Practices
Visual Analytics Best PracticesVisual Analytics Best Practices
Visual Analytics Best Practices
 
Vulkan 1.0 Quick Reference
Vulkan 1.0 Quick ReferenceVulkan 1.0 Quick Reference
Vulkan 1.0 Quick Reference
 
Business analytics
Business analyticsBusiness analytics
Business analytics
 
Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence Building Big Data Analytics Center Of Excellence
Building Big Data Analytics Center Of Excellence
 

Ähnlich wie Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsPethuru Raj PhD
 
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...Chris Andrews
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Just ask Watson Seminar
Just ask Watson SeminarJust ask Watson Seminar
Just ask Watson SeminarCertus Solutions
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'AlmereDataCapital
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security AnalyticsAmrit Chhetri
 

Ähnlich wie Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015 (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data Analytics
 
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Just ask Watson Seminar
Just ask Watson SeminarJust ask Watson Seminar
Just ask Watson Seminar
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
 

Mehr von Tomasz Bednarz

eResearch AU 2015, intro slides
eResearch AU 2015, intro slideseResearch AU 2015, intro slides
eResearch AU 2015, intro slidesTomasz Bednarz
 
Four Hats of Math: CFD
Four Hats of Math: CFDFour Hats of Math: CFD
Four Hats of Math: CFDTomasz Bednarz
 
NVIDIA GTC 2018 Presentation
NVIDIA GTC 2018 PresentationNVIDIA GTC 2018 Presentation
NVIDIA GTC 2018 PresentationTomasz Bednarz
 
Multi-Modal High-End Visualization System
Multi-Modal High-End Visualization SystemMulti-Modal High-End Visualization System
Multi-Modal High-End Visualization SystemTomasz Bednarz
 
Expanded Perception and Interaction Centre (EPICentre)
Expanded Perception and Interaction Centre (EPICentre)Expanded Perception and Interaction Centre (EPICentre)
Expanded Perception and Interaction Centre (EPICentre)Tomasz Bednarz
 
Seminar 2019 at CSE
Seminar 2019 at CSESeminar 2019 at CSE
Seminar 2019 at CSETomasz Bednarz
 
High-End Visualisation System (HEVS)
High-End Visualisation System (HEVS) High-End Visualisation System (HEVS)
High-End Visualisation System (HEVS) Tomasz Bednarz
 
SIGGRAPH Asia 2019 Opening Ceremony
SIGGRAPH Asia 2019 Opening CeremonySIGGRAPH Asia 2019 Opening Ceremony
SIGGRAPH Asia 2019 Opening CeremonyTomasz Bednarz
 
STEM Camp Virtual Reality
STEM Camp Virtual RealitySTEM Camp Virtual Reality
STEM Camp Virtual RealityTomasz Bednarz
 
Demoscene Stories, and Old-School Code Tricks presented at FMX2015
Demoscene Stories, and Old-School Code Tricks presented at FMX2015Demoscene Stories, and Old-School Code Tricks presented at FMX2015
Demoscene Stories, and Old-School Code Tricks presented at FMX2015Tomasz Bednarz
 
Design + Art + Science, and Demoscene
Design + Art + Science, and DemosceneDesign + Art + Science, and Demoscene
Design + Art + Science, and DemosceneTomasz Bednarz
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
Big Data in Finance, 2012
Big Data in Finance, 2012Big Data in Finance, 2012
Big Data in Finance, 2012Tomasz Bednarz
 
Hadoop, HDFS, MapReduce and Pig
Hadoop, HDFS, MapReduce and PigHadoop, HDFS, MapReduce and Pig
Hadoop, HDFS, MapReduce and PigTomasz Bednarz
 

Mehr von Tomasz Bednarz (16)

eResearch AU 2015, intro slides
eResearch AU 2015, intro slideseResearch AU 2015, intro slides
eResearch AU 2015, intro slides
 
Four Hats of Math: CFD
Four Hats of Math: CFDFour Hats of Math: CFD
Four Hats of Math: CFD
 
NVIDIA GTC 2018 Presentation
NVIDIA GTC 2018 PresentationNVIDIA GTC 2018 Presentation
NVIDIA GTC 2018 Presentation
 
Multi-Modal High-End Visualization System
Multi-Modal High-End Visualization SystemMulti-Modal High-End Visualization System
Multi-Modal High-End Visualization System
 
Expanded Perception and Interaction Centre (EPICentre)
Expanded Perception and Interaction Centre (EPICentre)Expanded Perception and Interaction Centre (EPICentre)
Expanded Perception and Interaction Centre (EPICentre)
 
Seminar 2019 at CSE
Seminar 2019 at CSESeminar 2019 at CSE
Seminar 2019 at CSE
 
High-End Visualisation System (HEVS)
High-End Visualisation System (HEVS) High-End Visualisation System (HEVS)
High-End Visualisation System (HEVS)
 
EPICentre UNSW
EPICentre UNSWEPICentre UNSW
EPICentre UNSW
 
SIGGRAPH Asia 2019 Opening Ceremony
SIGGRAPH Asia 2019 Opening CeremonySIGGRAPH Asia 2019 Opening Ceremony
SIGGRAPH Asia 2019 Opening Ceremony
 
SoS
SoSSoS
SoS
 
STEM Camp Virtual Reality
STEM Camp Virtual RealitySTEM Camp Virtual Reality
STEM Camp Virtual Reality
 
Demoscene Stories, and Old-School Code Tricks presented at FMX2015
Demoscene Stories, and Old-School Code Tricks presented at FMX2015Demoscene Stories, and Old-School Code Tricks presented at FMX2015
Demoscene Stories, and Old-School Code Tricks presented at FMX2015
 
Design + Art + Science, and Demoscene
Design + Art + Science, and DemosceneDesign + Art + Science, and Demoscene
Design + Art + Science, and Demoscene
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Big Data in Finance, 2012
Big Data in Finance, 2012Big Data in Finance, 2012
Big Data in Finance, 2012
 
Hadoop, HDFS, MapReduce and Pig
Hadoop, HDFS, MapReduce and PigHadoop, HDFS, MapReduce and Pig
Hadoop, HDFS, MapReduce and Pig
 

KĂźrzlich hochgeladen

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 

KĂźrzlich hochgeladen (20)

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 

Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

  • 1. CSIRO DIGITAL PRODUCTIVITY FLAGSHIP Platform for Big Data Analytics and Visual Analytics: CSIRO Use Cases Tomasz Bednarz | Research Team Leader 23rd February 2015 | Statistical Modelling and Analysis of Big Data Workshop 2015 The ARC Centre of Excellence in Mathematical and Statistical Frontiers in Big Data, Big Models and New Insights Project Team: Piotr Szul, Yulia Arzhaeva, Luke Domanski, Ryan Lagerstrom, Surya Nepal, John Zic, John Taylor
  • 2. Platform for Big Data Analytics and Visual Analytics CSIRO Computational Simulation Sciences TCP project, Digital Productivity Flagship  Platform for Big Data Analytics and Visual Analytics Dual use of Platform: • Support and foster a community around Big Data processing and visualisation • Provide computing tools and services supporting CSIRO specific Big Data Analytics needs What will the tools be: • Facility (software + hardware) • Portable VM or container image (run everywhere) Platform for Big Data Analytics and Visual Analytics
  • 3. Platform for Big Data Analytics and Visual Analytics Definition Platform for Big Data Analytics and Visual Analytics Platform is a software solution stack (on hardware infrastructure) that support development of big data analytics and visual analytics applications. It is: • Scalable: give appropriate hardware can scale to petabytes of data and thousands of nodes. • Universal: can be deployed on variety of computational platforms (clouds, HPC clusters, dedicated clusters, can use GPGPUs transparently). • Integrated: is integrated with relevant CSIRO systems (e.g. Digital Access Portal, Bowen Clouds).
  • 4. Isn’t Big Data a solved problem? Can’t we just install the most popular software and be done with it? No….for CSIRO, it is more complex  Science vs Commercial has a different set of needs  CSIRO = many disciplines/applications = different tool requirements  CSIRO = diverse large scale storage facilities, discipline specific/optimised data cubes, HPC parallel storage systems  CSIRO = diverse set of compute infrastructure Platform for Big Data Analytics and Visual Analytics Platform for Big Data Analytics and Visual Analytics Why?
  • 5. What does Big Data Analytics mean to Science? Big data software survey and analysis R Big data package survey and analysis Conceptual Platform Design  Planning layered architecture – Big picture view: available software, CSIRO Infrastructure + Science  Plan of attack Assessment of user requirements  User and project group outreach  Workshop Questionnaires and Abstracts Platform for Big Data Analytics and Visual Analytics What we’ve been doing? Understanding
  • 6. Understand  Big Data Analytics in Science?  Scientist & CSIRO specific needs  Tools and software landscape Big Picture Design  Forest from the trees  Layering: General to Specific, extensible, clear boundaries/responsibility/interfaces  Portable & Interoperable: share nothing/minimum, technology adapters, diverse infrastructure, diverse applications, extensible Refine Design + Implementation (Plan of attack)  Driven by Real business/use cases Platform for Big Data Analytics and Visual Analytics Goals + Progress Tools to empower scientists
  • 7. Platform for Big Data Analytics and Visual Analytics What is “Big Data” processing? “Python is like the jazz movement in machine learning to R is like classical music.”
  • 8. Definition: Collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications (Wikipedia) Simple right? But private sector has the loudest Big Data voice. Most popular tools and resources lean heavily towards:  Unstructured data, high number of small loosely related data elements  Hadoop, HDFS, NoSQL, Hadoop, HDFS, NoSQL, Hadoop, HDFS… etc. Platform for Big Data Analytics and Visual Analytics Big Data definition vs discussion? Understand
  • 9. Some science problems fit the commercial mold. Many don’t e.g:  Highly regular and structured data samples  Single large datasets of tightly coupled samples  Streaming data from sensors  Getting data from domain specific data cubes Right tools do exists, just not as visible in the community  Which ones do we need?  How do we integrate them with popular tools? Can we still use commercially driven tools for science problems that break the mold??? Platform for Big Data Analytics and Visual Analytics Big Data: where does science fit? Understand
  • 10. Definition: The discovery and communication of meaningful patterns in data (Wikipedia) Wow that’s broad! But commercial world has loudest voice again:  Analytics = [predictions] used to recommend action or to guide decision making rooted in business context (Wikipedia) Fortunately, this requires tools commonly used in science also:  data modeling, machine learning, optimization algorithms, visualisation etc. Platform for Big Data Analytics and Visual Analytics Analytics definition vs discussion? Understand
  • 11. Who is REALY doing Big Data? What are their needs?  Application/tools – Linear Algebra? Machine Learning? Image Processing? Text/pattern matching/mining?  Data – Streaming vs Persistent+(Static||Dynamic)? Unstructured vs Structured? SQL vs NoSQL vs Text vs Binary  Human Workflow – Prototype vs Production, Exploratory vs Directed, Interactive vs Batch – Scale code+tools from Interactive+Prototype => Production+Batch What will they need to work on? How much can we support!?  CSIRO infrastructure: Storage + Compute – Where is (should be) the data? Don’t move it!! – What/Where is the compute?  Possible?? Transparency + Interoperability + Portability over Infrastructure – HPC + Internal Cloud + Dedicated System Platform for Big Data Analytics and Visual Analytics The punters want this! Scientist and CSIRO specific needs Understand
  • 12. What is out there? What delivers our scientists requirements? Does it support CSIRO infrastructure? How does it all fit together?  Inter layer: Does product X work with product Y  Intra layer: Can data stored by A be easily abstracted/ingested by B Platform for Big Data Analytics and Visual Analytics Tools and Software Landscape Understand
  • 13. Data A, B, C + Infrastructure 1, 2, 4 + Tool/Software Îą, β, Îł + Science App/Domain l, m, n How to deal with Complexity!!! 1. Define the Forest 2. Map the Trees to Forest 3. Pick which Trees to keep/use Platform for Big Data Analytics and Visual Analytics Seeing the Forest from the Trees Design
  • 14. Platform for Big Data Analytics and Visual Analytics Seeing the Forest from the Trees Design
  • 15. Big Data: Petabytes Storage of low-value data H/W failure common Code: frequency, graphs, machine- learning, rendering Ingress/egress problems Dense storage of data Mix CPU and data Spindle:core ratio HPC: Petaflops Storage for checkpointing Surprised by H/W failure Code: simulation, rendering Less persistent data, ingress & egress Dense compute CPU + GPU Bandwidth to other servers • Failure is inevitable  fault tolerance build-in • Bandwidth and IO is precious  topology aware scheduling • Linear scalability  massive parallelisation, minimal communication • Hide the complexities from developers  expressive programming model Platform for Big Data Analytics and Visual Analytics Big Data versus HPC Understand
  • 16. Platform for Big Data Analytics and Visual Analytics • Become Big Data Excellence Centre with the vision/mission to be a hub for big data analytics and processing technology and provide technical expertise in this area. • Achieve a step change in the size of big data problems that are being tackled in CSIRO. • Decrease the effort and time required for CSIRO to discover new patterns in Massive datasets. • Simplify Scientist’s workflows with big data set. • Develop solution architectures and software components to support specific needs of big data processing and visualisation in CSIRO. • Deliver CSIRO shared "big data facility” supporting integration and processing data from different data sources. That would be more of an infrastructure project that built together with IM&T (Bowen Clouds) for certain types of in-house big data processing scenarios. Platform for Big Data Analytics and Visual Analytics Vision
  • 17. • Connect data analytics, simulations, statistical modeling, image & video analytics, machine learning, visualisation into one stack of reusable solutions supporting various science domains. • Build more interactive solutions that connect users with analytical models to improve business decisions. • Create new business cases. Platform for Big Data Analytics and Visual Analytics Platform for Big Data Analytics and Visual Analytics Mission
  • 18. • Uptake of the technology in CSIRO, transforming the way we do science. • Contribution to Big Data Science globally. • International collaborations. • Enable new discoveries. • Reduce time to new discovery. • Global outreach. • External grants, engagements with industry. Platform for Big Data Analytics and Visual Analytics Platform for Big Data Analytics and Visual Analytics Success factors
  • 19. • Data discovery • Quantitative visualisation focus: • Measurement on visualisation • Uncertainty - from data to display • Integration • Interaction • Views of the data • Collaboration across virtual environments • Annotated 3D videos • Augmented Reality • Immersive Virtual Reality • Wearables + Visual Analytics Platform for Big Data Analytics and Visual Analytics Platform for Big Data Analytics and Visual Analytics Visual Analytics RAVE @ NIST/USA
  • 20. Platform for Big Data and Visual Analytics Our project is orientated at providing incremental, use-case driven development of technical capabilities including skills, software and infrastructure to facilitate scientists’ access to big data processing Come talk to us! https://wiki.csiro.au/display/bigdata/PBDAVA+Collaboration Platform for Big Data Analytics and Visual Analytics
  • 21. Funded from CAPEX & build in collaboration with IM&T Deployed on Bowen Cloud 16 nodes each:  128GB RAM and 16 CPU cores  Infiniband network  ~100 TB of storage (planned) Various storage options being consider: OSM/NFS HDFS, GPFS+FPO YARN cluster (CDH5) : Hadoop MR, Spark, h2o … (any YARN compatible framework) Status: storage testing For more see: https://wiki.csiro.au/display/ICTCRC/DP+Research+Big+Data+Cluster The DB Research Big Data Cluster is a dedicated hardware cluster intended both to support big data related computer science research and to provide experimental big data processing capabilities for scientific projects within DP. Platform for Big Data Analytics and Visual Analytics DP Big Data Cluster
  • 22. OSM/NFS DP Big Data Cluster - Architecture GPFS DAS Edge Node Clients, Compiler, Staging, Monitor Bowen Storage Worker nodes Yarn Worker HDFS Worker Master nodes Yarn Master HDFS Master Bowen Compute NexusAuthentication GangliaMonitor CSIRO Intranet Workstations hadoop1-01-cdchadoop1-{03..16}-cdchadoop1-02-cdc Infiniband Network Bragg, Pearcey Platform for Big Data Analytics and Visual Analytics
  • 23. Hadoop What is it? Platform for Big Data Analytics and Visual Analytics ● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across cluster of computers using simple programming models. ● Designed to scale up from single servers to thousands of machines, each offering local computation and storage. ● Designed to detect and handle failures at the application layer. http://hadoop.apache.org
  • 24. Hadoop Components Platform for Big Data Analytics and Visual Analytics ● Hadoop components: ● Hadoop Distributed File System (HDFS) ● MapReduce ●Handles any data type ● Structured ● Unstructured ● Schema ● No schema ● High volume ● Low volume
  • 25. Hadoop Hadoop Distributed File System Platform for Big Data Analytics and Visual Analytics ● Breaks incoming files into blocks and stores them redundantly across the cluster ● A single large file is split into blocks, and the blocks are distributed among the nodes ● Blocks in HDFS are large – typically 128MB in size ● Files in HDFS are ‘write ones’ (no random writes allowed) and processed by MR framework. Results stored back in HDFS. ● Original data file not modified during lifecycle
  • 26. Hadoop HDFS Platform for Big Data Analytics and Visual Analytics ● Data replication (to enhance reliability and availability) – default is threefold ● HDFS optimised for large, streaming reads of files (rather than random reads) ● A master node NameNode keeps track (metadata) of blocks that make a file and their locations
  • 27. Hadoop Example Platform for Big Data Analytics and Visual Analytics ● NameNode holds metadata for files ● DataNodes hold the actual blocks
  • 28. MapReduce Word count example Platform for Big Data Analytics and Visual Analytics Map: reads each line in the text one at a time, splits out each word into a separate string, and for each word output the word and a 1 to indicate it has seen the word one time. Shuffle: uses the word as the key, hashing the records to reducers. Reduce: sums up the number of times each word was seen and write that together with the word as output.
  • 29. Big Volume Processing  Architectures – Share nothing – Traditional: compute + storage  Parallel file systems – HDFS, GPFS + FPO, – S3, Swift, Lustre, Gluster  Processing – Out of core (MapReduce) – In memory  Scheduling: – Yarn, Mesos A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster + a parallel filesystem MapReduce Model DAG Model Graph Model BSP/Collectiv e Model Twister Hadoop MPI Drya d Spark Giraph Hama GraphLab Harp GraphX HaLoop Stratosphere Reef Iterative Platform for Big Data Analytics and Visual Analytics
  • 30. Pig Philosophy ● Pigs eat anything ○ Input data can come in any format – popular formats, such as tab-delimited are natively supported. Users can add functions to support other data formats. ○ Operates on data: relational, nested, semi-structured, or unstructured ● Pigs live anywhere ● Pigs are domestic animals ● Pigs fly ○ Pig processes data quickly. Platform for Big Data Analytics and Visual Analytics
  • 31. Pig What is it? ● Pig provides an engine for executing data flows in parallel on Hadoop ● Pig includes a language called Pig Latin for expressing data flows ● Pig Latin includes operators for many of the traditional data operations (not to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD and STORE. ● Pig makes use of: the Hadoop Distributed File System (HDFS) and processing system MapReduce Why? Faster Development (increases productivity 10x), Flexible, Express data transformation tasks in just a few lines of code Don’t reinvent the wheel, 10 lines of Pig Latin = ~200 lines of Java Platform for Big Data Analytics and Visual Analytics
  • 32. Pig Workflow ● A LOAD statement reads data from the file system. ● A series of transformation statements process the data. ● A STORE statement writes output to the file system or, a DUMP statement displays output to the screen. ● Pig always at first validates the syntax and semantics of all statements and execute them only when encounters DUMP or STORE statements. Platform for Big Data Analytics and Visual Analytics
  • 33. Pig The whole Picture Platform for Big Data Analytics and Visual Analytics
  • 34. Pig Pig Latin ● Pig Latin is a dataflow language --> allows users to describe how data from one or more inputs should be read, processed and stored to one or more outputs in parallel. ● Data flows can be: ○ Linear: as in the word count example ○ Complex: multiple inputs are joined and where data is split into multiple streams to be processed by different operators ● Pig Latin script describes a directed acyclic graph (DAG) where the edges are data flows and the nodes are operators that process the data ● Pig Latin has no if statements or for loops (= it focuses on data flow) ○ Traditional procedural and OO programing languages describe control flow; data flow is a side effect of the program. Platform for Big Data Analytics and Visual Analytics
  • 35. Pig Running Pig / Starting Grunt Platform for Big Data Analytics and Visual Analytics ● Pig supports local mode: useful for prototyping and debugging Pig Latin scripts. Test on small data and move to large data. ● Pig also runs in mapreduce mode: it does parsing, checking and planning locally, but executes MapReduce jobs on Hadoop cluster (it needs to know where NameNode and JobTracker are located). You can execute Pig Latin statements: ● Using command line / Grunt shell ● In local mode or mapreduce mode (to interact with HDFS on your cluster) ● Either interactively or in batch ● Embedded Pig
  • 36. Pig Data types: scalar and complex
  • 37. Pig Schemas ● Pig eats everything - lax attitude for schemas ● If schema for data is available, Pig will use it ● If schema for data is not available, Pig will process the data and will make the best guesses (on how script treats data)
  • 38. Pig Commands Platform for Big Data Analytics and Visual Analytics
  • 39. Pig Words count example Platform for Big Data Analytics and Visual Analytics
  • 40. Pig User Defined Functions (UDF) Platform for Big Data Analytics and Visual Analytics ● Benefits ○ Use legacy code ○ Use library in scripting language ○ Leverage Hadoop for non-Java programmers ● Extensible Interface ○ Minimum effort to support another language ● Currently supported languages ○ Python ○ JavaScript ○ Ruby
  • 41. Pig DataFu Platform for Big Data Analytics and Visual Analytics ● DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. ● This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. ● Used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills". It contains functions for: ○ PageRank ○ Statistics (e.g. quantiles, median, variance, etc.) ○ Sampling (e.g. weighted, reservoir, etc.) ○ Convenience bag functions (e.g. enumerating items) ○ Convenience utility function (e.g., assertions, etc.) ○ Set operations (intersect, union)
  • 42. Pig ABC Radio Stations and Toilets example Platform for Big Data Analytics and Visual Analytics ● We have list of local ABC Radio stations in Australia ● We have list of all Public Toilets across Australia ● We want to find a closest toilet to a Radio Station Demonstration of: ● Data Schemas ● Use of external libraries ● Google Maps API https://github.com/tomaszbednarz/pig-abc-toilets
  • 43. Apache Spark Fast, general engine for large-scale data processing and analysis • Open source, developed at the UC Berkeley • Written in Scala (functional programming language that runs in a JVM) • Key Concepts • Avoid the data bottleneck by distributing data when it is stored • Bring the processing to the data • Data is stored in memory • Improves efficiency through (up to 100x faster):  In-memory computing primitives  General computation graphs • Improves usability through:  Rich APIs in Java, Scala, Python  Interactive shell in Python, Scala  Up to 2-10x less code Platform for Big Data Analytics and Visual Analytics API Spark Cluster Computing • Spark Standalone • YARN • Mesos Storage HDFS
  • 44. Apache Spark RDD (Resilient Distributed Dataset) • RDD (Resilient Distributed Dataset) • Resilient – if data in memory is lost, it can be recreated • Distributed – stored in memory across the cluster • Dataset – initial data can come from a file or created programmaticaly • RDDs are the fundamental unit of data in Spark • Concept: Resilient Distributed Datasets (RDDs)  Immutable collections of objects spread across a cluster  Built through parallel transformations (map, filter, etc)  Automatically rebuilt on failure  Controllable persistence (e.g. caching in RAM) Platform for Big Data Analytics and Visual Analytics From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley
  • 45. Operations Two types: transformation and actions Transformations (e.g. map, filter, groupBy, join, flatMap)  Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, reduce)  Return a result or write it to storage From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley Platform for Big Data Analytics and Visual Analytics RDDs can hold any type of element: - Primitive types: - Integers, characters, strings, etc. - Sequence types: - Lists, arrays, dics, etc. - Scala/Java Objects - Mixed types
  • 46. Apache Spark API Platform for Big Data Analytics and Visual Analytics http://www.slideshare.net/frodriguezolivera/apache-spark-41601032
  • 47. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley
  • 48. QI group has developed algorithm to extract significant frames  A single 30 day trip produces 720 hours or 180 GB of video footage – single CPU processing takes about 9 hours We developed Sparkle  prototype integration of SPARK and OpenCV  video reduction tool on top of Sparkle Results  Processing (reduction) of 256 x 0.5GB = 128GB video files on bragg with SPARK-HPC  Resources requested: 128 nodes with 4 process per node = 512 CPU cores  Execution time: 137s Automated Big Video Analysis Integrated video camera systems have been installed on fishing boats to trial for the 24/7 fishery monitoring of tuna longline operations in Australia. Platform for Big Data Analytics and Visual Analytics
  • 49. WebVR: Virtual Reality in Web Browsers collaboration with NIST (Sandy Ressler) Platform for Big Data Analytics and Visual Analytics
  • 50. SPARK-HPC SPARK-HPC is an open-source adapter for running Spark on PBS clusters Well suited for compute and memory intensive applications (e.g., large scale machine learning) Enables Spark computation on CSIRO HPC clusters including bragg (128 Dual Xeon 8-core E5-2650 nodes with 384 Kepler Tesla K20 GPUs) Open-source see: https://github.com/csirobigdata/spark-hpc Status on CSIRO HPC Clusters: Needs to be migrated to SLURM and redeployed Platform for Big Data Analytics and Visual Analytics
  • 51. www.bdva.net Platform for Big Data Analytics and Visual Analytics
  • 52. For even more discussions Directions • Connect Big Data and Science • Infrastructure • Data Provenance • How to link data centers together • Visual Analytics • Real time data processing • Internet of Things • Art + Science: communication • Spark + GPUs http://devblogs.nvidia.com/parallelf orall/bidmach-machine-learning- limit-gpus/ Platform for Big Data Analytics and Visual Analytics
  • 53. Thank you CONTACT Tomasz Bednarz E: tomasz.bednarz@csiro.au T: (07) 3833 5544 CSIRO DIGITAL PRODUCTIVITY FLAGSHIP

Hinweis der Redaktion

  1. You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
  2. Key idea: add “variables” to the “functions” in functional programming