Big Science, Big Data: Simon Metson at Eduserv Symposium 2012

Big Science, Big Data
Simon Metson
simon@cloudant.com

Outline
• Who’s this guy?
• Computing for the LHC experiments
• NoSQL tools
• Landslide modelling
• Issues arising for Universities

Who am I?
• Until March I was a research associate
working on computing for CMS - one of
the LHC experiments at CERN
• In March I began transitioning to
working for Cloudant as an “ecology
engineer”
• Have dealt with multi-petabyte datasets
for the last 10 years

Workflow ladder
Number of users
Large datasets (>100 TB)

}
Complex computation

Large datasets (>100 TB) Use Grid compute and storag
Simple computation
exclusively
Shared datasets (>500 GB)
Complex computation

Shared datasets (10-500 GB)

}
Complex computation
Work on departmental resourc
Shared datasets (10-100 GB) store resulting datasets to Grid st
Simple computation

Shared datasets (0.1-10 GB)

} Work on laptop/desktop machi
Simple computation

Private datasets (0.1-10 GB)
store resulting datasets to loca
Simple computation Grid storage

Warning: Obligatory
formula on next slide

The formula
Fixed

Usually fixed

People are important
• Be nice to people working on weekends
• The “cost” of a person is one place you
can make savings - e.g. by giving them
the ability to do more
• Building a suitable team is hard, takes
time and is essential for success

Observation
• What's interesting is that big data isn't
interesting any more
• Unless you are of a similar scale to
Google you don't need to write your
own system
• Doesn't mean it's easy, though!
• A terabyte was quite a lot 10 years ago,
now it’s commodity hardware

When all you have is a hammer
everything looks like a nail

General NoSQL
observations
• Good for startups with limited resources
and exposure to risk
• Good for large companies who build
data centres with lots of loosely related
data and large DevOps teams
• How does this fit with University
researchers?

LHC computing
evolution
• Our current system works, but at a high
staff cost
• Expect simplification of system, retire
bespoke components in favour of
generic tools

Why are landslides
an issue?

Use cases
• Needs to be usable by geographically
dispersed, non-expert field engineers
• Need expert approval step
• Need to be accessible on low end
hardware
• Need to run 1000’s of simulations per
slope/storm and analyse result data

Aside: Complexity

Variations
Cut slope Stochastic for each
Slopes Output files Runtime
angles parameters stochastic
parameter

0.25
1 1 0 0 1
(cpu hours

Aside: Complexity

Variations
parameter

6.25
1 25 0 0 25
(cpu hours

Aside: Complexity

Variations
parameter

312.5
1 25 5 10 1250
(cpu hours

Aside: Complexity

Variations
parameter

31250
100 25 5 10 125000
(cpu hours

Aside: Complexity

Variations
parameter

3.5
100 25 5 10 125000
(years)

Aside: Complexity
• The above is for one storm, simulate
many
• Can easily have more stochastic
parameters, or vary them in a more fine
grained manner
• May want to compare across software
versions - standard datasets

Use cases
• Needs to be usable by geographically
dispersed, non-expert field engineers
• Need expert approval step
• Need to be accessible on low end
hardware
• Need to run 1000’s of simulations per
slope/storm and analyse result data
mpossible scale with current tools/manpower

Design schematic
Geographers Job submi
validate input daemon
data

Job Job

Job Job
Write results
Job Job

Job Job

Job Job

Field engineers Governme

Design schematic
Geographers Job submi
validate input daemon
data

Job Job

Job Job
Write results
Job Job

Job Job

Replicate
Job Job

Field engineers Governme
Upload measurements via

Big Data and
Universities
• Data intensive research will become the
norm (already is in many fields)
• Universities will need access to Big
Data resources
• Expect significant use from
nontraditional fields
• Expect new fields to emerge

Workflow ladder
Number of users
Complex computation

Simple computation

Shared datasets (>500 GB)
Complex computation

Complex computation

Simple computation

Shared datasets (0.1-10 GB)
Simple computation

Private datasets (0.1-10 GB)
Simple computation

Implications for the
future
• Quality and scale of Big Data resource
will have direct impact on ability of
Universities to do research at
international level
• Universities will need to provide data
intensive compute resources to
complement traditional HPC

future
• Big data clusters are very different
architecturally to HPC clusters
• Data is stateful; harder to manage than
HPC
• Interesting legal issues arise

future
• Cost savings from SaaS vendors can
be hard to realise at an institute level
• Building DevOps teams is not
something University funding easily
supports

Summary
• Big data is mainstream
• Should be seen as an enabling
technology for academics
• Not trivial to adopt
• Universities need to build up teams to
support these activities, or find ways to
out source

Big Science, Big Data: Simon Metson at Eduserv Symposium 2012

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Big Science, Big Data: Simon Metson at Eduserv Symposium 2012

Ähnlich wie Big Science, Big Data: Simon Metson at Eduserv Symposium 2012 (20)

Mehr von Eduserv

Mehr von Eduserv (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Science, Big Data: Simon Metson at Eduserv Symposium 2012