Simon Metson discusses computing needs for large scientific experiments and how universities can address big data challenges. He describes using NoSQL tools for petabyte datasets from particle physics experiments and landslide modeling. Universities will need big data resources to support data-intensive research across fields and may build internal teams or outsource to cloud providers to manage increasingly large and complex datasets.
2. Outline
• Who’s this guy?
• Computing for the LHC experiments
• NoSQL tools
• Landslide modelling
• Issues arising for Universities
3. Who am I?
• Until March I was a research associate
working on computing for CMS - one of
the LHC experiments at CERN
• In March I began transitioning to
working for Cloudant as an “ecology
engineer”
• Have dealt with multi-petabyte datasets
for the last 10 years
13. People are important
• Be nice to people working on weekends
• The “cost” of a person is one place you
can make savings - e.g. by giving them
the ability to do more
• Building a suitable team is hard, takes
time and is essential for success
14. Observation
• What's interesting is that big data isn't
interesting any more
• Unless you are of a similar scale to
Google you don't need to write your
own system
• Doesn't mean it's easy, though!
• A terabyte was quite a lot 10 years ago,
now it’s commodity hardware
18. When all you have is a hammer
everything looks like a nail
19. General NoSQL
observations
• Good for startups with limited resources
and exposure to risk
• Good for large companies who build
data centres with lots of loosely related
data and large DevOps teams
• How does this fit with University
researchers?
20. LHC computing
evolution
• Our current system works, but at a high
staff cost
• Expect simplification of system, retire
bespoke components in favour of
generic tools
23. Use cases
• Needs to be usable by geographically
dispersed, non-expert field engineers
• Need expert approval step
• Need to be accessible on low end
hardware
• Need to run 1000’s of simulations per
slope/storm and analyse result data
29. Aside: Complexity
• The above is for one storm, simulate
many
• Can easily have more stochastic
parameters, or vary them in a more fine
grained manner
• May want to compare across software
versions - standard datasets
31. Use cases
• Needs to be usable by geographically
dispersed, non-expert field engineers
• Need expert approval step
• Need to be accessible on low end
hardware
• Need to run 1000’s of simulations per
slope/storm and analyse result data
mpossible scale with current tools/manpower
36. Big Data and
Universities
• Data intensive research will become the
norm (already is in many fields)
• Universities will need access to Big
Data resources
• Expect significant use from
nontraditional fields
• Expect new fields to emerge
37. Workflow ladder
Number of users
Large datasets (>100 TB)
}
Complex computation
Large datasets (>100 TB) Use Grid compute and storag
Simple computation
exclusively
Shared datasets (>500 GB)
Complex computation
Shared datasets (10-500 GB)
}
Complex computation
Work on departmental resourc
Shared datasets (10-100 GB) store resulting datasets to Grid st
Simple computation
Shared datasets (0.1-10 GB)
} Work on laptop/desktop machi
Simple computation
Private datasets (0.1-10 GB)
store resulting datasets to loca
Simple computation Grid storage
39. Implications for the
future
• Quality and scale of Big Data resource
will have direct impact on ability of
Universities to do research at
international level
• Universities will need to provide data
intensive compute resources to
complement traditional HPC
40. Implications for the
future
• Big data clusters are very different
architecturally to HPC clusters
• Data is stateful; harder to manage than
HPC
• Interesting legal issues arise
41. Implications for the
future
• Cost savings from SaaS vendors can
be hard to realise at an institute level
• Building DevOps teams is not
something University funding easily
supports
42. Summary
• Big data is mainstream
• Should be seen as an enabling
technology for academics
• Not trivial to adopt
• Universities need to build up teams to
support these activities, or find ways to
out source