1. Petabye Scale Data Challenge
- Worldwide LHC Computing Grid
ASGC/Jason Shih
Computex, Jun 2nd, 2010
2. Outline
Objectives & Milestones
WLCG experiment and ASGC Tier-1 Center
Petabyte Scale Challenge
Storage Management System
System Architecture, Configuration and
Performance
3. Objectives
Building sustainable research and collaboration
infrastructure
Support research by e-Science, on data intensive
sciences and applications require cross disciplinary
distributed collaboration
4. ASGC Milestone
Operational from the deployment of LCG0 since 2002
ASGC CA establish on 2005 (IGTF in same year)
Tier-1 Center responsibility start from 2005
Federated Taiwan Tier-2 center (Taiwan Analysis Facility, TAF)
is also collocated in ASGC
Rep. of EGEE e-Science Asia Federation while joining EGEE
from 2004
Providing Asia Pacific Regional Operation Center (APROC)
services to regional-wide WLCG/EGEE production
infrastructure from 2005
Initiate Avian Flu Drug Discovery Project and collaborate with
EGEE in 2006
Start of EUAsiaGrid Project from April 2008
5. LHC First Beam – Computing at the Petascale
General Purpose, pp, heavy ions
LHCb: B-physics, CP Violation ALICE: Heavy ions, pp
CMS: General Purpose, pp, heavy ions
ATLAS: General Purpose, pp, heavy ions
7. Standard Cosmology
Good model from 0.01 sec
after Big Bang
Energy, Density, Temperature
Supported by considerable
observational evidence
Time
Elementary Particle Physics
From the Standard Model into the
unknown: towards energies of
1 TeV and beyond: the Terascale
Towards Quantum Gravity
From the unknown into the
unknown...
http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html
UNESCO Information 7
Preservation debate, April 2007 -
Jamie Shiers@cern ch
8. WLCG Timeline
First Beam on LHC, Sep.
10, 2008
Severe Incident after 3w
operation (3.5TeV)
9. Max CERN/T1-ASGC Point2Point
Inbound : 9.3 Gbps
ASGC - Introduction
1. Most Reliable T1: 98.83%
2. Very Highly Performing and
most Stable Site in CCRC08
Asia Pacific Regional
A Worldwide Grid Operation Center
Infrastructure
>250 sites, 48 countries
>68,000 CPUs, >25 PetaBytes
>10,000 users, >200 VOs
>150,000 jobs/day
Best Demo Award of EGEE’07
Grid Application Platform Avian Flu Drug Discovery
Lightweight Problem Solving Large Hadron Collider (LHC)
Framework
21
10. Collaborating e-Infrastructures
TWGRID
EUAsiaGrid
Potential for linking ~80 countries
“Production” =
Reliable, sustainable, with commitments to quality of service
11. WLCG Computing Model
- The Tier Structure
Tier-0 (CERN)
Data recording
Initial data reconstruction
Data distribution
Tier-1 (11 countries)
Permanent storage
Re-processing
Analysis
Tier-2 (~130 countries)
Simulation
End-user analysis
12. Enabling Grids for E-sciencE
Archeology
Astronomy
Astrophysics
Civil Protection
Comp. Chemistry
Earth Sciences
Finance
Fusion
Geophysics
High Energy Physics
Life Sciences
Multimedia
Material Sciences
…
EGEE-II INFSO-RI-031688 EGEE07, Budapest, 1-5 October 2007 4
13. Why Petabyte? Challenges
Why Petabyte?
Experiment Computing Model
Comparing with conventional data management
Challenges
Performance: LAN and WAN activities
Sufficient B/W between CPU Farm
Eliminate Uplink Bottleneck (Switch Tires)
Fast responding of Critical Events
Fabric Infrastructure & Service Level Agreement
Scalability and Manageability
Robust DB engine (Oracle RAC)
KB and Adequate Administration (Training)
17. WLCG Tier-1
- Defined Minimum Levels of Services.
Define response time refer to max delay before taking action.
Mean time repairing the service is also crucial but cover
indirectly through required availability target.
18. WLCG MoU & ASGC Resource Level
- Pledged Resources and Projection
Year CPU (HEP2k6) Disk (PB) Tape (PB)
End 2009 29.5K 2.6 2.4
Mou 2009 20K 3.0 3.0
Mou 2010 28K 3.5 3.5
6000 CPU MoU 6000
CPU
5000 5000
(Unit KSI2k)
Disk
TeraByte
4000 Tape 4000
DISK MoU
3000 3000
Tape MoU
2000 2000
1000 1000
0 0
2005 2006 2007 2008 2009 2010
19. Data Management System
CASTOR V1
CERN Advanced STORage
Satisfactorily serving 10s of 1K
Req/day/TB of Disk Cache
Limitation: 1M files in cache
Tape movement API not flexible
CASTOR V2
Centric DB Arch.
Scheduling Feature
GSI and Kerberos
Resource Mgmt
Resource Handling
20. CASTOR Configurations
- Current Infrastructure
Shared cores services
Serving: Atlas and CMS
Services:
Stager, NS, DLF, Repack, and LSF
DB cluster
Two DB Clusters (SRM and NS)
5 Services (DB) split into two clusters
5 Oracle Instances
Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp.
Current usage: 63% and 44% for CMS and Atlas
21. CASTOR Configurations (cont’)
- Disk Cache
Disk pools & servers
Performance (IOPS)
With 0.5kB IO size: 76.4k and 54k for read & write resp.
Slightly decrease around 9% for both read and write
when inc. IO size to 4kB.
80 disk servers (+6 will be online end of 3rdw Oct)
Total capacity: 1.67PB (0.3PB allocate dynamically)
Current usage: 0.79PB (~58% usage)
14 disk pools (8 for atlas and 3 for CMS, another three
for bio, SAM, and dynamic)
22. at
la
sG
RO Total Capacity (TB)
bi UP
om D
0
50
100
150
200
250
300
350
400
at ed ISK
la D 450
cm sH 1T
sW otD 0
at A is
la N k
sP O
rd UT
at D
la 0
dt sS T1
at ea tag
at l a m e
la sM D
Install Capacity
sS C 0T
Disk Pool Configuration
c T 0
- T1 MSS (CASTOR)
at rat AP
l a ch E
Num of Disk Servers
sP D
cm rd isk
D
at sL 1T
l a TD 0
cm sM 0T
sP CD 1
rd ISK
D
S t 1T
an 0
db
Free Capacity
y
0
2
4
6
8
10
12
14
16
23. Distribution of Free Capacity
- Per Disk Servers vs. per Pool
Standby
dteamD0T0
cmsWANOUT
cmsPrdD1T0
cmsLTD0T1
biomedD1T0
Disk Pool
atlasStage
atlasScratchDisk
atlasPrdD1T0
atlasPrdD0T1
atlasMCTAPE
atlasMCDISK
atlasHotDisk
atlasGROUPDISK
0 50 100 150 200 250
Free Capacity (TB)
24. Storage Server Generation
- Drive vs. Total Capacity
Total Capacity of Storage
800 37
700 23
741TB
Generation (TB)
600 683TB
500
400
300 6 18
200 238TB 235.5TB
100
0
0 10 20 30 40
Numer of Raid Subsystem
25. CASTOR Configurations (cont’)
- Core Service Overview
Services OS Level Release Remark
Type
Core SLC 4.7/x86-64 2.1.7-19 Stager/NS/DLF
SRM SLC 4.7/x86-64 2.7-18 3 Head Nodes
Disk Svr. SLC 4.7/x86-64 2.1.7-19 80 Q3 2k9 (20+ in Q4)
Tape Svr. SLC 4.7/32 + 64 2.1.8-8 X86-64 OS deployed
26. CASTOR Configurations (cont’)
- CMS Disk Cache: Current Resource Level
Space Token
Capacity/ Disk TapePool/
Disk Pool
Job Limit Servers Capacity
cmsLTD0T1 278TB/488 9 *
cmsPrdD1T0 284TB/1560 13
cmsWanOut 72TB/220 4
* Dep. on tape family.
27. CASTOR Configurations (cont’)
- Atlas Disk Cache: Current Resource Level
Space Token Cap/JobLimit DiskServers TapePool/Cap.
atlasMCDISK 163TB/790 8 -
atlasMCTAPE 38TB/80 2 atlasMCtp/39TB
atlasPrdD1T0 278TB/810 15 -
atlasPrdtp/105T
atlasPrdD0T1 61TB/210 3
B
atlasGROUPDISK 19T/40 1 -
atlasScratchDisk 28TB/80 1 -
atlasHotDisk 2/40TB 2 -
Total 950TB/1835 46 -
28. IDC Collocation
Facility install complete at Mar 27th
Tape system delay after Apr 9th
Realignment
RMA for faulty parts
29. Storage Farm
~ 110 raid subsystem deployed since 2003.
Supporting both Tier1 and 2 storage fabric
DAS connection to frontend blade server
Flexible switching front end server upon
performance requirement
4-8G fiber channel connectivity
30. CASTOR Configurations (cont’)
- Tape Pool
Capacity Drive LTO3/4
Tape Pool
(TB)/Usage Dedication Mixed
atlasMCtp 8.98/40% N Y
atlasPrdtp 101/65% N Y
cmsCSA08cruzet 15.6/46% N N
cmsCSA08reco 5/0% N N
cmsCSAtp 639/99% N Y
cmsLTtp 34.4/44% N N
dteamTest 3.5/1% N N
31. MSS Monitoring Services
Std. Nagios Probes
NRPE + customized plugins
SMS to OSE/SM for all types of critical
alarms
Availability metrics
Tape metrics (SLS)
Throughput, capacity & scheduler per
VO and Diskpool
32. MSS Tape System
- Expansion/Upgrade Planning
Before incident:
LTO3 * 8 + LTO4 * 4
720TB with LTO3
530TB with LTO4
May 2009:
Two LOT3 drives
MES: 6 LTO4 drives end of May
Capacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4)
New S54 model introduce mid of 2009
2K slots with tier model
Required:
Upgrade ALMS
Enhanced gripper
MES Q3 2009
18 LTO4 drives
HA implementation resume in Q4
33. Expansion Planning
2008
0.5PB expansion of Tape system in Q2
Meet MOU target mid of Nov.
1.3MSI2k per rack base on recent E5450 processor.
2009 Q1
150 SMP/QC blade servers
Raid subsystem consider 2TB per drive
42TB net capacity per chassis and 0.75PB in total
2009 Q3-4
18 LTO4 drives – mid of Oct.
330 Xeon QC (SMP, Intel 5450) blades servers
2nd phase TAPE MES - 5 LTO4 drives + HA
3rd phase TAPE MES – 6 LTO4 drives
ETA 0.8PB expansion delivery: mid of Nov
34. Computing/Storage System Infrastructure
Da
ta
Ce
nte
ASGC CASTOR2 Disk Farm r – CASTOR2 Tape
C3 Servers
CASTOR2 Disk servers
Ar
ch
ive
Ro
om
2 * GE (LX) to 4F M160
(links to HK, JP Tier-2s)
2 * GE (LX) to 4F
20 x Quanta Blades - TaipeiGigaPoP-7609
WN Core Services – CE, (links to TW Tier-2s)
BladeCenter
RB, DPM, PX, BDII etc. 1
10GBASE-X
2 3 4
10G4X 41611
Diag
1
Stat
10GBASE-X 10G4X 41611
1 2 3 4
Diag
2
Stat
10GBASE-X 10G4X 41611
1 2 3 4
4 * GE (SX) to ASGC Distribution
D iag
3
Stat
10GBASE-X 10G4X 41611
1 2 3 4
Switch in Rack#49
Diag
4
Stat
10/100/1000BASE-T G48T 41511
1
5
Diag
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Diag
25
Stat
1 25 2 26 3 27 4 28 5
(links to Tier-1 Servers)
29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23
G 4 8X a
47 24 48
4 1 54 2
A
6
B
64 x IBM HS20
Stat
1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 23 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 24 48
7
Blade system -
8
WN
9
BladeCenter
10
DC SMR 48V / 100A
1 2 3 4 5 6 7 8 9 10 11 12 13 14
142 x IBM HS21
Battery Battery
Blade system - #1 + #2 #3 + #4
WN
35. Throughput of WLCG Experiments
Throughput defined as Job Eff. x # Jobs running
Characteristic of 4 LHC Exp. depicting in-efficiency
is due to poor coding.
37. Summary
Deploy highly-scalable DM system and performance driven
storage infrastructure
Eliminate possible complexity of SRM abstraction layer
Resource utilization, provisioning and optimization
From POC to Production, the challenges remains:
Data Challenge, Service Challenge, CCRC08, STEP09, etc.
Motivation appear clear for: Medical, Climate, Cosmological
Operation wide:
Robust Database setup
KB for fabric infrastructure operation
Fast enough event processing and documentation
Consider beyond the data management use cases in WLCG:
commonality in many other disciplines in EGEE infrastructure
actively participate in e-Science collaboration within the region