Petabye scale data challenge

Petabye Scale Data Challenge
- Worldwide LHC Computing Grid

ASGC/Jason Shih
Computex, Jun 2nd, 2010

Outline
Objectives & Milestones
WLCG experiment and ASGC Tier-1 Center
Petabyte Scale Challenge
Storage Management System
System Architecture, Configuration and
Performance

Objectives

Building sustainable research and collaboration
infrastructure
Support research by e-Science, on data intensive
sciences and applications require cross disciplinary
distributed collaboration

ASGC Milestone
Operational from the deployment of LCG0 since 2002
ASGC CA establish on 2005 (IGTF in same year)
Tier-1 Center responsibility start from 2005
Federated Taiwan Tier-2 center (Taiwan Analysis Facility, TAF)
is also collocated in ASGC
Rep. of EGEE e-Science Asia Federation while joining EGEE
from 2004
Providing Asia Pacific Regional Operation Center (APROC)
services to regional-wide WLCG/EGEE production
infrastructure from 2005
Initiate Avian Flu Drug Discovery Project and collaborate with
EGEE in 2006
Start of EUAsiaGrid Project from April 2008

LHC First Beam – Computing at the Petascale

General Purpose, pp, heavy ions

LHCb: B-physics, CP Violation ALICE: Heavy ions, pp

CMS: General Purpose, pp, heavy ions

ATLAS: General Purpose, pp, heavy ions

Size of LHC Detector

ATLAS
Bld. 40

CMS

Standard Cosmology

Good model from 0.01 sec
after Big Bang

Energy, Density, Temperature
Supported by considerable
observational evidence

Time
Elementary Particle Physics

From the Standard Model into the
unknown: towards energies of
1 TeV and beyond: the Terascale

Towards Quantum Gravity

From the unknown into the
unknown...
http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html
UNESCO Information 7
Preservation debate, April 2007 -
Jamie Shiers@cern ch

WLCG Timeline

First Beam on LHC, Sep.
10, 2008
Severe Incident after 3w
operation (3.5TeV)

Max CERN/T1-ASGC Point2Point
Inbound : 9.3 Gbps
ASGC - Introduction

1. Most Reliable T1: 98.83%
2. Very Highly Performing and
most Stable Site in CCRC08

Asia Pacific Regional
A Worldwide Grid Operation Center
Infrastructure
>250 sites, 48 countries
>68,000 CPUs, >25 PetaBytes
>10,000 users, >200 VOs
>150,000 jobs/day

Best Demo Award of EGEE’07
Grid Application Platform Avian Flu Drug Discovery

Lightweight Problem Solving Large Hadron Collider (LHC)
Framework

21

Collaborating e-Infrastructures

TWGRID

EUAsiaGrid

Potential for linking ~80 countries

“Production” =
Reliable, sustainable, with commitments to quality of service

WLCG Computing Model
- The Tier Structure
Tier-0 (CERN)
Data recording
Initial data reconstruction
Data distribution
Tier-1 (11 countries)
Permanent storage
Re-processing
Analysis
Tier-2 (~130 countries)
Simulation
End-user analysis

Enabling Grids for E-sciencE

Archeology
Astronomy
Astrophysics
Civil Protection
Comp. Chemistry
Earth Sciences
Finance
Fusion
Geophysics
High Energy Physics
Life Sciences
Multimedia
Material Sciences
…

EGEE-II INFSO-RI-031688 EGEE07, Budapest, 1-5 October 2007 4

Why Petabyte? Challenges

Why Petabyte?
Experiment Computing Model
Comparing with conventional data management
Challenges
Performance: LAN and WAN activities
Sufficient B/W between CPU Farm
Eliminate Uplink Bottleneck (Switch Tires)
Fast responding of Critical Events
Fabric Infrastructure & Service Level Agreement
Scalability and Manageability
Robust DB engine (Oracle RAC)
KB and Adequate Administration (Training)

Tier Model and Data Management Components

WLCG Experiment Computing Model

ATLAS T1 Data Flow RAW

RAW
Tape ESD (2x)
AODm (10x)
ESD2
RAW
AODm2 1 Hz
1.6 GB/file
0.02 Hz 0.044 Hz 85K f/day
1.7K f/day 3.74K f/day 720 MB/s
32 MB/s 44 MB/s
2.7 TB/day 3.66 TB/day

Tier-0 AODm1 AODm2
Disk 500 MB/file 500 MB/file
0.04 Hz 0.04 Hz

ESD1 AODm1 RAW AOD2
Buffer ESD2 AOD2 AODm2
3.4K f/day 3.4K f/day
20 MB/s 20 MB/s
1.6 TB/day 1.6 TB/day
0.5 GB/file 500 MB/file 1.6 GB/file 10 MB/file 0.5 GB/file 10 MB/file 500 MB/file
0.02 Hz 0.04 Hz 0.02 Hz 0.2 Hz 0.02 Hz 0.2 Hz 0.004 Hz
1.7K f/day 3.4K f/day 1.7K f/day 17K f/day 1.7K f/day 17K f/day 0.34K f/day
10 MB/s 20 MB/s 32 MB/s 2 MB/s 10 MB/s 2 MB/s 2 MB/s
0.8 TB/day 1.6 TB/day 2.7 TB/day 0.16 TB/day 0.8 TB/day 0.16 TB/day 0.16 TB/day Each
T1
Tier-2
T1
CPU Plus simulation and
ESD2 AODm2
0.5 GB/file 500 MB/file Farm analysis data flow
0.02 Hz 0.036 Hz
1.7K f/day 3.1K f/day ESD2 AODm2
10 MB/s 18 MB/s 0.5 GB/file 500 MB/file
ESD2 AODm2
0.8 TB/day 1.44 TB/day 0.02 Hz 0.004 Hz 0.5 GB/file 500 MB/file
1.7K f/day 0.34K f/day 0.02 Hz 0.036 Hz
10 MB/s 2 MB/s 1.7K f/day 3.1K f/day
0.8 TB/day 0.16 TB/day 10 MB/s 18 MB/s
Other 0.8 TB/day 1.44 TB/day Other
T1 Disk T1
Tier-1s
T1 Tier-1s
T1
Storage

WLCG Tier-1
- Defined Minimum Levels of Services.
Define response time refer to max delay before taking action.
Mean time repairing the service is also crucial but cover
indirectly through required availability target.

WLCG MoU & ASGC Resource Level
- Pledged Resources and Projection
Year CPU (HEP2k6) Disk (PB) Tape (PB)
End 2009 29.5K 2.6 2.4
Mou 2009 20K 3.0 3.0
Mou 2010 28K 3.5 3.5
6000 CPU MoU 6000
CPU
5000 5000
(Unit KSI2k)

Disk

TeraByte
4000 Tape 4000
DISK MoU
3000 3000
Tape MoU
2000 2000
1000 1000
0 0
2005 2006 2007 2008 2009 2010

Data Management System
CASTOR V1
CERN Advanced STORage
Satisfactorily serving 10s of 1K
Req/day/TB of Disk Cache
Limitation: 1M files in cache
Tape movement API not flexible

CASTOR V2
Centric DB Arch.
Scheduling Feature
GSI and Kerberos
Resource Mgmt
Resource Handling

CASTOR Configurations
- Current Infrastructure

Shared cores services
Serving: Atlas and CMS
Services:
Stager, NS, DLF, Repack, and LSF
DB cluster
Two DB Clusters (SRM and NS)
5 Services (DB) split into two clusters
5 Oracle Instances
Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp.
Current usage: 63% and 44% for CMS and Atlas

CASTOR Configurations (cont’)
- Disk Cache
Disk pools & servers
Performance (IOPS)
With 0.5kB IO size: 76.4k and 54k for read & write resp.
Slightly decrease around 9% for both read and write
when inc. IO size to 4kB.
80 disk servers (+6 will be online end of 3rdw Oct)
Total capacity: 1.67PB (0.3PB allocate dynamically)
Current usage: 0.79PB (~58% usage)
14 disk pools (8 for atlas and 3 for CMS, another three
for bio, SAM, and dynamic)

at
la
sG
RO Total Capacity (TB)
bi UP
om D

0
50
100
150
200
250
300
350
400
at ed ISK
la D 450
cm sH 1T
sW otD 0
at A is
la N k
sP O
rd UT
at D
la 0
dt sS T1
at ea tag
at l a m e
la sM D
Install Capacity

sS C 0T
Disk Pool Configuration

c T 0
- T1 MSS (CASTOR)

at rat AP
l a ch E
Num of Disk Servers

sP D
cm rd isk
D
at sL 1T
l a TD 0
cm sM 0T
sP CD 1
rd ISK
D
S t 1T
an 0
db
Free Capacity

y
0
2
4
6
8
10
12
14
16

Distribution of Free Capacity
- Per Disk Servers vs. per Pool
Standby
dteamD0T0
cmsWANOUT
cmsPrdD1T0
cmsLTD0T1
biomedD1T0
Disk Pool

atlasStage
atlasScratchDisk
atlasPrdD1T0
atlasPrdD0T1
atlasMCTAPE
atlasMCDISK
atlasHotDisk
atlasGROUPDISK

0 50 100 150 200 250
Free Capacity (TB)

Storage Server Generation
- Drive vs. Total Capacity
Total Capacity of Storage

800 37
700 23
741TB
Generation (TB)

600 683TB
500
400
300 6 18
200 238TB 235.5TB
100
0
0 10 20 30 40
Numer of Raid Subsystem

- Core Service Overview

Services OS Level Release Remark
Type
Core SLC 4.7/x86-64 2.1.7-19 Stager/NS/DLF
SRM SLC 4.7/x86-64 2.7-18 3 Head Nodes
Disk Svr. SLC 4.7/x86-64 2.1.7-19 80 Q3 2k9 (20+ in Q4)
Tape Svr. SLC 4.7/32 + 64 2.1.8-8 X86-64 OS deployed

- CMS Disk Cache: Current Resource Level
Space Token
Capacity/ Disk TapePool/
Disk Pool
Job Limit Servers Capacity
cmsLTD0T1 278TB/488 9 *
cmsPrdD1T0 284TB/1560 13
cmsWanOut 72TB/220 4
* Dep. on tape family.

- Atlas Disk Cache: Current Resource Level

Space Token Cap/JobLimit DiskServers TapePool/Cap.
atlasMCDISK 163TB/790 8 -
atlasMCTAPE 38TB/80 2 atlasMCtp/39TB
atlasPrdD1T0 278TB/810 15 -
atlasPrdtp/105T
atlasPrdD0T1 61TB/210 3
B
atlasGROUPDISK 19T/40 1 -
atlasScratchDisk 28TB/80 1 -
atlasHotDisk 2/40TB 2 -
Total 950TB/1835 46 -

IDC Collocation
Facility install complete at Mar 27th
Tape system delay after Apr 9th
Realignment
RMA for faulty parts

Storage Farm
~ 110 raid subsystem deployed since 2003.
Supporting both Tier1 and 2 storage fabric
DAS connection to frontend blade server
Flexible switching front end server upon
performance requirement
4-8G fiber channel connectivity

- Tape Pool

Capacity Drive LTO3/4
Tape Pool
(TB)/Usage Dedication Mixed
atlasMCtp 8.98/40% N Y
atlasPrdtp 101/65% N Y
cmsCSA08cruzet 15.6/46% N N
cmsCSA08reco 5/0% N N
cmsCSAtp 639/99% N Y
cmsLTtp 34.4/44% N N
dteamTest 3.5/1% N N

MSS Monitoring Services
Std. Nagios Probes
NRPE + customized plugins
SMS to OSE/SM for all types of critical
alarms
Availability metrics
Tape metrics (SLS)
Throughput, capacity & scheduler per
VO and Diskpool

MSS Tape System
- Expansion/Upgrade Planning
Before incident:
LTO3 * 8 + LTO4 * 4
720TB with LTO3
530TB with LTO4
May 2009:
Two LOT3 drives
MES: 6 LTO4 drives end of May
Capacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4)
New S54 model introduce mid of 2009
2K slots with tier model
Required:
Upgrade ALMS
Enhanced gripper
MES Q3 2009
18 LTO4 drives
HA implementation resume in Q4

Expansion Planning
2008
0.5PB expansion of Tape system in Q2
Meet MOU target mid of Nov.
1.3MSI2k per rack base on recent E5450 processor.
2009 Q1
150 SMP/QC blade servers
Raid subsystem consider 2TB per drive
42TB net capacity per chassis and 0.75PB in total
2009 Q3-4
18 LTO4 drives – mid of Oct.
330 Xeon QC (SMP, Intel 5450) blades servers
2nd phase TAPE MES - 5 LTO4 drives + HA
3rd phase TAPE MES – 6 LTO4 drives
ETA 0.8PB expansion delivery: mid of Nov

Computing/Storage System Infrastructure

Da
ta
Ce
nte
ASGC CASTOR2 Disk Farm r – CASTOR2 Tape
C3 Servers
CASTOR2 Disk servers
Ar
ch
ive
Ro
om

2 * GE (LX) to 4F M160
(links to HK, JP Tier-2s)

2 * GE (LX) to 4F
20 x Quanta Blades - TaipeiGigaPoP-7609
WN Core Services – CE, (links to TW Tier-2s)
BladeCenter
RB, DPM, PX, BDII etc. 1
10GBASE-X
2 3 4
10G4X 41611

Diag
1

Stat
10GBASE-X 10G4X 41611
1 2 3 4

Diag
2

Stat
10GBASE-X 10G4X 41611
1 2 3 4

4 * GE (SX) to ASGC Distribution

D iag
3

Stat
10GBASE-X 10G4X 41611
1 2 3 4

Switch in Rack#49
Diag
4

Stat
10/100/1000BASE-T G48T 41511

1

5

Diag
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Diag
25

Stat
1 25 2 26 3 27 4 28 5

(links to Tier-1 Servers)
29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23

G 4 8X a
47 24 48

4 1 54 2
A

6
B

64 x IBM HS20
Stat

1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 23 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 24 48

7

Blade system -
8

WN
9

BladeCenter

10

DC SMR 48V / 100A
1 2 3 4 5 6 7 8 9 10 11 12 13 14

142 x IBM HS21
Battery Battery
Blade system - #1 + #2 #3 + #4
WN

Throughput of WLCG Experiments
Throughput defined as Job Eff. x # Jobs running
Characteristic of 4 LHC Exp. depicting in-efficiency
is due to poor coding.

Reliability From Different View Perspective

Summary

Deploy highly-scalable DM system and performance driven
storage infrastructure
Eliminate possible complexity of SRM abstraction layer
Resource utilization, provisioning and optimization
From POC to Production, the challenges remains:
Data Challenge, Service Challenge, CCRC08, STEP09, etc.
Motivation appear clear for: Medical, Climate, Cosmological
Operation wide:
Robust Database setup
KB for fabric infrastructure operation
Fast enough event processing and documentation
Consider beyond the data management use cases in WLCG:
commonality in many other disciplines in EGEE infrastructure
actively participate in e-Science collaboration within the region

Petabye scale data challenge

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie Petabye scale data challenge

Ähnlich wie Petabye scale data challenge (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Petabye scale data challenge