Report to POB 11feb03 - RACF

Report to POB 11feb03 - RACF

LCG The LCG Service Challenges: Ramping up the LCG Service Jamie Shiers, CERN-IT-GD-SC April 2005 LCG Service Challenges Deploying the Service Agenda Goals and Timelines of the LCG Service Challenges Review of SC1 and SC2 Summary of LHC Experiments Computing Models Outline of SC3 and SC4 After that its the FULL PRODUCTION SERVICE! Plans for involving Tier2 sites in the Service Challenges Detailed SC3 planning LCG Service Challenges Deploying the Service

Acknowledgements / Disclaimer These slides are the work of many people LCG Storage Management Workshop, Service Challenge Meetings etc Plus relevant presentations I found with Google! The work behind the slides is that of many more All those involved in the Service Challenges at the many sites I will use LCG in the most generic sense Because thats what I know / understand (?) LCG Service Challenges Deploying the Service Partial Contribution List James Casey: SC1/2 review Jean-Philippe Baud: DPM and LFC Sophie Lemaitre: LFC deployment Peter Kunszt: FiReMan Michael Ernst: dCache gLite FTS: Gavin McCance

ALICE: Latchezar Betev ATLAS: Miguel Branco CMS: Lassi Tuura LHCb: [email protected] LCG Service Challenges Deploying the Service LCG Service Challenges Overview LHC will enter production (physics) in April 2007 Will generate an enormous volume of data Will require huge amount of processing power LCG solution is a world-wide Grid Many components understood, deployed, tested.. But Unprecedented scale Humungous challenge of getting large numbers of institutes and individuals, all with existing, sometimes conflicting commitments, to work together LCG must be ready at full production capacity, functionality and reliability in less than 2 years from now Issues include h/w acquisition, personnel hiring and training, vendor

rollout schedules etc. Should not limit ability of physicist to exploit performance of detectors nor LHCs physics potential Whilst being stable, reliable and easy to use LCG Service Challenges Deploying the Service LCG Deployment Schedule Apr05 SC2 Complete June05 - Technical Design Report Jul05 SC3 Throughput Test Sep05 - SC3 Service Phase Dec05 Tier-1 Network operational Apr06 SC4 Throughput Test May06 SC4 Service Phase starts Sep06 Initial LHC Service in stable operation Apr07 LHC Service commissioned 2005 SC2 SC3 preparation setup service 2006 2007 cosmics SC4 LHC Service Operation 2008

First physics First beams Full physics run LCG Service Challenges Deploying the Service Service Challenges Purpose Understand what it takes to operate a real grid service run for days/weeks at a time (outside of experiment Data Challenges) Trigger/encourage the Tier1 & large Tier-2 planning move towards real resource planning based on realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability, scalability, end-to-end performance Set out milestones needed to achieve goals during the service challenges NB: This is focussed on Tier 0 Tier 1/large Tier 2 Data management, batch production and analysis Short term goal by end 2004 have in place a robust and reliable data management service and support infrastructure and robust batch job submission From early proposal, May 2004 Ian Bird [email protected] LCG Service Challenges Deploying the Service Why Service Challenges?

To test Tier-0 Tier-1 Tier-2 services Network service Sufficient bandwidth: ~10 Gbit/sec Backup path Quality of service: security, help desk, error reporting, bug fixing, .. Robust file transfer service File servers File Transfer Software (GridFTP) Data Management software (SRM, dCache) Archiving service: tapeservers,taperobots, tapes, tapedrives, .. Sustainability Weeks in a row un-interrupted 24/7 operation Manpower implications: ~7 fte/site Quality of service: helpdesk, error reporting, bug fixing, .. Towards a stable production environment for experiments Kors Bos Presentation to LHCC, March 7 2005 LCG Service Challenges Deploying the Service Key Principles

Service challenges result in a series of services that exist in parallel with baseline production service Rapidly and successively approach production needs of LHC Initial focus: core (data management) services Swiftly expand out to cover full spectrum of production and analysis chain Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages Necessary resources and commitment pre-requisite to success! Effort should not be under-estimated! LCG Service Challenges Deploying the Service SC1 Review SC1 did not complete its goals successfully Dec04 - Service Challenge I complete mass store (disk) - mass store (disk)

3 T1s (Lyon, Amsterdam, Chicago) (others also participated) 500 MB/sec (individually and aggregate) 2 weeks sustained Software; GridFTP plus some scripts We did not meet the milestone of 500MB/s for 2 weeks We need to do these challenges to see what actually goes wrong A lot of things do, and did, go wrong We need better test plans for validating the infrastructure before the challenges (network throughput, disk speeds, etc) OK, so were off to a great start with the Service Challenges LCG Service Challenges Deploying the Service SC2 - Overview Service Challenge 2 Throughput test from Tier-0 to Tier-1 sites Started 14th March

Set up Infrastructure to 7 Sites NIKHEF/SARA, IN2P3, FNAL, BNL, FZK, INFN, RAL 100MB/s to each site 500MB/s combined to all sites at same time 500MB/s to a few sites individually Goal : by end March, sustained 500 MB/s at CERN LCG Service Challenges Deploying the Service SC2 met its throughput targets >600MB/s daily average for 10 days was achieved: Midday 23rd March to Midday 2nd April Not without outages, but system showed it could recover rate again from outages Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites) LCG Service Challenges Deploying the Service Division of Data between sites Site BNL FNAL Average throughput (MB/s) 61 Data Moved (TB) 51

61 51 GridKA 133 109 IN2P3 91 75 INFN 81 67 RAL 72 58 SARA 106 88 TOTAL 600

500 LCG Service Challenges Deploying the Service Tier-1 Network Topology LCG Service Challenges Deploying the Service Transfer Control Software (1/3) LCG RADIANT Software used to control transfers Prototype software interoperable with the gLite FTS Plan is to move to gLite FTS for SC3 Initial promising results presented at Lyon SC Meeting in Feb. More details in LCG Storage Workshop tomorrow Run on a single node at CERN, controlling all Tier-0 to Tier-1 channels Transfers done via 3rd-party gridftp radiant-load-generator was used to generate transfers Configured for each channel to load-balance where appropriate Specifies number of concurrent streams for a file transfer This was normally =1 for a dedicated link Ran from cron to keep transfer queues full LCG Service Challenges Deploying the Service Individual site tests

Overlapped with LCG Storage Management Workshop Sites can pick days in next two weeks when they have the capacity 500MB/s to disk 60MB/s to tape FNAL was running 500MB/s disk tests at the time LCG Service Challenges Deploying the Service FNAL Failover it shows that when the starlight link was cut, our transfers fell back to our ESNET connection, when our network folks rerouted it to our 1 GE link when starlight came back. LCG Service Challenges Deploying the Service SC2 Summary SC2 met its throughput goals and with more sites than originally planned! A big improvement from SC1 But we still dont have something we can call a service Monitoring is better

We see outages when they happen, and we understood why they happen First step towards operations guides Some advances in infrastructure and software will happen before SC3 gLite transfer software SRM service more widely deployed We have to understand how to incorporate these elements LCG Service Challenges Deploying the Service SC1/2 - Conclusions Setting up the infrastructure and achieving reliable transfers, even at much lower data rates than needed for LHC, is complex and requires a lot of technical work + coordination Even within one site people are working very hard & are stressed. Stressed people do not work at their best. Far from clear how this scales to SC3/SC4, let alone to LHC production phase Compound this with the multi-site / multi-partner issue, together with time zones etc and you have a large non-technical component to an already tough problem (example of technical problem follows)

But the end point is fixed (time + functionality) We should be careful not to over-complicate the problem or potential solutions And not forget there is still a humungous amount to do (much much more than weve done) LCG Service Challenges Deploying the Service Computing Model Summary Goals Present key features of LHC experiments Computing Models in a consistent manner High-light the commonality Emphasize the key differences Define these parameters in a central place (LCG web) Update with change-log as required

Use these parameters as input to requirements for Service Challenges To enable partners (T0/T1 sites, experiments, network providers) to have a clear understanding of what is required of them Define precise terms and factors LHC Computing Models Summary of Key Characteristics of LHC Computing Models LCG Service Challenges Deploying the Service Where do these numbers come from? Obtained from LHC Computing Models as reviewed in January Part of plan is to understand how sensitive overall model is to variations in key parameters Iteration with experiments is on-going i.e. I have tried to clarify any questions that I have had

Any mis-representation or mis-interpretation is entirely my responsibility Sanity check: compare with numbers from MoU Task Force (Actually the following LCG document now uses these numbers!) http://cern.ch/LCG/documents/LHC_Computing_Resources_report.pdf LCG Service Challenges Deploying the Service LHC Parameters (Computing Models) Year pp operations Heavy Ion operations Beam time (seconds/ year) Luminosity (cm-2s-1) 2007 5 x 106 5 x 1032 -

- 2008 (1.8 x) 107 2 x 1033 (2.6 x) 106 5 x 1026 2009 107 2 x 1033 106 5 x 1026 2010 107 1034 106 5 x 1026 (Real time given in brackets above) Beam time (seconds/

year) Luminosity (cm-2s-1) LCG Service Challenges Deploying the Service LHC Schedule Chamonix workshop First collisions: two months after first turn on in August 2007 32 weeks of operation, 16 weeks of shutdown, 4 weeks commissioning = 140 days physics / year (5 lunar months) LCG Service Challenges Deploying the Service Overview of pp running Experimen SIM t SIMESD RAW Trigger RECO AOD TAG ALICE

400KB 40KB 1MB 100Hz 200KB 50KB 10KB ATLAS 2MB 500KB 1.6MB 200Hz 500KB 100KB 1KB CMS 2MB 400KB 1.5MB 150Hz

250KB 50KB 10KB 400KB 25KB 2KHz 75KB 25KB 1KB LHCb LCG Service Challenges Deploying the Service Overview of Heavy Ion running Experiment SIM RAW Trigger RECO AOD TAG 12.5MB

100Hz 2.5MB 250KB 10KB ATLAS 5MB 50Hz CMS 7MB 50Hz 1MB 200KB TBD N/A N/A N/A N/A N/A ALICE

LHCb SIMESD 300MB 2.1MB N/A N/A LCG Service Challenges Deploying the Service pp / AA data rates (equal split) Centre ALICE ATLAS CMS LHCb Rate into T1 Rate into T1 (AA) ASCC, Taipei 0 1 1 0

118.7 28.2 CNAF, Italy 1 1 1 1 205.0 97.2 PIC, Spain 0 1 1 1 179.0 28.2 IN2P3, Lyon 1 1

1 1 205.0 97.2 GridKA, Germany 1 1 1 1 205.0 97.2 RAL, UK 1 1 1 1 205.0 97.2 BNL, USA

0 1 0 0 72.2 11.3 FNAL, USA 0 0 1 0 46.5 16.9 TRIUMF, Canada 0 1 0 0 72.2

11.3 NIKHEF/SARA, Netherlands 1 1 0 1 158.5 80.3 Nordic Centre 1 1 0 0 98.2 80.3 Totals 6 10 7

6 LCG Service Challenges Deploying the Service Streaming All experiments foresee RAW data streaming, but with different approaches: CMS: O(50) streams based on trigger path Classification is immutable, defined by L1+HLT Atlas: 4 streams based on event types Primary physics, Express line, Calibration, Debugging and diagnostic LHCb: >4 streams based on trigger category B-exclusive, Di-muon, D* Sample, B-inclusive Streams are not created in the first pass, but during the stripping process Not clear what is the best/right solution. Probably bound to evolve in time. Francesco Forti, Pisa LCG Service Challenges Deploying the Service Reprocessing Data need to be reprocessed several times because of: Improved software More accurate calibration and alignment

Reprocessing mainly at T1 centers LHCb is planning on using the T0 during the shutdown not obvious it is available Number of passes per year Alice Atlas CMS LHCb 3 2 2 4 But experience shows the reprocessing requires huge effort! Use these numbers in the calculation but 2 / year will be good going! Francesco Forti, Pisa LCG Service Challenges Deploying the Service

Base Requirements for T1s Provisioned bandwidth comes in units of 10Gbits/sec although this is an evolving parameter From Reply to Questions from Computing MoU Task Force Since then, some parameters of the Computing Models have changed Given the above quantisation, relatively insensitive to small-ish changes Important to understand implications of multiple-10Gbit links, particularly for sites with Heavy Ion programme Spread of AA distribution during shutdown probably means 1 link sufficient For now, planning for 10Gbit links to all Tier1s LCG Service Challenges Deploying the Service Tier1 SUMMARY High data rate transfer tests Karlsruhe to CERN already underway

10G available now from Bologna to CERN Testing of 10G lambdas from Lyon and Amsterdam can commence from July 2005 Amsterdam (and Taipei) will use Netherlight link to CERN until GANT2 paths are available Testing of Barcelona link at 10G from October 2005 Nordic distributed facility restricted to 1G until late-2006 when 10G available RAL could operate between 2 and 4 x 1GE (subject to scheduling and NetherLight/CERN agreement) until late-2006 when 10G available. Interconnection of UKLight to GEANT2 might make an earlier transition to higher capacities possible. Tier0/Tier1/Tier2 NREN Connectivity Overview John Chevers, Project Manager, DANTE Amsterdam 8th April 2005 LCG Service Challenges Deploying the Service T1/T2 Roles Tier2 Tier1 Keep certain portions of RAW, ESD, sim ESD Full copies of AOD + TAG, calibration data Official physics group large scale data analysis

Keep certain portions of AOD and full copies of TAG for real + simulated data LHCb: sim only at T2s Selected ESD samples Produce simulated data General end-user analysis ALICE + LHCb: also contribute to simulation Based on T1 Services for T2 Centres document (Just type this into Google) LCG Service Challenges Deploying the Service MC Data Units ALICE ATLAS CMS p-p Pb-Pb p-p p-p

LHCb Time to reconstruct 1 event kSI2k sec 5.4 675 15 25 2.4 Time to simulate 1 event kSI2k sec 35 15000 100 45 50 Tier2 sites offer 10 1000 kSI2K years ATLAS: 16MSI2K years over ~30 sites in 2008 CMS: 20MSI2K years over ~20 sites in 2008 Parameter Unit

ALICE p-p ATLAS CMS LHCb Pb-Pb Events/year Giga 1 0.1 2 1.5 20 Events SIM/year Giga 1 0.01 0.4 1.5

4 Ratio SIM/data % 100% 10% 20% 100% 20% LCG Service Challenges Deploying the Service GridPP Estimates of T2 Networking Number of T1s Number of T2s Total T2 CPU Total T2 Disk Average T2 CPU Average T2 Disk

Network In Network Out KSI2K TB KSI2K TB Gb/s Gb/s ALICE 6 21 13700 2600 652 124 0.010 0.600 ATLAS 10

30 16200 6900 540 230 0.140 0.034 6 to 10 25 20725 5450 829 218 1.000 0.100 6 14 7600 23

543 2 0.008 0.008 CMS LHCb The CMS figure of 1Gb/s into a T2 comes from the following: Each T2 has ~10% of current RECO data and 1/2 AOD (real+MC sample) These data are refreshed every 3 weeks compatible with frequency of (possible) major selection pass at T1s See CMS Computing Model S-30 for more details LCG Service Challenges Deploying the Service ALICE Computing Needs From as posted 25 Feb. 2005 Table 2.6 T0 Sum T1s Sum T2s Total CPU (MSI2K) [Peak] 7.5

13.8 13.7 35 0.44 7.6 2.5 10.54 2.3 7.5 0 9.8 Bandwidth in (Gbps) 8 2 0.075 Bandwidth out (Gbps) 6 1.5 0.27

Transient Storage (PB) Permanent storage (PB/year) Service Challenge 3 Goals and Timeline for Service Challenge 3 LCG Service Challenges Deploying the Service Service Challenge 3 - Phases High level view: Setup phase (includes Throughput Test) 2 weeks sustained in July 2005 Obvious target GDB of July 20th Primary goals: 150MB/s disk disk to Tier1s; 60MB/s disk (T0) tape (T1s) Secondary goals: Include a few named T2 sites (T2 -> T1 transfers) Encourage remaining T1s to start disk disk transfers Service phase September end 2005 Start with ALICE & CMS, add ATLAS and LHCb October/November All offline use cases except for analysis

More components: WMS, VOMS, catalogs, experiment-specific solutions Implies production setup (CE, SE, ) LCG Service Challenges Deploying the Service SC3 Production Services SC3 is a relatively small step wrt SC2 in terms of throughput! We know we can do it technology-wise, but do we have a solution that will scale? Lets make it a priority for the coming months to streamline our operations And not just throw resources at the problem which we dont have Whilst not forgetting real goals of SC3 i.e. services! LCG Service Challenges Deploying the Service

SC3 Preparation Workshop This (proposed) workshop will focus on very detailed technical planning for the whole SC3 exercise. It is intended to be as interactive as possible, i.e. not presentations to an audience largely in a different (wireless) world. There will be sessions devoted to specific experiment issues, Tier1 issues, Tier2 issues as well as the general service infrastructure. Planning for SC3 has already started and will continue prior to the workshop. This is an opportunity to get together to iron out concerns and issues that cannot easily be solved by e-mail, phone conferences and/or other meetings prior to the workshop. LCG Service Challenges Deploying the Service SC3 Preparation W/S Agenda 4 x 1/2 days devoted to experiments in B160 1-009, phone conferencing possible 1 day focussing on T1/T2 issues together with output of

above In 513 1-024, VRVS available Dates are 13 15 June (Monday Wednesday) Even though conference room booked tentatively in February, little flexibility in dates even then! LCG Service Challenges Deploying the Service SC3 Milestone Decomposition File transfer goals: Build up disk disk transfer speeds to 150MB/s with 1GB/s out of CERN SC2 was 100MB/s agreed by site Include tape transfer speeds of 60MB/s with 300MB/s out of CERN Tier1 goals: Bring in additional Tier1 sites wrt SC2 (at least wrt the original plan) PIC and Nordic most likely added later: SC4?

Tier2 goals: Start to bring Tier2 sites into challenge Agree services T2s offer / require On-going plan (more later) to address this via GridPP, INFN etc. Experiment goals: Address main offline use cases except those related to analysis i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 Service goals: Include CPU (to generate files) and storage Start to add additional components Catalogs, VOs, experiment-specific solutions etc, 3D involvement, Choice of software components, validation, fallback, LCG Service Challenges Deploying the Service SC3 Experiment Goals Meetings on-going to discuss goals of SC3 and experiment involvement

Focus on: First demonstrate robust infrastructure; Add simulated experiment-specific usage patterns; Add experiment-specific components; Run experiments offline frameworks but dont preserve data; Exercise primary Use Cases except analysis (SC4) Service phase: data is preserved Has significant implications on resources beyond file transfer services Storage; CPU; Network Both at CERN and participating sites (T1/T2) May have different partners for experiment-specific tests (e.g. not all T1s) In effect, experiments usage of SC during service phase = data challenge Must be exceedingly clear on goals / responsibilities during each phase! LCG Service Challenges Deploying the Service SC3 Milestone Decomposition File transfer goals:

Build up disk disk transfer speeds to 150MB/s SC2 was 100MB/s agreed by site Include tape transfer speeds of 60MB/s Tier1 goals: Bring in additional Tier1 sites wrt SC2 PIC and Nordic most likely added later: SC4? Tier2 goals: Start to bring Tier2 sites into challenge Agree services T2s offer / require On-going plan (more later) to address this via GridPP, INFN etc. Experiment goals: Address main offline use cases except those related to analysis i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 Service goals:

Include CPU (to generate files) and storage Start to add additional components Catalogs, VOs, experiment-specific solutions etc, 3D involvement, Choice of software components, validation, fallback, LCG Service Challenges Deploying the Service A Simple T2 Model N.B. this may vary from region to region Each T2 is configured to upload MC data to and download data via a given T1 In case the T1 is logical unavailable, wait and retry For data download, retrieve via alternate route / T1 MC production might eventually stall Which may well be at lower speed, but hopefully rare Data residing at a T1 other than preferred T1 is transparently delivered through appropriate network route

T1s are expected to have at least as good interconnectivity as to T0 Each Tier-2 is associated with a Tier-1 who is responsible for getting them set up Services at T2 are managed storage and reliable file transfer DB component at T1; user agent also at T2 1GBit network connectivity shared (less will suffice to start with, more maybe needed!) LCG Service Challenges Deploying the Service Prime Tier-2 sites For SC3 we aim for DESY FZK (CMS + ATLAS) More sites areCNAF, appearing! Bari, Italy Italy CMS Site Tier1

Experiment Lancaster RAL (ATLAS) Turin, Italy CNAF, Italy Alice London RAL (CMS) DESY, Germany FZK, Germany ATLAS, CMS For CMS,also Legnaro, Rome and Pisa UK RAL, UK ATLAS Lancaster, Scotgrid RAL (LHCb) UK RAL, UK CMS London, Torino CNAFwill (ALICE) For Atlas,

sites be discussed next week ScotGrid, UK RAL, UK LHCb US Tier2s US sites FNAL (CMS) BNL / FNAL ATLAS / CMS Plans for a workshop Bari end-May advancing Responsibility between T1 in and T2 (+ experiments) well. role limited CERNs Develop a manual how to connect as a T2 Tutorials also foreseen in theguides UK. Provide relevant s/w + installation Assist in workshops, training etc.

Pilot also with FZK / DESY see May HEPiX Other interested parties: Prague, Warsaw, Moscow, .. Also scaleFrance, problemthrough national / regional Alsoattacking foreseenlarger for Spain, bodies GridPP, INFN, HEPiX, US-ATLAS, US-CMS US sites through US-expt? LCG Service Challenges Deploying the Service Tier2 Region Coordinating Body Comments Italy INFN A workshop is foreseen for May during which hands-on training on the Disk Pool Manager and File Transfer components will be held. UK GridPP A coordinated effort to setup managed storage and File Transfer services is being managed

through GridPP and monitored via the GridPP T2 deployment board. Asia-Pacific ASCC Taipei The services offered by and to Tier2 sites will be exposed, together with a basic model for Tier2 sites at the Service Challenge meeting held at ASCC in April 2005. Europe HEPiX A similar activity will take place at HEPiX at FZK in May 2005, together with detailed technical presentations on the relevant software components. US US-ATLAS and US-CMS Tier2 activities in the US are being coordinated through the corresponding experiment bodies. Canada Triumf A Tier2 workshop will be held around the time of the Service Challenge meeting to be held in Triumf in November 2005. Other sites

CERN One or more workshops will be held to cover those Tier2 sites with no obvious regional or other coordinating body, most likely end 2005 / early 2006. Tier2 and Base S/W Components 1) Disk Pool Manager (of some flavour) e.g. dCache, DPM, 2) gLite FTS client (and T1 services) 3) Possibly also local catalog, e.g. LFC, FiReMan, 4) Experiment-specific s/w and services ( agents ) 1 3 will be bundled with LCG release. Experiment-specific s/w will not Tier2 Software Components Overview of dCache, DPM, gLite FTS + LFC and FiReMan File Transfer Software SC3 Gavin McCance JRA1 Data Management Cluster LCG Storage Management Workshop April 6, 2005, CERN LCG Service Challenges Deploying the Service Reliable File Transfer Requirements LCG created a set of requirements based on the Robust Data Transfer Service Challenge

LCG and gLite teams translated this into a detailed architecture and design document for the software and the service A prototype (radiant) was created to test out the architecture and was used in SC1 and SC2 Architecture and design have worked well for SC2 gLite FTS (File Transfer Service) is an instantiation of the same architecture and design, and is the candidate for use in SC3 Current version of FTS and SC2 radiant software are interoperable LCG Service Challenges Deploying the Service gLite FTS Summary Core FTS software is in good shape On its way to becoming software that can provide a manageable, reliable service Stress-testing underway Gaining operational experience with it early Increasing as we head to SC3 We are now understanding how the experiment frameworks can plug onto it Interested to discuss further and work with experiments

Pilot to be setup at CERN in May To allow experiments to start to evaluate / gain experience LCG Service Challenges Deploying the Service Disk Pool Manager (DPM) aims Provide a solution for the small Tier-2s in LCG-2 This implies 1 to 10 Terabytes in 2005 Focus on manageability Easy to install Easy to configure Low effort for ongoing maintenance Easy to add/remove resources Support for multiple physical partitions On one or more disk server nodes Support for different space types volatile and permanent Support for multiple replicas of a file within the disk pools

LCG Service Challenges Deploying the Service DPM Manageability Few daemons to install Disk Pool Manager Name Server SRM No central configuration files Disk nodes request to add themselves to the DPM Easy to remove disks and partitions Allows simple reconfiguration of the Disk Pools Administrator can temporarily remove file systems from the DPM if a disk has crashed and is being repaired DPM automatically configures a file system as unavailable when it is not contactable LCG Service Challenges Deploying the Service DPM Features DPM access via different interfaces Direct Socket interface SRM v1

SRM v2 Basic Also offer a large part of SRM v2 Advanced Global Space Reservation (next version) Namespace operations Permissions Copy and Remote Get/Put (next version) Data Access Gridftp, rfio (ROOTD, XROOTD could be easily added) DPM Catalog shares same code as LCG File Catalog Possibility to act as a Local Replica Catalog in a distributed catalog LCG Service Challenges Deploying the Service DPM Status DPNS, DPM, SRM v1 and SRM v2 (without Copy nor global space reservation) have been tested for 4 months The secure version has been tested for 6 weeks GsiFTP has been modified to interface to the DPM

RFIO interface is in final stage of development LCG Service Challenges Deploying the Service DPM - Proposal for SC3 Provide a possible solution for the small Tier-2s This implies 1 to 10 Terabytes in 2005 Focus on manageability Easy to install Easy to configure Low effort for ongoing maintenance Easy to add/remove resources Support for multiple physical partitions On one or more disk server nodes Replacement of Classic SE Only metadata operations needed (data does not need to be copied) Working with Tier2s regarding evaluation

LCG Service Challenges Deploying the Service CASTOR status & plans The CASTOR version running in production follows the original CASTOR design defined in 1999 2000 and fully deployed for production use in early 2002 Name space in Oracle, MySQL good scalability (currently >33M files) Known limitations: The stager disk residence catalog is in memory scales to O(200k) resident files but not beyond No file access request throttling No controlled resource sharing Stager limitations has lead to a deployment with many stager instances ~40 instances, each with each its own disk pool Bad sharing and loadbalancing Difficult to operate LCG Service Challenges Deploying the Service CASTOR status & plans New stager (disk pool manager) Disk file residence catalog in Oracle (later also MySQL) good scalability Externalized request scheduling throttling and controlled resource sharing Many features targeted for easy operation and automated management

New stager status: Developments started in 2003 First version of complete new system was ready for testing in late 2004 Running in testbed since 4 months Functionality OK Performance: Tier-0 requirements heavily tested with satisfactory results Tier-1 requirements tests still ongoing. Request handling requires more tuning New stager API not compatible with old stager new SRM required LCG Service Challenges Deploying the Service CASTOR status & plans New stager deployment Roll-out plan is being prepared It is too early to give a definite timescale Gain operational experience with test instances Configuration, monitoring, automated management Understand the Oracle requirements Hardware, tuning, administration User interfaces may require more discussions with experiments RFIO backward compatible Stager commands are only partially backward compatible Stager API is completely new ( http://cern.ch/castor/DOCUMENTATION/CODE/STAGE/NewAPI/index.html ) Work out a good deployment architecture exploiting the advantages of the new stager Scalable catalog and resource sharing facilities fewer instances

LCG Service Challenges Deploying the Service SC3 proposal CASTOR version: Throughput phase Use the new stager Service phase Can use new stager, but Sharing data with production disk pools only possible if the participating experiments have been migrated New stager cannot share disk pools with an old stager instance LCG Service Challenges Deploying the Service Status of SRM definition CMS input/comments not included yet SRM v1.1 insufficient mainly lack of pinning SRM v3 not required and timescale too late Require Volatile, Permanent space; Durable not practical Global space reservation: reserve, release, update

(mandatory LHCb, useful ATLAS,ALICE). Compactspace NN Permissions on directories mandatory Prefer based on roles and not DN (SRM integrated with VOMS desirable but timescale?) Directory functions (except mv) should be implemented asap Pin/unpin high priority srmGetProtocols useful but not mandatory Abort, suspend, resume request : all low priority Relative paths in SURL important for ATLAS, LHCb, not for ALICE LCG Service Challenges Deploying the Service SC3 Milestone Decomposition File transfer goals: Build up disk disk transfer speeds to 150MB/s SC2 was 100MB/s agreed by site Include tape transfer speeds of 60MB/s Tier1 goals:

Bring in additional Tier1 sites wrt SC2 PIC and Nordic most likely added later: SC4? Tier2 goals: Start to bring Tier2 sites into challenge Agree services T2s offer / require On-going plan (more later) to address this via GridPP, INFN etc. Experiment goals: Address main offline use cases except those related to analysis i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 Service goals: Include CPU (to generate files) and storage Start to add additional components Catalogs, VOs, experiment-specific solutions etc, 3D involvement, Choice of software components, validation, fallback, LCG Service Challenges Deploying the Service Service Goals

Expect relatively modest increase in service components File catalog based on agreement from Baseline Services WG Other services agreed by BSWG Experiment-specific components and impact on other services, e.g. Distributed Database Services, need to be clarified as soon as possible Similarly, requirements for processing power and storage at all sites involved (T0, T1, T2) (This is for both Service and Challenge phases: where we run the experiments s/w and store the output!) LCG Service Challenges Deploying the Service SC3 Milestone Decomposition File transfer goals: Build up disk disk transfer speeds to 150MB/s SC2 was 100MB/s agreed by site Include tape transfer speeds of 60MB/s Tier1 goals: Bring in additional Tier1 sites wrt SC2 PIC and Nordic most likely added later: SC4?

Tier2 goals: Start to bring Tier2 sites into challenge Agree services T2s offer / require On-going plan (more later) to address this via GridPP, INFN etc. Experiment goals: Address main offline use cases except those related to analysis i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 Service goals: Include CPU (to generate files) and storage Start to add additional components Catalogs, VOs, experiment-specific solutions etc, 3D involvement, Choice of software components, validation, fallback, LCG Service Challenges Deploying the Service SC3 pilot services

Not really restricted to SC3, except timewise gLite FTS pilot For experiments to start to get experience with it, integrating with their frameworks etc File catalogs Both LFC and FiReMan and RLS continues for the time being Expect these to start in May, based on s/w delivered this week (code free April 15) for release end May Max one more cycle for SC3 end June is too late for throughput phase! LFC is already part of LCG releases LCG Service Challenges Deploying the Service Service Challenge 3 Network CERN Side T1s cernh7 192.16.160.1 up to 4Gbps up to 8Gbps

Tapes 192.16.160.254 Disks Routing cernh7, one GPN backbone router and the 44-88 machines will be connected at layer 2. The 160.16.160.0/22 prefix will be shared among their interfaces. The correct routing must be configured on each one of the 44-88 machines: default towards the GPN, tier1s prefixes towards cernh7. Security Very strict access-list must be configured on cernh7 and the GPN backbone router. 4x1Gb nx1Gb Database General Purpose Network backbone 44/88 Machines 192.16.160.0/22 10Gb link 1Gb link 2x1Gb links [email protected] 20050408 LCG Service Challenges Deploying the Service Historical slides from Les / Ian

2005 Sep-Dec - SC4 preparation In parallel with the SC3 model validation period, in preparation for the first 2006 service challenge (SC4) Using 500 MByte/s test facility test PIC and Nordic T1s and T2s that are ready (Prague, LAL, UK, INFN, ..) Build up the production facility at CERN to 3.6 GBytes/s Expand the capability at all Tier-1s to full nominal data rate 2005 2006 SC2 SC3 SC4 2007 cosmics 2008 First physics First beams Full physics run LCG Service Challenges Deploying the Service Historical slides from Les / Ian 2006 Jan-Aug - SC4 SC4 full computing model services

- Tier-0, ALL Tier-1s, all major Tier-2s operational at full target data rates (~2 GB/sec at Tier-0) - acquisition - reconstruction - recording distribution, PLUS ESD skimming, servicing Tier-2s Goal stable test service for one month April 2006 100% Computing Model Validation Period (May-August 2006) Tier-0/1/2 full model test - All experiments - 100% nominal data rate, with processing load scaled to 2006 cpus 2005 2006 SC2 SC3 2007 cosmics SC4 2008 First physics First beams Full physics run LCG Service Challenges Deploying the Service SC4 Planning

Discussing a joint workshop with ARDA focussing on SC4 after the summer Tentative dates: during week of October 10 14 Clash with GDB and HEPiX? After we have concrete results from SC3? When a more concrete model for analysis has appeared? In all events, early enough for SC4 planning LCG Service Challenges Deploying the Service Historical slides from Les / Ian 2006 Sep LHC service available The SC4 service becomes the permanent LHC service available for experiments testing, commissioning, processing of cosmic data, etc. All centres ramp-up to capacity needed at LHC startup TWICE nominal performance Milestone to demonstrate this 3 months before first physics data April 2007 2005 SC2 SC3 2006

2007 cosmics SC4 LHC Service Operation 2008 First physics First beams Full physics run LCG Service Challenges Deploying the Service Key dates for Connectivity June05 - Technical Design Report Credibility Review by LHCC Sep05 - SC3 Service 8-9 Tier-1s sustain - 1 Gbps at Tier-1s, 5 Gbps at CERN Extended peaks at 10 Gbps CERN and some Tier-1s Jan06 - SC4 Setup AllTier-1s 10 Gbps at >5 Tier-1s, 35 Gbps at CERN 2005 SC2 SC3 July06 - LHC Service All Tier-1s 10 Gbps at Tier-1s, 70 Gbps at CERN 2007 2008 2006 cosmics

SC4 LHC Service Operation First physics First beams Full physics run LCG Service Challenges Deploying the Service Key dates for Services June05 - Technical Design Report Sep05 - SC3 Service Phase May06 SC4 Service Phase 2005 SC2 SC3 Sep06 Initial LHC Service in stable operation 2007 2008 2006 cosmics SC4 LHC Service Operation First physics First beams Full physics run LCG Service Challenges Deploying the Service

Additional threads started to address: Experiment involvement; Bringing T2s in SC3; Longer-term goals of bringing all T2s into the LCG Service (Challenges) The enthusiasm and support provided to these new activities is much appreciated We have a lot of work ahead but the problem is beginning to become tractable(?) LCG Service Challenges Deploying the Service Conclusions To be ready to fully exploit LHC, significant resources need to be allocated to a series of Service Challenges by all concerned parties These challenges should be seen as an essential on-going and long-term commitment to achieving production LCG The countdown has started we are already in

(pre-)production mode Next stop: 2020

Recently Viewed Presentations

  • The Roaring Life of the 1920s

    The Roaring Life of the 1920s

    Georgia O'Keeffe. Artist who painted intensely colored scenes of New York. 12. Sinclair Lewis. Writer who was the first American to win the Nobel Prize in literature. Criticized conformity and materialism. ... Stephanie Holt Company:
  • Intelligent Rooms Perceptual User Interfaces Smart Rooms ...

    Intelligent Rooms Perceptual User Interfaces Smart Rooms ...

    Multimodale Räume "Smart Rooms" "Intelligent Environments" Seminar SS 03 User Interfaces In the beginning: Wimpy Computing Windows, Icons, Menus, Pointing 2nd Generation:Human-Machine Interaction "Perceptual" User Interfaces Perceptive human-like perceptual capabilities (what is the user saying, who is the user, where...
  • The Role of Gray Scale & Doppler Ultrasound

    The Role of Gray Scale & Doppler Ultrasound

    Color Doppler sonography is a promising technique for analyzing the vascularity of the salivary glands and for characterizing pathologic conditions. Color Doppler sonography help in differentiating pleomorphic adenomas from other salivary gland tumors.
  • Micro-teaching plan

    Micro-teaching plan

    In this short presentation I will concentrate on the work of Abraham Maslow and the hierarchy of needs, and Fredrick Herzberg's two factor theory of motivation. Maslow's Hierarchy The final stage of Maslow's Hierarchy is said to be endless, never...
  • Political Economy Challenges in the Migration Policy Agenda

    Political Economy Challenges in the Migration Policy Agenda

    Political Economy Challenges in the Migration Policy Agenda Robert E.B. Lucas Boston University Migrant Stock in OECD Countries Simulated global income gains from expanded migration are very large Migrants are the big winners (despite commercialization) Net impacts on host countries...
  • Unit 6 Biology Notes Cell Size - Jefferson High School

    Unit 6 Biology Notes Cell Size - Jefferson High School

    Unit 6 Biology Notes Benefits of Cell Division. Objective 8: Describe how cell division benefits survival of organisms. Multicellular organisms grow (get bigger/increase in size) because of an increase in the number of cells . Growth.
  • The age of Enlightenment. - siegkoapeh

    The age of Enlightenment. - siegkoapeh

    The age of Enlightenment. Eighteenth Century France. The objectives of this slide show are: In what ways was the enlightenment and outgrowth of the Scientific Revolution? ... -He stressed "Liberty, equality, and Fraternity" - the motto of the French Revolution.
  • Using statistical simulation to achieve fast and accurate ...

    Using statistical simulation to achieve fast and accurate ...

    Problems Performance determines the dynamic workload Thread self-scheduling, spinning on locks, barriers Sampling Must respect lock structure Scaling Must respect Amdahl's law (intra-benchmark) Get out front of the problems As opposed to what happened with uniprocessors Challenges: Multiprocessors Metrics Should...