100 %
0 %
Information about MSchulzLCGSHORT

Published on March 18, 2008

Author: Gourangi


Experience Deploying the Large Hadron Collider Computing Grid:  Experience Deploying the Large Hadron Collider Computing Grid Markus Schulz CERN IT GD-GIS Overview :  Overview LHC Computing Grid CERN Challenges for LCG The project Deployment and Status LCG-EGEE Summary Large Hadron Collider CERN :  CERN CERN European Organization for Nuclear Research located on the Swiss - French border close to Geneva funded by 20 european member states In addition several observer states and non member states participating in the experiment program world’s largest center for particle physics research provides infrastructure and tools (accelerators etc.) ~3000 employees and several thousand visiting scientists 50 years of history (several Nobel Prizes) The place where the WWW was born (1990) Next milestone: the LHC (Large Hadron Collider) Challenges for the LHC Computing Grid :  Challenges for the LHC Computing Grid LHC (Large Hadron Collider) with 27 km of magnets the largest superconducting installation proton beams collide at an energy of 14TeV 40 million events per second from each of the 4 experiments after triggers and filters 100-1000MBytes/second remain every year ~15PetaByte of data will be recorded this data has to be reconstructed and analyzed by the users in addition large computational effort to produce Monte Carlo data 15 PetaByte/year :  15 PetaByte/year 15 PetaByte/year have to be: Recorded Cataloged Managed Distributed Processed Core Tasks:  Core Tasks Reconstruction: transform signals from the detector to physical properties energy, charge, tracks, momentum, particle id. this task is computational intensive and has modest I/O requirements structured activity (production manager) Simulation: start from the theory and compute the responds of the detector very computational intensive structured activity, but larger number of parallel activities Analysis: complex algorithms, search for similar structures to extract physics very I/O intensive, large number of files involved access to data cannot be effectively coordinated iterative, parallel activities of hundreds of physicists Computing Needs:  Computing Needs Some 100 Million SPECInt2000 are needed A 3 GHz Pentium IV ~ 1K SPECInt2000 O(100k) CPUs are needed Large and distributed user community:  Large and distributed user community CERN Collaborators > 6000 users from 450 institutes none has all the required computing all have access to some computing Europe: 267 institutes 4603 users Elsewhere: 208 institutes 1632 users Solution: Connect all the resources into a computing grid The LCG Project (and what it isn’t):  The LCG Project (and what it isn’t) Mission To prepare, deploy and operate the computing environment for the experiments to analyze the data from the LHC detectors Two phases: Phase 1: 2002 – 2005 Build a prototype, based on existing grid middleware Deploy and run a production service Produce the Technical Design Report for the final system Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment LCG is NOT a development project for middleware but problem fixing is permitted (even if writing code is required) LCG Time Line:  LCG Time Line Testing, with simulated event productions first data * TDR – technical design report experiment setup & preparation LCG Scale and Computing Model:  LCG Scale and Computing Model Tier-0 reconstruct (ESD) record raw and ESD distribute data to tier-1 Tier-1 data heavy analysis permanent, managed grid-enabled storage (raw, analysis, ESD), MSS reprocessing regional support Tier-2 managed disk storage simulation end user analysis parallel interactive analysis Data distribution ~70Gbits/sec LCG - a Collaboration:  LCG - a Collaboration Building and operating the LHC Grid - a collaboration between The physicists and computing specialists from the experiments The projects in Europe and the US that have been developing Grid middleware European DataGrid (EDG) US Virtual Data Toolkit (Globus, Condor, PPDG, iVDGL, GriPhyN) The regional and national computing centres that provide resources for LHC some contribution from HP (tier2 centers) The research networks Researchers Software Engineers Service Providers LCG-2 Software:  LCG-2 Software LCG-2_1_0 core packages: VDT (Globus2) EDG WP1 (Resource Broker) EDG WP2 (Replica Management tools) One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with LCG enhancements Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics LCG-2 Software:  LCG-2 Software Authentication and Authorization Globus GSI based on X509 certificates LCG established trust relationship between the CAs in the project Virtual Organizations (VOs) registration hosted at different sites Data Management Tools Catalogues keep track of replicas (Replica Metadata Catalog, Local Replica Catalog) SRM interface for several HMSS and disk pools Wide area transport via GridFTP LCG-2 Software:  LCG-2 Software Information System Globus MDS based for information gathering on a site LDAP + lightweight DB based system for collecting data from sites LCG-BDII solved scalability problem of Globus2 MDS (>200sites tested) Contains information on capacity, capability, utilization and state of services (computing, storage, cataloges..) Work Load Data Management Tools Matches user requests with the resources available for a VO requirements formulated in JDL (classadds) user tunable ranking of resources Uses RLS and information system Keeps state of jobs and manages extension of credentials input output sandboxes proxy renewal…. Interfaces to local batch systems via a gateway node LCG-2 Software:  LCG-2 Software sandbox Input Sandbox is what you take with you to the node Output Sandbox is what you get back Failed jobs are resubmitted LCG Grid Deployment Area:  LCG Grid Deployment Area Goal: - deploy & operate a prototype LHC computing environment Scope: Integrate a set of middleware and coordinate and support its deployment to the regional centres Provide operational services to enable running as a production-quality service Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Initially focused on batch production work Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Focus on robustness, fault-tolerance, predictability, and supportability Understand how LCG can be integrated into the sites’ physics computing services Slide18:  Deployment Area Manager Grid Deployment Board GDB task forces JTB HEPiX GGF Grid Projects: EDG, Trillium, Grid3, etc Regional Centres LHC Experiments LCG Deployment Area LCG Deployment Organisation and Collaborations Advises, informs, Sets policy Set requirements Set requirements Collaborative activities participate participate Implementation:  Implementation Grid Deployment Board coordinates partners Partners take responsibility for specific tasks (e.g. GOCs, GUS) Focussed task forces as needed Collaborative joint projects – via JTB, grid projects, etc. Address policy, operational issues that require general agreement Brokered agreements on: Initial shape of LCG-1 via 5 working groups Security What is deployed CERN deployment group (~30) Release preparation, certification, deployment, and support activities Integration, packaging, debugging, development of missing tools, Deployment coordination & support, security & VO management, Experiment integration and support Operations Services :  Operations Services Operations Service: RAL (UK) is leading sub-project on developing operations services Initial prototype Basic monitoring tools Mail lists for problem resolution GDB database containing contact and site information Working on defining policies for operation, responsibilities (draft document) Working on grid wide accounting Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring Many more like Deployment and operation support: Hierarchical model CERN acts as 1st level support for the Tier 1 centres Tier 1 centres provide 1st level support for associated Tier 2s User Support :  User Support Central model for user support VOs provide 1st level triage FZK (germany) leading sub-project to develop user support services Web portal for problem reporting Experiments contacts send problems through the FZK portal During the data challenges the experiments used a direct channel via the GD teams. Experiment integration support by CERN based group close collaboration during data challenges Documentation Installation guides (manual and management tool based) See: Rather comprehensive user guides Security:  Security LCG Security Group LCG1 usage rules (still used by LCG2) Registration procedures and VO management Agreement to collect only minimal amount of personal data Initial audit requirements are defined Initial incident response procedures Site security contacts etc. are defined Set of trusted CAs (including Fermilab online KCA) Draft of security policy (to be finished by end of year) Web site Certification and release cycles :  Certification and release cycles 2003-2004 Milestones:  2003-2004 Milestones Project Level 1 Deployment milestones for 2003: July: Introduce the initial publicly available LCG-1 global grid service With 10 Tier 1 centres in 3 continents November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges Additional Tier 1 centres, several Tier 2 centres – more countries Expanded resources at Tier 1s Agreed performance and reliability targets 2004 the “LHC Data Challenges” Large-scale tests of the experiments’ computing models, processing chains, grid technology readiness, operating infrastructure ALICE and CMS data challenges started at the beginning of March LHCb and ATLAS – started in May/June The big challenge for this year - data – - file catalogue, (million of files) - replica management, - database access, - integrating all available mass storage systems (several hundred TByte) History:  History Jan 2003 GDB agreed to take VDT and EDG components March 2003 LCG-0 existing middleware, waiting for EDG-2 release September 2003 LCG-1 3 month late -> reduced functionality extensive certification process improved stability (RB, Information system) integrated 32 sites ~300 CPUs operated until early January, first use for production December 2003 LCG-2 Full set of functionality for DCs, but only “classic SE”, first MSS integration Deployed in January, Data challenges started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Mai 2004 -> now releases with incrementaly: Improved services SRM enabled storage for disk and MSS systems LCG1 Experience (2003):  LCG1 Experience (2003) Integrate sites and operate a grid Problems: only fraction of personnel available, software was late introduced hierarchical support model (primary and secondary sites) worked well for some regions (less for others) installation and configuration was an issue only time to package software for the LCFGng tool (problematic) not sufficient documentation (partially compensated by travel) manual installation procedure documented when new staff arrived communication/cooperation between sites needed to be established deployed MDS + EDG-BDII in a robust way redundant regional GIISes vastly improved the scalability and robustness of the information system upgrades, especially non backward compatible ones took very long not all sites showed the same dedication still some problems with the reliability of some of the core services Big step forward LCG2:  LCG2 Operate large scale production service Started with 8 “core” sites each bringing significant resources sufficient experience to react quickly weekly core site phone conference weekly meeting with each of the experiments weekly joined meeting of the sites and the experiments (GDA) Introduced a testZone for new sites LCG1 showed that ill configured sites can affect all sites sites stay in the testZone until they have been stable for some time Further improved (simplified) information system addresses: manageability, improves: robustness and scalability allows partitioning of the grid into independent views Introduced local testbed for experiment integration rapid feedback on functionality from the experiments triggered several changes to the RLS system LCG2 :  LCG2 Focus on integrating local resources batch systems at CNAF, CERN, NIKHEF already integrated MSS systems with CASTOR at several sites, enstore at FNAL Experiment software distribution mechanism based on shared file system with access for privileged users tool to publish the installed software in the information system needs to be as transparent as possible. (Some work to be done) Improved documentation and installation sites have the choice to use LCFGng or follow a manual installation guide generic config. description eases integration with local tools documentation includes simple tests, sites join in a better state improved readability by going to HTML and PDF Release page Slide29:  22 Countries 63 Sites 49 Europe, 2 US, 5 Canada, 6 Asia, 1 HP Coming: New Zealand, China, Korea other HP (Brazil, Singapore) 6100 cpu Usage:  Usage Hard to measure: VOs “Pick” services and add own components for job submission, file catalogs, replication … we have no central control of the resources accounting has to be improved File catalogues (used by 2 VOs ) ~ 2.5 Million entries Integrating Site Resources:  Integrating Site Resources The plan: Provide defined grid interfaces to a grid site: Storage, compute clusters, etc. Integration with local systems is site responsibility Middleware layered over existing system installations But (real life): Interfaces are not well defined (SRM maybe a first?) Lots of small sites require a packaged solution Including fabric management (disk pool managers, batch systems) That installs magically out of the box Strive for the first view, while providing the latter But – “some assembly is required” – it costs effort Constraints: Packaging and installation integrated with some middleware Complex dependencies of middleware packages Current software requires that many holes are punched into the firewalls of the sites Integrating Site Resources:  Integrating Site Resources Adding Sites: Site contacts the deployment team or T1 centre Deployment team send form for contact DB and points site to the release page Site decides after consultation on scope and method of installation Site installs, problems are resolved via mailing list and T1s intervention Site runs initial certification tests (provided with installation guides) Site is added to the testZone information system Deployment team runs certification jobs and helps the site to fix problems Tests are repeated and the status is published (GIIS) and (Status) internal web based tool to follow history VOs add stable sites to their RBs Sites are added to the productionZone Most frequent problems: Missing or wrong localization of the configuration firewalls are not configured correctly, too many ports have to be opened Site policies conflit with the need of access to the WNs 2004 Data Challenges:  2004 Data Challenges LHC experiments use multiple grids and additional services Integration, Interoperability expect some central resource negotiation concerning: queue length, memory,scratch space, storage etc. Service provision planned to provide shared services RBs, BDIIs, UIs etc experiments need to augment the services on the UI and need to define their super/subsets of the grid individual RB/BDII/UIs for each VO (optional on one node) Scalable services for data transport needed DNS switched access to gridFTP Performance issues with several tools (RLS, Info System, RBs) most understood, work around and fixes implemented and part of 2_1_0 Local Resource Managers (batch systems) are too smart for Glue GlueSchema can’t express the richness of batch systems (LSF etc.) load distribution not understandable for users (looks wrong) problem is understood and workaround in preperation Interoperability :  Interoperability Several grid infrastructures for LHC experiments: LCG-2/EGEE, Grid2003/OSG, NorduGrid, other national grids LCG explicit goals to interoperate One of LCG service challenges Joint projects on storage elements, file catalogues, VO management, etc. Most are VDT (or at least Globus-based) Grid2003 & LCG use GLUE schema Issues are: File catalogues, information schema, etc at technical level Policy and semantic issues Developments in 2004:  Developments in 2004 General: LCG-2 will be the service run in 2004 – aim to evolve incrementally Goal is to run a stable service Service challenges (data transport (500 MB/s one week), jobs, interoperability) Some functional improvements: Extend access to MSS – tape systems, and managed disk pools Distributed vs replicated replica catalogs – with Oracle back-ends To avoid reliance on single service instances Operational improvements: Monitoring systems – move towards proactive problem finding, ability to take sites on/offline; experiment monitoring (R-GMA), accounting Control system (alarms, actions…) “Cookbook” to cover planning, installation and operation Activate regional centres to provide and support services this has improved over time, but in general there is too little sharing of tasks Address integration issues: With large clusters (on non routed networks), storage systems, OSs, Integrating with other experiments and apps Changing landscape:  Changing landscape The view of grid environments has changed in the past year From A view where all LHC sites would run a consistent and identical set of middleware, To A view where large sites must support many experiments each of which have grid requirements National grid infrastructures are coming – catering to many applications, and not necessarily driven by HEP requirements We have to focus on interoperating between potentially diverse infrastructures (“grid federations”) At the moment these have underlying same m/w But modes of use and policies are different (IS, file catalogues,..) Need to have agreed services, interfaces, protocols The situation is now more complex than anticipated LCG – EGEE :  LCG – EGEE LCG-2 will be the production service during 2004 Will also form basis of EGEE initial production service Will be maintained as a stable service Will continue to be developed Expect in parallel a development service – Q204 Based on EGEE middleware prototypes Run as a service on a subset of EGEE/LCG production sites The core infrastructure of the LCG and EGEE grids will be operated as a single service LCG includes US and Asia, EGEE includes other sciences The ROCs support Resource Centres and applications Similar to LCG primary sites Some ROCs and LCG primary sites will be merged LCG Deployment Manager will be the EGEE Operations Manager Will be member of PEB of both projects Summary:  Summary Deployment and operation of Grid services is still uncharted territory The systems straddles many administrative domains chain of command??? Diversity of sites is needed while being interoperable and secure Communication time zones, size of the community Upgrades how to drain a grid??? …….. Summary:  Summary LCG1 in 2003 has seen first production of simulated events 32 sites 300 CPUs LCG2 has by now integrated 63 sites and 6100 CPUs 2004 data challenges evaluate that simple computing models can be handled by current grid middleware validate the security model understand storage model Deploying and operating LCG2 gave input: requirements for monitoring and control systems understanding management issues understanding the requirements of the sites Service challenges at the end of 2004 find performance and scalability issues measure in a generic way the limits of the system drive interoperability between grids No shortage of challenges and opportunities

Add a comment

Related presentations