advertisement

tftge dec 02

67 %
33 %
advertisement
Information about tftge dec 02
Product-Training-Manuals

Published on June 18, 2007

Author: CoolDude26

Source: authorstream.com

advertisement

U.S. ATLAS Grid Production Experience:  U.S. ATLAS Grid Production Experience Kaushik De University of Texas at Arlington Troubleshooting and Fault Tolerance in Grid Environments, Chicago December 11, 2002 U.S. ATLAS Testbed:  U.S. ATLAS Testbed US -ATLAS testbed launched February 2001 Fabric Testing:  Fabric Testing Testbed Production:  Testbed Production Goals: Demonstrate distributed ATLAS data production, access and analysis using grid middleware and using tools developed by the testbed group Production (and testing) experience so far: Fast simulation (Atlfast) Short jobs, 5 sites used (all 8 sites certified) Generated ~10 million events during two weeks in July 2002, 6000 files fully catalogued and accessible through the grid Data Challenge production (Atlsim) Phase 1 CPU intensive - ~14 hours per job/output file 3 heterogeneous sites participated: 15, 30, 300 nodes; Condor (2) and LSF; 300-1500 MHz Generated 200k events, 5000 files in August 2002 DC Phase 2 ~25 hours per job, 50-60k events in January 2003 Pre-production testing started General Remarks:  General Remarks Tackled a large number of complex issues repackaging of applications (by hand) software deployment (PACMAN) site verification (GridView) production tools (GRAT, Grappa) data management (magda) VO management (BNL tools) ... Troubleshooting ignore andamp; resubmit, check log files, databases most of the troubleshooting done by tool developers - not a robust operations model! Fault tolerance redundancy, independent verification process, concatenated logs, error handling Not a production environment yet - still a development testbed doing production! Databases Used in U.S. DC1:  Databases Used in U.S. DC1 MySQL databases play a central role in U.S. DC1 production scripts Production database used to track job status (filename, submitting site, processing site, job id, time started, time finished, temporary and final file locations…) information is updated periodically during job Data management to transfer input and output files using GridFTP to register file locations in Magda catalogue Virtual Data Catalogue used to define job (transformation) store job parameters, random numbers Metadata catalogue store post-production summary information data provenance, physics summary... GRAT Software:  GRAT Software ~50 independently executable modular scripts based on Globus and magda Minimal requirement on grid production site Globus andamp; Magda installed on gatekeeper shared $ATLAS_SCRATCH disk for all nodes Automatic job submission under full user control One, many or infinite sequence of jobs at one or many sites, using grid even for local submits Any user from any site can submit production jobs Independent data management scripts to check consistency of production semi-automatically query production database check Globus for job completion status check data catalog (magda) for output files recover from many possible production failures Data management using magda: moving and registering output files to BNL HPSS and at replica locations on the grid GRAT Execution Model:  GRAT Execution Model 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-stage 5. Batch Submission 6. Job Parameterization 7. Simulation 8. Post-stage 9. Cataloging 10. Monitoring GRAT Job Scheduling:  GRAT Job Scheduling Create job script module Replica storage select module Site select module Stage software on atlas_scratch Move files/cleanup module Execute Atlsim Job Query environment Partition select module Scheduler Scheduler Gatekeeper Queue Node ATLAS_SCRATCH Virtual Data Catalogue Register Production Magda Database DC1 Jobs on U.S. Grid:  DC1 Jobs on U.S. Grid DC1 Production Experience:  DC1 Production Experience Grid production requires robust software During 18 days of grid production (in August), every system died at least once Local experts were not always accessible (many of them on vacation) Examples: scheduling machines died 5 times (thrice power failure, twice system hung) Long network outages - multiple times Gatekeeper - died at every site at least 2-3 times Three databases used - production, magda and virtual data. Each inaccessible at least once! Scheduled maintenance - HPSS, Magda server, LBNL hardware, LBNL Raid array… These outages should be expected on the grid, as we include many more sites We managed andgt; 100 files/day (~75% efficiency) in spite of these stoppages! Future Plans:  Future Plans Continue production/development Pileup data production (data - not cpu intensive) other production/analysis use cases GRAT improvements Use Condor-G for job submission detailed plan developed working with Condor team need database publication of Condor log files 1 month time-scale Use DAGMan for pileup production nice use case - hundreds of nodes to be managed over many days or many weeks 3 month time-scale Migrate to Chimera 6 month time-scale MDS integration (using GLUE andamp; Pippy schema) Implement resource broker

Add a comment

Related presentations

Related pages

Full text of "Zeitschrift der österreichischen ...

Search the history of over 462 billion pages on the Internet. search Search the Wayback Machine. Featured texts All Texts latest This Just In ...
Read more

Full text of "Fortsetzung des Allgemeinen teutschen Garten ...

Search the history of over 466 billion pages on the Internet. search Search the Wayback Machine
Read more

Full text of "Jahresbericht über die Leistungen und ...

Search the history of over 462 billion pages on the Internet. search Search the Wayback Machine. Featured texts All Texts latest This Just In ...
Read more

Full text of "73 Magazine (March 1984)" - Internet Archive ...

Full text of "73 Magazine (March 1984)" See other formats ...
Read more

BC Historical Newspapers|UBC Library

Learning, knowledge, research, insight: welcome to the world of UBC Library, the second-largest academic research library in Canada.
Read more

www.msss.com

pds_version_id = pds3 file_name = "e0200196.imq" record_type = fixed_length record_bytes = 2048 file_records = 922 label_records = 1 ^image = 2 ...
Read more

Full text of "Archives de neurologie" - Internet Archive ...

Search the history of over 466 billion pages on the Internet. search Search the Wayback Machine
Read more

starbase.jpl.nasa.gov

pds_version_id = pds3 file_name = "e1100675.imq" record_type = fixed_length record_bytes = 2048 file_records = 1638 label_records = 1 ^image = 2 ...
Read more

Abnormal Man, Being Essays on Education and Crime and ...

Full text. ABNORMAL MAN, BEING ESSAYS ON . EDDCATION AND CRiE AND RELATi SUBJECTS, WITH . DIGESTS OF LITERATURE AND A BIBLIOGRAPHY. BY . ARTHUR MacDONALD ...
Read more