Storage
1) Castor / Storm
GPFS vs GridFTP block sizes - GPFS 1MB, SL3 GridFTP 64K, SL4 GridFTP 256K
Reduced gridftp timeouts from 3600s to 3000s
CMS had problems with GPFS s/w area - latency issue [same as ECDF?] - partly due to switch fault but they migrated s/w are to another filesystem
RAL - General slowdown needed DB intervention. LHCb RFIO failures. Need to prioritise tape mounts - Prodn >>> users.
2) CASTOR development - Better testing planned and in progress. Also logging improved.
3) dCache (SARA) 20 gridftp movers on thumpers cf 6 on linux boxes. gsidcap only on SRM node itself. Good tape metrics gathered and published on their wiki. HSM pools filled up due to orphanned files- demoved from pnfs namespace but not deleted. Caused by failed FTS transfers (timeout now increased)
... coffee time :-)
4) DPM (GRIF/LAL) - Mostly OK -- some polishing required (highlighted dpm-drain) and Greigs monitoring work.
Databases
1) 'make em resilient' - have managed 99.96% availability (3.5h downtime/yr)
They have upgraded to new hardware on the oracle cluster @ Tier0
Lessons learnt from the powercut - make sure your network stuff is UPSd esp if you have machines you expect to be up...
2) ATLAS - rely on DB for reprocessing capable of ~1k concurrent sessions. Some nice replication tricks to (and from) remote sites such as the muon calibration centres
3) SRM issues - went more or less according to plan. Understood and corrected issues found. Otherwise - seemed to be within 'normal' load range that they were used to.
 
No comments:
Post a Comment