CMS SC4 Site log for PHOENIX cluster

August - ? 2006:

14. 8. 2006 Workaround for DPM / VOMS issue

Solved the DPM / VOMS roles issue by opening the H1 pool for writing by all groups, i.e. the CMS production role's files will now end up there. Also asked CSCS managers about short time upgrade of our disk capacity. Since LHCb is currently not using its storage space, we may also add that pool.

18. 8. 2006 Ordering additional disks (3.2 TB)

Vincenzo Annaloro proposed to stock up our existing file server with an additional 3.2 TB of disks and we will go ahead with it. This will give us some flexibility for CSA06, since the real harware upgrade will only come towards end of the year.

21. 8. 2006 FTS testing in T2_CSCS_Load PhEDEx instance

I tested PhEDEx for three days via FTS with our STAR-CSCS channel hosted by FZK. Up to now the transfers had all be done with srmcp. The success rates were extremely sobering. Here I present some success rates based on correctly copied singel files. This and also details about the errors can be found in this CMS hypernews message.

SITE STATISTICS using FTS 18.-21. Aug

site: T1_RAL_Load (OK: 17 / FAILED: 115)   success rate: 12.8787878787879%
site: T1_CERN_Load (OK: 154 / FAILED: 273)   success rate: 36.0655737704918%
site: T1_FZK_Load (OK: 60 / FAILED: 141)   success rate: 29.8507462686567%
site: T1_CNAF_Load (OK: 1 / FAILED: 38)   success rate: 2.56410256410256%
site: T1_IN2P3_Load (OK:  / FAILED: 128)   success rate: 0%


site: T1_RAL_Load (OK: 8 / FAILED: 5)   success rate: 61.5384615384615%
site: T1_CERN_Load (OK: 2 / FAILED: 5)   success rate: 28.5714285714286%
site: T1_FZK_Load (OK: 11 / FAILED: 2)   success rate: 84.6153846153846%
site: T1_CNAF_Load (OK:  / FAILED: 1)   success rate: 0%
site: T1_IN2P3_Load (OK: 158 / FAILED: 1)   success rate: 99.3710691823899%

SITE STATISTICS using SRMCP ~week before change to FTS

site: T1_RAL_Load (OK: 51 / FAILED: 48)   success rate: 51.5151515151515%
site: T1_CERN_Load (OK: 20 / FAILED: 2)   success rate: 90.9090909090909%
site: T1_FZK_Load (OK: 122 / FAILED: 178)   success rate: 40.6666666666667%
site: T1_CNAF_Load (OK: 1 / FAILED: 90)   success rate: 1.0989010989011%
site: T1_IN2P3_Load (OK: 164 / FAILED: 19)   success rate: 89.6174863387978%
site: T1_FNAL_Load (OK: 391 / FAILED: 31)   success rate: 92.6540284360189%

Since the associated errors were so divers and I saw no good chance of resolving them (based on general status of successful file transfers by our service provider FZK), I decided to go back to using srmcp for now.

23. 8. 2006 DPM service breakdown

The DPM service on our SE died after a number of ~20 gridftp processes started to use up the whole virtual memory, resulting in processes dying due to memory allocation errors. These gridftp jobs were basically idle and hanging. The machine was not swapping heavily. After killing some of the gridftp processes the system recovered. It is not clear whether the optimization of some kernel parameters as recommended by LCG had something to do with it (see this hypernews message).

Ganglia plots for Sat, Aug. 19 - Sat, Aug 26:

30. 8. 2006 DPM problem: uncontrolled dpm-gsiftp process spawning

Discovered around 16:30h that dpm-gsiftp processes got wildly forked off from the main dpm-gsiftp process. The children had no file descriptors to files in our storage area, so they were not actively transferring. When the process count went >900 I restarted the service. (Note: At the same time the CSCS sysadmin was formatting newly added disks on the server). For more than a day there were again a few gsiftp processes (from atlassgm) wich used up considerable virtual memory space and obviously were stuck for hours (in select() loops).


14. 9. 2006 Started FTS debugging for STAR-CSCS channel hosted at FZK

This activity is described on a separate page: FTSChannelDebugging

Until the end of September, we only got the FZK LoadTest success to about 40%, while from CERN it was always around 90%. There may still be problems with some firewall settings between FZK and CSCS...

-- DerekFeichtinger - 31 Aug 2006

