CMS SC4 Site log for PHOENIX cluster
August - ? 2006:
Go to
previous page /
next page of CMS site log
14. 8. 2006 Workaround for DPM / VOMS issue
Solved the
DPM /
VOMS roles issue by opening the
H1 pool for writing by all groups, i.e. the CMS production role's files will now
end up there. Also asked CSCS managers about short time upgrade of our disk capacity. Since LHCb is currently not using its storage space,
we may also add that pool.
18. 8. 2006 Ordering additional disks (3.2 TB)
Vincenzo Annaloro proposed to stock up our existing file server with an additional 3.2 TB of disks and we will go ahead with it.
This will give us some flexibility for CSA06, since the real harware upgrade will only come towards end of the year.
21. 8. 2006 FTS testing in T2_CSCS_Load PhEDEx instance
I tested PhEDEx for three days via FTS with our STAR-CSCS channel hosted by FZK. Up to now the transfers had all be done with
srmcp
.
The success rates were extremely sobering. Here I present some success rates based on correctly copied singel files. This and also details about the errors can be found in
this CMS hypernews message.
SITE STATISTICS using FTS 18.-21. Aug
site: T1_RAL_Load (OK: 17 / FAILED: 115) success rate: 12.8787878787879%
site: T1_CERN_Load (OK: 154 / FAILED: 273) success rate: 36.0655737704918%
site: T1_FZK_Load (OK: 60 / FAILED: 141) success rate: 29.8507462686567%
site: T1_CNAF_Load (OK: 1 / FAILED: 38) success rate: 2.56410256410256%
site: T1_IN2P3_Load (OK: / FAILED: 128) success rate: 0%
SITE STATISTICS using SRMCP 22.-23. Aug
site: T1_RAL_Load (OK: 8 / FAILED: 5) success rate: 61.5384615384615%
site: T1_CERN_Load (OK: 2 / FAILED: 5) success rate: 28.5714285714286%
site: T1_FZK_Load (OK: 11 / FAILED: 2) success rate: 84.6153846153846%
site: T1_CNAF_Load (OK: / FAILED: 1) success rate: 0%
site: T1_IN2P3_Load (OK: 158 / FAILED: 1) success rate: 99.3710691823899%
SITE STATISTICS using SRMCP ~week before change to FTS
site: T1_RAL_Load (OK: 51 / FAILED: 48) success rate: 51.5151515151515%
site: T1_CERN_Load (OK: 20 / FAILED: 2) success rate: 90.9090909090909%
site: T1_FZK_Load (OK: 122 / FAILED: 178) success rate: 40.6666666666667%
site: T1_CNAF_Load (OK: 1 / FAILED: 90) success rate: 1.0989010989011%
site: T1_IN2P3_Load (OK: 164 / FAILED: 19) success rate: 89.6174863387978%
site: T1_FNAL_Load (OK: 391 / FAILED: 31) success rate: 92.6540284360189%
Since the associated errors were so divers and I saw no good chance of resolving them (based on general status
of successful file transfers by our service provider FZK), I decided to go back to using
srmcp
for now.
23. 8. 2006 DPM service breakdown
The DPM service on our SE died after a number of ~20 gridftp processes started to use up the whole virtual memory,
resulting in processes dying due to memory allocation errors. These gridftp jobs were basically idle and hanging. The machine was
not swapping heavily. After killing some of the gridftp processes the system recovered. It is not clear whether the optimization of some
kernel parameters as recommended by LCG had something to do with it (see
this hypernews message).
Ganglia plots for Sat, Aug. 19 - Sat, Aug 26:
30. 8. 2006 DPM problem: uncontrolled dpm-gsiftp process spawning
Discovered around 16:30h that dpm-gsiftp processes got wildly forked off from the main dpm-gsiftp process. The children had no
file descriptors to files in our storage area, so they were not actively transferring. When the process count went >900 I restarted
the service. (Note: At the same time the CSCS sysadmin was formatting newly added disks on the server). For more than a day
there were again a few gsiftp processes (from atlassgm) wich used up considerable virtual memory space and obviously
were stuck for hours (in select() loops).
14. 9. 2006 Started FTS debugging for STAR-CSCS channel hosted at FZK
This activity is described on a separate page:
FTSChannelDebugging
Until the end of September, we only got the FZK
LoadTest success to about 40%, while from CERN it was always around 90%. There may still be problems with some firewall settings between FZK and CSCS...
Go to
previous page /
next page of CMS site log
--
DerekFeichtinger - 31 Aug 2006