CMS Site Log for PHOENIX Cluster
Go to
previous page /
next page of CMS site log
9. 3. 2007 DPM service crash due to automatic updates requiring manual intervention
Automatic apt updates of the DPM service RPMs led to a failure of our monitoring (because
dpm-qryconf
segfaulted). A restart of the
services resulted in a failure to start the
dpm
service and as it turned out, seemed to corrupt the DB. Only then did we see that the RPMs had been updated and that a manual procedure was required to migrate the DB to a new schema. In our opinion these changes were not communicated with the required visibility. We had originally thought thet the updates usually cover security relevant things and that no major service updates are entered into this apt repository.
Migrating the DB using the documentation provided by LCG was no longer possible since the DB had been corrupted. I was forced to play back a backup of the morning. The update script ran (almost) correctly with this. I was able to start up the service again by Friday night.
13./14. 3. 2007 Cycle-1 Week-5 Load tests
According to the
instructions from D. Bonarcorsi I started a download test from FZK using the
/PhEDEx_Prod/LoadTest07_FZK/CSCS
sample. The transfers were going worse than end of last month with intermittent failure periods. This may be the result of the FZK dCache SRM instability that has been mentioned by the operators.
Note: PhEDEx page shows transfer speeds of ~7 MB/s.
SITE STATISTICS:
==================
first entry: 2007-03-13 14:32:59 last entry: 2007-03-15 09:50:02
site: T1_FZK_Buffer (OK: 417 / Err: 136) succ. rate: 75.4 % total: 1030.2 GB avg. rate: 4.2 MB/s = 35.5 Mb/s
Error message statistics per site:
===================================
*** ERRORS from T1_FZK_Buffer:***
85 Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping:
SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !
33 the server sent an error response: 425 425 Can't open data connection
11 Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping:
SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success
4 transfer expired in the download agent queue
1 Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: service timeout.
1 Failed Cannot retrieve final message from
/var/tmp/glite-url-copy-edguser/STAR-CSCSfailed/STAR-CSCS__2007-03-14-1418_2yYlpG
1 Failed Failed on SRM get: SRM getRequestStatus timed out on get
19./20. 3. 2007 Cycle-2 Week-1 Load tests
Very stable transfers, though at a low throughput rate (ca. 5MB/s avergage with 25-30MB/s peak).
An explanation is given by Artem Trunov from FZK:
"For the changing rate we sort of have explanations - the injection is kept
at 5MB/s level, so if transfers work all the time, then we get a stable
5MB/s rate, and this is what we see for CSCS for example_. Then, if
transfers not working for anyreasons, and then start working again, you see
a peak in rate, this is illustrated by Aachen. Since GridKa was not stable
last week, and they now have the latest patch, we may see less gridka
induced errors."
SITE STATISTICS:
==================
first entry: 2007-03-19 02:29:38 last entry: 2007-03-20 13:27:30
site: T1_FZK_Buffer (OK: 440 / Err: 4) succ. rate: 99.1 % total: 1087.1 GB avg. rate: 3.8 MB/s = 32.0 Mb/s
*** ERRORS from T1_FZK_Buffer:***
1 Failed Getting filesize failed. the server sent an error response: 553 553 Permission denied, reason:
CacheException(rc=10006;msg=Pnfs request timed out)
1 Failed Getting filesize failed. the server sent an error response:
500 500 java.lang.reflect.InvocationTargetException: <auth>
1 Failed Getting filesize failed. an end-of-file was reached
1 Failed Transfer failed. ERROR the server sent an error response:
500 500 java.lang.reflect.InvocationTargetException: <retr>
22. 3. 2007 Transfer failures due to dCache downtime at FZK
Failing transfers because of dCache maintenance at FZK. The unavailable system produces mainly this error message in SRM:
transfer expired in the download agent queue
.
Go to
previous page /
next page of CMS site log
--
DerekFeichtinger - 22 Mar 2007