Tags:
create new tag
view all tags

CMS Site Log for PHOENIX Cluster

Arrow left Go to previous page / next page of CMS site log MOVED TO...

9. 3. 2007 DPM service crash due to automatic updates requiring manual intervention

Automatic apt updates of the DPM service RPMs led to a failure of our monitoring (because dpm-qryconf segfaulted). A restart of the services resulted in a failure to start the dpm service and as it turned out, seemed to corrupt the DB. Only then did we see that the RPMs had been updated and that a manual procedure was required to migrate the DB to a new schema. In our opinion these changes were not communicated with the required visibility. We had originally thought thet the updates usually cover security relevant things and that no major service updates are entered into this apt repository.

Migrating the DB using the documentation provided by LCG was no longer possible since the DB had been corrupted. I was forced to play back a backup of the morning. The update script ran (almost) correctly with this. I was able to start up the service again by Friday night.

13./14. 3. 2007 Cycle-1 Week-5 Load tests

According to the instructions from D. Bonarcorsi I started a download test from FZK using the /PhEDEx_Prod/LoadTest07_FZK/CSCS sample. The transfers were going worse than end of last month with intermittent failure periods. This may be the result of the FZK dCache SRM instability that has been mentioned by the operators.

Note: PhEDEx page shows transfer speeds of ~7 MB/s.

SITE STATISTICS:
==================
                         first entry: 2007-03-13 14:32:59      last entry: 2007-03-15 09:50:02
site: T1_FZK_Buffer (OK: 417 / Err: 136)   succ. rate: 75.4 %   total: 1030.2 GB   avg. rate: 4.2 MB/s = 35.5 Mb/s

Error message statistics per site:
===================================

 *** ERRORS from T1_FZK_Buffer:***
     85   Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: 
            SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !
     33   the server sent an error response: 425 425 Can't open data connection
     11   Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: 
            SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success
      4   transfer expired in the download agent queue
      1   Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: service timeout.
      1   Failed Cannot retrieve final message from
           /var/tmp/glite-url-copy-edguser/STAR-CSCSfailed/STAR-CSCS__2007-03-14-1418_2yYlpG
      1   Failed Failed on SRM get: SRM getRequestStatus timed out on get

cpu_report-20070314-1153.gif
network_report-20070314-1153.gif

19./20. 3. 2007 Cycle-2 Week-1 Load tests

Very stable transfers, though at a low throughput rate (ca. 5MB/s avergage with 25-30MB/s peak).

An explanation is given by Artem Trunov from FZK:

"For the changing rate we sort of have explanations - the injection is kept at 5MB/s level, so if transfers work all the time, then we get a stable 5MB/s rate, and this is what we see for CSCS for example_. Then, if transfers not working for anyreasons, and then start working again, you see a peak in rate, this is illustrated by Aachen. Since GridKa was not stable last week, and they now have the latest patch, we may see less gridka induced errors."

SITE STATISTICS:
==================
                         first entry: 2007-03-19 02:29:38      last entry: 2007-03-20 13:27:30
site: T1_FZK_Buffer (OK: 440 / Err: 4)   succ. rate: 99.1 %   total: 1087.1 GB   avg. rate: 3.8 MB/s = 32.0 Mb/s


 *** ERRORS from T1_FZK_Buffer:***
      1   Failed Getting filesize failed. the server sent an error response: 553 553 Permission denied, reason: 
           CacheException(rc=10006;msg=Pnfs request timed out)
      1   Failed Getting filesize failed. the server sent an error response: 
           500 500 java.lang.reflect.InvocationTargetException: <auth>
      1   Failed Getting filesize failed. an end-of-file was reached
      1   Failed Transfer failed. ERROR the server sent an error response:
           500 500 java.lang.reflect.InvocationTargetException: <retr>

cpu_report-20070320-1432.gif
network_report-20070320-1432.gif

22. 3. 2007 Transfer failures due to dCache downtime at FZK

Failing transfers because of dCache maintenance at FZK. The unavailable system produces mainly this error message in SRM: transfer expired in the download agent queue.

Arrow left Go to previous page / next page of CMS site log MOVED TO...

-- DerekFeichtinger - 22 Mar 2007

Topic attachments
I Attachment History Action Size DateSorted ascending Who Comment
GIFgif cpu_report-20070314-1153.gif r1 manage 12.5 K 2007-03-14 - 11:07 DerekFeichtinger  
GIFgif network_report-20070314-1153.gif r1 manage 11.9 K 2007-03-14 - 11:08 DerekFeichtinger  
GIFgif cpu_report-20070320-1432.gif r1 manage 13.3 K 2007-03-20 - 13:32 DerekFeichtinger  
GIFgif network_report-20070320-1432.gif r1 manage 13.5 K 2007-03-20 - 13:33 DerekFeichtinger  
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2007-03-22 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback