CMS CSA06 Site Log for PHOENIX Cluster

Arrow left Go to previous page / next page of CMS site log MOVED TO...

6. 10. 2006 Many failures due to permissions problem - FIXED

Came back from a few days of vacation and discovered lots of failures on our CSCS PhEDEx download agent.

It's an old problem: The directory /dpm/projects.cscs.ch/home/cms/trivcat/store/CSA06 had automatically been created during production by Carsten with the production VOMS-role, while PhEDEx agents from FZK now try to write with a "naked" proxy that just gets mapped to group cms. In DPM the groups "cms/Role=production" and "cms" are understood as completely different groups. There is no hierarchy of groups possible. Additional permissions can be set by hand using dpns-setacl, which is fine, as long as the different roles constrain themselves to write in well separated branches of the directory tree. So, only a few high level directories need to be changed by a manager. It would be optimal, if the authorization modules of the storage services understood group hierarchies... (also look at this hypernews thread)

After changing the permissions of the directory, the transfers immediately started to work, although pushing our SE to the limit:


cpu20061006.gif
network20061006.gif
gridftp20061006.gif

The PhEDEx settings of the download agent are

   -batch-files 5
   -jobs        2
while the FTS channel parameters are set to
Bandwidth: 0
Nominal throughput: 0
Number of files: 5, streams: 1

8. 10. 2006: Success rate for the transfers from FZK is basically 100% (repeated failures yesterday were due to a wrong file length value in the central DB)! Peak transfer rates for some single files are ~ 150Mb/s = 20MB/s.

9. 10. 2006 strange myproxy problem - FIXED

A configuration problem on the FTS server at FZK caused an error condition, where all of our download transfers failed due to not being able to obtain a myproxy certificate. I was not able to reproduce this failure when trying myself to check the certificate delegation using myproxy tools:

$> myproxy-info -d -s myproxy-fts.cern.ch
username: /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
owner: /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
  timeleft: 719:53:34  (30.0 days)

$> myproxy-get-delegation -v -l "/O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek 
Feichtinger" -s myproxy-fts.cern.ch -o /tmp/gagaga
Enter MyProxy pass phrase:
Socket bound to port 20002.

server name: /C=CH/O=CERN/OU=GRID/CN=host/lxb0728.cern.ch
checking if server name matches "myproxy@lxb0728.cern.ch"
server name does not match
checking if server name matches "host@lxb0728.cern.ch"
server name accepted
A proxy has been received for user /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek 
Feichtinger in /tmp/gagaga

Martin Litmaath suggested, that the problem was due to a missing setting in the environment on the VO agent node at FZK: MYPROXY_TCP_PORT_RANGE 20000,25000. This solved it, so I restarted our PhEDEx download agent.

The current downloads are quite efficient. As can be seen from the following ganglia plots, the transfers have a constant throughput. Also the CPU is now used much more efficiently as last month (maybe because we have now more disks in the machine and I opened up the pools?).


cpu20061011.gif
network20061011.gif
gridftp20061011.gif

19. 10. 2006 Transfers stable - about 1 TB/week

The transfers seem to go pretty smoothly. Yesterday, the CMS pool on our SE only had 500 GB left, but following a message of Ian Fisk, I was able to erase the /store/CSA06/CSA06-102-os-minbias-0 data set.

The CMS pool graph below shows that we are currently geting rather stably about 1 TB/week from FZK.
dpm_diskfree_cms20061019.gif

SITE STATISTICS:  for 2006-10-14 00:09:27  until 2006-10-20 10:08:49
==================
site: T1_FZK_Buffer (OK: 328 / FAILED: 28)   success rate: 92.1348314606742%
site: T1_FNAL_Buffer (OK: 44 / FAILED: 12)   success rate: 78.5714285714286%

Errors:
      1 T1_FZK_Buffer:  Failed Failed on SRM get: SRM getRequestStatus timed out on get;
      2 T1_FZK_Buffer:  Failed Failed on SRM put: Failed SRM put on 
              httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
     10 T1_FNAL_Buffer:  Canceled (null);
     25 T1_FZK_Buffer:  Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: 
                      SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success;

26. 10. 2006 Very bad transfer quality from FZK

Since approximately the time of the dcache update at FZK, the transfer quality is consistently bad. I can see that all transfers from FZK share this problem. Notified the FZK admins

SITE STATISTICS: 2006-10-26 05:20:53  to  2006-10-26 12:41:29
==================
site: T1_FZK_Buffer (OK: 1 / FAILED: 16)   success rate: 5.88235294117647%
site: T1_FNAL_Buffer (OK: 12 / FAILED: 13)   success rate: 48%

Errors:
     1 T1_FZK_Buffer:  Failed Failed on SRM get: SRM getRequestStatus timed out on get;
      4 T1_FZK_Buffer:  Canceled (null);
      8 T1_FZK_Buffer:  Failed Transfer failed. ERROR the server sent an error response:
               425 425 Can't open data connection. timed out() failed.;
      7 T1_FNAL_Buffer:  Canceled (null);

Note: the 425 425 Can't open error is again the most prominent error.

29. 10. 2006 SE services breakdown - still trouble with FZK transfers

This morning around 5 am it seems that again the SE ran out of memory and was unable to spawn any new processes. Monitoring information from 5 am until about noon is missing, which suggests that no process for the ganglia collecting and sending of data was able to launch. Based on the two following monitoring graphs, it seems again that a small number of processes hijacked the whole memory.


memory20061029.gif
load_fifteen-20061029.gif

The system recovered by itself, but the SE services have died.

Apart from that we still have severe problems with transfers from FZK. The admins there told us that they have difficulties with the gridftp doors of their new dcache version. We almost get no data at the moment frown

30. 10. 2006 SE services restarted, again huge memory consumption at night

23:30h: There is again a dpm.ftpd process with huge memory consumption (takes up much swap space. 2.79 GB). Local process owner is cms016. Looking at the process reveals that it has a read mode file descriptor to a CMS file with length 2839158582 Bytes. The fd points to /storage/cms/cms/2006-10-28/2C4599BC-1861-DB11-B16E-0002B3D8E597.root.853039.0. There is a number of established sockets to FZK: se01-lcg.projects.cscs.ch:20009->f01-015-107-e.gridka.de:20014 Need to find out how to map that to the CMS LFN in the DPM DB. In the PhEDEx web monitoring pages (transfer details) I can see an active transfer from CSCS to FZK. The file length matches exactly the above job.

lfn=/store/CSA06/CSA06-105-os-minbias-0/RECO/CMSSW_1_0_5-RECO-H746ee88eddaa52306cd016b2f689e370/1021/
             2C4599BC-1861-DB11-B16E-0002B3D8E597.root
size=2839158582
errors=0
from_pfn=srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/trivcat/store/
                      CSA06/CSA06-105-os-minbias-0/RECO/CMSSW_1_0_5-RECO-H746ee88eddaa52306cd016b2f689e370/
                      1021/2C4599BC-1861-DB11-B16E-0002B3D8E597.root
to_pfn=srm://gridka-dCache.fzk.de:8443/srm/managerv1?SFN=/pnfs/gridka.de/cms/disk-only/Prod/store/CSA06/
                     CSA06-105-os-minbias-0/RECO/CMSSW_1_0_5-RECO-H746ee88eddaa52306cd016b2f689e370/
                     1021/2C4599BC-1861-DB11-B16E-0002B3D8E597.root
time_xfer_start=2006-10-30 18:49:09 UTC

The transfer process is completely dormant - I see no activity. Killed the process as a safety measure. So, it seems that the problematic processes are hanging dpm.ftpd processes exporting files to the outside. A big part of the file gets memory mapped. Obviously the timeout set by the receiving end is much to high. Not clear why the transfer is stuck.

1. 11. 2006 FZK seems ok again, good transfers quality from FNAL

FZK site admins reported that their dcache pool is working again. There have been only 5 PhEDEx Transfers during this night. They were all successful, but at low transfer rates. Our best source is now FNAL, but with equally low transfer rates. At least the success rates are now reassuring.

SITE STATISTICS:
==================
                         start: 2006-10-31 21:21:06      end: 2006-11-01 12:35:36
site: T1_FZK_Buffer (OK: 5 / FAILED: 0)   success rate: 100%   avg. rate: 4.01 Mb/s
site: T1_FNAL_Buffer (OK: 95 / FAILED: 0)   success rate: 100%   avg. rate: 3.84 Mb/s

Currently we have no JobRobot information. This may either be due to the needed samples not being yet on site, or CE having disappeared for some time from the BDII, or problems of the JobRobot itself. As soon as we have the needed samples, I will investigate.


Arrow left Go to previous page / next page of CMS site log MOVED TO...

-- DerekFeichtinger - 1 Nov 2006

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif cpu20061006.gif r1 manage 12.3 K 2006-10-06 - 09:47 DerekFeichtinger  
GIFgif cpu20061011.gif r1 manage 11.9 K 2006-10-11 - 10:15 DerekFeichtinger  
GIFgif dpm_diskfree_cms20061019.gif r1 manage 9.9 K 2006-10-19 - 10:36 DerekFeichtinger  
GIFgif gridftp20061006.gif r1 manage 9.7 K 2006-10-06 - 09:49 DerekFeichtinger  
GIFgif gridftp20061011.gif r1 manage 9.7 K 2006-10-11 - 10:16 DerekFeichtinger  
GIFgif load_fifteen-20061029.gif r1 manage 10.0 K 2006-10-29 - 13:00 DerekFeichtinger  
GIFgif memory20061029.gif r1 manage 12.7 K 2006-10-29 - 12:59 DerekFeichtinger  
GIFgif network20061006.gif r1 manage 12.5 K 2006-10-06 - 09:48 DerekFeichtinger  
GIFgif network20061011.gif r1 manage 15.5 K 2006-10-11 - 10:16 DerekFeichtinger  
Edit | Attach | Watch | Print version | History: r19 | r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2006-11-01 - DerekFeichtinger
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback