CMS Site Log for PHOENIX Cluster

2. 4. 2007 Load Tests on new dCache SE

We brought the new dCache based SE into production by end of last week. The STAR-CSCS FTS channel hosted by FZK has now been extended to also allow transfers to the new SRM endpoint. The SRM transfer quality looks good, though the actual rate is not overwhelming. Only T1_ASGC transfers have problems and ther errors uniformly seem to indicate a problem on the T1 side.

Statistics for the last 12 hours:

                         first entry: 2007-04-02 08:29:00      last entry: 2007-04-02 12:02:17
site: T1_FZK_Buffer (OK: 54 / Err: 0)   succ. rate: 100.0 %   total: 133.2 GB   avg. rate: 6.4 MB/s = 53.8 Mb/s
site: T1_RAL_Buffer (OK: 1 / Err: 0)   succ. rate: 100.0 %   total: 2.6 GB   avg. rate: 0.8 MB/s = 7.0 Mb/s
site: T1_ASGC_Buffer (OK: 0 / Err: 42)   succ. rate: 0.0 %   total: 0.0 GB
site: T1_FNAL_Buffer (OK: 9 / Err: 0)   succ. rate: 100.0 %   total: 24.6 GB   avg. rate: 1.3 MB/s = 10.8 Mb/s

 *** ERRORS from T1_ASGC_Buffer:***
     42   Failed SRM get on httpg://castorsc.grid.sinica.edu.tw:8443/srm/managerv1 ;id=[id]call. Error is specified file(s) does
            not exist

I can sometimes see that there are big transfers between the dCache pool nodes. Either a door on one node is actually saving a file on the other node or it is some kind of redistribution (probably the first, I guess). This waste of bandwidth is not optimal and should be studied.

3. 4. 2007 failing transfers, SLC4 nodes

The transfers of the last 12 hours look unanimously bad. All failed transfers except for ASGC show the same error: transfer expired in the download agent queue. This seems to be a PhEDEx problem. Based on a message of Chia Ming Kuo on the list the ASGC error is due to their migrationg from CASTOR to CASTOR2.

                         first entry: 2007-04-03 00:06:51      last entry: 2007-04-03 07:44:51
site: T1_FZK_Buffer (OK: 29 / Err: 386)   succ. rate: 7.0 %   total: 71.4 GB   avg. rate: 1.9 MB/s = 16.2 Mb/s
site: T1_RAL_Buffer (OK: 25 / Err: 193)   succ. rate: 11.5 %   total: 64.1 GB   avg. rate: 2.9 MB/s = 24.2 Mb/s
site: T1_ASGC_Buffer (OK: 0 / Err: 33)   succ. rate: 0.0 %   total: 0.0 GB
site: T1_FNAL_Buffer (OK: 20 / Err: 298)   succ. rate: 6.3 %   total: 54.8 GB   avg. rate: 1.7 MB/s = 14.4 Mb/s

 *** ERRORS from T1_FZK_Buffer:***
    386   transfer expired in the download agent queue

We will try bring now the new SLC4 nodes into the system. This is problematic for CMS, because the new software area has not been set up correctly yet. Peter Elmer has notified us, that we would be a test site for this kind of operation.

5. 4. 2007 New PhEDEx FileDownload from CVS

I installed a new FileDownload version from CVS (Rev. 1.103), and this seems to have taken care of the many _transfer expired_ errors. But the amount of transfers and the rates are not very exciting:

                         first entry: 2007-04-05 05:15:25      last entry: 2007-04-05 11:18:35
site: T1_FZK_Buffer (OK: 14 / Err: 0)   succ. rate: 100.0 %   total: 34.6 GB   avg. rate: 3.7 MB/s = 31.3 Mb/s
site: T1_RAL_Buffer (OK: 12 / Err: 0)   succ. rate: 100.0 %   total: 30.8 GB   avg. rate: 3.0 MB/s = 25.2 Mb/s
site: T1_ASGC_Buffer (OK: 7 / Err: 3)   succ. rate: 70.0 %   total: 19.7 GB   avg. rate: 1.0 MB/s = 8.4 Mb/s
site: T1_FNAL_Buffer (OK: 10 / Err: 0)   succ. rate: 100.0 %   total: 27.5 GB   avg. rate: 1.4 MB/s = 12.1 Mb/s

This morning one of our dCache pool nodes died. Seemingly just did not react any more, but the remote management still was working. Seemingly just at the time when the weekly disk scrubbing was running. Tom restarted the scrubbing in the afternoon to see whether it's a persistent problem connected to it. But the run finished ok. So it is not clear what the problem was. There were no messages in the logs. Scrubbing takes place in the kernel, so a bad problem could show this behavior.

cpu_report-se02-lcg20070405-1404.gif load_report-se02-lcg20070405-1404.gif

cpu_report-se03-lcg20070405-1404.gif load_report-se03-lcg20070405-1404.gif

The network load:

network_report-se02-lcg20070405-1404.gif network_report-se03-lcg20070405-1404.gif

The PhEDEx load tests continued to the still working pool node se03-lcg. But the failure had a major impact on the worker nodes:

worker-CPU.gif worker-load.gif

11. 4. 2007 Huge Memory Leak in FileDownload

Spent almost the whole day trying to chase a memory leak together with Ricky Egeland. Our FileDownload agent grew by 128 kB every 15 seconds (or in another phase at 1 MB every 125 seconds. Very similar overall rate). I had started a thread about it here on HyperNews. Tests showed that this happens with the new version from CVS as well as with the release version.

The memory growth does not result from the growing $self structure of the agent. At least this I was able to observe. There was not enough time to follow this up more thoroughly and soon after the PhEDEx tests were stopped, because there are also a number of DB hang problems around. It seems that the transfer expired messages are connected with this.

Transfer download rates were very low over all these days.

25. 4. 2007 LoadTests started again

After the successful DB migration at CERN the transfers begin again.

Our two dCache nodes are currently supplying FZK with ~20 MB/s of data with no noticeable CPU load. I just show the plot for se02:



-- DerekFeichtinger - 25 Apr 2007

