CMS SC4 Site log for PHOENIX cluster

June-July 2006:

31. 5. 2006 srmcp problems, FTS request
4.-5. 6. 2006 SE overload
6.-7. 6. 2006 SE overload + first attempts to fix it, CMSSW RFIO timeout problem,
8. 6. 2006 stopped LoadTest exports (currently only way to reduce SE overload)
9. 6. 2006 CE overload due to many blocked job slots
16. 6. 2006 High job success rates, but still need to restrict PhEDEx LoadTests
21. 6. 2006 SC4 Technical Meeting, cannot get new test sample due to limited disk space
22-26. 6. 2006 upgrade to glite3.0
30. 6. 2006 stuck jobs due to defect RB and a design flaw in the job bootstrapper, huge number of JobRobot jobs
2. 8. 2006 STATUS summary (and DPM / VOMS roles issue)

Go to next page of CMS site log

31. 5. 2006 srmcp problems, FTS request

Lots of intermittent problems of all sorts with the PhEDEx LoadTests and srmcp. The problems are notoriously difficult to debug and often of an intermittent nature (started writing some scripts, but I do not want to go into details here) Followed Lassi's suggestion to try out the FTS backend and asked for additional FTS channels at the GGUS portal (we currently have only to/from FZK, Link to the GGUS Ticket).

I started the Dev agents with the FTS backend, but since FZK is not using this yet, there's nothing to be seen frown

4.-5. 6. 2006 SE overload

The PhEDEx Load tests (with srmcp backend) create so much load on our single file server, that it affects running jobs accessing data sets using rfio/POSIX protocol. Several CMS integration test jobs had timeouted the previous day. As shown by the Ganglia CPU graph, the SE's CPU is most of the time waiting for I/O. A test involving a rfcp of a datafile to the UI under this load yielded only an effective download I/O rate of about 90KB/s! Every srmcp uses up to 10 gridftp streams (the default). I tried to reduce the load by setting the streams/transfer to 2, which alleviated the situation only slightly.

The maximal observed data rate from/to the SE was generated by the LoadTests and amounted to ~30 MB/s.

I stopped the LoadTests around 24h to allow the CMS integration test jobs to work undisturbed. I/O by these jobs seemed rather low: < 100 KB/s per job. Since the CPU load on a node running one of these jobs is close to 100%, it seems that this low rate is not due to any performance issues of the SE, but that the job is CPU limited. Restarted the LoadTests at noon. It is to be expected that more I/O demanding analysis jobs will run into trouble with our current storage solution (however, the first significant upgrade to the cluster, which is planned for this year, will repair these deficiencies).

The JobRobot analysis by Oliver Gutsche looks good for this cluster (312 jobs successfully run, exit status 0).

Note: The number of analysis jobs can be estimated based on the number of rfiod processes, the number of LoadTest processes is related to the number of gridftp processes (I added both to the normal Ganglia metrics for the SE). Currently, only CMS jobs seem to use direct access via rfio.

This allows an estimate on the I/O demands based on the plots. There were ~20 parallel jobs generating ~2MB/s transfer from the SE, i.e. ~100 kB/s per job. This matches what I saw when looking directly at some jobs (see above).

6.-7. 6. 2006 SE overload + first attempts to fix it, CMSSW RFIO timeout problem,

Again, three jobs got stuck, probably because of SE overload (lots of walltime, no cpu time, SE mostly in CPU WAIT state). There were almost 200 gridftp processes running wild on the SE at one time today.

I used the -jobs switch of the PhEDEx LoadTest download agent to reduce the number of parallel downloads, and set it to 2 (default was 5). How can I control the number of parallel downloads from our center to the other centers? They are operated by the other centers.

(from Lassi) The -jobs switch must appear after the -backend switch.

As discussed in this hypernews thread, there is currently no easy way how a site can exercise control over the transfer load, especially concerning the transfers initiated from the other sites (exports in PhEDEx lingo). Sigh! This functionality should have been implemented at the level of the lower middleware, e.g. the FTP servers. Lassi proposed to regulate the transfer to our site via PhEDEx and he set it for now to 10MB/s per Link.

All integration jobs that started this night failed, because of RFIO timeout followed by a segmentation fault. The error log of a typical job shows:

Use of tracked parameters for the message service is deprecated.
The .cfg file should be modified to make this untracked.
SysError in <TRFIOFile::ReadBuffer>: error reading from
   file /dpm/projects.cscs.ch/home/cms/trivcat/store/preprod/2006/05/05/PreProdR3Minbias/0000/
   920AFCA1-22DA-DA11-9151-003048723767.root (Timed out)

Error in <TRFIOFile::ReadBuffer>: error reading all requested bytes from
    file /dpm/projects.cscs.ch/home/cms/trivcat/store/preprod/2006/05/05/PreProdR3Minbias/0000/920A
    FCA1-22DA-DA11-9151-003048723767.root, got 19403 of 17981

 *** Break *** segmentation violation

I filed this bug in the CMS bugs Savannah project (almost the same bug was posted by Tommaso Boccali to the CMS Framework project. Where now does it belong correctly???).

The recipes by which we tried to reduce the load on our SE failed. 23 parallel exports to CNAF managed to overload it again, so that the SE's CPU was again in CPU_WAIT most of the time. Eventually, I shut down the LoadTests for the rest of the day.

8. 6. 2006 stopped LoadTest exports (currently only way to reduce SE overload)

Decided to only run the LoadTests download agent, but not the export agents, since we have no control over exports (except completely denying them). There ought to be a way for a site to control I/O bandwidth for requests from other sites. Even though our current SE solution is surely inadequate (just 1 big fileserver), the principal problems are in the middleware.

If too many jobs try to access the same disks simultaneously, the disks get extremely inefficient, even though they could easily provide the I/O with a more sequential access pattern.

Having on the order of 100 parallel gridftps for the WAN transfers + all the accesses from the WNs is just getting too much. Why must a site download 23 files in parallel (as happened yesterday), instead of using a more sequential approach? The parallelism should enter at the level of parallel streams, and over the number of these both sites should have some control, e.g. by setting an upper limit.

9. 6. 2006 CE overload due to many blocked job slots

The JobRobot throws jobs at us faster than we can handle them at the moment. The nodes are today still working on some files from 3 days ago. One recurring problem are jobs from other users/experiments that end up blocking the queues, because they are waiting for some I/O from off-site, leaving the CPU completely idle (similar in effect to the cmsRun jobs failing and hanging because of RFIO timeout, q.v. yesterday). I have now scripts which detect them, but the removal of the jobs is a manual process, i.e. notifying the sysadmin, who then has to investigate to whom the jobs belong, etc....... frown

Thinking of reconfiguring the queues... Why do we have to reserve whole CPUs for dteam and the experiments' software managers? We could assign an extra virtual slot for these nodes. Then both CPU's of these nodes would be continually busy, and if then a third job hets started for these special users, the node will be a bit inefficient. But this is still better than to almost continually waste 5 CPUs.

It was clear from the start, that our cluster is too weak in its current form, and the upgrade will still take several months. Still, it is good to build up experience with the technology, especially as it helps with planning the layout of the upgrade.

Some observations regarding the current JobRobot jobs: Two jobs were running on this node. Average job's running time is 30 min.

Around 20h all JobRobots disappeared from our CE, so it seems that Oliver Gutsche has taken us out of the Robot for now.

16. 6. 2006 High job success rates, but still need to restrict PhEDEx LoadTests

Spent several days at the T2 workshop.

We have a high success rate for the JobRobot jobs, but due to the current lack of prioritization of these jobs, it needs some manual tweaking of the queues (Needs blocking of other jobs that run 48h on the CMS queues. Usually done after communicating with Job owners). As can be seen from the ARDA Dashboard page, 106 out of 107 jobs finished successfully from June 14th-15th, one is lost in action somewhere (unknown status).

I reconfigured PhEDEx today to allow exports to FZK. Centers with a less aggressive download policy (fewer streams) may be ok for the SE. Some of the centers use the LoadTests to do a throughput test, so they try to stress the system a lot. There is still no way for a site to exercise control over their exports, except if it is using FTS. We still have only a dedicated channel to FZK, the GGUS Ticket's resolution still does not answer who is responsible for the management of the channels into a Tier2.

19.6.2006 Andreas Heiss from FZK replied that they will install a STAR-CSCS FTS channel (i.e. a catch-all channel for transfers from any source to CSCS). For this, glite3.0 services are necessary and they have just now upgraded. Silke Halstenberg is in charge of their setup and currently debugging it. So we still need to wait a few days.

Number of jobs in CMS queue over the last week (since most of them are from the JobRobot, this allows some insight into its current submission behavior):

20. 6. 2006 Hypernews message from Lassi regarding the current T0/T1 transfer situation.

21. 6. 2006 SC4 Technical Meeting, cannot get new test sample due to limited disk space

Link to talks at the SC4 Technical Meeting.

Lassi announced a new sample available via PhEDEx. Since it has a size of roughly 2TB, we cannot get it (current HD space left for cms is ~1.2 TB).

22-26. 6. 2006 upgrade to glite3.0

Upgrade was announced for 22-23 June. Due to an RGMA problem the cluster only was ready again on Mo evening, 26 June.

30. 6. 2006 stuck jobs due to defect RB and a design flaw in the job bootstrapper, huge number of JobRobot jobs

Found a number of stuck jobs. As usual they were waiting for I/O from the WAN. In this case the jobs were in the bootstrap phase and tried to retrieve their input box from the egee-rb-04.cnaf.infn.it resource broker. The broker was not able to serve the files and the jobs kept hanging forever in the globus-url-copy. This seems a design flaw and there should be a timeout after which the job gives up and releases the resource (notified Andrea Sciaba, also see this hn thread.

Due to the stuck queue many JobRobot jobs piled up (aggravated by the fact that Oliver Gutsche had to restart the Robot yesterday, seemingly leading to its forgetting the previously sent jobs).

2. 8. 2006 STATUS summary (and DPM / VOMS roles issue)

PhEDEx: Still running with srmcp (had done some manual FTS tests with FZK, but the priority in integration was to get T0-T1 transfers working first). Good downloads from FNAL, IN2P3. Bad results for FZK and RAL. Need to start investigating again. Downloads from our site are limited to T1_FZK_Load and T1_FZK_Buffer to prevent excessive traffic.

JobRobot: Often jobs get killed, because the site cannot handle them in time. This is due to regular jobs running for long times on the same resources. I currently do not prioritize JobRobot jobs any more.

Production: Even though CMSSW_0_8_1 is installed and is working (also fixed the DPM/rfio issue that had popped up again), we are blocked by another DPM problem. The new VOMS based proxy certificates allow users to assume authorization roles. These roles get directly mapped to a DPM user group ("cms/Role=production" in this case). But a group dedicated pool in DPM only allows members of the dedicated group to write into it, and this is the way DPM gets deployed at multi-VO SE's - one VO per pool. So, even if DPM ACLs are used to allow another group write access to a directory of the pool, this other group is prevented from creating files, because of the disk quota, which seems to be zero for all groups except the principal one. Still, the production group is allowed to delete files, but this is not much use. Jean-Philipe Baud and Sophie Lemaitre from LCG wrote in reply to my requests that it may well take until end of the year, before work to resolve these issues is done (q.v. also this hypernews thread).

Go to next page of CMS site log

-- DerekFeichtinger - 02 Aug 2006

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
gif	cpu0604.gif	r1	manage	13.2 K	2006-06-05 - 14:14	DerekFeichtinger
gif	cpu0607.gif	r1	manage	12.4 K	2006-06-07 - 14:38	DerekFeichtinger
gif	gridftp0604.gif	r1	manage	10.7 K	2006-06-05 - 14:15	DerekFeichtinger
gif	network0604.gif	r1	manage	5.8 K	2006-06-05 - 14:15	DerekFeichtinger
gif	network0607.gif	r1	manage	12.5 K	2006-06-07 - 14:39	DerekFeichtinger
gif	network0617.gif	r2 r1	manage	5.4 K	2006-06-21 - 10:14	DerekFeichtinger
gif	queuecms0617.gif	r2 r1	manage	4.1 K	2006-06-17 - 17:03	DerekFeichtinger
gif	rfio0604.gif	r1	manage	10.8 K	2006-06-05 - 14:16	DerekFeichtinger
gif	wn07-cpu0609.gif	r2 r1	manage	11.1 K	2006-06-09 - 08:32	DerekFeichtinger
gif	wn07-network0609.gif	r2 r1	manage	12.4 K	2006-06-09 - 08:31	DerekFeichtinger