CMS SC4 Site log for PHOENIX cluster

31. 5. 2006 srmcp problems, FTS request

Lots of intermittent problems of all sorts with the PhEDEx LoadTests and srmcp. The problems are notoriously difficult to debug and often of an intermittent nature (started writing some scripts, but I do not want to go into details here) Followed Lassi's suggestion to try out the FTS backend and asked for additional FTS channels at the GGUS portal (we currently have only to/from FZK, Link to the GGUS Ticket).

I started the Dev agents with the FTS backend, but since FZK is not using this yet, there's nothing to be seen frown

4.-5. 6. 2006 SE overload

The PhEDEx Load tests (with srmcp backend) create so much load on our single file server, that it affects running jobs accessing data sets using rfio/POSIX protocol. Several CMS integration test jobs had timeouted the previous day. As shown by the Ganglia CPU graph, the SE's CPU is most of the time waiting for I/O. A test involving a rfcp of a datafile to the UI under this load yielded only an effective download I/O rate of about 90KB/s! Every srmcp uses up to 10 gridftp streams (the default). I tried to reduce the load by setting the streams/transfer to 2, which alleviated the situation only slightly.

The maximal observed data rate from/to the SE was generated by the LoadTests and amounted to ~30 MB/s.

I stopped the LoadTests around 24h to allow the CMS integration test jobs to work undisturbed. I/O by these jobs seemed rather low: < 100 KB/s per job. Since the CPU load on a node running one of these jobs is close to 100%, it seems that this low rate is not due to any performance issues of the SE, but that the job is CPU limited. Restarted the LoadTests at noon. It is to be expected that more I/O demanding analysis jobs will run into trouble with our current storage solution (however, the first significant upgrade to the cluster, which is planned for this year, will repair these deficiencies).

The JobRobot analysis by Oliver Gutsche looks good for this cluster (312 jobs successfully run, exit status 0).

Note: The number of analysis jobs can be estimated based on the number of rfiod processes, the number of LoadTest processes is related to the number of gridftp processes (I added both to the normal Ganglia metrics for the SE). Currently, only CMS jobs seem to use direct access via rfio.

This allows an estimate on the I/O demands based on the plots. There were ~20 parallel jobs generating ~2MB/s transfer from the SE, i.e. ~100 kB/s per job. This matches what I saw when looking directly at some jobs (see above).

6.-7. 6. 2006 SE overload + first attempts to fix it, CMSSW RFIO timeout problem,

Again, three jobs got stuck, probably because of SE overload (lots of walltime, no cpu time, SE mostly in CPU WAIT state). There were almost 200 gridftp processes running wild on the SE at one time today.

I used the -jobs switch of the PhEDEx LoadTest download agent to reduce the number of parallel downloads, and set it to 2 (default was 5). How can I control the number of parallel downloads from our center to the other centers? They are operated by the other centers.

(from Lassi) The -jobs switch must appear after the -backend switch.

As discussed in this hypernews thread, there is currently no easy way how a site can exercise control over the transfer load, especially concerning the transfers initiated from the other sites (exports in PhEDEx lingo). Sigh! This functionality should have been implemented at the level of the lower middleware, e.g. the FTP servers. Lassi proposed to regulate the transfer to our site via PhEDEx and he set it for now to 10MB/s per Link.

All integration jobs that started this night failed, because of RFIO timeout followed by a segmentation fault. The error log of a typical job shows:

Use of tracked parameters for the message service is deprecated.
The .cfg file should be modified to make this untracked.
SysError in <TRFIOFile::ReadBuffer>: error reading from
   file /dpm/projects.cscs.ch/home/cms/trivcat/store/preprod/2006/05/05/PreProdR3Minbias/0000/
   920AFCA1-22DA-DA11-9151-003048723767.root (Timed out)

Error in <TRFIOFile::ReadBuffer>: error reading all requested bytes from
    file /dpm/projects.cscs.ch/home/cms/trivcat/store/preprod/2006/05/05/PreProdR3Minbias/0000/920A
    FCA1-22DA-DA11-9151-003048723767.root, got 19403 of 17981

 *** Break *** segmentation violation

I filed this bug in the CMS bugs Savannah project (almost the same bug was posted by Tommaso Boccali to the CMS Framework project. Where now does it belong correctly???).

The recipes by which we tried to reduce the load on our SE failed. 23 parallel exports to CNAF managed to overload it again, so that the SE's CPU was again in CPU_WAIT most of the time. Eventually, I shut down the LoadTests for the rest of the day.

8. 6. 2006 stopped LoadTest exports (currently only way to reduce SE overload)

Decided to only run the LoadTests download agent, but not the export agents, since we have no control over exports (except completely denying them). There ought to be a way for a site to control I/O bandwidth for requests from other sites. Even though our current SE solution is surely inadequate (just 1 big fileserver), the principal problems are in the middleware.

If too many jobs try to access the same disks simultaneously, the disks get extremely inefficient, even though they could easily provide the I/O with a more sequential access pattern.

Having on the order of 100 parallel gridftps for the WAN transfers + all the accesses from the WNs is just getting too much. Why must a site download 23 files in parallel (as happened yesterday), instead of using a more sequential approach? The parallelism should enter at the level of parallel streams, and over the number of these both sites should have some control, e.g. by setting an upper limit.

9. 6. 2006 CE overload due to many blocked job slots

The JobRobot throws jobs at us faster than we can handle them at the moment. The nodes are today still working on some files from 3 days ago. One recurring problem are jobs from other users/experiments that end up blocking the queues, because they are waiting for some I/O from off-site, leaving the CPU completely idle (similar in effect to the cmsRun jobs failing and hanging because of RFIO timeout, q.v. yesterday). I have now scripts which detect them, but the removal of the jobs is a manual process, i.e. notifying the sysadmin, who then has to investigate to whom the jobs belong, etc....... frown

Thinking of reconfiguring the queues... Why do we have to reserve whole CPUs for dteam and the experiments' software managers? We could assign an extra virtual slot for these nodes. Then both CPU's of these nodes would be continually busy, and if then a third job hets started for these special users, the node will be a bit inefficient. But this is still better than to almost continually waste 5 CPUs.

It was clear from the start, that our cluster is too weak in its current form, and the upgrade will still take several months. Still, it is good to build up experience with the technology, especially as it helps with planning the layout of the upgrade.

Some observations regarding the current JobRobot jobs: Two jobs were running on this node. Average job's running time is 30 min.

Around 20h all JobRobots disappeared from our CE, so it seems that Oliver Gutsche has taken us out of the Robot for now.

16. 6. 2006 High job success rates, still restricting PhEDEx LoadTests

Spent several days at the T2 workshop.

We have a high success rate for the JobRobot jobs, but due to the current lack of prioritization of these jobs, it needs some manual tweaking of the queues (Needs blocking of other jobs that run 48h on the CMS queues. Usually done after communicating with Job owners). As can be seen from the ARDA Dashboard page, 106 out of 107 jobs finished successfully from June 14th-15th, one is lost in action somewhere (unknown status).

I reconfigured PhEDEx today to allow exports to FZK. Centers with a less aggressive download policy (fewer streams) may be ok for the SE. Some of the centers use the LoadTests to do a throughput test, so they try to stress the system a lot. There is still no way for a site to exercise control over their exports, except if it is using FTS. We still have only a dedicated channel to FZK, the GGUS ticket's resolution still does not answer who is responsible for the management of the channels into a Tier2.

-- DerekFeichtinger - 16 Jun 2006

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
gif	cpu0604.gif	r1	manage	13.2 K	2006-06-05 - 14:14	DerekFeichtinger
gif	cpu0607.gif	r1	manage	12.4 K	2006-06-07 - 14:38	DerekFeichtinger
gif	gridftp0604.gif	r1	manage	10.7 K	2006-06-05 - 14:15	DerekFeichtinger
gif	network0604.gif	r1	manage	5.8 K	2006-06-05 - 14:15	DerekFeichtinger
gif	network0607.gif	r1	manage	12.5 K	2006-06-07 - 14:39	DerekFeichtinger
gif	rfio0604.gif	r1	manage	10.8 K	2006-06-05 - 14:16	DerekFeichtinger
gif	wn07-cpu0609.gif	r2 r1	manage	11.1 K	2006-06-09 - 08:32	DerekFeichtinger
gif	wn07-network0609.gif	r2 r1	manage	12.4 K	2006-06-09 - 08:31	DerekFeichtinger