01. 02. 2008 Comparison of dashboard monitoring output and CSCS local monitoring.

A first comparison had been done in October (q.v. CMSSiteLog8).

Dashboard plots

The new dashboard contains functionality which should be able to reflect the currently running jobs at a given time, and not only categorize jobs according to their submission time.

Looking at the result of a dashboard query of what ran on ce01.lcg.cscs.ch this morning from 3:20h to 4:20h (link to the query) :


Sorted according to users (link to the query)


Local ganglia monitoring

The local monitoring is done with ganglia/gmetad and therefore the history is kept in the typical round robin databases with different integration slices (accurate for last hour, then gets integrated to the day RRD, etc.), so it's not very exact for looking at the older historic data and at short processes. I wanted to look at job robot jobs here, which are short processes. I turns out that some of the ganglia information is not reliable for these, because the job robot jobs run for about 1 min while the sampling interval for the CE job counting was 5 min. But the log excerpts further below are exact, and the movers and I/O measurements are sampled at 1 min intervals.

Number of jobs on the CE:


The number of active analysis jobs one can get indirectly by looking at the number of active CMS LAN transfers.


For completeness, here are the IO rates on the currently active file servers (stacked), so one can estimate the I/O from that, noting that the CMS spike of short jobs (job robot characteristic), is reflected in the "bytes out" graph.


Values from CE gatekeeper logs

There were 50 jobs from Vincenzo Miccio which mainly seem to be responsible for the peak seen between 3:30h and 4:10h.

02/01/2008 02:23:42 cms016 Exit_status=0 cput=00:03:49 mem=342868kb vmem=554140kb walltime=00:14:49

02/01/2008 03:35:56 cms016 Exit_status=0 cput=00:04:19 mem=351268kb vmem=559636kb walltime=00:05:31
02/01/2008 05:28:45 cms016 Exit_status=0 cput=00:03:48 mem=341348kb vmem=541540kb walltime=00:05:16

These are the jobs that Stefano Belforte ran between 2h and 5h (102 jobs). The walltimes are very short. Since we sampled the ganglia CE job information in 5 min intervals, most of these jobs were missed by our CE monitoring. The movers and I/O is measured at 1 minute intervals and therefore is more accurate.

02/01/2008 02:32:22 cms026 Exit_status=0 cput=00:01:07 mem=141136kb vmem=299408kb walltime=00:02:19
02/01/2008 04:00:03 cms026 Exit_status=0 cput=00:01:07 mem=132032kb vmem=274964kb walltime=00:01:57
02/01/2008 05:51:03 cms026 Exit_status=0 cput=00:01:07 mem=130608kb vmem=288768kb walltime=00:01:45


Job numbers for V. Miccio and S. Belforte are 150 and 204 as reported by the dashboard, and 50 and 102 reported from the gatekeeper logs. It is curious that the numbers are almost exactly trebled/doubled. Could there be some double counting?

Julia Andreeva from ARDA/CMS explained, that this query (with the checkbox "all jobs regardless" enabled) actually does not take the second entered time value into account at all. It just counts the jobs from the first time value to the present. This should be explained better on the dashboard Web-UI.

