Tags:
create new tag
view all tags

How to check the CSCS Tier-2 status for CMS site contacts / site managers

This is a small routine which should be performed once a day by the responsible CMS site contact. Some of these things can and should be automatized at some point, but the manual check does not take much time and will increase your understanding of the system.

All the basic information and links can be found on our main monitoring page: PhoenixMonOverview. The following list basically tells you at what you should look on this page.

  1. Look at the three pie charts for the worker nodes, service nodes, and the file servers.
    The service and fileserver pie charts must show no black parts (i.e. nodes down). A few worker nodes that are down are not so critical, but you still may want to contact the site admins.
  2. Check all SAM tests using the links towards the top of the page, the CMS SAM tests being the most important ones for us.
  3. Check the graphs for running and queued jobs.
    You should only see a number of queued CMS jobs, if the cluster is filled with running jobs. If jobs stay in the queue despite free slots on the cluster, something with the scheduling is wrong.
  4. Check the free storage space graph for CMS, and take note of the trend shown over the last week.
    You can check how much space is taken up by users and datasets by using the Links below the Storage Element section.
  5. Take a look at the graphs for the dcache movers. If you see a large number of queued movers (especially if it is still growing) you may want to notify the CSCS admins. In case of problems you may also want to look at the Pool Transfer Queues, Active Transfers, and Detailed Tape Transfer Queue (don't be misguided by this name - it applies to disk transfer problems, too) in the dCache GUI.
  6. Check Phedex by looking at the log analyzer output on the PhEDEx download and export pages (links are located below Networking and File Transfers)
    • I there is zero activity, make sure that the Phedex processes are up
    • If there are lots of transfer errors, try to analyze them based on what you see in the log analyzer and post a support request on savannah (assign to cmscompinfrasup-datatransfer group or contact the responsible site admins directly.
  7. Check whether there are any pending data set requests (There is a link to the correct page below the Storage Element section).
    The decision whether to allow the request must be based on the available space and policy

-- DerekFeichtinger - 27 Nov 2008

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2009-02-05 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback