Solutions to Operational Issues

Nodes drained due to Jobs not ending with signals

This can happen for various reasons, e.g. when the processes are blocked by tasks in kernel space, or also if the slurm controller is so busy that communication is not possible. Often such occurrences happen in bunches, so that a large number of nodes may be affected. The timeout before a node is identified as problematic is already 3 minutes, so enlarging that probably does not make sense.

  1. Check on at least one of the nodes whether the node is ok and responsive (e.g., run the nhc health check utility, list some path from /pnfs/psi.ch/cms/trivcat/store).
  2. If nodes are ok, use a command like the following from the slurm master node to put nodes back to work
     scontrol update NodeName=t3wn[56,71,73] State=RESUME Reason=""
    

Users banned by fail2ban on one of the UI nodes

We are running fail2ban on the UI nodes for at least a minimum of protection from brute force attacks. If a genuine user communicates via our admin list that he has been banned, you need to identify the user and source address from the journal logs, and then check whether that address indeed has been banned.

There is a little script to help with that in the /root home directory of the UI nodes: is-user-banned.sh. You run it with the user name like this:

/root/is-user-banned.sh noehte_l
   banned: 129.129.71.65

In order to unban an address, you need to execute

fail2ban-client unban 129.129.71.65

-- DerekFeichtinger - 2023-08-15

Edit | Attach | Watch | Print version | History: r5 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2023-08-15 - DerekFeichtinger
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback