Solutions to Operational Issues
Nodes drained due to Jobs not ending with signals
This can happen for various reasons, e.g. when the processes are blocked by tasks in kernel space, or also if the slurm controller is so busy that communication is not possible. Often such occurrences happen in bunches, so that a large number of nodes may be affected. The timeout before a node is identified as problematic is already 3 minutes, so enlarging that probably does not make sense.
- Check on at least one of the nodes whether the node is ok and responsive (e.g., run the
nhc
health check utility, list some path from /pnfs/psi.ch/cms/trivcat/store
).
- If nodes are ok, use a command like the following from the slurm master node to put nodes back to work
scontrol update NodeName=t3wn[56,71,73] State=RESUME Reason=""
Users banned by fail2ban on one of the UI nodes
We are running fail2ban on the UI nodes for at least a minimum of protection from brute force attacks. If a genuine user communicates via our admin list that he has been banned, you need to identify the user and source address from the journal logs, and then check whether that address indeed has been banned.
There is a little script to help with that in the
/root
home directory of the UI nodes:
is-user-banned.sh
. You run it with the user name like this:
/root/is-user-banned.sh noehte_l
banned: 129.129.71.65
In order to unban an address, you need to execute
fail2ban-client unban 129.129.71.65
--
DerekFeichtinger - 2023-08-15