Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> %TOC% ---+ Solutions to Operational Issues ---++ daily check of slurm Even though many problems are signalled through mails by Icinga, I also prefer to have a look each morning on the Slurm master: t3slurm.psi.ch. These are the kind of queries I usually run to get a short overview <pre> sinfo -o "%.12P %.16F %.16C %.14l %.16L %.12G %N" sinfo -R squeue -o "%.18i %.8P %.15j %.12u (%a) %.2t %.10M %.6Dn %.6Ccpu %A %.20S %.30R" | less sprio -l -S Y | less # only in case of need </pre> To make investigations easier, these alias definitions exist for the root account on t3slurm: <pre> alias df_sinfo='sinfo -o "%.12P %.16F %.16C %.14l %.16L %.12G %N"' alias df_squeue='squeue -o "%.18i %.2t %.8P %.15j %.12u (%7a) %.10M %.6Dn %.6Ccpu %.10m %.20S %.20R %Z"' alias df_sacct='sacct --format="JobID%16,User%12,State%16,partition,time,elapsed,TotalCPU,UserCPU,ReqMem,MaxRss,MaxVMSize,ncpus,nnodes,reqcpus,reqnode,Start,End,NodeList"' </pre> Especially the =df_sacct= one is useful for a fast investigation of past jobs. ---++ Slurm: Nodes drained due to Jobs not ending with signals This can happen for various reasons, e.g. when the processes are blocked by tasks in kernel space, or also if the slurm controller is so busy that communication is not possible. Often such occurrences happen in bunches, so that a large number of nodes may be affected. The timeout before a node is identified as problematic is already 3 minutes, so enlarging that probably does not make sense. 1. Check on at least one of the nodes whether the node is ok and responsive (e.g., run the =nhc= health check utility, list some path from =/pnfs/psi.ch/cms/trivcat/store=). 1. If nodes are ok, use a command like the following from the slurm master node to put nodes back to work<pre> scontrol update NodeName=t3wn[56,71,73] State=RESUME Reason="" </pre> ---++ Users banned by fail2ban on one of the UI nodes We are running fail2ban on the UI nodes for at least a minimum of protection from brute force attacks. If a genuine user communicates via our admin list that he has been banned, you need to identify the user and source address from the journal logs, and then check whether that address indeed has been banned. Even better, maybe the user herself can provide the address, but frequently they do not know. There is a little script to help with discovering the address associated with the banning of a particular user. It is found on the UI nodes at =/root/bin/is-user-banned.sh=. You run it with the user name like this:<pre> /root/bin/is-user-banned noehte_l banned: 129.129.71.65 </pre> The script greps through the sshd logs to identify lines where the user failed authentication, and composes a host list. Then it checks whether those host addresses are on fail2ban's banned list. This can fail, if the failure was a longer time back, or if another account from the same IP address triggered the banning. You can use =-s 30= to extend the log search to 30 days back. If you know the IP address you can check whether it is banned by explicitly querying the fail2ban<pre> fail2ban-client banned 81.221.220.179 [['sshd']] </pre> In the above case, the address is indeed banned from ssh (it's in the 'sshd' jail). In order to unban an address, you need to execute<pre> fail2ban-client unban 129.129.71.65 </pre> -- Main.DerekFeichtinger - 2023-08-15
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r5
<
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r5 - 2023-09-02
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback