Atlas has to check how CreamCE behaves and also enable SAM tests
AOB
Attendants
ATLAS: Gianfranco Sciacca, Marc Goulette, Sigve Haug, Szymon Gadomski
CMS: Derek Feichtinger
LHCb: Roland Bernet
CSCS: Fotis Georgatos, Peter Oettl
Minutes
Report on unscheduled downtime (FG)
Troublesome situation due to various Lustre instabilities
complexity/size of experiment-software aggravates Lustre risks
VO reps realized the issue and asked what we can do about it
CSCS has placed purchase orders for new controller hardware
CSCS recommend to verify AND rethink on the exp-software dirs
DF:
probably longest emergency downtime we ever experienced
VO-contacts were not aware that there are 4-5 lustre fail overs / month
if it only the Lustre shared scratch was affected, we could just wipe and rebuild it. All running jobs would be lost, but the downtime would only be some hours. Rebuilding the SW area can take days and also involves work by central operations people. So, we should separate the exp SW from the scratch (Also, Lustre is not ideal for storing huge amounts of tiny files as found in the exp SW area.)
many sites had similar experiences; they went back to NFS
CSCS management (MDL and/or DU) has to push on Sun. The system is severely affecting our operations and consuming excessive admin time to keep stable.