Tags: view all tags

CHIPP Computing Board Meeting of Thu, July 1st 2010

Agenda:
- Main items
- Further items
Meeting minutes

The meeting will be held from 10-13h at the ETH Zurich (Main Building) in Room HG E 33.5 (equipped with a Beamer)

Agenda:

Please add any other subject you would like to have discussed!

Main items

NEWS from the CHIPP executive board meeting (CG).
communication channels: We currently use email lists, chat, a trouble ticket tracker, and the wiki for collaborating. Some of the lists have fallen into decay, as have certain parts of the T2 wiki. This is in part to the changing environment and the changing manpower over the last years. We should strive to clean up our tools and end up with a few efficient guidelines and tools.
- Especially, I think we would all profit from some guidelines on how and where to communicate VO requirements to the CSCS admins. There should be an easy way for all of us to look up this information, and I think that having a wiki area per VO for this would be really beneficial (CommunicationChannelsProposal).
- Need to decide how to proceed with the current Wiki structure (remember that we now have the wiki split in two webs, some pages are obsolete. Need to clean up the structure)
Old WNs removal from the machine room. We need to have just a few words about when could we give away the old WNs, which is the same as saying when will we cross the no-way-back point in the migration to PhaseC.
- Sigve: Andres Aeschlimann from Bern ID, who will arrange the transport has been in Manno talking to Michele and Pablo, I think. To my knowledge, the pick up can happen soon. I guess the no-way-back point has been crossed already.
some first ideas on how to evolve the system for the next 2 years? Please take a look and join the PhaseDRequirementsDiscussion
- Tier-2 resource planning for 2011/2012 ... : what are the experiments needs, in terms of CPU-power, jobs-slots, data-storage (local group spaces, production spaces etc..). Please: every experiment's reps. should supply their preferences (according to present knowledge) in ONE slide.
experiments' job IO requirements: The way that VO applications read from the SE storage is important to know in order to define policies on the scheduling mechanisms in the SE. VOs also may offer the possibility to influence job behavior through site local configuration files.
- Derek: CMS: Can run using direct dcap access. At the FZK-KIT meeting ATLAS people also told me that the ATLAS applications now also can run via direct dcap. Our original design target was about 3 MB/s, but we know that there are jobs which may go up to 10MB/s. Leo who is the author of some CMSSW benchmarking tools estimates 5-10 MB/s for optimised simple analysis jobs
- Szymon: ATLAS jobs can also use various protocols, like dCap and rfio. An official bandwidth requirement is 10 MB/s/core. Most jobs currently submitted to a T2 would not read data faster than ~5 MB/s (Athena framework reading AOD data fromat). Some jobs seen at the T3 in Geneva can read data much faster, they are I/O limited even at 50 MB/s. Such jobs, which are not using Athena but only ROOT, will likely be moving upstream in the near future, i.e. they will appear more and more on T2 sites. On the other hand, one should not forget that we also have simulation jobs, which are CPU limited and have much lower I/O requirements.
- Roland: LHCb runs mainly simulation jobs at CSCS. Therefore I/O requirements are minimal.

Further items

CREAM-CE: Status from each VO; when are they ready to use it in production
- Derek: CMS: Not all workflows are yet on the CREAM-CE. But expect move in the next few months. No fixed date, I am sorry.
- Szymon: ATLAS: lcg-CE is still requited. Having CREAM-SE in parallel is useful for testing.
- Roland: LHCb: CREAM-CE is no problem for LHCb. Direct submission is not yet supported. Currently we are using CREAM-CEs through the WMS.
policies for VOs. Some VOs have until now had the right to use (more or less) spare cycles. We offered queues with a limited number of slots and smaller max runtimes. Will we continue to do that and what policies do we want to set up? We should review the CPU/Disk shares of the cluster (maui fair shares, cpu reservation and disk usage)
monitoring& accounting: What local monitoring and accounting should/can be offered? What would be useful to you?
- Derek: The ability to see the current job queue and the mapping of local name to grid DN is important for a first diagnosis of our jobs (good CMS link to some solutions). Plots for seeing the history of jobs and movers per VO also helps a lot. Most of this is already in place. Having the possibility of monitoring the avg throughput per gridftp mover would be very useful to judge stageout quality.
- Szymon: According to Marc Goulette, our link between the CSCS and the German Cloud in ATLAS, the existing monitoring is sufficient.
- Roland: Existing monitoring is sufficient for LHCb.
What do the VOs desire in regard to local UIs at the CSCS? These could allow working in a more Tier-3 like way at the Tier-2 with local dcap access to data for tests and debugging. What local resources like home directory space, etc. would users require to benefit from this?
- Derek: At the PSI T3 users have a staggering 100GB of backed up home space. One need probably not supply that much, but something on the order of at least 30 GB seems reasonable. I think that this could be very interesting to Swiss CMS users and about 40 users may want to take advantage of this.
- Szymon: in Geneva we have up to 100 GB/user of backed up home. We have 110 TB on NFS and 70 TB in a Storage Element. This is more than enough for T3-like work. We do not think we need T3 functionality also in our T2, i.e. at the CSCS. People should work via grid tools, which seem reliable and usable enough. We would see local login only as a fallback. So far it was not needed. We don't have accounts at the CSCS.
- Roland: As LHCb is mainly running simulation jobs at CSCS, there is no real need for user accounts.
New security policy at CSCS. We want to completely disable password authentication, delete user accounts outside the ui, and also block ssh access from the outside to most of the machines (save for ui or voboxes). We would like to have your feedback about this before implementing it.
- Derek: I think that it is a good idea to strengthen the security for these systems. But I am wondering whether enforcement of pubkey-access will be a good thing. Users will be forced to log in from machines where their key is stored. The passphrase for the securing of the key is out of your control, and if it is a machine at the user's home site, you also cannot control the quality of the security there. I think a login-node at CSCS with password based login is better. There you can enforce quality passwords (via PAM) and you may enforce a policy for password changes every few months.
- Szymon: I did not know user accounts ourtside UI even exist in Manno. No problem for us to close them. About UI, see the above point.
- Roland: No problem at all for LHCb.
Pre-production system. We want to deploy a complete pre-production system where we can apply changes before doing it in production, all in virtual machines with mostly PhaseB hardware. We would like to know how "public" should it be (for example, can we have SAM test jobs there?) and we're open to any other suggestion it may arise.
- Derek: I think getting VO-testing on the preproduction service will be difficult. In terms of ops based jobs, you need to talk to the German NGI-DE people. Based on our experiences from the last years, having at least a dCache testing environment for upgrades is very desirable. For the rest of the components I do not really know... you will have more experience with them. But at least in the past, the main problems derived from the SE.
- Szymon: I agree that getting VO-testing on a pre-production system is unlikely. I think it is up to the CSCS team to judge how useful such a system would be and what functionality it should have.
- Roland: I don't think LHCb has any manpower for testing pre-production systems. LHCb SAM jobs are only sent to production sites.
Long term operation of tier3: resources and funding issues. What model do the three experiments follow over the next 5 years? Every experiment shall present in short their ideas.
- Szymon: ATLAS T3 in Geneva close to limits of power, space and manpower. However, hardware will be renewed, meaning an increase of CPU power. Storage space will likely increase as well (this year from 170 to 200 TB, next year maybe another 30 to 60 TB). The facility is now shared between ATLAS and Neutrino groups.

Meeting minutes

NOTE: Action items are written in BOLD script.

Communication channels

Basis: CommunicationChannelsProposal

Very long discussion, mainly about Twiki technicalities and hosting
- Michele said that if possible they would like to outsource the wiki hosting, and maybe even go to a different system. Pablo and Derek point to the large amount of information in the present twiki, which may make migration a big effort
- All experiment contacts confirm that Twiki is used extensively in their collaborations and there is a preference for staying with that technology
- Derek mentions that FOSwiki (a brached-off free split of Twiki) now is the base of PSI's intranet. PSI has 1.5 FTEs mainly working on the system, but currently there is no usable extranet twiki hosted by PSI.
- Derek: It is import to have front pages for external users and admins, where they can get general information on the T2 and links to local monitoring. In CMS, physics groups are associated with a small number of Tier-2s, and they will profit from these pages. E.g. it would be nice to see from the main page, when downtimes are planned, or whether there is something amiss
- All agreed that it makes sense to have at least one page per VO where the VO's requirements are collected, so that CSCS admins can easily refer to them.
  - Pablo and Peter mentioned a number of facts they would like from every VO. They should provide a list of their requirements to the VOs.
- What we almost totally neglected to discuss is the great benefit that the wiki had internally as a source of admin documentation. As Pablo had mentioned, very often when googling for problems, you are even referred to our wiki. This has given our project some visibility and also has been beneficial to establishing contacts within the community. I (Derek) would very much like that we continue in using this as the primary documentation tool. Private notes and word documents are lost to the others and are not easily searchable.
Mailing lists:
- CMS would like to have a CSCS list where downtimes and other T2 relevant things are specifically announced. We will probably just subscribe our Tier-3 list to this list, since the users will very much be the same.
- ATLAS has less direct user-Tier2 contacts, but as Szymon has pointed out this may change, since also in ATLAS there is the concept of physics group space at dedicated Tier2s.
- Since LHCb does not run normal user jobs on the T2s, they have no requirements for end user notification.
EVO / phone meetings
- proposal by CSCS to have a biweekly phone meeting between CSCS admins and VO contacts for direct discussion
- agreement on testing out this option. CSCS will set up a doodle poll for the specific dates.
- A day before each meeting we'll decide by mail whether there is something to discuss, with the option of cancelling the meeting.
- Meeting notes must be placed on the wiki
Use of the CSCS tracker
- not discussed - please look at the CommunicationChannelsProposal
IRC chat
- CSCS would like to move some of the discussion off to phone meetings (q.v. above).
- We will keep the IRC chat for interactive work

Old WNs (phase B) removal from the machine room

Bern will take 5 blade centers. 3 will go into the central Bern UBELIX cluster
Geneva takes 1 blade center
letters of transfer and documents for SBF have been prepared
Michele: CSCS feels sure about the stability of phase C, would like to finally decommission phase B for good and move the racks by 2nd week of July
- VO site contacts confirm that the cluster has run stably for the last 3 weeks, so decommissioning is ok
Specifics will be arranged offline between CSCS and Uni Bern and Uni Geneva

job I/O requirements and data access patterns

ATLAS and CMS think that the future system should be targeted at 10 MB/s per job (we know that some jobs can reach significantly higher rates).
As a mitigating fact we may use that normally only a part of the slots will be used by high I/O analysis jobs, so let's target a system able to supply half the nodes with the above rate.

Future evolution of the system

Experiments would like 300 TB of additional storage (150 for ATLAS and CMS each)
Thumpers will need to be replaced. They currently supply 27*17.5 TB = 472.5 TB of our current storage
- So, to reach our targets, we would need to procure ~800 TB of storage!!! Ouch!
- We need to make sure that we can reach the needed throughput (see above) per job slot.
  - We need to test how many clients a infiniband connected Thor can supply with sufficient I/O.
If possible we would also to scale up to 50% more cores
Power consumption is a constraint at CSCS. CSCS will supply an estimate on the costs and limitations due to power constraints until August (because we need to hand in the proposal)
CSCS also mentions potential costs due to the move of the system to the new center in 2012
Pablo: LHCb will not use more storage for now. All disks on Thors should be dedicated for ATLAS and CMS, and in case LHCb needs space at any time, we will try to make a cleanup, reduce the pools and make some free space

Long term operation of tier3s (manpower)

Every experiment VO can apply for one FTE financed for Tier-3 and related Swiss HEP Grid tasks
ATLAS Bern already hired a sysadmin, CMS has just now published a job offer for the PSI Tier-3
LHCb has not yet applied. Roland to enquire with U. Straumann.

policies for other guest VOs

We will continue to allow guest VOs to use spare cycles
Limit maximal job length to 4 hours and maximal running jobs to 100. Pablo: In my notes I have writen 200... what should we do?
Guest VOs must apply to CHIPP via the CHIPP Computing Board. A decision on allowing them is taking there
Guest VOs must define one VO site contact that can be reached by CSCS admins and who can liaise between CSCS and their users.

local UIs for users

only CMS sees a big immediate benefit for their users.
ATLAS users only see CSCS usage via standard grid tools.
again, LHCb has no need of this, because user jobs run at the Tier-1s
We decide to deprioritize this issue.

New security policy at CSCS

Since for the immediate future anyhow only site-contacts will have local accounts at CSCS (no normal users), we are ok with any security scheme that CSCS proposes.

Preproduction system

Pablo mentioned that he had run a similar system at his last job and feels that it is doable without undue effort
VO testing (SAM tests) may be difficult or almost impossible to get for this
Derek mentions the Preproduction Service that was run at CERN and some EGEE sites and will try to get them in contact with CSCS

CSCS Tier-2 users day

the interested"users" will be mainly the VO site contacts, because the end users are rather abstracted from CSCS via the grid. Without added benefits (like local UIs) which differentiate CSCS from the other grid sites open to our users, this will hardly change
therefore, the overlap of the audience with the members in the computing board is almost complete, and we need not have another meeting so soon after today's meeting
We decide to keep the date (August 26th) reserved for further follow-up discussions, should we need them. Will confirm by end of July?
user meeting will be moved to Octobre and be held at Bern. Time slot to be defined.

AOB

Michele to be added to CHIPP Computing Board mailing list
Pablo: Agreed that, in case a user is running jobs that cause problems to other VOs or the cluster stability, CSCS is allowed to cancel his/her jobs and temporarily ban the user. The VO Representative of the user should be informed, and also the user if his email is easy to find.
Pablo: Derek asked CSCS to create a webpage where you can identify the real users with the number running jobs they have, and if possible identify how intensively are they using the storage (like for example, the number of movers they use with dcap)
Peter: Twiki Hosting%ENDCOLOR&
- Derek will ask for an offer at PSI Derek: PSI cannot take this over. Concentrating on wiki for intranet. The responsible group told me that they cannot provide this as a service similar to the intranet wiki due to manpower
- Christoph will ask for an offer at CERN