Phenix updated installation and configuration status
- Minutes of the phone call held the 13th February 2007
- Participants:
- Sergio Maffioletti
- Alessandro Usai
- Tom Guptil
- Derek Feichtinger
- Sigve Haug
- Zhiling Chen
Status of installation and configuration
- WN (X2200)
- SLC308 [ok]
- LCG/GLite [ok]
- WN integrated into old LRMS [ok]
- we could have all WNs integrated in short time (sharing /apps via NFS)
- CE (X4200)
- SLC4 [ok]
- for the time being we agreed on having the old ce01-lcg used and Torque as LRMS
- SGE integration will be tested and plan to have it in production as soon as it will be stable
- Nordugrid will have to be checked and tested too
- Problem encoutered, solutions and workarounds
- SLC306 does not work (missing controller drivers)
- SLC4 does work but installing LCG/Glite sw is error prone
- Thumper installation [ok] but we need to test ZFS functionalities
- Tom proposed to change the current RAID configuration to use only 1 parity disk; this would give additional 4TB at the expense of reliability [still to be decided]
- SUN N1 is not suitable for cluster management therefore we will use cfengine
- planning to have all cluster management services on 1 X2200 on Linux (possibly with Solaris on a Virtual Machine)
- 2 X4200 --> should become free
- Tests (tentative dates)
- Reliability tests on Thumpers (Tom + Alessandro) --> 12 - 16 February
- Performance test from WNs to Thumpers via dcache --> 14 - 21 February
- Test different configurations of ZSF and dcache --> 12 - 23 February
- Organisation of the dCache tests
- functionality tests
- VO codes
- local load tests (mainly dcap):
- writing files in parallel from multiple nodes
- reading same file from multiple nodes
- trying to write file that is being written by another process
- erasing file that is being read by another process
- measure I/O rates as function of parallel clients
- WAN protocol tests (SRM, gridftp)
- CMS PhEDEx transfers
- Storage access profile of CMS jobs -> they will use dcap protocol
- Storage access profile of Atlas jobs -> for those using ARC, maily the access is through SRM and/or Gridftp
- each VO should prepare their own specific tests
- General test suite ( local and wan test ) will be prepared by Derek
- Sigve will forward the test description to the Atlas contact to check if Atlas will need additional tests
SE dcache configuration scenario
- PNFS + posgres DB on a fat node = 1 X4200
- SRM + Dcache domains + LCG/GLite sw on standard WN = 1 X2200
- gridftp + few dcache modules still to be checked = 2 X4500
- we still need to understand the proper scenario
- We all agree on such configuration
- things tobe checked:
- is it necessary to mount PNFS on Thumper ? apparently yes if Thumper is running Gridftp ( thanks to Lionel Schwarz )
- what is necessary for WN to use the dcap protocol to access the dcache pools ? apparently the client-only dcache package ( Alessandro will check )
Planning migration of DPM data to dcache pool
- Migrate DPM data to the new SE (se02-lcg.projects.cscs.ch)
- Users will have to migrate their data and update catalog
- With the introduction of the new SE as the default CSCS/CHIPP SE, we will have to change few settings in FTS and IS (Alessandro will check)
- Derek and Sigve will check what is necessary to be done for CMS and Atlas VOs for the support of the new SE
- Current se01-lcg will be kept as backup for initial time and then re-converted as dcache pool
VO Disk space shares:
- Agreed to have Filesystem <-> VO mapping as has been done in Phenix I
- Accepted proposal:
- Each VO will have access to both Thumpers
- Atals = 2 x 6TB
- CMS = 2 x 6TB
- LhCb = 1 x 1/2 TB
- Hone = 1 x 1/2 TB
- dteam = 1 x 1/2 TB
- spare = 2 x 3.5TB
- spare disk will be available to all VOs at request
- we may also take space from dteam
- lhcb should agree on having initially only 1/2TB ( Derek will contact )
What bandwidth can we expect:
- WN = 1GB link
- Thumper = 4 x 1GB links trunked
- from CSCS to Karlsruhe -> 20MB/s should be guaranteed ( CSCS will check )
VO CPU shares based on queue priority including nordugrid- queues.
- For the time being we will keep the configuration of the queues as they are
- We will observe the behavior of the queues
- When we will migrate to the new SGE-based CE, we will address the fare-share issue
Integration with Phenix 1 cluster
- integration of WN
- Agreed to integrate 10WNs when installation will be stable
Deadlines
- We can make it for the end of February
- Next week we will have more info
AOB
- update UI machine (Strange java exception errors)?
- UI will be re-installed in the next two weeks
- Proposal to migrate to a server box ( gain reliability )
- VOBoxes
- should we reinstall as true LCG VO-Boxes? This would provide gsissh and easier myproxy management
- We are planning to migrate these boxes anyway
- responsibility for Twiki areas (CSCS will take care)
- Create 1 page per VO
- add a page with logs of problems
- VOBoxes page with info about how to start services
Resume of the Configuration
- 1 X2200 = cluster management system
- 1 X2200 = SRM + dcache domains + LCG SE related software
- 1 X4200 = PNFS + posgres
- 1 X4500 Thumpers = Gridftpd + dcache pool node
- 1 X4500 Thumpers = Gridftpd + dcache pool node
- ZFS configuration (Proposal):
- 1 Thumper test with 4 Raid and 2 parity disks = 16TB + 4 spare disks
- 1 Thumper test with 4 Raid and 1 parity disk = 18TB + 4 spare disks
- 1 Filesystem per VO per Thumper
- each VO gets space on both Thumpers
- Thumper dcache configuration
- each Thumper will have 1 dcache pool per VO/FS (CMS,Atlas,dteam)
- 1 Thumper will also have FS for lhcb and hone
--
SergioMaffioletti - 13 Feb 2007
This topic: LCGTier2
> WebHome >
RoadMap >
SUNCluster > PhenixMinutes130207
Topic revision: r1 - 2007-02-13 - SergioMaffioletti