Activities Overview from September 2010 to August 2011

Here are described the most important activities performed in Phoenix Cluster since Summer 2010 to Summer 2011.

Achieved stability and high availability on PhaseC

LCG software is distributed among many different services, that need to be installed in tens of machines. Servers nowadays have multiple cores and bigger amounts of memory and hard disk, and would be a waste to dedicate one server to each service. Therefore, we decided to virtualize many of the middleware pieces. Due to disk latencies on virtual machines, it was decided not to virtualize storage, but other services like:

Vobox (one per VO), Mon (for accounting), Ganglia (monitoring), User Interface, and a security bastion host.

Also, a number of very important systems were deployed using different High Availiability mechanisms, also making use of virtualization techniques. This basically means that, even if a problem arises in one server, there should be another that take over the same task:

Lcg-CE, the gLite old computing element. Two independent servers, both accepting jobs independently, also dividing the incoming load between both (active/active).
Cream-CE, the new gLite computing element. Again, another two independent servers (active/active).
Arc-CE, the Nordugrid computing element. Two independent servers (active/active).
PBS/Batch server, with the scheduler, both in active/passive mode: if the active fails, the passive becomes active.
BDII (information system), with three services in a DNS-balanced fashion. (active/active)

In sum, 18 production servers were deployed using 5 physical hosts.

There were other services not virtualized, in PhaseC, because they deliver high throughput:

Central storace, with dCache, consisted of 38 pool nodes (data) and 2 service nodes (metadata and other services)
Scratch filesystem, with Lustre, consisted of almost 400 spindles connected to 8 IO servers and 2 metadata servers.
NFS servers, for the experiment software, consisted in 2 servers in High Availability (active/passive)

Old Worker Nodes from PhaseB distributed among Swiss universities

PhaseB worker nodes were decommissioned in Summer 2010, replaced by Sun Blades in PhaseC. Storage was not replaced at that time, because they were considered to last one or two more years. This left an important amount of computing resources that could be used somewhere else. Therefore it was decided to split it among the different WLCG Universities in Switzerland: Bern and Geneva.

This has proven to be a very well applauded activity, not to say very useful for the Universities, and we will continue giving them CSCS decommissioned hardware, if they are interested.

Moab scheduler and Torque support

One of the key elements on a cluster is the Batch system. It is composed by the Resource Manager (Torque) and the Scheduler (Maui). We now have a three-year license for Moab (Maui's replacement) and Torque support from Adaptive Computing:

Maui is the entry-level free scheduler, with which we had many problems, specially because of the complexity we have with diskless nodes, different VOs, fairshares and space reservations. It's a black box, where it is impossible to trace certain errors. Moab, its commercial replacement, not only provides much better features like High Availability, but also comes with a very helpful support contract.
Torque is a free tool, that comes with no support by default. Adaptive Computing included Torque support with Moab, which at the end became very helpful too.

Reinstallation of Solaris pool nodes to Linux

Historically, we had old dCache storage nodes installed with Solaris. Having two sets of different Operative Systems (ScientificLinux and Solaris) increased the amount of system administration workload. It also lacked infiniband support, and dCache solaris code had often problems due to little testing from other LCG sites that mostly use Linux. To be able to migrate out of Solaris we had to migrate all its data out, and gradually install Linux on those servers, while everything was in production.

NFS servers replacement

The old NFS infrastructure was getting old and faulty. The introduction of two new physical machines with Infiniband has clearly paid back in stability and performance (with Infiniband network).

PhaseD deployment

In March 2011, a new extension for Phoenix was purchased and deployed. With those resources we were able to meet the 2011 WLCG Pledges. PhaseD extension consisted of:

10 Worker Nodes from Dalco, each providing 24 AMD Cores, 3 GB RAM per core, infiniband connection and a normal local disk. The computing power in Phoenix was hence increased by 2000 HepSpec06.
250 TB of Central Storage provided by two DS3500 IBM FibreChannel controllers, attached to 90 disks (2 TB) each controller. Each controller was connected to 2 IO servers with dCache installed, providing up to 2 GB/s bandwidth each.
96 TB of Scratch FS, with the same technology as before (DS3500 controller) but attached to 3 IO servers with GPFS installed, providing up to 2.4 GB/s bandwidth with the worker nodes.

Given the amount of problems related to Lustre, and the uncertain continuity of this product due to Sun and Oracle fussion, GPFS was installed as an alternative Scratch Filesystem in PhaseD. We have ran both filesystems (Lustre and GPFS) together, until we make the decission to jump to GPFS only (or not).

Pre-Production virtual cluster

Middleware for the Grid (and other sources) is not always reliable, and blindly upgrading a production system is not a good idea. Hence, we deployed a virtual environment with a complete Grid Cluster, where we deploy software before putting it in production. It consists of 2 physical machines (similar to the ones used in production) that currently host around 20 virtual machines. It may not be adequate for stress tests, or hardware configuration, but has already proven to be very useful for day-to-day operation.

New EMI services installation and configuration

The introduction of EGI as a replacement for EGEE organization has brought new services, that replace old ones, to be able to converge gLite with both Nordugrid and Unicore middlewares. Teams so called "Early Adopters" were set up for each piece of software that EMI (European Middleware Initiative) wants to release, before they are considered "production".

CSCS is part of four of those Beta Tester groups: Cream, Argus, Apel, WN, helping the global community informing of possible bugs that this new software may have.
Argus and gLite, the new centrally-managed security services, were installed as two new services (Argus, in High Availability) and also in the Worker Nodes.
Lcg-CE, the old computing element, was decommissioned in favor of the well proven Cream CE.
APEL, the EMI accounting service, was also installed replacing old MON.

Preparation for Moving to Lugano started

Migrating our current physical infrastructure to the new building in Lugano, in May 2012, reveals a big set of challenges, that need to be addressed soon. Therefore, CSCS has already started planning the procedure to be followed. The conditions in the new building differ to the current ones:

There will not be cold air in the room, and everything will be cooled down with water comming from Lugano's Lake. Isolated "Islands" will be put in place, where all the heat exchange happens: inside, there will be two cold-air corridors where machines take the air from, warm it up, and expell to a central warm corridor. Then, cooling devices take that warm air, cool it down with water, and put it back to the front corridors. This creates a few constrains, like a desirable 10 KW / rack heat density (20 KW Max), the rack size should be all the same, and the free space in front of the racks is limited.
Worker nodes will not have Uninterrupted Power Supply.
A new application will be used to manage all machines and cables inside the machine room.

Besides, the moving process will be delicate and complicated. We face the risk of data loss, specially with old disks. There is hardware to be decommissioned, and it does not make sense to move it; replacements will have to be purchased and put in place beforehand. At the end, the move is a complex logistical problem that needed to be addressed as soon as possible.

Other activities of interest

Twiki, our documentation platform, was migrated to a commercial supported server, and changed its name to reflect the ownership of the documentation, to wiki.chipp.ch. A cleanup of the old documents was also performed, and the documents reordered in a tree-fashion manner.
Security was enhanced by allowing ssh with keys only, for both users and administrators, and a review of firewall rules in all services. Besides, several software upgrades, including kernel, were performed for security reasons.
Optimization of queue closure time for maintenances, reducing the closure from three days for all VOs, to one (or two, for LHCb) days, based on real job duration (instead of theoretical), hence increasing production hours.
Lustre reformatting to ext4 FS to improve performance.
CHIPP Tier3 system administrators were introduced to the procedures generally followed in Phoenix activities, for backup in case of an emergency.
Two new routers, in HA mode, for Infiniband and Ethernet separaton from the same physical network, to simplify internal networking software stacks. The real network split is not yet finished, but still under testing.