Tags: view all tags

CMS VObox

Machine name	cmsvobox.lcg.cscs.ch

Firewall requirements

port	open to	reason
80/tcp	*	access to our custom PhEDEx monitoring pages
3128/tcp	worker nodes	access from WNs to FroNtier squid proxy
3401/udp	128.142.202.212, 128.142.137.39, 131.225.209.5	central SNMP monitoring of FroNtier service
1975/tcp	*	gsissh access for team members

access to mounted PNFS required
Regular Maintenance work
- Renew myproxy certificates (once per month)
- Copy in new phedex user certificate (once a year)
Emergency measures
- - in case of SE breakdown
Services
System backups

access to mounted PNFS required

The CMS site contacts need access to the mounted PNFS area in order to check the completeness of data sets on the SE. Also, a part of our local monitoring (cron job at /etc/cron.d/cms-se-spacecounter) requires that PNFS is mounted.

Regular Maintenance work

Renew myproxy certificates (once per month)

Renew the myproxy certificates for authentication versus our local SE and the FTS service as described below in the PhEDEx authentication/authorization section.

Copy in new phedex user certificate (once a year)

The phedex user certificate is identical to the host certificate of the machine (cmsvobox.lcg.cscs.ch) The certificate is under the control of the cfengine system, so you need to replace it there, or it will be overwritten again by cfengine (ask the local admins for oing this task).

Emergency measures

in case of SE breakdown

You should stop all PhEDEx services. For a prolonged stop, a notification about the condition should be sent to the PhEDEx mailing list hn-cms-phedex@cern.ch identifying our node name T2_CH_CSCS together with an estimate about the estimated downtime.

Services

Our CMS VObox runs these services

PhEDEx data management service
FroNtier data base caching service (a squid proxy)
Web Server for displaying monitoring information

PhEDEx

central PhEDEx admin documentation

Service startup, stop and status

Note that PhEDEx is run by the phedex user! I wrote some custom init scripts which make these steps simpler than in the original.

We run multiple instances of PhEDEx. One is for Production transfers, and one or more others are for load tests and development. Currently, these instances are active:

Debug	Prod

Debug

Prod

startup scripts at /home/phedex/init.d/phedex_[instance] (start|status|stop). The init script will check for a valid service certificate before startup! Example:
```
/home/phedex/init.d/phedex_Prod start /home/phedex/init.d/phedex_Debug start 
```
also check status via the central web site http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Components::Status?view=global
make sure that the there is still a valid proxy available to PhEDEx:
```
voms-proxy-info -all -file /home/phedex/gridcert/proxy.cert
```
If not, a CMS site admin must renew it.

Installation

For instructions refer to the central Twiki documentation (old information can be found in the PhEDEx CVS Repository). The new installations all are apt/rpm based. Up to date instructions for every update are always published on the !PhEDEx CMS hypernews forums (search for Update to phedex).

Missing RPMs for the slc5_amd64_gcc434 installation of version 3_3_1 that I had to add:

tk.x86_64
compat-libstdc++-33.x86_64
mesa-libGLU.x86_64

Configuration

The PhEDEx configuration can be found in ~phedex/config:

DBParam.CSCS: Passwords needed for accessing the central data base. These we receive encrypted from cms-phedex-admins@cern.ch. The file contains one section for every PhEDEx instance (Prod, Dev, ...)
SITECONF/T2_CH_CSCS/PhEDEx/Config*: Configuration definitions for the PhEDEx instances (including load tests)
SITECONF/T2_CH_CSCS/PhEDEx/storage.xml: defines the trivial file catalog mappings
SITECONF/T2_CH_CSCS/PhEDEx/FileDownload*: site specific scripts called by the download agent
SITECONF/T2_CH_CSCS/PhEDEx/fts.map: mapping of SRM endpoints to FTS servers (q.v. CERN Twiki)

The SITECONF area is checked in to the central CMS CVS repository

There is a symbolic link /home/phedex/PHEDEX which points to the active PhEDEx distribution, so that the configuration files need not be changed with every update (though the link needs to be reset).

DB access

PhEDEx relies on a central Oracle data base at CERN. The passwords for accessing it are stored in DBParam.CSCS (q.v. configuration section above).

Authentication/Authorization

Useful Link: VOMS Proxy renewal for PhEDEx

for local SE access

The PhEDEx daemons require a valid grid proxy in /home/phedex/gridcert/proxy.cert to transfer files. This certificate is renewed through a cron job running myproxy-get-delegation and a special form of the voms-proxy-init command. The administrator needs to deposit an own long lived proxy on the myproxy server, which is used to continually renew the local certificate. The phedex user fetches the credentials from the myproxy server using his own service certificate (DN: phedex/cmsvobox.lcg.cscs.ch), which is stored as a typical user certificate in the ~phedex/.globus directory.

In order to renew a proxy certificate from a myproxy server with authentication through a different grid proxy (the service certificate), you need to have the DN of this grid proxy entered into the myproxy server's configuration. I. e. you need to contact the responsible admins for the myproxy.cern.ch server if the hostname of the cmsvobox changes! Write a mail to Helpdesk@cern.ch

As a phedex administrator user (member of cmsadmin unix group, "dfeich" in my case) do from the UI using your account:

voms-proxy-init -voms cms
myproxyserver=myproxy.cern.ch
myproxy-init -s $myproxyserver -l cscs_phedex -x \
     -R "/DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=hosts/C=CH/ST=Zuerich/L=Zuerich/O=ETH Zuerich/CN=cmsvobox.lcg.cscs.ch" -c 720
scp /tmp/x509up_u$(id -u) phedex@cmsvobox:/home/phedex/gridcert/proxy.cert.admin
#  for testing, you can try
myproxy-info -s $myproxyserver -l cscs_phedex

As the phedex user do

cp ~/gridcert/proxy.cert.admin ~/gridcert/proxy.cert
chmod 600 ~/gridcert/proxy.cert

You should test whether the renewal of the certificate works for the phedex user:

unset X509_USER_PROXY # make sure that the service credentials from ~/.globus are used!
voms-proxy-init  # initializes the service proxy cert that is allowed to retrieve the user cert
myproxyserver=myproxy.cern.ch
myproxy-get-delegation -s $myproxyserver -v -l cscs_phedex \
          -a /home/phedex/gridcert/proxy.cert -o /tmp/gagatest

export X509_USER_PROXY=/tmp/gagatest
srm-get-metadata srm://storage01.lcg.cscs.ch:8443/srm/managerv1?SFN=/pnfs/lcg.cscs.ch/cms
rm /tmp/gagatest

Note: The certificate renewal can only be done if the renewing machine has been added to the list of authorized renewers in the configuration of the myproxy server. So it is necessary to contact the administrator of that server whenever the name of the machine doing the renewal (i.e. the vobox) or the used service certificate changes.

for FTS channels (mostly obsolete except when not using delegation mode)

NOTE: This is only required when using the download agent without delegation mode. Delegation is standard now, and should always be used. Not using delegation mode involves giving two additional options to the agent

-myproxy myproxy-fts.cern.ch -passfile /home/phedex/config/ftspass

Since the introduction of FTS we need yet another myproxy-certificate, which the FTS service uses to drive the transfers:

With your admin identity (dfeich in my case), create a myproxy certificate and specify a passphrase for retrieval. Put the passphrase used into the home/phedex/config/ftspass file on the VO-Box. Make sure it only has rw- permissions for the phedex user.

myproxy-init -d -s myproxy-fts.cern.ch -c 720

check from some other box (e.g. the VO-Box) that you can receive a delegated proxy ( it is strange, that I have to explicitely use my certificate subject as username. The -d flag should automatically retrieve that from the used grid proxy. But on the VO-Box this was necessary*):

export X509_USER_PROXY=/home/phedex/gridcert/proxy.cert
mycertsubject="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dfeich/CN=613756/CN=Derek Feichtinger"
myproxy-info -s myproxy-fts.cern.ch -l "$mycertsubject"
myproxy-get-delegation -v -l "$mycertsubject" -s myproxy-fts.cern.ch \
                                      -o /tmp/gagaga -S < ~/config/ftspass
rm  /tmp/gagaga

You can check whether the PhEDEx FTS backend works with this certificate by doing something like

echo \
     'srm://gridka-dCache.fzk.de:8443/srm/managerv1?SFN=/pnfs/gridka.de/cms/disk-only/LoadTest/LoadTest_T1_FZK_036 \
     srm://storage01.lcg.cscs.ch:8443/srm/managerv1?SFN=/pnfs/lcg.cscs.ch/home/cms/local_tests/my-local-test.tst' \
  > copyjob.tst

ftscp -copyjobfile=copyjob.tst -passfile=/home/phedex/config/ftspass \
   -server=https://fts2-fzk.gridka.de:8443/glite-data-transfer-fts/services/FileTransfer \
   -report=/tmp/test-ftscp-report.txt

Dependencies on external services

central PhEDEx services
myproxy service (currently from CERN: myproxy.cern.ch and myproxy-fts.cern.ch) for retrieval of the grid-proxy used to operate the file transfers
SRM/gridFTP/rfio service from se01-lcg to store files on our SE
FTS channels which GridKa (FZK) is running for us

Monitoring / Testing

The PhEDEx web site contains monitoring and some error information ; have a look ; there is also exposed a CGI datasvc to raise specific questions ; look below in this chapter for some live example

To investigate download errors, take a look at the =~/state/{PhEDEx}/incoming/download directory. Every running transfer and all failed transfers write their full logs to files here.

Recent transfer errors

The following dir and files are not required by the PhEDEx service itself but they are useful to check if there were recent transfer errors :

/home/phedex/ErrorSiteQueries
/home/phedex/ErrorSiteQueries/ErrorSiteQuery_dst_T2_CH_CSCS.sh
/home/phedex/ErrorSiteQueries/ErrorSiteQuery_src_T2_CH_CSCS.sh
# chown -R root.nagios /home/phedex/ErrorSiteQueries

where:

# cat  /home/phedex/ErrorSiteQueries/ErrorSiteQuery_dst_T2_CH_CSCS.sh
source /home/phedex/PHEDEX/etc/profile.d/env.sh && /home/phedex/PHEDEX/Utilities/ErrorSiteQuery --db /home/phedex/config/DBParam.CSCS:Prod/CSCS -m 1000 -s "-1 days"  --dst T2_CH_CSCS

and:

# cat /home/phedex/ErrorSiteQueries/ErrorSiteQuery_src_T2_CH_CSCS.sh
source /home/phedex/PHEDEX/etc/profile.d/env.sh && /home/phedex/PHEDEX/Utilities/ErrorSiteQuery --db /home/phedex/config/DBParam.CSCS:Prod/CSCS -m 1000 -s "-1 days"  --src T2_CH_CSCS

Example of a error condition :

# /home/phedex/ErrorSiteQueries/ErrorSiteQuery_src_T2_CH_CSCS.sh
2015-05-13 11:57:58: ErrorSiteQuery[25195]: (re)connecting to database
2015-05-13 11:57:58: ErrorSiteQuery[25195]: disconnected from database
Results starting from date 1431431878  Tue May 12 13:57:58 2015
Number of results: 5 (of max 1000)
**** from T2_CH_CSCS to T1_IT_CNAF_Disk:
      1   TRANSFER  Transfer canceled because the gsiftp performance marker timeout of 600 seconds has been exceeded, or all performance markers during that period indicated zero bytes transferred
**** from T2_CH_CSCS to T1_DE_KIT_Disk:
      1   TRANSFER  globus_ftp_client: the server responded with an error 451 Transfer was forcefully killed
**** from T2_CH_CSCS to T1_US_FNAL_Disk:
      3   TRANSFER  Transfer canceled because the gsiftp performance marker timeout of 360 seconds has been exceeded, or all performance markers during that period indicated zero bytes transferred

Recent transfer errors by Nagios

Instead of manually checking the file transfer errors you can use Nagios and check_execgrep.pl

# cat /home/phedex/ErrorSiteQueries/ErrorSiteQuery_src_T2_CH_CSCS.sh.possible.nagios.check 
command[check_PhEDEx_src_T2_CH_CSCS]=/opt/cscs/libexec/nagios-plugins/check_execgrep.pl  --contains NO --warning "Number of results: 0 " --critical "ErrorSiteQuery"  --command /home/phedex/ErrorSiteQueries/ErrorSiteQuery_src_T2_CH_CSCS.sh

# cat /home/phedex/ErrorSiteQueries/ErrorSiteQuery_dst_T2_CH_CSCS.sh.possible.nagios.check
command[check_PhEDEx_dst_T2_CH_CSCS]=/opt/cscs/libexec/nagios-plugins/check_execgrep.pl  --contains NO --warning "Number of results: 0 " --critical "ErrorSiteQuery"  --command /home/phedex/ErrorSiteQueries/ErrorSiteQuery_dst_T2_CH_CSCS.sh

# chown -R root.nagios /home/phedex/ErrorSiteQueries

Load tests for constant checks on the infrastructure

In the Debug instance of PhEDEx there is a constant small stream of load test transfers, that is used to keep a check on whether everything works correctly between the sites. All direct links will be exercised in this way, i.e. for a T2 this means all the links from and to the different T1s. Each site has a specific source data set for this purpose ( /pnfs/lcg.cscs.ch/cms/trivcat/store/phedex_monarctest/monarctest_CSCS-DISK1 in the case of our site). The files will be stored with random names in a part of the CMS tree at the destinations site. This part of the tree should be erased regularly to prevent the storage from filling up. At CSCS we use for this purpose a cron job running on the dCache admin node (because direct rm in the pnfs namespace is easiest):

Cron job: storage02:/etc/cron.d/cleanCMSLoadtest which invokes https://svn.cscs.ch/LCG/VO-specific/cms/cleanCMSLoadtest.sh

Links to Presentations about PhEDEx

Ricky Egeland's slides for the acat2008 conference

FroNtier

FroNtier is a squid http-proxy-server that is used to cache queries to the central databases, and thereby reduce their load.

Service startup and stop

The service is running under the dbfrontier user.

startup scripts at /etc/init.d/frontier-squid (start|status|stop); Need root access

Installation

For installation instructions refer to the CERN Frontier twiki page The installation is located in /home/dbfrontier/frontier.

The /etc/init.d/frontier-squid service starter is installed manually from the install directory.

NOTE that the /home/dbfrontier/frontier/frontier-cache/squid/var/cache directory needs to be on a performant storage medium, i.e. not part of a VM disk main image. So, usually we have it on a separately mounted hd partition.

Configuration

Is done at installation time.

The configuration file is located at /home/dbfrontier/frontier/frontier-cache/squid/etc/squid.conf. It is built by a non-standard interactive configure in the installation step.
Main parameters to look out for are the NET_LOCAL, cache_mem, and cache_dir options. In the installation step a file named Makefile.conf.inc will be generated with your settings. It should hold something like

FRONTIER_DIR=/home/dbfrontier/frontier
export PORT_ROOT=/home/dbfrontier/frontier_squid-3.0rc2
export FRONTIER_USER=dbfrontier
export FRONTIER_GROUP=dbfrontier
export FRONTIER_NET_LOCAL='148.187.64.0/22'
export FRONTIER_CACHE_MEM=512
export FRONTIER_CACHE_DIR=40960

A cleanup/logrotate cron job is installed in /etc/cron.d/frontier.

Since the access logs can fill up the space rather fast, we usually run squid with the access logs turned off. This must be done after installation in the main config file.

access_log none

Log files

To be found in /home/dbfrontier/frontier/frontier-cache/squid/var/logs

Testing

For a test of our installation, do

$> export http_proxy=http://${CMSVOBOX}:3128
$> cd /home/dbfrontier/test
$> ./fnget-new.py  --url=http://cmsfrontier.cern.ch:8000/Frontier/Frontier --sql="select 1 from dual"
$> ./fnget-new.py  --url=http://cmsfrontier.cern.ch:8000/Frontier/Frontier --sql="select 1 from dual"

This test is described on the installation page (see above)

Execute the last command multiple times, check the output and look into the logfile at /home/dbfrontier/frontier/frontier-cache/squid/var/logs/access.log. The first invocation should leave a line with something like this in the log:

 TCP_MISS/200 824 GET http://cmsfrontier.cern.....

for subsequent calls you should find:

 TCP_MEM_HIT  ...        ...   - NONE/- text/xml

Monitoring Frontier

There is a central monitoring page. The contact person is Barry Blumenfeld (bjb@hep.pha.jhu.edu).

The machines which are allowed to SNMP monitor the service must get defined in the configuration file like this:

acl HOST_MONITOR src 128.142.137.39/255.255.255.255

The SNMP monitoring I was able to test using the snmptest command from net-snmp-utils (The variable name was obtained from a packet capture).

$> snmptest -c public  -v 2c localhost:3401
Variable: 1.3.6.1.4.1.3495.1.3.2.1.1
Variable:
Received Get Response from 127.0.0.1
requestid 0x19ABDB87 errstat 0x0 errindex 0x0
SNMPv2-SMI::enterprises.3495.1.3.2.1.1 = Counter32: 1515491

Web Server

Standard Apache installation used for displaying Phedex and PNFS monitoring information.

System backups

Tom Guptill has set up new and more reliable backup routines. Details need to be filled in here.

At least a backup of these directories is required:

directory	purpose
`/home`	phedex, frontier
`/etc`	frontier/squid configuration, some cron definitions
`/var/www`	some output from monitoring pages

-- DerekFeichtinger - 17 Jan 2008 * cmsvobox.log: installation log

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
log	cmsvobox.log	r1	manage	3.8 K	2006-05-12 - 16:57	DerekFeichtinger	installation log
pdf	phedex-REgeland-acat2009.pdf	r1	manage	5450.6 K	2009-04-23 - 10:01	DerekFeichtinger	phedex Acat2008 slides by Ricky Egeland