Symptoms
Summary: gridftp file read times out
Occurrences
At what times did this problem occur (used to estimate frequency):
Observations
A file cannot be read by gridftp (and therefore also by SRM). The command to read it just hangs forever, the target file on local disk remains at 0 file size.
globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root file:///tmp/derek1
Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/
Dest: file:///tmp/
CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root -> derek1
(hangs forever)
Checking whether the file exists
echo /pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root | dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh
0002000000000000002ABBD0 t3fs02_cms
On the fileserver, the file can be listed:
Checking whether this is a problem of this particular pool
Does writing to the pool work?
Creating of a new file on that particular pool
globus-url-copy file:/tmp/derek2 gsiftp://t3fs02.psi.ch:2811//pnfs/psi.ch/cms/testing/derek20090507a
# where is the file located?
echo /pnfs/psi.ch/cms/testing/derek20090507a |dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh
0002000000000000002C8B10 t3fs04_cms_1
# we need to move it to the affected pool t3fs02_cms
echo 0002000000000000002C8B10 > IDlist
dc_ppcopy_files.sh t3fs04_cms_1 t3fs02_cms IDlist
dc_get_cacheinfo_from_IDlist.sh IDlist
0002000000000000002C8B10 t3fs02_cms,t3fs04_cms_1
# note: trying to read the file now worked, because it can get it from t3fs04_cms_1
# now we remove it from t3fs04_cms_1
dc_rep_rm_list.sh t3fs04_cms_1 IDlist
dc_get_cacheinfo_from_IDlist.sh IDlist
0002000000000000002C8B10 t3fs02_cms
Now we test whether the new file can be read from the pool
globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek7
Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/
Dest: file:///tmp/
derek20090507a -> derek7
(hangs forever)
There is one log line in the log of the gridftp door:
07 May 2009 12:00:06 Socket OPEN (ACCEPT) remote = /193.40.150.123:34639 local = /192.33.123.42:2811
Other files on this pool all show the same behavior.
Trying to transfer a file from the second pool on this machine also ended up with a hanging transfer! So, the condition not specific to a single pool cell.
Logfiles
The only unexplained error I see in
t3fs02Domain.log
, happened two days ago. Unclear whether it is connected.
06 May 2009 19:25:52 Socket OPEN remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Exception in runIO for : 0002000000000000002B3288 CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1705)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Storing incomplete file : 0002000000000000002B3288 with 1476960
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Exception (Original) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Throwable (Resulting) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1912)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619)
06 May 2009 19:27:22 Socket CLOSE remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669
06 May 2009 19:27:44 remove entry for: 0002000000000000002B3288
In
/var/adm/messages.0
there are signs of a disk problem!
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30):
May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265585 Error Block: 326265585
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30):
May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265329 Error Block: 326265329
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
...
...
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@1,0 (sd20):
Apr 27 18:01:12 t3fs02 Error for Command: read(10) Error Level: Retryable
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 380806823 Error Block: 380806823
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
...
...
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47):
Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734606 Error Block: 239734606
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47):
Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734862 Error Block: 239734862
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Solution or Workaround
Monitoring for this condition
--
DerekFeichtinger - 07 May 2009