Tags:
create new tag
view all tags

IssueDcacheRead

Symptoms

Summary: gridftp (and srmcp) file read hangs forever

Occurrences

At what times did this problem occur (used to estimate frequency):

2009-05-07

Observations

This happened on a dcache production-1-8-0-15p11(1.17) installation.

A file cannot be read by gridftp (and therefore also by SRM). The command to read it just hangs forever, the target file on local disk remains at 0 file size.

globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root file:///tmp/derek1

Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/
Dest:   file:///tmp/
  CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root  ->  derek1
(hangs forever)

Checking whether the file exists

echo /pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root | dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh

0002000000000000002ABBD0 t3fs02_cms

On the fileserver, the file can be listed:

Checking whether this is a problem of this particular pool

Does writing to the pool work?

Creating of a new file on that particular pool

globus-url-copy file:/tmp/derek2 gsiftp://t3fs02.psi.ch:2811//pnfs/psi.ch/cms/testing/derek20090507a
# where is the file located?
echo /pnfs/psi.ch/cms/testing/derek20090507a |dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh
0002000000000000002C8B10 t3fs04_cms_1

# we need to move it to the affected pool t3fs02_cms
echo 0002000000000000002C8B10 > IDlist
dc_ppcopy_files.sh t3fs04_cms_1 t3fs02_cms IDlist
dc_get_cacheinfo_from_IDlist.sh IDlist
0002000000000000002C8B10 t3fs02_cms,t3fs04_cms_1

# note: trying to read the file now worked, because it can get it from t3fs04_cms_1

# now we remove it from t3fs04_cms_1
dc_rep_rm_list.sh t3fs04_cms_1 IDlist
dc_get_cacheinfo_from_IDlist.sh IDlist
0002000000000000002C8B10 t3fs02_cms

Now we test whether the new file can be read from the pool

globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek7
Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/
Dest:   file:///tmp/
  derek20090507a  ->  derek7
(hangs forever)

There is one log line in the log of the gridftp door:

07 May 2009 12:00:06 Socket OPEN (ACCEPT) remote = /193.40.150.123:34639 local = /192.33.123.42:2811

Other files on this pool all show the same behavior.

Trying to transfer a file from the second pool on this machine also ended up with a hanging transfer! So, the condition not specific to a single pool cell.

I was able to trigger a p2p copy of a file to another pool. Regrettably I did not check whether a dcap read on the pool works.

Logfiles

The only unexplained error I see in t3fs02Domain.log, happened two days ago. Unclear whether it is connected.

06 May 2009 19:25:52 Socket OPEN remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Exception in runIO for : 0002000000000000002B3288 CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1705)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.lang.Thread.run(Thread.java:619)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Storing incomplete file : 0002000000000000002B3288 with 1476960
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Exception (Original) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Throwable (Resulting) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1912)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) :  at java.lang.Thread.run(Thread.java:619)
06 May 2009 19:27:22 Socket CLOSE remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669
06 May 2009 19:27:44 remove entry for: 0002000000000000002B3288

In /var/adm/messages.0 there are signs of a disk problem!

May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30):
May  5 06:07:55 t3fs02  Error for Command: write(10)               Error Level: Retryable
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Requested Block: 326265585                 Error Block: 326265585
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Sense Key: Aborted_Command
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30):
May  5 06:07:55 t3fs02  Error for Command: write(10)               Error Level: Retryable
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Requested Block: 326265329                 Error Block: 326265329
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    Sense Key: Aborted_Command
May  5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
...
...
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@1,0 (sd20):
Apr 27 18:01:12 t3fs02  Error for Command: read(10)                Error Level: Retryable
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice]    Requested Block: 380806823                 Error Block: 380806823
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice]    Sense Key: Aborted_Command
Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
...
...
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47):
Apr  1 07:17:02 t3fs02  Error for Command: write(10)               Error Level: Retryable
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Requested Block: 239734606                 Error Block: 239734606
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Sense Key: Aborted_Command
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47):
Apr  1 07:17:02 t3fs02  Error for Command: write(10)               Error Level: Retryable
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Requested Block: 239734862                 Error Block: 239734862
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    Sense Key: Aborted_Command
Apr  1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0

Solution or Workaround

Pools need to be restarted

Set the pools to read only:

dc_set_pools_readonly.sh cms_broken_pools.lst

Restart the pools on the affected node t3fs02 (note that a disabled pool will be enabled after restart!):

/opt/d-cache/bin/dcache stop pool
/opt/d-cache/bin/dcache start pool

Test whether a file can be read by gridftp:

globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek103
Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/
Dest:   file:///tmp/
  derek20090507a  ->  derek103

ls -l /tmp/derek103
-rw-r--r--  1 feichtinger cms 51200 May  7 16:00 /tmp/derek103

OK

Set the pools to read/write, again.

dc_set_pools_readonly.sh -n cms_broken_pools.lst

Monitoring for this condition

-- DerekFeichtinger - 07 May 2009

IssueForm
Affected Service gsiftp
Symptom summary gridftp (and srmcp) file read hangs forever
Reason Understood no
Solution Exists workaround
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2009-05-07 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback