<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+!! %TOPIC% %TOC% ---++ Symptoms Summary: %FORMFIELD{"Symptom summary"}% ---++ Occurrences At what times did this problem occur (used to estimate frequency): | 2009-05-07 | ---++ Observations <!-- #collect here the information which may help to better understand the state of the system or services, e.g. #log excerpts, strace output, etc. #this also may help to identify the problem if similar conditions arise again --> This happened on a *dcache production-1-8-0-15p11(1.17)* installation. A file cannot be read by gridftp (and therefore also by SRM). The command to read it just hangs forever, the target file on local disk remains at 0 file size. <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root file:///tmp/derek1 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/ Dest: file:///tmp/ CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root -> derek1 (hangs forever) </pre> Checking whether the file exists <pre> echo /pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root | dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh 0002000000000000002ABBD0 t3fs02_cms </pre> On the fileserver, the file can be listed: ---+++ Checking whether this is a problem of this particular pool Does writing to the pool work? Creating of a new file on that particular pool <pre> globus-url-copy file:/tmp/derek2 gsiftp://t3fs02.psi.ch:2811//pnfs/psi.ch/cms/testing/derek20090507a # where is the file located? echo /pnfs/psi.ch/cms/testing/derek20090507a |dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh 0002000000000000002C8B10 t3fs04_cms_1 # we need to move it to the affected pool t3fs02_cms echo 0002000000000000002C8B10 > IDlist dc_ppcopy_files.sh t3fs04_cms_1 t3fs02_cms IDlist dc_get_cacheinfo_from_IDlist.sh IDlist 0002000000000000002C8B10 t3fs02_cms,t3fs04_cms_1 # note: trying to read the file now worked, because it can get it from t3fs04_cms_1 # now we remove it from t3fs04_cms_1 dc_rep_rm_list.sh t3fs04_cms_1 IDlist dc_get_cacheinfo_from_IDlist.sh IDlist 0002000000000000002C8B10 t3fs02_cms </pre> Now we test whether the new file can be read from the pool <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek7 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/ Dest: file:///tmp/ derek20090507a -> derek7 (hangs forever) </pre> There is one log line in the log of the gridftp door: <pre> 07 May 2009 12:00:06 Socket OPEN (ACCEPT) remote = /193.40.150.123:34639 local = /192.33.123.42:2811 </pre> Other files on this pool all show the same behavior. Trying to transfer a file from the second pool on this machine also ended up with a hanging transfer! So, the condition not specific to a single pool cell. I was able to trigger a p2p copy of a file to another pool. Regrettably I did not check whether a dcap read on the pool works. ---+++ Logfiles The only unexplained error I see in =t3fs02Domain.log=, happened two days ago. Unclear whether it is connected. <pre %FILESTYLE%> 06 May 2009 19:25:52 Socket OPEN remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Exception in runIO for : 0002000000000000002B3288 CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1705) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Storing incomplete file : 0002000000000000002B3288 with 1476960 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Exception (Original) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Throwable (Resulting) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1912) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619) 06 May 2009 19:27:22 Socket CLOSE remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669 06 May 2009 19:27:44 remove entry for: 0002000000000000002B3288 </pre> In =/var/adm/messages.0= there are signs of a disk problem! <pre %FILESTYLE%> May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30): May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265585 Error Block: 326265585 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30): May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265329 Error Block: 326265329 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 ... ... Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@1,0 (sd20): Apr 27 18:01:12 t3fs02 Error for Command: read(10) Error Level: Retryable Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 380806823 Error Block: 380806823 Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 ... ... Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47): Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734606 Error Block: 239734606 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47): Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734862 Error Block: 239734862 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 </pre> ---++ Solution or Workaround ---+++ Pools need to be restarted Set the pools to read only: <pre> dc_set_pools_readonly.sh cms_broken_pools.lst </pre> Restart the pools on the affected node t3fs02 (note that a disabled pool will be enabled after restart!): <pre> /opt/d-cache/bin/dcache stop pool /opt/d-cache/bin/dcache start pool </pre> Test whether a file can be read by gridftp: <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek103 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/ Dest: file:///tmp/ derek20090507a -> derek103 ls -l /tmp/derek103 -rw-r--r-- 1 feichtinger cms 51200 May 7 16:00 /tmp/derek103 </pre> *OK* Set the pools to read/write, again. <pre> dc_set_pools_readonly.sh -n cms_broken_pools.lst </pre> ---++ Monitoring for this condition <!-- #how can this condition be recognized automatically, if at all? --> -- Main.DerekFeichtinger - 07 May 2009
IssueForm
Affected Service
gsiftp
Symptom summary
gridftp (and srmcp) file read hangs forever
Reason Understood
no
Solution Exists
workaround
This topic: CmsTier3
>
WebHome
>
AdminArea
>
IssueDcacheRead
Topic revision: r3 - 2009-05-07 - DerekFeichtinger
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback