Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+!! %TOPIC% %TOC% ---++ Symptoms Summary: %FORMFIELD{"Symptom summary"}% ---++ Occurrences At what times did this problem occur (used to estimate frequency): | 2009-05-07 | ---++ Observations <!-- #collect here the information which may help to better understand the state of the system or services, e.g. #log excerpts, strace output, etc. #this also may help to identify the problem if similar conditions arise again --> This happened on a *dcache production-1-8-0-15p11(1.17)* installation. A file cannot be read by gridftp (and therefore also by SRM). The command to read it just hangs forever, the target file on local disk remains at 0 file size. <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root file:///tmp/derek1 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/ Dest: file:///tmp/ CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root -> derek1 (hangs forever) </pre> Checking whether the file exists <pre> echo /pnfs/psi.ch/cms/trivcat/store/user/pnef/fastsim_files/CH_V2.7_set5_pp_ttbarw+_jet_soup_5000_1_def_scale_fastsim.root | dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh 0002000000000000002ABBD0 t3fs02_cms </pre> On the fileserver, the file can be listed: ---+++ Checking whether this is a problem of this particular pool Does writing to the pool work? Creating of a new file on that particular pool <pre> globus-url-copy file:/tmp/derek2 gsiftp://t3fs02.psi.ch:2811//pnfs/psi.ch/cms/testing/derek20090507a # where is the file located? echo /pnfs/psi.ch/cms/testing/derek20090507a |dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh 0002000000000000002C8B10 t3fs04_cms_1 # we need to move it to the affected pool t3fs02_cms echo 0002000000000000002C8B10 > IDlist dc_ppcopy_files.sh t3fs04_cms_1 t3fs02_cms IDlist dc_get_cacheinfo_from_IDlist.sh IDlist 0002000000000000002C8B10 t3fs02_cms,t3fs04_cms_1 # note: trying to read the file now worked, because it can get it from t3fs04_cms_1 # now we remove it from t3fs04_cms_1 dc_rep_rm_list.sh t3fs04_cms_1 IDlist dc_get_cacheinfo_from_IDlist.sh IDlist 0002000000000000002C8B10 t3fs02_cms </pre> Now we test whether the new file can be read from the pool <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek7 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/ Dest: file:///tmp/ derek20090507a -> derek7 (hangs forever) </pre> There is one log line in the log of the gridftp door: <pre> 07 May 2009 12:00:06 Socket OPEN (ACCEPT) remote = /193.40.150.123:34639 local = /192.33.123.42:2811 </pre> Other files on this pool all show the same behavior. Trying to transfer a file from the second pool on this machine also ended up with a hanging transfer! So, the condition not specific to a single pool cell. I was able to trigger a p2p copy of a file to another pool. Regrettably I did not check whether a dcap read on the pool works. ---+++ Logfiles The only unexplained error I see in =t3fs02Domain.log=, happened two days ago. Unclear whether it is connected. <pre %FILESTYLE%> 06 May 2009 19:25:52 Socket OPEN remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Exception in runIO for : 0002000000000000002B3288 CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1705) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Storing incomplete file : 0002000000000000002B3288 with 1476960 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Exception (Original) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : Stacked Throwable (Resulting) for 0002000000000000002B3288 : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : CacheException(rc=666;msg=Checksum error client=1:c2e0dfc3;file=1:82d23fea) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.ChecksumModuleV1.setMoverChecksums(ChecksumModuleV1.java:120) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.pools.MultiProtocolPoolV3$RepositoryIoHandler.run(MultiProtocolPoolV3.java:1912) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:109) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) 05/06 19:26:52 Cell(t3fs02_cms@t3fs02Domain) : at java.lang.Thread.run(Thread.java:619) 06 May 2009 19:27:22 Socket CLOSE remote = t3wn05.psi.ch/192.33.123.85:47355 local = /192.33.123.42:36669 06 May 2009 19:27:44 remove entry for: 0002000000000000002B3288 </pre> In =/var/adm/messages.0= there are signs of a disk problem! <pre %FILESTYLE%> May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30): May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265585 Error Block: 326265585 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@3,0 (sd30): May 5 06:07:55 t3fs02 Error for Command: write(10) Error Level: Retryable May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 326265329 Error Block: 326265329 May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command May 5 06:07:55 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 ... ... Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@1,0 (sd20): Apr 27 18:01:12 t3fs02 Error for Command: read(10) Error Level: Retryable Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 380806823 Error Block: 380806823 Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 27 18:01:12 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 ... ... Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47): Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734606 Error Block: 239734606 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd47): Apr 1 07:17:02 t3fs02 Error for Command: write(10) Error Level: Retryable Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Requested Block: 239734862 Error Block: 239734862 Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Apr 1 07:17:02 t3fs02 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 </pre> ---++ Solution or Workaround ---+++ Pools need to be restarted Set the pools to read only: <pre> dc_set_pools_readonly.sh cms_broken_pools.lst </pre> Restart the pools on the affected node t3fs02 (note that a disabled pool will be enabled after restart!): <pre> /opt/d-cache/bin/dcache stop pool /opt/d-cache/bin/dcache start pool </pre> Test whether a file can be read by gridftp: <pre> globus-url-copy -v gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/derek20090507a file:///tmp/derek103 Source: gsiftp://t3fs02.psi.ch:2811///pnfs/psi.ch/cms/testing/ Dest: file:///tmp/ derek20090507a -> derek103 ls -l /tmp/derek103 -rw-r--r-- 1 feichtinger cms 51200 May 7 16:00 /tmp/derek103 </pre> *OK* Set the pools to read/write, again. <pre> dc_set_pools_readonly.sh -n cms_broken_pools.lst </pre> ---++ Monitoring for this condition <!-- #how can this condition be recognized automatically, if at all? --> -- Main.DerekFeichtinger - 07 May 2009
IssueForm
Affected Service
gsiftp
Symptom summary
gridftp (and srmcp) file read hangs forever
Reason Understood
no
Solution Exists
workaround
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r3 - 2009-05-07
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback