Go to
previous page /
next page of Tier3 site log
21. 08. 2013 t3fs13 dCache pools found unresponsive
Today since ~19:30 the
t3fs13
were unresponsive, I noticed it because the
SRM tests constantly executed by Nagios vs each pool were in error just concerning the
t3fs13
pools;
a
show pinboard
command in one of those pool was exiting with
Request timed out after 10 seconds while sending message to [>t3fs13_cms_6@local]
;
the
t3fs13
cells responsiveness was unusually slow as here reported:
...
t3fs09_ops t3fs09-Domain-pool 0 37 21 msec 03/28 14:51:33
t3fs10_cms t3fs10-Domain-pool 0 74 18 msec 03/28 14:54:34
t3fs10_cms_1 t3fs10-Domain-pool 0 45 21 msec 03/28 14:54:3
t3fs10_ops t3fs10-Domain-pool 0 37 21 msec 03/28 14:54:39
t3fs11_cms t3fs11-Domain-pool 0 48 22 msec 08/08 08:31:20
t3fs11_cms_1 t3fs11-Domain-pool 0 43 22 msec 08/08 08:31:2
t3fs11_ops t3fs11-Domain-pool 0 37 22 msec 08/08 08:31:28
t3fs13_cms t3fs13-Domain-pool 0 92 48538 msec 08/16 16:44:14 <--------- look the ms delay
t3fs13_cms_0 t3fs13-Domain-pool 0 45 49971 msec 08/16 16:44:21
t3fs13_cms_1 t3fs13-Domain-pool 0 81 48531 msec 08/16 16:44:17
t3fs13_cms_10 t3fs13-Domain-pool 6 61 27393 msec 08/16 16:44:21
t3fs13_cms_11 t3fs13-Domain-pool 8 78 27191 msec 08/16 16:44:21
t3fs13_cms_2 t3fs13-Domain-pool 4 78 27192 msec 08/16 16:44:18
t3fs13_cms_3 t3fs13-Domain-pool 11 103 27191 msec 08/16 16:44:19
t3fs13_cms_4 t3fs13-Domain-pool 6 95 18211 msec 08/16 16:44:19
t3fs13_cms_5 t3fs13-Domain-pool 5 70 27190 msec 08/16 16:44:20
t3fs13_cms_6 t3fs13-Domain-pool 8 69 18213 msec 08/16 16:44:20
t3fs13_cms_7 t3fs13-Domain-pool 0 72 8618 msec 08/16 16:44:20
t3fs13_cms_8 t3fs13-Domain-pool 13 60 18212 msec 08/16 16:44:20
t3fs13_cms_9 t3fs13-Domain-pool 13 63 18211 msec 08/16 16:44:2
t3fs13_ops t3fs13-Domain-pool 0 44 27190 msec 08/16 16:44:21
t3fs13_ops_1 t3fs13-Domain-pool 3 44 27394 msec 08/16 16:44:21
t3fs13_ops_10 t3fs13-Domain-pool 0 44 48537 msec 08/16 16:44:23
t3fs13_ops_11 t3fs13-Domain-pool 0 44 48737 msec 08/16 16:44:23
t3fs13_ops_2 t3fs13-Domain-pool 0 44 49961 msec 08/16 16:44:22
t3fs13_ops_3 t3fs13-Domain-pool 0 44 27190 msec 08/16 16:44:22
t3fs13_ops_4 t3fs13-Domain-pool 0 44 49963 msec 08/16 16:44:22
t3fs13_ops_5 t3fs13-Domain-pool 0 44 49963 msec 08/16 16:44:22
t3fs13_ops_6 t3fs13-Domain-pool 0 44 49962 msec 08/16 16:44:22
t3fs13_ops_7 t3fs13-Domain-pool 0 44 45155 msec 08/16 16:44:23
t3fs13_ops_8 t3fs13-Domain-pool 0 44 49964 msec 08/16 16:44:
t3fs13_ops_9 t3fs13-Domain-pool 0 44 49965 msec 08/16 16:44: <---------------
t3fs14_cms t3fs14-Domain-pool 0 64 25 msec 08/16 16:52:05
t3fs14_cms_0 t3fs14-Domain-pool 0 45 24 msec 08/16 16:52:13
t3fs14_cms_1 t3fs14-Domain-pool 0 58 24 msec 08/16 16:52:08
t3fs14_cms_10 t3fs14-Domain-pool 0 57 25 msec 08/16 16:52:12 2.2.9(18131)
...
to fix I ran:
[root@t3fs13 dcache]# dcache restart t3fs13-Domain-pool
but then the problem appeared again 1h later: this time I've found this exception in the logs and all the pools went in
OFFLINE
mode:
[root@t3fs13 dcache]# grep --color java.nio.HeapByteBuffer t3fs13-Domain-pool.log -A 14
...
21 Aug 2013 22:24:35 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37]
21 Aug 2013 22:24:35 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] (Transfer finished : 0 bytes in 1 seconds)
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:37 (t3fs13_cms_6) [DCap-t3se01--1531462 PoolDeliverFile 00008AFD66ACE04949189937FF8632D4A229] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Defaults Buffer Sizes : send=262144;recv=262144;io=262144
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Max Buffer Sizes : send=1048576;recv=1048576;io=1048576
--
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37]
21 Aug 2013 22:24:39 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] (Transfer finished : 0 bytes in 1 seconds)
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Defaults Buffer Sizes : send=262144;recv=262144;io=262144
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Max Buffer Sizes : send=1048576;recv=1048576;io=1048576
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] ioError = -1
--
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37]
[root@t3fs13 dcache]# elinks
http://t3se01:2288
...
t3fs11_ops t3fs11-Domain-pool 0 37 19 msec 08/08 08:31:28 2.2.9(18131)
t3fs13_cms t3fs13-Domain-pool OFFLINE
t3fs13_cms_0 t3fs13-Domain-pool 0 45 19373 msec 08/21 22:15:27 2.2.9(18131)
t3fs13_cms_1 t3fs13-Domain-pool OFFLINE
t3fs13_cms_10 t3fs13-Domain-pool OFFLINE
t3fs13_cms_11 t3fs13-Domain-pool OFFLINE
t3fs13_cms_2 t3fs13-Domain-pool 0 51 19372 msec 08/21 22:15:23 2.2.9(18131)
t3fs13_cms_3 t3fs13-Domain-pool OFFLINE
t3fs13_cms_4 t3fs13-Domain-pool OFFLINE
t3fs13_cms_5 t3fs13-Domain-pool OFFLINE
t3fs13_cms_6 t3fs13-Domain-pool OFFLINE
t3fs13_cms_7 t3fs13-Domain-pool OFFLINE
t3fs13_cms_8 t3fs13-Domain-pool OFFLINE
t3fs13_cms_9 t3fs13-Domain-pool OFFLINE
t3fs13_ops t3fs13-Domain-pool 0 44 19373 msec 08/21 22:15:28 2.2.9(18131)
t3fs13_ops_1 t3fs13-Domain-pool 0 44 19373 msec 08/21 22:15:28 2.2.9(18131)
t3fs13_ops_10 t3fs13-Domain-pool 0 44 19372 msec 08/21 22:15:34 2.2.9(18131)
t3fs13_ops_11 t3fs13-Domain-pool 0 44 19373 msec 08/21 22:15:36 2.2.9(18131)
t3fs13_ops_2 t3fs13-Domain-pool 0 44 19373 msec 08/21 22:15:29 2.2.9(18131)
t3fs13_ops_3 t3fs13-Domain-pool 0 44 18567 msec 08/21 22:15:29 2.2.9(18131)
t3fs13_ops_4 t3fs13-Domain-pool 0 44 18568 msec 08/21 22:15:29 2.2.9(18131)
t3fs13_ops_5 t3fs13-Domain-pool 0 44 25278 msec 08/21 22:15:29 2.2.9(18131)
t3fs13_ops_6 t3fs13-Domain-pool 0 44 18568 msec 08/21 22:15:30 2.2.9(18131)
t3fs13_ops_7 t3fs13-Domain-pool 0 44 18568 msec 08/21 22:15:30 2.2.9(18131)
t3fs13_ops_8 t3fs13-Domain-pool 0 44 18568 msec 08/21 22:15:30 2.2.9(18131)
t3fs13_ops_9 t3fs13-Domain-pool 0 44 18568 msec 08/21 22:15:31 2.2.9(18131)
t3fs14_cms t3fs14-Domain-pool 0 65 24 msec 08/16 16:52:05 2.2.9(18131)
...
[root@t3fs13 dcache]# dcache status
DOMAIN STATUS PID USER
t3fs13-Domain-pool running 946 dcache
t3fs13-Domain-dcap running 5267 dcache
t3fs13-Domain-gridftp running 5339 dcache
t3fs13-Domain-gsidcap running 5406 dcache
A heap dump file, /var/log/dcache/t3fs13-Domain-pool-oom.hprof, was found
for domain t3fs13-Domain-pool. The file was generated at 2013-08-21
22:14:23.615147040 +0200 as a result of an out of memory failure in the
domain.
The dump contains debugging information that may help a developer
determine the cause of high memory usage. Please note that the dump may
contain confidential information.
As long as the file exists no other dumps will be generated on out of
memory failures. Please move or delete
/var/log/dcache/t3fs13-Domain-pool-oom.hprof. Consider increasing the
dcache.java.memory.heap property. The current value is 1024m.
I doubled
dcache.java.memory.heap
in
/etc/dcache/dcache.conf
and completely restarted dcache on
t3fs13
Go to
previous page /
next page of Tier3 site log