Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

21. 08. 2013 t3fs13 dCache pools found unresponsive

Today since ~19:30 the t3fs13 were unresponsive, I noticed it because the SRM tests constantly executed by Nagios vs each pool were in error just concerning the t3fs13 pools;

a show pinboard command in one of those pool was exiting with Request timed out after 10 seconds while sending message to [>t3fs13_cms_6@local] ;

the t3fs13 cells responsiveness was unusually slow as here reported:

   ...
   t3fs09_ops      t3fs09-Domain-pool            0  37  21 msec    03/28 14:51:33 
   t3fs10_cms      t3fs10-Domain-pool            0  74  18 msec    03/28 14:54:34
   t3fs10_cms_1    t3fs10-Domain-pool            0  45  21 msec    03/28 14:54:3
   t3fs10_ops      t3fs10-Domain-pool            0  37  21 msec    03/28 14:54:39 
   t3fs11_cms      t3fs11-Domain-pool            0  48  22 msec    08/08 08:31:20 
   t3fs11_cms_1    t3fs11-Domain-pool            0  43  22 msec    08/08 08:31:2
   t3fs11_ops      t3fs11-Domain-pool            0  37  22 msec    08/08 08:31:28 
   t3fs13_cms      t3fs13-Domain-pool            0  92  48538 msec 08/16 16:44:14  <--------- look the ms delay 
   t3fs13_cms_0    t3fs13-Domain-pool            0  45  49971 msec 08/16 16:44:21 
   t3fs13_cms_1    t3fs13-Domain-pool            0  81  48531 msec 08/16 16:44:17 
   t3fs13_cms_10   t3fs13-Domain-pool            6  61  27393 msec 08/16 16:44:21 
   t3fs13_cms_11   t3fs13-Domain-pool            8  78  27191 msec 08/16 16:44:21
   t3fs13_cms_2    t3fs13-Domain-pool            4  78  27192 msec 08/16 16:44:18 
   t3fs13_cms_3    t3fs13-Domain-pool            11 103 27191 msec 08/16 16:44:19                                                                                                             
   t3fs13_cms_4    t3fs13-Domain-pool            6  95  18211 msec 08/16 16:44:19 
   t3fs13_cms_5    t3fs13-Domain-pool            5  70  27190 msec 08/16 16:44:20 
   t3fs13_cms_6    t3fs13-Domain-pool            8  69  18213 msec 08/16 16:44:20 
   t3fs13_cms_7    t3fs13-Domain-pool            0  72  8618 msec  08/16 16:44:20 
   t3fs13_cms_8    t3fs13-Domain-pool            13 60  18212 msec 08/16 16:44:20                  
   t3fs13_cms_9    t3fs13-Domain-pool            13 63  18211 msec 08/16 16:44:2
   t3fs13_ops      t3fs13-Domain-pool            0  44  27190 msec 08/16 16:44:21   
   t3fs13_ops_1    t3fs13-Domain-pool            3  44  27394 msec 08/16 16:44:21 
   t3fs13_ops_10   t3fs13-Domain-pool            0  44  48537 msec 08/16 16:44:23 
   t3fs13_ops_11   t3fs13-Domain-pool            0  44  48737 msec 08/16 16:44:23      
   t3fs13_ops_2    t3fs13-Domain-pool            0  44  49961 msec 08/16 16:44:22 
   t3fs13_ops_3    t3fs13-Domain-pool            0  44  27190 msec 08/16 16:44:22 
   t3fs13_ops_4    t3fs13-Domain-pool            0  44  49963 msec 08/16 16:44:22                                                                                                            
   t3fs13_ops_5    t3fs13-Domain-pool            0  44  49963 msec 08/16 16:44:22 
   t3fs13_ops_6    t3fs13-Domain-pool            0  44  49962 msec 08/16 16:44:22 
   t3fs13_ops_7    t3fs13-Domain-pool            0  44  45155 msec 08/16 16:44:23                                                                                                       
   t3fs13_ops_8    t3fs13-Domain-pool            0  44  49964 msec 08/16 16:44:
   t3fs13_ops_9    t3fs13-Domain-pool            0  44  49965 msec 08/16 16:44: <---------------
   t3fs14_cms      t3fs14-Domain-pool            0  64  25 msec    08/16 16:52:05 
   t3fs14_cms_0    t3fs14-Domain-pool            0  45  24 msec    08/16 16:52:13 
   t3fs14_cms_1    t3fs14-Domain-pool            0  58  24 msec    08/16 16:52:08 
   t3fs14_cms_10   t3fs14-Domain-pool            0  57  25 msec    08/16 16:52:12 2.2.9(18131)         
   ...

to fix I ran: [root@t3fs13 dcache]# dcache restart t3fs13-Domain-pool
but then the problem appeared again 1h later: this time I've found this exception in the logs and all the pools went in OFFLINE mode:

[root@t3fs13 dcache]# grep --color java.nio.HeapByteBuffer t3fs13-Domain-pool.log -A 14
...
 21 Aug 2013 22:24:35 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
	at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
	at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
	at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37]
21 Aug 2013 22:24:35 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] (Transfer finished : 0 bytes in 1 seconds) 
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:37 (t3fs13_cms_6) [DCap-t3se01--1531462 PoolDeliverFile 00008AFD66ACE04949189937FF8632D4A229] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Defaults Buffer Sizes  : send=262144;recv=262144;io=262144
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Max Buffer Sizes       : send=1048576;recv=1048576;io=1048576
--
21 Aug 2013 22:24:37 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
	at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
	at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
	at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37]
21 Aug 2013 22:24:39 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] (Transfer finished : 0 bytes in 1 seconds) 
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] DCapProtocol_3 (nio) created $Id: CMSTier3Log43.txt,v 1.2 2013/08/21 22:07:34 fabiom Exp $
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Defaults Buffer Sizes  : send=262144;recv=262144;io=262144
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Setup : Max Buffer Sizes       : send=1048576;recv=1048576;io=1048576
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] ioError = -1
--
21 Aug 2013 22:24:41 (t3fs13_cms_3) [DCap-t3se01--1531425 PoolDeliverFile 00007629A65D9DA64BC8BEC959BB96C19D3E] Problem in command block : RequestBlock [Size=24 Code=11 Buffer=java.nio.HeapByteBuffer[pos=24 lim=24 cap=16384]
java.lang.IllegalArgumentException: null
	at java.nio.Buffer.limit(Buffer.java:247) ~[na:1.6.0_37]
	at org.dcache.pool.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1083) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:606) ~[dcache-dcap-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.PoolIOReadTransfer.transfer(PoolIOReadTransfer.java:58) [dcache-core-2.2.9.jar:2.2.9]
	at org.dcache.pool.classic.LegacyMoverExecutorService$MoverTask.run(LegacyMoverExecutorService.java:66) [dcache-core-2.2.9.jar:2.2.9]
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_37]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_37]
	at java.lang.Thread.run(Thread.java:662) [na:1.6.0_37] 
[root@t3fs13 dcache]# elinks http://t3se01:2288
...
  t3fs11_ops      t3fs11-Domain-pool            0  37  19 msec    08/08 08:31:28 2.2.9(18131)                                                                                                                                               
   t3fs13_cms      t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_0    t3fs13-Domain-pool            0  45  19373 msec 08/21 22:15:27 2.2.9(18131)                                                                                                                                               
   t3fs13_cms_1    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_10   t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_11   t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_2    t3fs13-Domain-pool            0  51  19372 msec 08/21 22:15:23 2.2.9(18131)                                                                                                                                               
   t3fs13_cms_3    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_4    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_5    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_6    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_7    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_8    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_cms_9    t3fs13-Domain-pool            OFFLINE                                                                                                                                                                                     
   t3fs13_ops      t3fs13-Domain-pool            0  44  19373 msec 08/21 22:15:28 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_1    t3fs13-Domain-pool            0  44  19373 msec 08/21 22:15:28 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_10   t3fs13-Domain-pool            0  44  19372 msec 08/21 22:15:34 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_11   t3fs13-Domain-pool            0  44  19373 msec 08/21 22:15:36 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_2    t3fs13-Domain-pool            0  44  19373 msec 08/21 22:15:29 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_3    t3fs13-Domain-pool            0  44  18567 msec 08/21 22:15:29 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_4    t3fs13-Domain-pool            0  44  18568 msec 08/21 22:15:29 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_5    t3fs13-Domain-pool            0  44  25278 msec 08/21 22:15:29 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_6    t3fs13-Domain-pool            0  44  18568 msec 08/21 22:15:30 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_7    t3fs13-Domain-pool            0  44  18568 msec 08/21 22:15:30 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_8    t3fs13-Domain-pool            0  44  18568 msec 08/21 22:15:30 2.2.9(18131)                                                                                                                                               
   t3fs13_ops_9    t3fs13-Domain-pool            0  44  18568 msec 08/21 22:15:31 2.2.9(18131)                                                                                                                                               
   t3fs14_cms      t3fs14-Domain-pool            0  65  24 msec    08/16 16:52:05 2.2.9(18131)               
...
[root@t3fs13 dcache]# dcache status
DOMAIN                STATUS  PID  USER   
t3fs13-Domain-pool    running 946  dcache 
t3fs13-Domain-dcap    running 5267 dcache 
t3fs13-Domain-gridftp running 5339 dcache 
t3fs13-Domain-gsidcap running 5406 dcache 
A heap dump file, /var/log/dcache/t3fs13-Domain-pool-oom.hprof, was found
for domain t3fs13-Domain-pool. The file was generated at 2013-08-21
22:14:23.615147040 +0200 as a result of an out of memory failure in the
domain.

The dump contains debugging information that may help a developer
determine the cause of high memory usage. Please note that the dump may
contain confidential information.

As long as the file exists no other dumps will be generated on out of
memory failures. Please move or delete
/var/log/dcache/t3fs13-Domain-pool-oom.hprof. Consider increasing the
dcache.java.memory.heap property. The current value is 1024m.
I doubled dcache.java.memory.heap in /etc/dcache/dcache.conf and completely restarted dcache on t3fs13

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2013-08-21 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback