Tags:
create new tag
view all tags

IssueDcachePoolHangs

Symptoms

Summary: connections timeout. Pool shows 0 size. Pool cannot be shut down even with kill -9. OS update required

Occurrences

At what times did this problem occur (used to estimate frequency):

2009-12-09
2009-12-10

Observations

Observations Case 1

Excerpt from ps
    root  3302     1   0   Nov 19 ?           0:00 /bin/sh /opt/d-cache/jobs/pool -pool=t3fs05 start
    root  3304  3302   0   Nov 19 ?         454:56 /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.

Open files

root@t3fs05 # lsof -p 3304
COMMAND  PID USER   FD   TYPE  DEVICE SIZE/OFF     NODE NAME
java    3304 root  cwd   VDIR   85,10      512   271560 /root/t3solaris-tools
java    3304 root  txt   VREG   85,10    64476    27345 /usr/jdk/instances/jdk1.5.0/bin/java
java    3304 root  txt   VREG   85,10     9552    97630 /lib/libsendfile.so.1
java    3304 root  txt   VREG   85,10    46428    97581 /lib/libaio.so.1
java    3304 root  txt   VREG   85,10    46356    97624 /lib/librt.so.1
java    3304 root  txt   VREG   85,10    23592    29242 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libnio.so
java    3304 root  txt   VREG   85,10   314380   117476 /usr/lib/security/pkcs11_softtoken.so.1
java    3304 root  txt   VREG   85,10    41264   115718 /usr/lib/libcryptoutil.so.1
java    3304 root  txt   VREG   85,10   131396   110313 /usr/lib/libpkcs11.so.1
java    3304 root  txt   VREG   85,10    47532    29227 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libj2pkcs11.so
java    3304 root  txt   VREG   85,10  2461878   271819 /opt/d-cache/classes/toplink/toplink-essentials.jar
java    3304 root  txt   VREG   85,10    55724    29241 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libnet.so
java    3304 root  txt   VREG   85,10   163580   271820 /opt/d-cache/classes/xrootd-tokenauthz.jar
java    3304 root  txt   VREG   85,10   476636   271818 /opt/d-cache/classes/terapaths/xsdlib.jar
java    3304 root  txt   VREG   85,10  5566952   271810 /opt/d-cache/classes/terapaths/jaxrpc-impl.jar
java    3304 root  txt   VREG   85,10   150333   271817 /opt/d-cache/classes/terapaths/terapathsExampleJavaClient.jar
java    3304 root  txt   VREG   85,10    31204   271814 /opt/d-cache/classes/terapaths/relaxngDatatype.jar
java    3304 root  txt   VREG   85,10   270394   271813 /opt/d-cache/classes/terapaths/mail.jar
java    3304 root  txt   VREG   85,10    49672   271812 /opt/d-cache/classes/terapaths/jsr173_api.jar
java    3304 root  txt   VREG   85,10   618186   271816 /opt/d-cache/classes/terapaths/saaj-impl.jar
java    3304 root  txt   VREG   85,10    36298   271815 /opt/d-cache/classes/terapaths/saaj-api.jar
java    3304 root  txt   VREG   85,10    40820   271811 /opt/d-cache/classes/terapaths/jaxrpc-spi.jar
java    3304 root  txt   VREG   85,10    57375   271809 /opt/d-cache/classes/terapaths/jaxrpc-api.jar
java    3304 root  txt   VREG   85,10   217152   271808 /opt/d-cache/classes/terapaths/jaxp-api.jar
java    3304 root  txt   VREG   85,10     7700   271807 /opt/d-cache/classes/terapaths/jax-qname.jar
java    3304 root  txt   VREG   85,10    44623   271806 /opt/d-cache/classes/terapaths/activation.jar
java    3304 root  txt   VREG   85,10   593190   271805 /opt/d-cache/classes/terapaths/FastInfoset.jar
java    3304 root  txt   VREG   85,10  1197052   271804 /opt/d-cache/classes/srm.jar
java    3304 root  txt   VREG   85,10     1409   271803 /opt/d-cache/classes/srm-tomcat.jar
java    3304 root  txt   VREG   85,10   476940   271801 /opt/d-cache/classes/spring/spring-context.jar
java    3304 root  txt   VREG   85,10   488282   271800 /opt/d-cache/classes/spring/spring-beans.jar
java    3304 root  txt   VREG   85,10  1070477   271798 /opt/d-cache/classes/security/bcprov-jdk15-133.jar
java    3304 root  txt   VREG   85,10   572551   271797 /opt/d-cache/classes/saxon.jar
java    3304 root  txt   VREG   85,10   553805   271796 /opt/d-cache/classes/opensaml/xmltooling-1.0.1.jar
java    3304 root  txt   VREG   85,10  3149655   271794 /opt/d-cache/classes/opensaml/xalan-2.6.0.jar
java    3304 root  txt   VREG   85,10   285491   271802 /opt/d-cache/classes/spring/spring-core.jar
java    3304 root  txt   VREG   85,10     3854   271799 /opt/d-cache/classes/smc/statemap.jar
java    3304 root  txt   VREG   85,10   415492   271795 /opt/d-cache/classes/opensaml/xml-security-1.4.1.jar
java    3304 root  txt   VREG   85,10   392124   271793 /opt/d-cache/classes/opensaml/velocity-1.5.jar
java    3304 root  txt   VREG   85,10     7477   271792 /opt/d-cache/classes/opensaml/slf4j-simple-1.5.0.jar
java    3304 root  txt   VREG   85,10    16591   271791 /opt/d-cache/classes/opensaml/slf4j-api-1.5.0.jar
java    3304 root  txt   VREG   85,10   121144   271790 /opt/d-cache/classes/opensaml/openws-1.1.0.jar
java    3304 root  txt   VREG   85,10  1242124   271789 /opt/d-cache/classes/opensaml/opensaml-2.2.0.jar
java    3304 root  txt   VREG   85,10   105026   271788 /opt/d-cache/classes/opensaml/opensaml-1.0.1.jar
java    3304 root  txt   VREG   85,10   531326   271787 /opt/d-cache/classes/opensaml/joda-time-1.5.2.jar
java    3304 root  txt   VREG   85,10   207723   271786 /opt/d-cache/classes/opensaml/commons-lang-2.1.jar
java    3304 root  txt   VREG   85,10   559366   271785 /opt/d-cache/classes/opensaml/commons-collections-3.1.jar
java    3304 root  txt   VREG   85,10   352668   271784 /opt/d-cache/classes/log4j/log4j-1.2.8.jar
java    3304 root  txt   VREG   85,10   914243   271783 /opt/d-cache/classes/jpox/jpox-rdbms-1.2.2.jar
java    3304 root  txt   VREG   85,10  1503051   271782 /opt/d-cache/classes/jpox/jpox-core-1.2.2.jar
java    3304 root  txt   VREG   85,10    82512   271781 /opt/d-cache/classes/jpox/jdo2-api-2.0.jar
java    3304 root  txt   VREG   85,10    87401   271780 /opt/d-cache/classes/jline/jline-0.9.92.jar
java    3304 root  txt   VREG   85,10   150794   271779 /opt/d-cache/classes/jdom/jdom.jar
java    3304 root  txt   VREG   85,10   402505   271778 /opt/d-cache/classes/jdbc-drivers/postgresql-8.1-405.jdbc3.jar
java    3304 root  txt   VREG   85,10    43080   271777 /opt/d-cache/classes/jdbc-drivers/commons-pool.jar
java    3304 root  txt   VREG   85,10   107465   271776 /opt/d-cache/classes/jdbc-drivers/commons-dbcp.jar
java    3304 root  txt   VREG   85,10    29872   271775 /opt/d-cache/classes/javatunnel.jar
java    3304 root  txt   VREG   85,10    30663   271774 /opt/d-cache/classes/infoDynamicSE.jar
java    3304 root  txt   VREG   85,10   170704   271773 /opt/d-cache/classes/info.jar
java    3304 root  txt   VREG   85,10    71241   271772 /opt/d-cache/classes/gplazma/privilege-xacml-2.2.3.jar
java    3304 root  txt   VREG   85,10    89835   271771 /opt/d-cache/classes/gplazma/privilege-1.0.1.4.jar
java    3304 root  txt   VREG   85,10   327603   271770 /opt/d-cache/classes/gplazma/mail.jar
java    3304 root  txt   VREG   85,10    77904   271769 /opt/d-cache/classes/gplazma/gplazma.jar
java    3304 root  txt   VREG   85,10    23593   271768 /opt/d-cache/classes/gplazma/anam.jar
java    3304 root  txt   VREG   85,10     6357   271767 /opt/d-cache/classes/gplazma/alldepends.jar
java    3304 root  txt   VREG   85,10    82297   271766 /opt/d-cache/classes/gplazma/activation.jar
java    3304 root  txt   VREG   85,10    40809   271765 /opt/d-cache/classes/glue/servlet.jar
java    3304 root  txt   VREG   85,10   463471   271764 /opt/d-cache/classes/glue/jsse.jar
java    3304 root  txt   VREG   85,10     3098   271763 /opt/d-cache/classes/glue/jnet.jar
java    3304 root  txt   VREG   85,10     7637   271762 /opt/d-cache/classes/glue/jcert.jar
java    3304 root  txt   VREG   85,10    13407   271761 /opt/d-cache/classes/glue/dom.jar
java    3304 root  txt   VREG   85,10   259265   271760 /opt/d-cache/classes/glue/collections.jar
java    3304 root  txt   VREG   85,10   633873   271759 /opt/d-cache/classes/glue/GLUE-STD.jar
java    3304 root  txt   VREG   85,10   125485   271758 /opt/d-cache/classes/glite/vomsjapi.jar
java    3304 root  txt   VREG   85,10    61580   271757 /opt/d-cache/classes/glite/glite-security-util-java.jar
java    3304 root  txt   VREG   85,10    40737   271756 /opt/d-cache/classes/glite/glite-security-trustmanager-1.8.16-1.norefresh.jar
java    3304 root  txt   VREG   85,10   573145   271751 /opt/d-cache/classes/dcache-srm.jar
java    3304 root  txt   VREG   85,10   165821   271750 /opt/d-cache/classes/dcache-replica.jar
java    3304 root  txt   VREG   85,10   610790   271749 /opt/d-cache/classes/dbcp/c3p0-0.9.1.2.jar
java    3304 root  txt   VREG   85,10   352668   271747 /opt/d-cache/classes/cog/log4j-1.2.8.jar
java    3304 root  txt   VREG   85,10   121070   271745 /opt/d-cache/classes/cog/junit.jar
java    3304 root  txt   VREG   85,10     6567   271744 /opt/d-cache/classes/cog/jgss.jar
java    3304 root  txt   VREG   85,10    31605   271740 /opt/d-cache/classes/cog/commons-logging.jar
java    3304 root  txt   VREG   85,10     2949   271739 /opt/d-cache/classes/cog/cog-url.jar
java    3304 root  txt   VREG   85,10    23037   271738 /opt/d-cache/classes/cog/cog-tomcat.jar
java    3304 root  txt   VREG   85,10   366255   271743 /opt/d-cache/classes/cog/cryptix32.jar
java    3304 root  txt   VREG   85,10    80392   271741 /opt/d-cache/classes/cog/cryptix-asn1.jar
java    3304 root  txt   VREG   85,10     3608   271742 /opt/d-cache/classes/cog/cryptix.jar
java    3304 root  txt   VREG   85,10   191354   271748 /opt/d-cache/classes/cog/puretls.jar
java    3304 root  txt   VREG   85,10   358180   271746 /opt/d-cache/classes/cog/log4j-1.2.13.jar
java    3304 root  txt   VREG   85,10   689716   271736 /opt/d-cache/classes/cog/cog-jglobus.jar
java    3304 root  txt   VREG   85,10    31064   271735 /opt/d-cache/classes/cog/cog-axis.jar
java    3304 root  txt   VREG   85,10    91131   271734 /opt/d-cache/classes/chimera/portmap.jar
java    3304 root  txt   VREG   85,10    84788   271733 /opt/d-cache/classes/chimera/oncrpc.jar
java    3304 root  txt   VREG   85,10    31605   271732 /opt/d-cache/classes/chimera/commons-logging.jar
java    3304 root  txt   VREG   85,10    30117   271731 /opt/d-cache/classes/chimera/commons-cli-1.0.jar
java    3304 root  txt   VREG   85,10   179052   271730 /opt/d-cache/classes/chimera/chimera-provider.jar
java    3304 root  txt   VREG   85,10   640592   271729 /opt/d-cache/classes/chimera/chimera-nfs.jar
java    3304 root  txt   VREG   85,10   128289   271728 /opt/d-cache/classes/chimera/chimera-core.jar
java    3304 root  txt   VREG   85,10   108000   271727 /opt/d-cache/classes/chimera/chimera-acl.jar
java    3304 root  txt   VREG   85,10  1157968   271723 /opt/d-cache/classes/berkeleyDB/je-3.2.76.jar
java    3304 root    0r  VCHR    13,2           6815752 /devices/pseudo/mm@0:null
java    3304 root    1w  VCHR    13,2           6815752 /devices/pseudo/mm@0:null
java    3304 root    2w  VCHR    13,2           6815752 /devices/pseudo/mm@0:null
java    3304 root    3u  VCHR   13,12           6815772 /devices/pseudo/mm@0:zero
java    3304 root    4r  DOOR              0t0       53 /var/run/name_service_door (door to nscd[176]) (FA:->0xffffffffaff29ec0)
java    3304 root    5r  VCHR   149,0          78118916 /devices/pseudo/random@0:random
java    3304 root    6r  VCHR   149,1          78118918 /devices/pseudo/random@0:urandom
java    3304 root    7w  VREG   85,10        0   338286 /tmp/trace.out
java    3304 root    8u                                 unknown file system type (zfs), v_op: 0xffffffffa977f200
java    3304 root   11u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   15u  0011 105,209      0t0 55050244 /devices/pseudo/tl@0:ticots->(socketpair: 0xd1) (0xffffffffb02b1560)
java    3304 root   16u                                 unknown file system type (zfs), v_op: 0xffffffffa977f200
java    3304 root   17u                                 unknown file system type (zfs), v_op: 0xffffffffa977f200
java    3304 root   18u                                 unknown file system type (zfs), v_op: 0xffffffffa977f200
java    3304 root   19r  VCHR   149,1          78118918 /devices/pseudo/random@0:urandom
java    3304 root   20u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   22u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   23u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   24u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   25u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   26u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   27u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   28u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   29u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   30u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   31u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   32u  IPv4                       TCP no TCP/UDP/IP information available
java    3304 root   33u  IPv4                       TCP no TCP/UDP/IP information available

zpool status shows no problem with the zfs pools. The system logs show occasional problems with the marvell controller, but no entry for the time when this problem started. The free space monitoring shows that the problem started around 3:30 am, because we see there a sharp decrease in free space (lost the pool).

Cannot get a stack trace or attach through truss

root@t3fs05 # pstack 3304
pstack: cannot examine 3304: no such process

root@t3fs05 # truss -p 3304
truss: no such process: 3304

Some information on the state may be in this thread: http://opensolaris.org/jive/thread.jspa?threadID=25988 on unkillable processes. But they do not have a cure for it.

A normal shutdown/reboot did not work

shutdown -i6 -g0

Shutdown started.    Wed Dec  9 09:46:04 CET 2009

I can see that the shutdown process is hanging in the sync process:

pgrep -fl shutd
 4250 /sbin/sh /usr/sbin/shutdown -i6 -g0

truss -p 4250
waitid(P_PID, 4256, 0x08047B10, WEXITED|WTRAPPED|WNOWAIT) (sleeping...)

ps -ef | grep 4256
    root  4256  4250   0 09:46:04 pts/1       0:00 /sbin/sync
    root  4261   645   0 10:00:06 console     0:00 grep 4256

As a harsher measure, I then issued a reboot command. Even this hung. I then tried from the ILOM a stop /SYS. The still active console showed a confirmation of having received the shutdown command.

WARNING: Power off requested from power button or SC, powering down the system!

Shutdown started.    Wed Dec  9 10:13:10 CET 2009
WARNING: Failed to shut down the system!

So, the only remaining option is a hard stop of the system by turning off the power: stop -force /SYS

Observations Case 2

Based on the free storage monitor, the pool went offline around 17h on 2009-12-09. The gridftp log was the last log written by dcache. It is full of errors: gridftp-t3fs05Domain.log-20091210.

root@t3fs05 # /opt/d-cache/bin/dcache status
Domain               Status      Pid Log file
t3fs05Domain         running     872 /var/log/dcache/t3fs05Domain.log
dcap-t3fs05Domain    running     921 /var/log/dcache/dcap-t3fs05Domain.log
gridftp-t3fs05Domain running     970 /var/log/dcache/gridftp-t3fs05Domain.log
gsidcap-t3fs05Domain running    1022 /var/log/dcache/gsidcap-t3fs05Domain.log


/opt/d-cache/bin/dcache stop
Stopping gsidcap-t3fs05Domain (pid=1022) 0 Done
Stopping gridftp-t3fs05Domain (pid=970) 0 Done
Stopping dcap-t3fs05Domain (pid=921) 0 Done
Stopping t3fs05Domain (pid=872) 0 1 2 3 4 5 6

root@t3fs05 # truss -p 872
truss: unanticipated system error: 872

A google search for "truss: unanticipated system error" brought up this blog entry: http://blog.thilelli.net/post/2008/09/12/Processes-Can-Not-Be-Killed-Upon-Hanging-In-sendfilev

Further note: On the backplane of the X4500 the Warning LEDs for the chassis were on, also the white locator LED was on. On the front there was nothing

I cannot get get access to the Web frontend of the ILOM. Also, after a few tries, I cannot get access to the ILOM through the CLI. In both cases I just end up on the login. ... after a long time, I got access to the CLI, and then also to the web front end, I did a reset /SP on it.

-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing hard reset on /SP failed
reset: No such object value

Then I tried a reset over the web front end (maintenance tab), which worked.

So, it also seems that the ILOM was in a bad state, which already had happened with the older ILOM versions on the X4150. Probably it will pay off to reset the ILOM regularly or to update it.

Solution or Workaround

A system restart is needed to bring the machine again into a clean state. But this does not really solve the underlying problem. An update of the OS is required (I installed Solaris 10u8).

Monitoring for this condition

A sharp decline in the available space is a symptom. One could also try to ping cells (would need to confirm).

-- DerekFeichtinger - 2009-12-09

IssueForm
Affected Service dcache pool
Symptom summary connections timeout. Pool shows 0 size. Pool cannot be shut down even with kill -9. OS update required
Reason Understood yes
Solution Exists yes
Obsolete no
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatlog-20091210 gridftp-t3fs05Domain.log-20091210 r1 manage 11.4 K 2009-12-10 - 08:31 DerekFeichtinger gridftp door log
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2009-12-12 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback