Symptoms
Summary: connections timeout. Pool shows 0 size. Pool cannot be shut down even with kill -9. OS update required
Occurrences
At what times did this problem occur (used to estimate frequency):
Observations
Observations Case 1
Excerpt from ps
root 3302 1 0 Nov 19 ? 0:00 /bin/sh /opt/d-cache/jobs/pool -pool=t3fs05 start
root 3304 3302 0 Nov 19 ? 454:56 /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.
Open files
root@t3fs05 # lsof -p 3304
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 3304 root cwd VDIR 85,10 512 271560 /root/t3solaris-tools
java 3304 root txt VREG 85,10 64476 27345 /usr/jdk/instances/jdk1.5.0/bin/java
java 3304 root txt VREG 85,10 9552 97630 /lib/libsendfile.so.1
java 3304 root txt VREG 85,10 46428 97581 /lib/libaio.so.1
java 3304 root txt VREG 85,10 46356 97624 /lib/librt.so.1
java 3304 root txt VREG 85,10 23592 29242 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libnio.so
java 3304 root txt VREG 85,10 314380 117476 /usr/lib/security/pkcs11_softtoken.so.1
java 3304 root txt VREG 85,10 41264 115718 /usr/lib/libcryptoutil.so.1
java 3304 root txt VREG 85,10 131396 110313 /usr/lib/libpkcs11.so.1
java 3304 root txt VREG 85,10 47532 29227 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libj2pkcs11.so
java 3304 root txt VREG 85,10 2461878 271819 /opt/d-cache/classes/toplink/toplink-essentials.jar
java 3304 root txt VREG 85,10 55724 29241 /usr/jdk/instances/jdk1.5.0/jre/lib/i386/libnet.so
java 3304 root txt VREG 85,10 163580 271820 /opt/d-cache/classes/xrootd-tokenauthz.jar
java 3304 root txt VREG 85,10 476636 271818 /opt/d-cache/classes/terapaths/xsdlib.jar
java 3304 root txt VREG 85,10 5566952 271810 /opt/d-cache/classes/terapaths/jaxrpc-impl.jar
java 3304 root txt VREG 85,10 150333 271817 /opt/d-cache/classes/terapaths/terapathsExampleJavaClient.jar
java 3304 root txt VREG 85,10 31204 271814 /opt/d-cache/classes/terapaths/relaxngDatatype.jar
java 3304 root txt VREG 85,10 270394 271813 /opt/d-cache/classes/terapaths/mail.jar
java 3304 root txt VREG 85,10 49672 271812 /opt/d-cache/classes/terapaths/jsr173_api.jar
java 3304 root txt VREG 85,10 618186 271816 /opt/d-cache/classes/terapaths/saaj-impl.jar
java 3304 root txt VREG 85,10 36298 271815 /opt/d-cache/classes/terapaths/saaj-api.jar
java 3304 root txt VREG 85,10 40820 271811 /opt/d-cache/classes/terapaths/jaxrpc-spi.jar
java 3304 root txt VREG 85,10 57375 271809 /opt/d-cache/classes/terapaths/jaxrpc-api.jar
java 3304 root txt VREG 85,10 217152 271808 /opt/d-cache/classes/terapaths/jaxp-api.jar
java 3304 root txt VREG 85,10 7700 271807 /opt/d-cache/classes/terapaths/jax-qname.jar
java 3304 root txt VREG 85,10 44623 271806 /opt/d-cache/classes/terapaths/activation.jar
java 3304 root txt VREG 85,10 593190 271805 /opt/d-cache/classes/terapaths/FastInfoset.jar
java 3304 root txt VREG 85,10 1197052 271804 /opt/d-cache/classes/srm.jar
java 3304 root txt VREG 85,10 1409 271803 /opt/d-cache/classes/srm-tomcat.jar
java 3304 root txt VREG 85,10 476940 271801 /opt/d-cache/classes/spring/spring-context.jar
java 3304 root txt VREG 85,10 488282 271800 /opt/d-cache/classes/spring/spring-beans.jar
java 3304 root txt VREG 85,10 1070477 271798 /opt/d-cache/classes/security/bcprov-jdk15-133.jar
java 3304 root txt VREG 85,10 572551 271797 /opt/d-cache/classes/saxon.jar
java 3304 root txt VREG 85,10 553805 271796 /opt/d-cache/classes/opensaml/xmltooling-1.0.1.jar
java 3304 root txt VREG 85,10 3149655 271794 /opt/d-cache/classes/opensaml/xalan-2.6.0.jar
java 3304 root txt VREG 85,10 285491 271802 /opt/d-cache/classes/spring/spring-core.jar
java 3304 root txt VREG 85,10 3854 271799 /opt/d-cache/classes/smc/statemap.jar
java 3304 root txt VREG 85,10 415492 271795 /opt/d-cache/classes/opensaml/xml-security-1.4.1.jar
java 3304 root txt VREG 85,10 392124 271793 /opt/d-cache/classes/opensaml/velocity-1.5.jar
java 3304 root txt VREG 85,10 7477 271792 /opt/d-cache/classes/opensaml/slf4j-simple-1.5.0.jar
java 3304 root txt VREG 85,10 16591 271791 /opt/d-cache/classes/opensaml/slf4j-api-1.5.0.jar
java 3304 root txt VREG 85,10 121144 271790 /opt/d-cache/classes/opensaml/openws-1.1.0.jar
java 3304 root txt VREG 85,10 1242124 271789 /opt/d-cache/classes/opensaml/opensaml-2.2.0.jar
java 3304 root txt VREG 85,10 105026 271788 /opt/d-cache/classes/opensaml/opensaml-1.0.1.jar
java 3304 root txt VREG 85,10 531326 271787 /opt/d-cache/classes/opensaml/joda-time-1.5.2.jar
java 3304 root txt VREG 85,10 207723 271786 /opt/d-cache/classes/opensaml/commons-lang-2.1.jar
java 3304 root txt VREG 85,10 559366 271785 /opt/d-cache/classes/opensaml/commons-collections-3.1.jar
java 3304 root txt VREG 85,10 352668 271784 /opt/d-cache/classes/log4j/log4j-1.2.8.jar
java 3304 root txt VREG 85,10 914243 271783 /opt/d-cache/classes/jpox/jpox-rdbms-1.2.2.jar
java 3304 root txt VREG 85,10 1503051 271782 /opt/d-cache/classes/jpox/jpox-core-1.2.2.jar
java 3304 root txt VREG 85,10 82512 271781 /opt/d-cache/classes/jpox/jdo2-api-2.0.jar
java 3304 root txt VREG 85,10 87401 271780 /opt/d-cache/classes/jline/jline-0.9.92.jar
java 3304 root txt VREG 85,10 150794 271779 /opt/d-cache/classes/jdom/jdom.jar
java 3304 root txt VREG 85,10 402505 271778 /opt/d-cache/classes/jdbc-drivers/postgresql-8.1-405.jdbc3.jar
java 3304 root txt VREG 85,10 43080 271777 /opt/d-cache/classes/jdbc-drivers/commons-pool.jar
java 3304 root txt VREG 85,10 107465 271776 /opt/d-cache/classes/jdbc-drivers/commons-dbcp.jar
java 3304 root txt VREG 85,10 29872 271775 /opt/d-cache/classes/javatunnel.jar
java 3304 root txt VREG 85,10 30663 271774 /opt/d-cache/classes/infoDynamicSE.jar
java 3304 root txt VREG 85,10 170704 271773 /opt/d-cache/classes/info.jar
java 3304 root txt VREG 85,10 71241 271772 /opt/d-cache/classes/gplazma/privilege-xacml-2.2.3.jar
java 3304 root txt VREG 85,10 89835 271771 /opt/d-cache/classes/gplazma/privilege-1.0.1.4.jar
java 3304 root txt VREG 85,10 327603 271770 /opt/d-cache/classes/gplazma/mail.jar
java 3304 root txt VREG 85,10 77904 271769 /opt/d-cache/classes/gplazma/gplazma.jar
java 3304 root txt VREG 85,10 23593 271768 /opt/d-cache/classes/gplazma/anam.jar
java 3304 root txt VREG 85,10 6357 271767 /opt/d-cache/classes/gplazma/alldepends.jar
java 3304 root txt VREG 85,10 82297 271766 /opt/d-cache/classes/gplazma/activation.jar
java 3304 root txt VREG 85,10 40809 271765 /opt/d-cache/classes/glue/servlet.jar
java 3304 root txt VREG 85,10 463471 271764 /opt/d-cache/classes/glue/jsse.jar
java 3304 root txt VREG 85,10 3098 271763 /opt/d-cache/classes/glue/jnet.jar
java 3304 root txt VREG 85,10 7637 271762 /opt/d-cache/classes/glue/jcert.jar
java 3304 root txt VREG 85,10 13407 271761 /opt/d-cache/classes/glue/dom.jar
java 3304 root txt VREG 85,10 259265 271760 /opt/d-cache/classes/glue/collections.jar
java 3304 root txt VREG 85,10 633873 271759 /opt/d-cache/classes/glue/GLUE-STD.jar
java 3304 root txt VREG 85,10 125485 271758 /opt/d-cache/classes/glite/vomsjapi.jar
java 3304 root txt VREG 85,10 61580 271757 /opt/d-cache/classes/glite/glite-security-util-java.jar
java 3304 root txt VREG 85,10 40737 271756 /opt/d-cache/classes/glite/glite-security-trustmanager-1.8.16-1.norefresh.jar
java 3304 root txt VREG 85,10 573145 271751 /opt/d-cache/classes/dcache-srm.jar
java 3304 root txt VREG 85,10 165821 271750 /opt/d-cache/classes/dcache-replica.jar
java 3304 root txt VREG 85,10 610790 271749 /opt/d-cache/classes/dbcp/c3p0-0.9.1.2.jar
java 3304 root txt VREG 85,10 352668 271747 /opt/d-cache/classes/cog/log4j-1.2.8.jar
java 3304 root txt VREG 85,10 121070 271745 /opt/d-cache/classes/cog/junit.jar
java 3304 root txt VREG 85,10 6567 271744 /opt/d-cache/classes/cog/jgss.jar
java 3304 root txt VREG 85,10 31605 271740 /opt/d-cache/classes/cog/commons-logging.jar
java 3304 root txt VREG 85,10 2949 271739 /opt/d-cache/classes/cog/cog-url.jar
java 3304 root txt VREG 85,10 23037 271738 /opt/d-cache/classes/cog/cog-tomcat.jar
java 3304 root txt VREG 85,10 366255 271743 /opt/d-cache/classes/cog/cryptix32.jar
java 3304 root txt VREG 85,10 80392 271741 /opt/d-cache/classes/cog/cryptix-asn1.jar
java 3304 root txt VREG 85,10 3608 271742 /opt/d-cache/classes/cog/cryptix.jar
java 3304 root txt VREG 85,10 191354 271748 /opt/d-cache/classes/cog/puretls.jar
java 3304 root txt VREG 85,10 358180 271746 /opt/d-cache/classes/cog/log4j-1.2.13.jar
java 3304 root txt VREG 85,10 689716 271736 /opt/d-cache/classes/cog/cog-jglobus.jar
java 3304 root txt VREG 85,10 31064 271735 /opt/d-cache/classes/cog/cog-axis.jar
java 3304 root txt VREG 85,10 91131 271734 /opt/d-cache/classes/chimera/portmap.jar
java 3304 root txt VREG 85,10 84788 271733 /opt/d-cache/classes/chimera/oncrpc.jar
java 3304 root txt VREG 85,10 31605 271732 /opt/d-cache/classes/chimera/commons-logging.jar
java 3304 root txt VREG 85,10 30117 271731 /opt/d-cache/classes/chimera/commons-cli-1.0.jar
java 3304 root txt VREG 85,10 179052 271730 /opt/d-cache/classes/chimera/chimera-provider.jar
java 3304 root txt VREG 85,10 640592 271729 /opt/d-cache/classes/chimera/chimera-nfs.jar
java 3304 root txt VREG 85,10 128289 271728 /opt/d-cache/classes/chimera/chimera-core.jar
java 3304 root txt VREG 85,10 108000 271727 /opt/d-cache/classes/chimera/chimera-acl.jar
java 3304 root txt VREG 85,10 1157968 271723 /opt/d-cache/classes/berkeleyDB/je-3.2.76.jar
java 3304 root 0r VCHR 13,2 6815752 /devices/pseudo/mm@0:null
java 3304 root 1w VCHR 13,2 6815752 /devices/pseudo/mm@0:null
java 3304 root 2w VCHR 13,2 6815752 /devices/pseudo/mm@0:null
java 3304 root 3u VCHR 13,12 6815772 /devices/pseudo/mm@0:zero
java 3304 root 4r DOOR 0t0 53 /var/run/name_service_door (door to nscd[176]) (FA:->0xffffffffaff29ec0)
java 3304 root 5r VCHR 149,0 78118916 /devices/pseudo/random@0:random
java 3304 root 6r VCHR 149,1 78118918 /devices/pseudo/random@0:urandom
java 3304 root 7w VREG 85,10 0 338286 /tmp/trace.out
java 3304 root 8u unknown file system type (zfs), v_op: 0xffffffffa977f200
java 3304 root 11u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 15u 0011 105,209 0t0 55050244 /devices/pseudo/tl@0:ticots->(socketpair: 0xd1) (0xffffffffb02b1560)
java 3304 root 16u unknown file system type (zfs), v_op: 0xffffffffa977f200
java 3304 root 17u unknown file system type (zfs), v_op: 0xffffffffa977f200
java 3304 root 18u unknown file system type (zfs), v_op: 0xffffffffa977f200
java 3304 root 19r VCHR 149,1 78118918 /devices/pseudo/random@0:urandom
java 3304 root 20u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 22u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 23u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 24u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 25u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 26u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 27u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 28u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 29u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 30u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 31u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 32u IPv4 TCP no TCP/UDP/IP information available
java 3304 root 33u IPv4 TCP no TCP/UDP/IP information available
zpool status
shows no problem with the zfs pools. The system logs show occasional problems with the marvell controller, but no entry for the time when this problem started.
The free space monitoring shows that the problem started around 3:30 am, because we see there a sharp decrease in free space (lost the pool).
Cannot get a stack trace or attach through truss
root@t3fs05 # pstack 3304
pstack: cannot examine 3304: no such process
root@t3fs05 # truss -p 3304
truss: no such process: 3304
Some information on the state may be in this thread:
http://opensolaris.org/jive/thread.jspa?threadID=25988 on unkillable processes. But they do not have a cure for it.
A normal shutdown/reboot did not work
shutdown -i6 -g0
Shutdown started. Wed Dec 9 09:46:04 CET 2009
I can see that the shutdown process is hanging in the sync process:
pgrep -fl shutd
4250 /sbin/sh /usr/sbin/shutdown -i6 -g0
truss -p 4250
waitid(P_PID, 4256, 0x08047B10, WEXITED|WTRAPPED|WNOWAIT) (sleeping...)
ps -ef | grep 4256
root 4256 4250 0 09:46:04 pts/1 0:00 /sbin/sync
root 4261 645 0 10:00:06 console 0:00 grep 4256
As a harsher measure, I then issued a
reboot
command. Even this hung. I then tried from the ILOM a
stop /SYS
. The still active console showed a confirmation of having received the shutdown command.
WARNING: Power off requested from power button or SC, powering down the system!
Shutdown started. Wed Dec 9 10:13:10 CET 2009
WARNING: Failed to shut down the system!
So, the only remaining option is a hard stop of the system by turning off the power:
stop -force /SYS
Observations Case 2
Based on the free storage monitor, the pool went offline around 17h on 2009-12-09.
The gridftp log was the last log written by dcache. It is full of errors:
gridftp-t3fs05Domain.log-20091210.
root@t3fs05 # /opt/d-cache/bin/dcache status
Domain Status Pid Log file
t3fs05Domain running 872 /var/log/dcache/t3fs05Domain.log
dcap-t3fs05Domain running 921 /var/log/dcache/dcap-t3fs05Domain.log
gridftp-t3fs05Domain running 970 /var/log/dcache/gridftp-t3fs05Domain.log
gsidcap-t3fs05Domain running 1022 /var/log/dcache/gsidcap-t3fs05Domain.log
/opt/d-cache/bin/dcache stop
Stopping gsidcap-t3fs05Domain (pid=1022) 0 Done
Stopping gridftp-t3fs05Domain (pid=970) 0 Done
Stopping dcap-t3fs05Domain (pid=921) 0 Done
Stopping t3fs05Domain (pid=872) 0 1 2 3 4 5 6
root@t3fs05 # truss -p 872
truss: unanticipated system error: 872
A google search for "truss: unanticipated system error" brought up this blog entry:
http://blog.thilelli.net/post/2008/09/12/Processes-Can-Not-Be-Killed-Upon-Hanging-In-sendfilev
Further note:
On the backplane of the X4500 the Warning LEDs for the chassis were on, also the white locator LED was on. On the front there was nothing
I cannot get get access to the Web frontend of the ILOM. Also, after a few tries, I cannot get access to the ILOM through the CLI. In both cases I just end up on the login.
... after a long time, I got access to the CLI, and then also to the web front end, I did a
reset /SP
on it.
-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing hard reset on /SP failed
reset: No such object value
Then I tried a reset over the web front end (maintenance tab), which worked.
So, it also seems that the ILOM was in a bad state, which already had happened with the older ILOM versions on the X4150. Probably it will pay off to reset the ILOM regularly or to update it.
Solution or Workaround
A system restart is needed to bring the machine again into a clean state. But this does not really solve the underlying problem. An update of the OS is required (I installed Solaris 10u8).
Monitoring for this condition
A sharp decline in the available space is a symptom. One could also try to ping cells (would need to confirm).
--
DerekFeichtinger - 2009-12-09