Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+ !!Solaris Installations * [[http://www.zagbot.com/solaris.html][excellent collection of Solaris commands by zagbot]] We use *Solaris 10* as our default. We had tested !OpenSolaris installations in 2008 (!OpenSolaris 2008.05, but decided to stay with the standard systems) %TOC% ---+ System installation We use a jumpstart server to install raw Solaris10 from the network (Look [[Trash.CmsTier3SolarisJumpstartServer#how_to_configure_a_client][here]] for instructions). Some basic OS configuration tasks like ZFS setup and basic puppet configuration need to be done manually as described in the next chapter. But most other configurations tasks are then handled by the *puppet* system. ---+ Oracle SunOS 5.11 changes ---++ Networking <strike>dladm show-dev</strike> now is =dladm show-link= ---+ Solaris 10 Look at the preinstalled system's version <pre> cat /etc/release Solaris 10 11/06 s10x_u3wos_10 X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 November 2006 </pre> use the =showrev= command <pre> showrev Hostname: t3fs04 Hostid: 6c2b90b Release: 5.10 Kernel architecture: i86pc Application architecture: i386 Hardware provider: Domain: Kernel version: SunOS 5.10 Generic_127112-10 </pre> ---++ Create a root home directory Solaris10 regrettably uses "/" as root's homedir. Edit the root entry in =/etc/passwd= <pre %FILESTYLE%> root:x:0:0:Super-User:/root:/sbin/sh </pre> Create the directory and move some files to it <pre> mkdir /root chmod 700 /root chown root:root /root cd / mv .Xauthority .bash_history .ssh /root/ </pre> ---++ getting ssh access as root and passwordless root access from the admin node Default installations usually do not allow remote SSH access for root. You need to * edit =/etc/ssh/sshd_config= to allow root access (=PermitRootLogin yes= option) * restart the ssh service <pre>svcadm restart svc:/network/ssh:default</pre> From the *admin node*, copy the SSH key to the authorized keys <pre> ssh-copy-id -i /root/.ssh/id_dsa root@NEWMACHINE </pre> *Log out and in again to the new server to have the correct environment for the next steps* ---++ getting remote access to the console via the ILOM On some x4500 systems this worked out of the box, on others there were problems. Frequently, the console works up to the point when the grub boot prompt appears. Remote access to the console through the ILOM (=start /SP/console=) is very desirable, because one can get important messages like the ones one gets when patches are applied after a system reboot. How to get the console to work * comment out the splashimage line from the =/boot/grub/menu.lst= file * Edit the =/boot/solaris/bootenv.rc= file. This file contains the boot environment settings. Some of the values also can be seen by using the =eeprom= command as =root=. On all Solaris machines where the redirection works, I can see in this file a setting <pre>setprop console 'ttya'</pre>A machine with failing redirection had this property set to ='text'=. Note: The issue is much more complicated than this, and various hints to the problem can be gained from a _ELOM on X4150_ discussion on [[http://opensolaris.org/jive/thread.jspa?threadID=55657][this OpenSolaris weblink]]. and this nice [[http://wiki.csclub.uwaterloo.ca/Console_Configuration][wiki from uwaterloo]]. ---++ Solaris 10 Services ---+++ Solaris 10 Services to disable <pre> for n in " telnet ftp rlogin svc:/network/shell:kshell svc:/network/shell:default finger sendmail smtp "; do svcadm disable $n ; done </pre> Disable the font server through inetd: <pre> /usr/sbin/inetadm -d svc:/application/x11/xfs:default </pre> Check the open ports with netstat! There should be nothing left except for SSH: <pre> netstat -a TCP: IPv6 Local Address Remote Address Swind Send-Q Rwind Recv-Q State If --------------------------------- --------------------------------- ----- ------ ----- ------ ----------- ----- *.* *.* 0 0 49152 0 IDLE *.ssh *.* 0 0 49152 0 LISTEN SCTP: Local Address Remote Address Swind Send-Q Rwind Recv-Q StrsI/O State ------------------------------- ------------------------------- ------ ------ ------ ------ ------- ----------- 0.0.0.0 0.0.0.0 0 0 102400 0 32/32 CLOSED Active UNIX domain sockets Address Type Vnode Conn Local Addr Remote Addr ffffffffb0ad23a8 stream-ord 00000000 00000000 ffffffffb0c76c88 stream-ord 00000000 00000000 ffffffffaf41bc80 stream-ord ffffffffae9a89c0 00000000 /var/run/.inetd.uds </pre> ---++ Mounting our minimal shared SW area (mounting a Linux NFS share on Solaris) <pre> mkdir /mnt/master mount -o vers=3 t3admin01:/master /mnt/master </pre> Note: The =-o vers=3= is needed for Solaris to understand the older protocol used on Linux. ---++ geting solaris packages ---+++ Basic Solaris Package management This is a list of examples for the basic Solaris package management commands, i.e. using the native Solaris commands <pre> pkgadd -d ./SUNWhd-1.01.pkg # install a package given by a file pkginfo # lists all installed packages pkginfo -l CSWpuppet # lists detailed package information for this package pkgchk -l -p /usr/lib/values-Xa.o # list packages to which this local file belongs and check against package pkgchk -l CSWpuppet | grep Pathname # lists files belonging to the named package pkgrm CSWpuppet # remove a package. Dependencies will be checked </pre> ---+++ from OpenCSW *THIS SECTION NEEDS TO BE REVISED. But opencsw is still very much alive* - Derek 2013-04-30 This follows instructions from http://www.opencsw.org/ 1 Get the pkg-get utility and transfer it to the new node <pre> wget http://www.opencsw.org/pkg_get.pkg wget ftp://ftp.ibiblio.org/pub/mirrors/opencsw/wget-i386.bin </pre> 1 Install the package <pre> pkgadd -d pkg_get.pkg </pre> 1 Edit the pkg-get configuration =/opt/csw/etc/pkg-get.conf= to select our nearby mirror <pre> url=http://mirror.switch.ch/ftp/mirror/opencsw/current </pre> 1 if pkg-get is not able to download a wget itself, copy a static one from ftp://ftp.ibiblio.org/pub/mirrors/opencsw/wget-i386.bin to the node <pre> scp wget-i386.bin newnode:/opt/csw/bin/wget </pre> 1 Install the gnupg suite. This will take some time <pre> /opt/csw/bin/pkg-get install gnupg textutils </pre> 1 Import the opencsw key and set the trust on it <pre> /opt/csw/bin/gpg --keyserver pgp.mit.edu --recv-keys E12E9D2F /opt/csw/bin/gpg --edit-key E12E9D2F trust # in the command line interface of gpg # choose "trust ultimately" </pre> from here on you can go to the puppet installation. ---+++ from Blastwave Blastwave has gone... as a result of Oracle's policies[[http://permalink.gmane.org/gmane.os.solaris.pca/3037][Denis Clarke threw the towel]]: <strike> Note: * there is a competing effort to the blastwave releases hosted at http://www.opencsw.org/ * many packages can also be obtained from http://sunfreeware.com/ I follow here the instructions from the [[http://www.blastwave.org/jir/blastwave.fam][Blastwave main page]]: <pre> pkgadd -d http://blastwave.network.com/csw/pkgutil_`/sbin/uname -p`.pkg /opt/csw/bin/pkgutil --catalog /opt/csw/bin/pkgutil --install gnupg textutils </pre> Import the blastwave gpg key to check the packages that you download! <pre> /opt/csw/bin/gpg --keyserver pgp.mit.edu --recv-keys A1999E90 /opt/csw/bin/gpg --list-keys //.gnupg/pubring.gpg -------------------- pub 1024D/A1999E90 2008-08-17 [expires: 2011-08-17] uid Blastwave Software (Blastwave.org Inc.) <software@blastwave.org> sub 2048g/E4845389 2008-08-17 [expires: 2011-08-17] /opt/csw/bin/gpg --edit-key A1999E90 ... Command> trust ... Your decision? 5 (trust ultimately) Do you really want to set this key to ultimate trust? (y/N) y ... </pre> Edit the pkgutil configuration file =/etc/opt/csw/pkgutil.conf=. We are using the default mirror with the *unstable* release (access to newer packages). <pre %FILESTYLE%> ... # To enable use of gpg or md5, uncomment these # NOTE: it doesn't make sense to use md5 but not gpg so your options should be: # 1. both disabled, 2. gpg enabled, 3. both enabled # Default: false, false use_gpg=true use_md5=true ... </pre> Again fetch the catalog, and now it will be checked with gpg <pre> /opt/csw/bin/pkgutil --catalog </pre> Now, we need to enter the paths to these tools into the configuration 1 add the path =/opt/csw/bin= in =/etc/default/login= <pre> PATH=/opt/csw/bin:/usr/bin: SUPATH=/sbin:/usr/sbin:/opt/csw/bin:/usr/bin </pre> 1 update the MANPATH in =/etc/profile= by adding these lines<pre> MANPATH=/usr/share/man:/opt/csw/share/man export MANPATH </pre> Log out and in again to test whether you now have the =pkgutil= utility in the path and can read its man page. </strike> ---+++ How to install packages with pkgutil Searching for a package using a substring (the -a flag shows available packages) <pre> pkgutil -a sed gsed CSWgsed 4.2.1,REV=2009.07.01 432.6 KB </pre> As a minimum we need to fetch gsed and rsync <pre> pkgutil -i gsed rsync </pre> Useful are also * CSWshutils - Gnu coreutils. They will be installed with a "g" prefix to all commands * top ---++ Install Puppet *Note: It is necessary to pick a Puppet client that works with our puppet server version. Changes are sometimes erratic even with minor version upgrades and will break the current functionality and manifests. As of this writing I recommend to retrieve puppet 24.8 from the /master/Solaris area of t3admin01.* If you want to install facter and puppet from our local area, you first need to use pkg-get to satisfy the dependencies: <pre> /opt/csw/bin/pkg-get -i ruby # now install our standard puppet version pkgadd -d /mnt/master/Solaris/facter-1.5.7\,REV\=2009.11.16-SunOS5.9-all-CSW.pkg pkgadd -d /mnt/master/Solaris/puppet-0.24.8,REV=2009.03.23-SunOS5.10-all-CSW.pkg </pre> Immediately disable the puppet service that is started up automatically by the package <pre> svcadm disable puppetd </pre> Put this into the =/etc/puppet/puppet.conf= file <pre %FILESTYLE%> [main] environment = DerekDevelopment [puppetd] factsync = true #pluginsync = true server = psi-puppet1.psi.ch #evaltrace = true </pre> Put a correct config in a =/etc/sysconfig/psi= file <pre %FILESTYLE%> ZONE=Tier3 SET=Solaris ROLE=dcachefs </pre> Be sure that you have entered the new node into the main puppet node manifest for the Tier-3. Run puppet (from version 0.25 puppetd resides in sbin) <pre> /opt/csw/sbin/puppetd -v -t </pre> ---++ Patching * For finding information on patches: http://sunsolve.sun.com/patchfinder/ * *do this only from the system console or you may not see important messages, especially if a system restart is necessary* * some patches are dangerous (see below) and can render the system unbootable (seldom, but it happened!) ---+++ System Registration Copy the registration template =/usr/lib/breg/data/RegistrationProfile.properties= and edit it (you need to get yourself a SUN account at .http://sunsolve.sun.com/). <pre> scp -p /root/clusteradmin/profiles/solarisfs/common/root/RegistrationProfile.properties NEWNODE:/root/ </pre> By convention, I install one in the =/root/= folder. So, you may use my existing one on the cluster. Register the machine <pre> cd /root sconadm register -a -r ./RegistrationProfile.properties sconadm is running Authenticating user ... finish registration! </pre> this can fail with a long java backtrace if there is a problem with the hostname or domainname. The java framework somehow wants to connect back to a local socket, and this fails . More information may be gathered from a log file written at =/tmp/basicreg${DATE}*.log= The patching policy can be seen and changed like this: <pre> smpatch get ... patchpro.install.types - rebootafter:reconfigafter:standard ... smpatch set patchpro.install.types=rebootafter:reconfigafter:standard </pre> ---+++ patch information gathering You can list the revision information of already applied patches using the =showrev= command: <pre> showrev -p |head -3 Patch: 118344-14 Obsoletes: 122397-01 Requires: Incompatibles: Packages: SUNWcsu, SUNWcsl, SUNWckr, SUNWhea, SUNWarc, SUNWfmd Patch: 118368-04 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu </pre> This command analyzes which patches are available for this machine <pre> smpatch analyze ... 138110-01 SunOS 5.10_x86: ata driver patch ... </pre> One can also analyze which dependencies need to be installed for a specific patch (or multiple patches. Multiple -i options are supported or one can pass a file with a list). <pre> smpatch analyze -i 138110-01 You have new messages. To retrieve: smpatch messages [-a] 127128-11 SunOS 5.10_x86: kernel patch 138110-01 SunOS 5.10_x86: ata driver patch # with a listfile smpatch analyze -x idlist=patch-list-20081215.lst 125556-01 SunOS 5.10_x86: patch behavior patch 138270-02 SunOS 5.10_x86: devfs patch 127128-11 SunOS 5.10_x86: kernel patch 137140-06 SunOS 5.10_x86: aggr patch 119255-59 SunOS 5.10_x86: Install and Patch Utilities Patch 137122-03 SunOS 5.10_x86: e1000g driver patch 138110-01 SunOS 5.10_x86: ata driver patch </pre> You can see the README of a specific patch by executing <pre> smpatch download -t -i 139580-01 </pre> ---+++ patch installation * patches are downloaded to =/var/sadm/spool= by default. * *logs from install attempts and READMEs are available in* =/var/sadm/patch/PATCHNUMBER/= directories <pre> smpatch update # without additional argument applies all available patches smpatch update -x idlist=patch-list-20081215.lst # updates specified by list smpatch update -i patch_id1 -i patch_id1 # updates by single patch IDs </pre> It may be that the patches require a system restart or at least going to maintenance mode to be installed. It is easiest to log in on the remote console and initiate a init 0, so one still can follow the progress, e.g. <pre> init 0 root@t3fs03 # svc.startd: The system is coming down. Please wait. svc.startd: 94 system services are now being stopped. Dec 15 11:35:06 t3fs03 syslogd: going down on signal 15 Dec 15 11:35:07 rpc.metad: Terminated Installing updates Installing update 138270-02 Succeeded Installing update 127128-11 Succeeded ........ </pre> %RED% *Note: Careful with kernel patches. The patch 137138-09 from SUN rendered two systems unbootable (q.v. this [[IssueSolarisPatchFail1][Issue report]]).* %ENDCOLOR% ---++ Solaris Filesystem configuration (UFS, ZFS), disk partitioning, RAID ---+++ The hd and hdadm tool The *hd* tool is very useful to get an overview on the device mappings to physical disks: One can get it from the _X4500 Solaris tools CD_ and install it by <pre> pkgadd -d ./SUNWhd-1.01.pkg </pre> hd usage: <pre> hd # show mappings and ASCII-graphical slot display hd -q # sequence in drive slot order. 1 & 2 are bootable disks hd -l # list in phys order (1 spc separated line). 1 & 2 are bootable disks hd -j # shows controller device files (PCI) and mapping to c1, c2, c3 .... hd -w /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0 # maps pci dev path to disk hd -r and hd -R: read out SMART information </pre> In a very intersting [[http://www.cuddletech.com/blog/pivot/entry.php?id=993][blog entry]] Ben Rockwood points to the possibility of *reading out the SMART data of all disks* with =hd -r= (detailed) or =hd -R= (terse). In his experience the *(1.) Raw read error rate* and the *(5.) Reallocated sector count* are the best indicators for disks to be replaced. He also points to an interesting [[http://labs.google.com/papers/disk_failures.pdf][paper on disk failure statistics by google]]. The *hdadm* tool can be used to manage disks and bring them online/offline. <pre> hdadm display # displays info on all drives hdadm offline slot 11 hdadm offline disk c0t0 hdadm offline row3 hdadm offline col3 hdadm online all </pre> ---+++ Disk exchange The =cfgadmin= command can be used to manage the hardware configuration (e.g. _connected_ state for devices.). This may be necessary when a disk is exchanged and ends up in disconnected state. It may require something like this (from http://osdir.com/mlos.solaris.opensolaris.storage.general/2008-07/msg00122.html): <pre> # cfgadm -c configure sata1/3 </pre> A lot of examples can be found on our page of FileserverDiskProblems. ---+++ ZFS pool setup * [[http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance][excellent artice on ZFS raid setups and relations to performance and data safety]] Delete the original empty zpool: zpool destroy zpool1 Run our "setup_zpool.sh" script which intelligently identifies the boot disks and puts out the commands to create our required zpool structure. Example: <pre> setup_zpool.sh # system contains 48 disks # boot disks: c5t0 c5t4 # spare disks: c0t3 c0t7 # num of pool disks: 44 zpool create -f data1 raidz1 c4t0d0 c4t4d0 c7t0d0 c7t4d0 c6t0d0 c6t4d0 c1t0d0 c1t4d0 c0t0d0 zpool add -f data1 raidz c0t4d0 c5t1d0 c5t5d0 c4t1d0 c4t5d0 c7t1d0 c7t5d0 c6t1d0 c6t5d0 zpool add -f data1 raidz c1t1d0 c1t5d0 c0t1d0 c0t5d0 c5t2d0 c5t6d0 c4t2d0 c4t6d0 c7t2d0 zpool add -f data1 raidz c7t6d0 c6t2d0 c6t6d0 c1t2d0 c1t6d0 c0t2d0 c0t6d0 c5t3d0 c5t7d0 zpool add -f data1 raidz c4t3d0 c4t7d0 c7t3d0 c7t7d0 c6t3d0 c6t7d0 c1t3d0 c1t7d0 zpool add -f data1 spare c0t3d0 c0t7d0 </pre> ---+++ monitoring disk I/O You can use the following commands to monitor i/O broken down to single disks <pre> zpool iostat -v iostat -nxz 5 </pre> ---+++ soon obsolete: Custom UFS partitioning and RAID setup for X4500 How to partition a X4500 and set up RAID mirroring can be found on the InstallationSolarisPartitioning page. ---+++ soon obsolete: UFS Boot partition information of preinstalled Solaris10 X4500 The boot image in the preinstalled X4500 Solaris 10 installations of this version is not on a ZFS file system (and actually I think this is also would not give a big added value). This is the list of mounted partitions as seen in =/etc/vfstab= (and =/etc/mnttab=), among them the root partition, which sits on an *ufs* file system: Contents of =/etc/vfstab= on a SUN Solaris 10 preinstalled X4500: <pre %FILESTYLE%> #device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # fd - /dev/fd fd - no - /proc - /proc proc - no - /dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no - /devices - /devices devfs - no - ctfs - /system/contract ctfs - no - objfs - /system/object objfs - no - /dev/md/dsk/d20 - - swap - no - /dev/md/dsk/d30 /dev/md/rdsk/d30 /var ufs 1 no - </pre> Here is a good document on the so-called soft-partitions (kind of logical volumes): http://sysunconfig.net/unixtips/soft-partitions.html To find out more about these so called _meta devices_, one can use the =metastat= command: <pre> metastat d10: Mirror Submirror 0: d11 State: Okay Submirror 1: d12 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 22539195 blocks (10 GB) d11: Submirror of d10 State: Okay Size: 22539195 blocks (10 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c5t0d0s0 0 No Okay Yes d12: Submirror of d10 State: Okay Size: 22539195 blocks (10 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c5t4d0s0 0 No Okay Yes Device Relocation Information: Device Reloc Device ID c5t0d0 Yes id1,sd@SATA_____HITACHI_HUA7250S______GTF402P6G3NM6F c5t4d0 Yes id1,sd@SATA_____HITACHI_HUA7250S______GTF402P6G3KPYF </pre> The information in the last few lines points to the physical disks on which the file systems reside. This can be correlated with the information given by the *hd* utility, and we can see that (naturally) these two devices are the boot disks (same is found for the d30 meta device, here). ---++ Changing the hostname If the hostname has been configured wrongly, you may need to make changes to all of these files. This is regrettably much harder than in Linux. * =/etc/hosts= * =/etc/nodename= * =/etc/hostname.*= * =/etc/inet/ipnodes= ---++ Networking ---+++ Network trunking Note: Trunking requires the switches to know about it, or only the outgoing transfer will profit from it (because there the OS will decide which interface to use). If the switch is set to do static trunking, one can get into problems when the system is in the boot phase, because at that point it is not yet able to support the trunking. So, it can happen that the system cannot receive answers from DHCP, because the switch will send them in on another interface than the request was coming from. This problem can be solved by using the *LACP (Line Aggregation Control Protocol)* technology, where the client will tell the server about when to trunk the interfaces. The switch must be configured for this! This service should be online: =svc:/network/physical:default= detach and erase the configuration for the single interface <pre> dladm show-dev ifconfig nge0 unplumb rm -f /etc/hostname.nge0 rm -f /etc/dhcp.nge0 </pre> Create the aggregated network interface: There is a bug in the dladm on our nodes, even after many system updates having been installed: <pre> dladm create-aggr -P L2,L3,L4 --lacp-mode=active -d e1000g0 -d e1000g1 -d e1000g2 -d e1000g3 1 Segmentation Fault (core dumped) </pre> The trunking works if the hash function policy only uses L3,L4 or L2,L4. The =--lacp-mode=active= flag turns on LACP mode (see above) for the trunking. <pre> dladm create-aggr -P L3,L4 --lacp-mode=active -d e1000g0 -d e1000g1 -d e1000g2 -d e1000g3 1 </pre> NOTE: One can also change the properties of an existing aggregated link <pre> dladm modify-aggr --lacp-mode=active 1 </pre> One can list the details on the trunked interfaces <pre> dladm show-aggr -L key: 1 (0x0001) policy: L3,L4 address: 0:14:4f:a6:df:e4 (auto) LACP mode: active LACP timer: short device activity timeout aggregatable sync coll dist defaulted expired e1000g0 active short yes yes yes yes no no e1000g1 active short yes yes yes yes no no e1000g2 active short yes yes yes yes no no e1000g3 active short yes yes yes yes no no </pre> To select a hardcoded IP/hostname configuration, make sure that =/etc/inet/hosts= (DANGER: =/etc/hosts= is just a symbolic link to that!) contains the local node's IP and hostname <pre> echo "[my hostname" > /etc/hostname.aggr1 echo 192.33.123.1 > /etc/defaultrouter </pre> Also put the correct information into =/etc/netmasks=. <pre %FILESTYLE%> 192.33.123.0 255.255.255.0 </pre> Activate the interface and restart the service to obtain the correct network settings. <pre> ifconfig aggr1 plumb svcadm restart svc:/network/physical:default </pre> Our aggr.sh script will take care of correctly trunking on both Solaris 10 and !OpenSolaris systems. ---+++ TCP tuning Explanation of the Solaris TCP parameters (older, from 2002): http://developers.sun.com/solaris/articles/tuning_for_streaming.html also look at the reference manual: http://docs.sun.com/app/docs/doc/806-4015/6jd4gh8fn?a=view From the [[http://www-didc.lbl.gov/TCP-tuning/Solaris.html][TCP Tuning Guide]]: <pre %FILESTYLE%> For Solaris create a boot script similar to this (e.g.: /etc/rc2.d/S99ndd) #!/bin/sh # increase max tcp window # The maximum buffer size in bytes. It controls how large the send and receive buffers # can be set to by an application using setsockopt(3SOCKET). ndd -set /dev/tcp tcp_max_buf 4194304 # The maximum value of TCP congestion window (cwnd) in bytes. ndd -set /dev/tcp tcp_cwnd_max 2097152 # The default (???) send/receive window size in bytes (i.e. what you get if the application does not ask for more) ndd -set /dev/tcp tcp_xmit_hiwat 65536 ndd -set /dev/tcp tcp_recv_hiwat 65536 </pre> Note: The 64kB for the hiwat values are too small for WAN transfers over gigabit links. Here are the default values on one of our Solaris10 Thumpers: <pre> root@t3fs01 $ ndd -get /dev/tcp tcp_max_buf 1048576 root@t3fs01 $ ndd -get /dev/tcp tcp_cwnd_max 1048576 root@t3fs01 $ ndd -get /dev/tcp tcp_xmit_hiwat 49152 root@t3fs01 $ ndd -get /dev/tcp tcp_recv_hiwat 49152 </pre> This effectively limits the tcp window to 48 kbyte!!!! ---+++ How to fix the recurring t3fs06 links down issue Sometimes during an unexpected NFS load vs NFS =t3fs06= its links goes down and that locks all the activities at T3; if that happens Nagios will report it and the status will be in =t3fs06=: <pre> dladm show-dev e1000g0 link: unknown speed: 1000 Mbps duplex: full e1000g1 link: unknown speed: 1000 Mbps duplex: full e1000g2 link: unknown speed: 1000 Mbps duplex: full e1000g3 link: unknown speed: 1000 Mbps duplex: full </pre> To recover try: <pre> # ifconfig aggr1 unplumb # ifconfig aggr1 plumb # svcadm restart svc:/network/physical:default </pre> Check that now the link is like: <pre> bash-4.2# dladm show-aggr key: 1 (0x0001) policy: L3,L4 address: 0:14:4f:a6:e1:8c (auto) device address speed duplex link state e1000g0 0:14:4f:a6:e1:8c 1000 Mbps full up attached e1000g1 0:14:4f:a6:e1:8d 1000 Mbps full up attached e1000g2 0:14:4f:a6:e1:8e 1000 Mbps full up attached e1000g3 0:14:4f:a6:e1:8f 1000 Mbps full up attached </pre> ---++ Syslog Solaris 10 machines come with a syslog setup where the =/etc/syslog.conf= file is processed by m4, before it is fed to the syslogd. If a host _loghost_ is defined inside of =/etc/hosts=, and this host resolves to one of the IP addresses of the local machine, the LOGHOST variable will get defined, and the various conditionals will resolve to logging to local files. ---++ Cleanly shutting down Shutdown: * *shutdown -i5 -g0* This turns the machine into state 5, which means a state in which the machine can be turned off, and it turns the machine off. Rebooting: * *init 6:* Use of `init 6` will give the cleanest and orderly reboot (init informs svc.startd of the runlevel change and will move to the appropriate milestone) * *shutdown -y -g0 -i6 "message"* will invoke init as well as give you grace period and messages to user *NOTE: halt,reboot,poweroff* will not run any of the shutdown scripts and should be last resort. ---++ Local Firewall settings Look at the [[https://twiki.cscs.ch/twiki/bin/view/LCGTier2/SolarisFirewall][Tier-2 documentation]]. At PSI we have a firewall in front of the Tier-3. ---++ Performance monitoring ---+++ Memory monitoring Good article on [[http://oraclepoint.com/oralife/2011/02/09/different-ways-to-check-memory-usage-on-solaris-server/][oralife blog]] ---+++ mdb Nive overview of memory usage <pre> root@t3fs05 $ echo ::memstat | mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 738360 2884 18% ZFS File Data 11856 46 0% Anon 19434 75 0% Exec and libs 2583 10 0% Page cache 206 0 0% Free (cachelist) 6218 24 0% Free (freelist) 3282408 12821 81% Total 4061065 15863 Physical 3975282 15528 </pre> * Kernel: Kernel page * Anon: anonymous pages (such as stack, heap, shared mem etc) * Exec and libs: executables and libraries * Page cache: file cache * Free (cachelist) + Free (freelist) = freemem(value for column “free” when “vmstat” is issued) ---++++ prstat like top nice overview of per user statistics can be obtained by: =prstat -p= ---++++ ps Sorts processes according to mem usage <pre> ps -efo pmem,uid,pid,ppid,pcpu,comm | sort -r </pre> ---++++ pmap Like in Linux <pre> /usr/proc/bin/pmap -x <process-id> </pre> ---+++ Dtrace * http://www.solarisinternals.com/wiki/index.php/DTraceToolkit * http://www.prefetch.net/articles/solaris.dtracetopten.html * http://www.brendangregg.com/dtrace.html Useful dtracetoolkit commands: * iotop - display top disk I/O events by process (list for every disk) * rwsnoop: measuring reads and writes at the application level * dtruss -p pid: truss equivalent * bitesize.d: histogram of sizes of disk events per process as sent by the block I/O driver * seeksize.d: histograms disk event seeks by process * iopattern: differentiates sequential from random i/o * iosnoop: lists processes doing IO (limited utility on ZFS because no filenames are given) * opensnoop: snoops file open calls ---+++ SAR You must install the SUNWaccu and SUNWaccr (service) packages. Then, activate the service <pre> svcadm enable svc:/system/sar:default </pre> The data taking is done per cron job as the sys user: Uncomment these lines if necessary <pre> cat /var/spool/cron/crontabs/sys #ident "@(#)sys 1.5 92/07/14 SMI" /* SVr4.0 1.2 */ # # The sys crontab should be used to do performance collection. See cron # and performance manual pages for details on startup. # 0 * * * 0-6 /usr/lib/sa/sa1 20,40 8-17 * * 1-5 /usr/lib/sa/sa1 5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A </pre> ---+++ Misc * prstat * intrstat ---+ Links with documentation and Howtos * [[http://spiralbound.net/2005/05/10/how-to-copy-a-solaris-boot-drive-to-a-disk-with-a-different-partition-layout][Copy a Solaris Boot Drive to a New Disk]] (alternative source: [[http://www.ignorantinc.com/www/ignorantwiki/index.php/Solaris_Wiki:Procedures/copy_boot_disk][Solaris Wiki]]) ------------------------------------------------------------- ---+ !OpenSolaris 2008.05 *OBSOLETE* ---++ Basic OS installation !OpenSolaris was installed by attaching a USB DVD drive with the !OpenSolaris distribution to the machine and booting from it. SUN ILOM (works on X4500): cd SP/console ; start or start SP/console On x4500:<br> <pre> cat /etc/release OpenSolaris 2008.05 snv_86_rc3 X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 26 April 2008 </pre> Form where did I have this? SunOS Release 5.10 Version Generic_127112-10 64-bit Marvell Serial ATA Adapter - Driver Version 3.6.1.0-5 On SunOS unknown 5.10 Generic_127112-10 i86pc i386 i86pc ---++ Booting into single user mode In the grub screen, edit the kernel options and add a "-s" flag. Then boot with "b". ---++ getting passwordless root access from the admin node * first edit =/etc/ssh/sshd_config= to allow root access (=PermitRootLogin yes= option) and restart the ssh service. * opensolaris may require turning off the pam_roles module, because in =/etc/user_attr= root is defined as a role. This is different from what we find on a standard Solaris 10 installation. Definitions for the =/etc/pam.conf= file: <pre %FILESTYLE%> # PSI modification to allow interactive root user remote login sshd-pubkey auth requisite pam_authtok_get.so.1 sshd-pubkey auth required pam_dhkeys.so.1 sshd-pubkey auth required pam_unix_cred.so.1 sshd-pubkey auth required pam_unix_auth.so.1 # sshd-pubkey account requisite pam_roles.so.1 allow_remote # allow_remote is not enough because auser information is missing sshd-pubkey account required pam_unix_account.so.1 </pre> Note: one may also want to define this for the _sshd-kbdint_ module (keybord password based login). One could also drop the definition of root as a role. ---++ Package management installing a standard SUN package from file using the default package manager <pre> pkgadd -d ./SUNWhd-1.01.pkg pkginfo # lists all installed packages pkginfo -l CSWpuppet # lists detailed package information for this package pkgchk -l -p /usr/lib/values-Xa.o # list packages to which this local file belongs and check against package pkgchk -l CSWpuppet | grep Pathname # lists files belonging to the named package </pre> ---+++ Obtaining free additional packages from Blastwave *NOTE: BLASTWAVE SEEMS NOT VERY ACTIVE ANY MORE: MOST DEVELOPERS HAVE GONE TO* http://www.opencsw.org/!!!!! 1 Install the =pkg-get= tool from blastwave (see [[http://www.blastwave.org/jir/blastwave.fam][blastwave web page]]) <pre> pkgadd -d http://www.blastwave.org/pkg_get.pkg</pre> You may have to download the file by hand according to instructions at the blastwave page, if the weblink fails. 1 You may need to modify the repository url in =/opt/csw/etc/pkg-get.conf= to point to a valid mirror. We use <pre>url=http://blastwave.network.com/csw/stable</pre> 1 add the path =/opt/csw/bin= in =/etc/default/login= <pre> PATH=/opt/csw/bin:/usr/bin: SUPATH=/sbin:/usr/sbin:/opt/csw/bin:/usr/bin </pre> 1 update the MANPATH in =/etc/profile= <pre> MANPATH=/usr/share/man:/opt/csw/share/man export MANPATH </pre> pkg-get example usage: <pre> pkg-get -D regexp # searches for a package pkg-get install top # installs package "top" pkginfo CSWtop # shows info about a package </pre> special pkg messages are saved in logfiles in =/var/sadm/install/logs/= ---++++ Packages to install <pre> pkg-get -i gsed </pre> ---+++ the new package manager 'pkg' (not used by us) A new package manager *pkg* is in development and it seems that it used for some packages. It is orthogonal to the other package management and this is a little disconcerting. We do not use it. Still, here is some example usage: <pre> # pfexec is like sudo and often used for issueing the following comands as another user pkg refresh pkg list [fmri-pattern] # list packages (very slow) pkg info [fmri-pattern] # list package information (long format) pkg authority # displays repository info pkg search gnu # search for packages with this substring pkg contents SUNWgnu-coreutils pkg install SUNWvim </pre> packages get downloaded from opensolaris servers via http (Look at http://pkg.opensolaris.org/status). ---++ ZFS file system and disk management Very nice links * [[http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide][ZFS_Best_Practices_Guide]] * [[http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Guide][ZFS_Configuration_Guide]] * [[http://initialprogramload.blogspot.ch/2008/07/how-solaris-disk-device-names-work.html][How solaris disk device names work]] How to determine what ZFS features are available on a system: <pre> zfs upgrade -v The following filesystem versions are supported: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifer (FUID) 4 userquota, groupquota properties For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/zpl/N Where 'N' is the version number. </pre> On !OpenSolaris the root partition already sits on a zpool: <pre> mount / on rpool/ROOT/opensolaris read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90002 zfs list rpool 2.92G 450G 55K /rpool rpool@install 16K - 55K - rpool/ROOT 2.92G 450G 18K /rpool/ROOT rpool/ROOT@install 15K - 18K - rpool/ROOT/opensolaris 2.92G 450G 2.33G legacy rpool/ROOT/opensolaris@install 196M - 2.22G - rpool/ROOT/opensolaris/opt 412M 450G 411M /opt rpool/ROOT/opensolaris/opt@install 118K - 3.60M - rpool/export 540K 450G 19K /export rpool/export@install 15K - 19K - rpool/export/home 506K 450G 488K /export/home rpool/export/home@install 18K - 21K - </pre> Locating the boot disk <pre> cfgadm|grep sata3/0 # x4500: for determining boot disk </pre> All device files are located under =/dev/dsk/= The *hd* tool is very useful to get an overview on the device mappings to physical disks: One can get it from the Solaris tools CD by using <pre> pkgadd -d ./SUNWhd-1.01.pkg </pre> hd usage: <pre> hd # show mappings and ASCII-graphical slot display hd -q # sequence in drive slot order. 1 & 2 are bootable disks hd -l # list in phys order (1 spc separated line). 1 & 2 are bootable disks hd -j # shows controller device files (PCI) and mapping to c1, c2, c3 .... hd -w /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@2,0 # maps pci dev path to disk </pre> ZFS command examples <pre> zfs list : shows mountpoints zpool status -v [poolname] # status + device. This also will show corrupted files, e.g. after a resilver zpool status -x # health of all pools zpool online tank c1t0d0 # bring a pool device online zpool iostat zpool iostat rpool 2 zpool iostat -v # device specific information zpool import # lists exported pools ready to be imported zpool import [-f] zpool1 zpool upgrade zpool1 zpool destroy zpool1 zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0 zpool create tank raidz c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 zpool create pool mirror c0d0 c1d0 spare c2d0 c3d0 # spare </pre> ---++ Networking The solarisfs profile which can be synced from the admin node contains a script for setting up an aggregated network device. Since it will detach the current network connections, it should be run either in batch mode (e.g. using the =at= service with =at now=) or inside a console session. <pre> bash aggr.sh -f </pre> for dhcp to work correctly, one needs two files * =/etc/hostname.[interface]= makes the interface persistant and may contain the hostname * =/etc/dhcp.[interface]= is needed to get the dhcp parameters to be used for the local configuration A little command overview: <pre> ifconfig -a dladm show-dev e1000g0 dladm show-link [-s] [linkname] # disable nwam and enable default svcs disable svc:/network/physical:nwam svcs enable svc:/network/physical:default # must have a /etc/hostname.aggr1 to make # aggregate persistent ifconfig e1000g1 unplumb dladm create-aggr -l e1000g0 ... aggr1 dladm show-aggr [-L] [-x] [-s] dladm delete-aggr aggr1 ifconfig aggr1 plumb # aggr seems to get MAC of first link ifconfig aggr1 dhcp start netstat -D # shows state of dhcp per interface # dhcp config in /etc/default/dhcpagent dhcpinfo -i e1000g0 12 dhcpinfo -i e1000g0 Hostname How to control *inetd* related services: inetadm # lists all services and their status inetadm -l svc:/network/rpc/smserver:default # shows config details for that service </pre> ---+++ Network routing table *CAREFUL:* The file =/etc/defaultrouter= can be used to hardcode a default route (just put the IP of the target machine in). If this file contains a setting, it will be added to the settings that are obtained by DHCP, and *you may end up with 2 default routes*. This can lead to unpredictable results, if the hardcoded IP points to a wrong place. Some commands to change routes: <pre> route add 192.33.123.0/24 192.33.123.44 -interface # adding a route with gw being # machine's own interface route add default 192.33.123.1 route delete default 192.33.123.41 # deleting a gateway default route </pre> ---+++ DNS Do not forget to modify the =/etc/nsswitch.conf= file to use DNS information. Here you can define whether only file based information (=/etc/hosts=) or also the resolver is used. ---+++ How to enable jumbo frames Refer to the Solaris driver's *e1000g* man page. This is not yet tested! <pre> You can look at the present frame size with ndd -get /dev/e1000g0 max_frame_size 1514 ifconfig e1000g0 mtu 16128 # enable jumbo frames It seems that one can also add the following line to the /etc/hostname.aggr1 file mtu 16128 </pre> It might be that one has to change by hand the =MaxFrameSize= line in the driver's configuration file =/kernel/drv/e1000g.conf=, which by default contains the following. <pre %FILESTYLE%> MaxFrameSize=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; # 0 is for normal ethernet frames. # 1 is for upto 4k size frames. # 2 is for upto 8k size frames. # 3 is for upto 16k size frames. # These are maximum frame limits, not the actual ethernet frame # size. Your actual ethernet frame size would be determined by # protocol stack configuration (please refer to ndd command man pages) # For Jumbo Frame Support (9k ethernet packet) # use 3 (upto 16k size frames) </pre> Note: - On a freshly installed =Solaris 10 10/08 s10x_u6wos_07b X86= system the ndd command did not accept the max_frame_size parameter. Need to check, why this is missing. <pre> ndd -get /dev/e1000g0 max_frame_size operation failed: Invalid argument </pre> %INCLUDE{"ToDoItem" what="Look at jumbo frame support for file servers" who="DerekFeichtinger" priority="5"}% ---++ Services Service management is done with the =scvadmin=, =svcs= and =svcprop= commands You can get a list of service udentifiers + description by doing <pre> svcs -o FMRI,DESC </pre> Example Usage: <pre> svcs # lists all services svcs -xv # diagnose service startup problems svcs -l gdm # lists details of sevice specified by pattern or _FMRI_ ID svcs -d gdm # lists all other services on which this service depends svcs -D fc-cache # lists all services depending on the given service svcs -p gdm # list all pids associated with the named service svcadm enable gdm # enable a service svcadm enable -t gdm # temporarily enable a service svcadm disable gdm # disable a service svcadm disable -t gdm # temporarily disable a service svcadm restart gdm svcadm clear gdm # if service is in maintenance state, signal that we have repaired and can start svcprop gdm # lists service configuration properties for services framework </pre> ---+++ Services to disable Disable these services <pre> for n in " finger sendmail ppd-cache-update ogl-select gdm "; do svcadm disable $n ; done </pre> ---++ Patches and updates *!Opensolaris has no real patch service.* There is a GUI application that can be used to get an overview over packages that can be updated: =packagemanager=. One may also have a look at the =image-update [-nvq]= command. ---++ Compiling with gcc *Note: Careful! The many updates needed to get gcc working on this installation rendered the system unbootable.* This derived from an update to ZFS, where the bootloader could no longer read the old zfs root partition. Make sure that package CSWgcc3 is installed Set up the environment <pre> export PATH=$PATH:/opt/csw/gcc3/bin -bash-3.2# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/csw/gcc3/lib </pre> I needed to install the SUNWarc package to get gcc working for !OpenSolaris. This *requires using the new package manager*. It is not very nice that we need to rely on two different package management systems. (tip from http://wiki.csclub.uwaterloo.ca/OpenSolaris) <pre> pkg install SUNWarc SUNWsfwhea SUNWhea SUNWtoo </pre> This took quite some time, since a number of packages needed to be fetched from the opensolaris web site. I wonder whether we can mirror such packages locally. The error had been: <pre> gcc test.c ld: fatal: file values-Xa.o: open failed: No such file or directory ld: fatal: File processing errors. No output written to a.out collect2: ld returned 1 exit status </pre> ---++ OpenSolaris boot problem after update After some update (I do not know whether it was an automatic update or one made by me some days ago), the system was no longer able to boot. The root file system had been installed on a ZFS fs by the OS installation process, and there was no obvious place where one could have selected the boot partition FS as an option. One of the updates seemingly introduced an incompatibility of the bootloader and that version of ZFS. This is the related [[http://bugs.opensolaris.org/view_bug.do;jsessionid=ff095ff1f8ecb7ffffffffcaa1b88f971ad19?bug_id=6534519][OpenSolaris Bug 6534519]]. I was not able to save the system like described on that page (but I did not manage to completely follow the instructions). I reinstalled !OpenSolaris. A further bad point is that I cannot get the ILOM serial console redirection to work with opensolaris. It works for the BIOS startup, but fails before grub comes up. After the reinstallation, I did a reboot, and the machine (using the graphical ILOM redirection) always came back with a black screen and a "0037" BIOS message in the lower right corner. Found a [[http://fixunix.com/sun/264237-x4100-ilom-hangs-0037-a.html][Forum thread]] on a similar problem. And indeed, powering the machine off for some time and rebooting brought back the !OpenSolaris login prompt. -- Main.DerekFeichtinger - 05 Aug 2008
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r90
<
r89
<
r88
<
r87
<
r86
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r90 - 2013-10-08
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback