That when you’re working on an Enterprise class Solaris system, you should check that you’re booting off of the disk you really think you’re booting off of.
I ran into this problem while patching an old E10k domain. During the patch process, the first thing we do is to take a snap-shot of many system configuration files (such as /etc/vfstab) and capture the output of running several commands (such as vxprint -htA).
The second thing that we do is to break the root mirror by detaching the plex called “rootvol-02.” This un-patched copy of root gives us a path to fall-back to, if things go terribly wrong. Granted, before using such a copy, we would have to boot from the network in order to manually un-encapsulate that copy… Perhaps we should be manually un-encapsulating it before we patch, but I digress…
I patched the machine as normal, and rebooted. Then, “panic” set in:
WARNING: Error writing ufs log stateWARNING: ufs log for / changed state to ErrorWARNING: Please umount(1M) / and run fsck(1M)WARNING: Error writing master during ufs log rollWARNING: ufs log for / changed state to ErrorWARNING: Please umount(1M) / and run fsck(1M)Cannot mount root on /pseudo/vxio@0:0 fstype ufs
panic[cpu24]/thread=140a000: vfs_mountroot: cannot mount root
0000000001409970 genunix:vfs_mountroot+70 (0, 0, 0, 200, 145ba30, 0) %l0-3: 000000000144f400 000000000144f400 0000000000002000 0000000001496690 %l4-7: 000000000149c400 0000000001412e80 000000000144fc00 0000000001452c000000000001409a20 genunix:main+90 (1409ba0, f105bd68, 1409ec0, 38f84d, 2000, 350) %l0-3: 0000000000000001 000000000140a000 0000000001414028 0000000000000000 %l4-7: 0000000078002000 0000000000392000 00000000014a41a0 00000000010688c8
skipping system dump - no dump device configuredrebooting...Resetting...
Which is what you normally see if you try to boot from a stale VxVM mirror copy, cause by the following:
1. The plex has been dis-associated from the volume “rootvol”
2. /etc/vfstab still references /dev/vx/dsk/rootvol as the root device
3. /etc/system still has all sorts of references to root liviing on a VxVM device.
And this struck me as odd, since the system should have booted from the patched root plex. You know, the one that was still valid. Before I went into panic mode, I decided to poke around a little, and discovered that we just had some wires crossed:
SUNW,Ultra-Enterprise-10000, using Network ConsoleOpenBoot 3.2.181, 4096 MB memory installed, Serial #10921789.Ethernet address 0:0:be:a6:a7:3d, Host ID: 80a6a73d.
ok devaliasvx-disk02 /sbus@58,0/QLGC,isp@0,10000/sd@4,0:avx-disk01 /sbus@58,0/QLGC,isp@0,10000/sd@0,0:adisk /sbus@5d,0/SUNW,socal@1,0/sf@1,0/ssd@0,0:anet /sbus@5d,0/SUNW,qfe@0,8c10000ttya /ssp-serialssa_b_example /sbus@40,0/SUNW,soc@0,0/SUNW,pln@b0000000,XXXXXX/SUNW,ssd@0,0:assa_a_example /sbus@40,0/SUNW,soc@0,0/SUNW,pln@a0000000,XXXXXX/SUNW,ssd@0,0:aisp_example /sbus@40,0/QLGC,isp@0,10000/sd@0,0net_example /sbus@40,0/qec@0,20000/qe@0,0name aliases ok printenv boot-deviceboot-device = vx-disk01 net
A cursory review of the pre-mirror-split vxprint -ht output showed this:
dm disk01 c0t0d0s2 sliced 2888 71124291 -dm disk02 c0t4d0s2 sliced 2888 71124291 -
v rootvol - ENABLED ACTIVE 62737524 ROUND - rootpl rootvol-01 rootvol ENABLED ACTIVE 62737524 CONCAT - RWsd disk02-01 rootvol-01 disk02 0 62737524 0 c0t4d0 ENApl rootvol-02 rootvol ENABLED ACTIVE 62737524 CONCAT - RWsd disk01-01 rootvol-02 disk01 0 62737524 0 c0t0d0 ENA
So, out boot disk is vx-disk01 == c0t0d0s2 == rootvol-02. Which is the stale copy we split-off!
The fix was simple. All I had to do was this:
ok boot vx-disk02 -sBoot device: /sbus@58,0/QLGC,isp@0,10000/sd@4,0:a File and args: -s,orry, variable 'scsi_option' is not defined in the 'kernel' SunOS Release 5.9 Version Generic_118558-21 64-bitCopyright 1983-2003 Sun Microsystems, Inc. All rights reserved.Use is subject to license terms.
I really didn’t need the “-s” argument on the end, but it gave me the opportunity to stop the boot half-way up if I needed to.
The moral of the story is that part of being a good systems administrator is preparation. It is very easy to start working on a system, thinking “Oh, I don’t need to make backups of this or that setting/config file, because I’m just making minor changes.” Seeing this message could easily have led me to believe that the root filesystem was currupted to the point of needing to restore the machine from tape. It could have lead to hours of down-time, and an irate customer. But, since I’d prepared for things like that to go wrong, I had the system up and running again at the cost of only one extra reboot.