Hostnames, directory names, etc have been changed to protect my job.
The situation, in brief:
A Solaris 9 system that is serving as an SAP on Oracle database server. The system also acts as an NFS server to SAP application servers. This specific system is currently in use as a test system for end-of-fiscal-year processing. As a result, it gets reconfigured often. Many times, in a hurry.
The problem, in brief:
The NFS server is only exporting the LAST entry in /etc/dfs/dfstab. Or more accurately, the NFS server only exports one filesystem at a time.
Yesterday, I got a request from the SAP BASIS group to reconfigure this server in a hurry. One UFS filesystem (on Veritas Volume Manager) needed to be grown, and a second directory needed to be exported via NFS. I added the new filesystem into the /etc/dfs/dfstab as usual, and as usual, I ran the shareall command. So, I logged into the application server, added the filesystem into my automount map, restarted autofs, then tried to test the new mountpoint:
[root@client root]# ls /mnt/b
Permission denied
Permission denied. Hmm. Running showmount -e on the server showed the following:
[steelmi1@server steelmi1]$ showmount -e
export list for server:
/export/a
Which is interesting, because the /etc/dfs/dfstab reads thusly:
[steelmi1@server steelmi1]$ cat /etc/dfs/dfstab
share -F nfs /export/b
share -F nfs /export/a
In order to make sure that I hadn’t malformed the /export/b entry, I reveresed the two entries in /etc/dfs/dfstab so that it now looks like this:
[steelmi1@server steelmi1]$ cat /etc/dfs/dfstab
share -F nfs /export/a
share -F nfs /export/b
[steelmi1@server steelmi1]$ showmount -e
export list for server:
/export/b
So obviously, the entry isn’t malformed.
The next thing I tried was running the individual share commands. No matter which one I ran, it would export that filesystem, and remove the others.
Now, we UNIX administrators are loath to admit that our systems sometimes need to be rebooted. Ran uptime to see how long this machine had been up, and I saw something strange:
[steelmi1@server ~]$ uptime
1:32pm up 3 users, load average: 0.41, 0.23, 0.15
This is what the output of uptime normally looks like:
[steelmi1@server ~]$ uptime
1:32pm up 13 day(s), 21:51, 3 users, load average: 0.41, 0.23, 0.15
So, the “uptime” field was missing from the output. I knew that during end-of-year testing, we often disable the NetworkTime Protocol daemon, then set the system clock several months forward to more accurately simulate what will happen during closing. I took a quick look at the /var/adm/wtmpx and /var/adm/utmpx files to see if there were any obvious problems with them. First, I copied them to /root, and operated on the copies:
[root@server root]# cp /var/adm/*tmpx /root
[root@server root]# /usr/lib/acct/fwtmp < wtmpx | grep time
old time 0 3 0000 0000 1151077407 0 0 0 Fri Jun 23 10:43:27 2006
new time 0 4 0000 0000 1163001000 0 0 0 Wed Nov 8 09:50:00 2006
old time 0 3 0000 0000 1166355898 0 0 0 Sun Dec 17 05:44:58 2006
new time 0 4 0000 0000 1154431320 0 0 0 Tue Aug 1 06:22:00 2006
old time 0 3 0000 0000 1154956653 0 0 0 Mon Aug 7 08:17:33 2006
new time 0 4 0000 0000 1160226960 0 0 0 Sat Oct 7 08:16:00 2006
old time 0 3 0000 0000 1160227423 0 0 0 Sat Oct 7 08:23:43 2006
new time 0 4 0000 0000 1159881720 0 0 0 Tue Oct 3 08:22:00 2006
[root@server root]# /usr/lib/acct/fwtmp < utmpx
system boot 0 2 0000 0000 1163450999 0 0 0 Mon Nov 13 14:49:59 2006
run-level 3 0 1 0063 0123 1163451051 0 0 0 Mon Nov 13 14:50:51 2006
(OUTPUT TRUNCATED)
[root@server root]# date
Thu Oct 26 13:54:16 CDT 2006
So, since 23 June, the date has been set and reset several times, forward and backward. In fact, the last recorded reboot is about 8 days into the future. As far as the system is concerned, uptime is a negative value!
The Fix, in brief:
With that mystery solved, and with /var/adm/utmpx already backed up, I decided to clear utmpx out, and try running shareall again.
[root@server root]# > /var/adm/utmpx
[root@server root]# shareall; showmount -e
export list for server:
/export/b
/export/a
I do not know why share cares about the system’s uptime. I’m hoping to have time to go poke around at the OpenSolaris source code to figure out exactly where things went wrong. As it stands, I don’t even know if I could reproduce this problem by changing the system date forward several days or weeks, then rebooting, setting the clock backwards, and then trying to export multiple filesystems. It would be worth finding out at some point.