Bored as toast

I’m in Atlanta for SAP training this week. Amy came down with me on Sunday, but she went home this morning. The end result is that I’m bored and uninspired. I’ve exhausted my ration of Cadbury’s Royal Dark for the week, and I’m down to orange juice and Dogfish Head 90 Minute IPA to stave-off dehydration. As our Mr. Jerome noted, thirst is a dangerous thing. Now, the 90 Minute IPA is quite tasty, but the amazing hops presence and slightly elevated ethanol content means that one has to take it easy with this stuff.

The SAP training is going well. I’ve seen a lot of the material we’re covering before in the real world, but I haven’t usually understood what I was doing or why. This class has done a good job of filling in the gaps in my understanding. I know that I still have a long way to go in understanding why Jerry made some of the decisions he did in desiging this product, but that doesn’t make it any less impressive. SAP is a good system, but it’s large and complex. There is a lot to learn about it, and a person so inclined could make a career of it. To a large extent, I have. With the exception of the one year I worked at the NASA Integrated Services Network, I have been supporting SAP systems in one way or another since 1998. With any luck, we’ll keep at it. It seems to pay well, and the hours are good mostly.

Published in: on June 28, 2007 at 2:04 am Leave a Comment

Question:

Is a beer a character device or a block device?

Discuss. Not too heatedly…

Published in: on February 9, 2007 at 6:37 am Comments (1)

Some observations on Java

For the last week, I’ve gotten a bit of a crash course in Java. Java, at this point, is more than ten years old, but I’ve avoided it for as long as I could.

All that I can say is, FINALLY!!! We now have a programming language that makes it harder to find all of the required dependencies than C does!

All I was trying to do was to recompile a few classes! There are no makefiles. There are no hints or clues as to where to look. I found myself doing things like this, far too often:

for i in `find . -name *.jar -print`  do    export CLASSPATH=$CLASSPATH:${i}done

javac blah.java

I try to stay away from this sort of brute force compilation because too much can go wrong (multiple, incompatible versions of the same .jar comes immediately to mind). In this case, I was lucky. I eventually got everything to compile, but I was still angry after it did.

Published in: on February 5, 2007 at 8:11 pm Leave a Comment

Noteable news…

Several interesting developments lately…

1. In mid-November, I’ll be changing jobs. Working in the same building, for the same company, just in a different capacity.
2. Saturday, Amy and I celebrated our 6th wedding anniversary.
3. Sierra Nevada Pale Ale on draft in Alabama!
4. Fuller’s ESB reappears in TN… Albeit in smaller 11.2 oz (330 ml) bottles. 330’s are popular in Europe, but they make me mad. Still, I’m glad to have been able to lay in a small supply. There are no hops but fuggles. And Fuller’s ESB is their prophet.

Published in: on October 11, 2006 at 2:09 am Leave a Comment

The Mysterious ’shareall’ problem…

Hostnames, directory names, etc have been changed to protect my job.

The situation, in brief:

A Solaris 9 system that is serving as an SAP on Oracle database server. The system also acts as an NFS server to SAP application servers. This specific system is currently in use as a test system for end-of-fiscal-year processing. As a result, it gets reconfigured often. Many times, in a hurry.

The problem, in brief:

The NFS server is only exporting the LAST entry in /etc/dfs/dfstab. Or more accurately, the NFS server only exports one filesystem at a time.

Yesterday, I got a request from the SAP BASIS group to reconfigure this server in a hurry. One UFS filesystem (on Veritas Volume Manager) needed to be grown, and a second directory needed to be exported via NFS. I added the new filesystem into the /etc/dfs/dfstab as usual, and as usual, I ran the shareall command. So, I logged into the application server, added the filesystem into my automount map, restarted autofs, then tried to test the new mountpoint:

[root@client root]# ls /mnt/b
Permission denied

Permission denied. Hmm. Running showmount -e on the server showed the following:

[steelmi1@server steelmi1]$ showmount -e
export list for server:
/export/a

Which is interesting, because the /etc/dfs/dfstab reads thusly:

[steelmi1@server steelmi1]$ cat /etc/dfs/dfstab
share -F nfs /export/b
share -F nfs /export/a

In order to make sure that I hadn’t malformed the /export/b entry, I reveresed the two entries in /etc/dfs/dfstab so that it now looks like this:

[steelmi1@server steelmi1]$ cat /etc/dfs/dfstab
share -F nfs /export/a
share -F nfs /export/b
[steelmi1@server steelmi1]$ showmount -e
export list for server:
/export/b

So obviously, the entry isn’t malformed.

The next thing I tried was running the individual share commands. No matter which one I ran, it would export that filesystem, and remove the others.

Now, we UNIX administrators are loath to admit that our systems sometimes need to be rebooted. Ran uptime to see how long this machine had been up, and I saw something strange:

[steelmi1@server ~]$ uptime
1:32pm up 3 users, load average: 0.41, 0.23, 0.15

This is what the output of uptime normally looks like:

[steelmi1@server ~]$ uptime
1:32pm up 13 day(s), 21:51, 3 users, load average: 0.41, 0.23, 0.15

So, the “uptime” field was missing from the output. I knew that during end-of-year testing, we often disable the NetworkTime Protocol daemon, then set the system clock several months forward to more accurately simulate what will happen during closing. I took a quick look at the /var/adm/wtmpx and /var/adm/utmpx files to see if there were any obvious problems with them. First, I copied them to /root, and operated on the copies:

[root@server root]# cp /var/adm/*tmpx /root
[root@server root]# /usr/lib/acct/fwtmp < wtmpx | grep time
old time 0 3 0000 0000 1151077407 0 0 0 Fri Jun 23 10:43:27 2006
new time 0 4 0000 0000 1163001000 0 0 0 Wed Nov 8 09:50:00 2006
old time 0 3 0000 0000 1166355898 0 0 0 Sun Dec 17 05:44:58 2006
new time 0 4 0000 0000 1154431320 0 0 0 Tue Aug 1 06:22:00 2006
old time 0 3 0000 0000 1154956653 0 0 0 Mon Aug 7 08:17:33 2006
new time 0 4 0000 0000 1160226960 0 0 0 Sat Oct 7 08:16:00 2006
old time 0 3 0000 0000 1160227423 0 0 0 Sat Oct 7 08:23:43 2006
new time 0 4 0000 0000 1159881720 0 0 0 Tue Oct 3 08:22:00 2006
[root@server root]# /usr/lib/acct/fwtmp < utmpx
system boot 0 2 0000 0000 1163450999 0 0 0 Mon Nov 13 14:49:59 2006
run-level 3 0 1 0063 0123 1163451051 0 0 0 Mon Nov 13 14:50:51 2006
(OUTPUT TRUNCATED)
[root@server root]# date
Thu Oct 26 13:54:16 CDT 2006

So, since 23 June, the date has been set and reset several times, forward and backward. In fact, the last recorded reboot is about 8 days into the future. As far as the system is concerned, uptime is a negative value!

The Fix, in brief:

With that mystery solved, and with /var/adm/utmpx already backed up, I decided to clear utmpx out, and try running shareall again.

[root@server root]# > /var/adm/utmpx
[root@server root]# shareall; showmount -e
export list for server:
/export/b
/export/a

I do not know why share cares about the system’s uptime. I’m hoping to have time to go poke around at the OpenSolaris source code to figure out exactly where things went wrong. As it stands, I don’t even know if I could reproduce this problem by changing the system date forward several days or weeks, then rebooting, setting the clock backwards, and then trying to export multiple filesystems. It would be worth finding out at some point.

Published in: on August 31, 2006 at 3:21 pm Leave a Comment

Shared Memory Settings for Solaris 10 w/o Zones

Here is how I set up the shared memory parameters for Oracle on Solaris 10.

Create a new “project” for oracle with the appropriate resource manager controls:

# projadd -U oracle -p 108 -G dba -c “Oracle Project” -K “project.max-sem-ids=(priv,200,deny)” -K “project.max-sem-nsems=512″ -K “project.max-shm-memory=(priv,4294967295,deny)” -K “project.max-shm-ids=(priv,200,deny)” oracle

2. Create the following entry in /etc/user_attr to set the oracle user’s default project to the oracle project we created in step 1. This will make sure that when the oracle user logs in and starts any new processes, they will be assigned the oracle project.

oracle::::project=oracle

3. Reassociate all of oracle’s currently running processes with the oracle project.

# for i in `pgrep -u oracle`; do newtask -v -p oracle -c $i; done

This works without a reboot.

Also, the prstat command is now aware of zones and projects… So “top” is now obsolete on Solaris 10. Running prstat –J will show you per-project stats.

Published in: on July 11, 2006 at 7:02 pm Comments (1)

Moving from Emulex lpfc driver to Sun’s emlxs.

The emlxs driver seems to be part of Sun’s SAN Foundation Software, and does not require static bindings like the lpfc driver did (no more lpfc.conf, yay!).

Here is what I did to install the Sun StorEdge SFS 4.4.10 driver on an E10k domain with an LP9000 card:

  1. pkgrm lpfc
  2. Reboot
  3. Install the StorEdge SAN Foundation Software 4.4.10, using the install_it script.
  4. Reboot
  5. Run cfgadm –la to identify the controller number. In our case, it was c6.
  6. Use cfgadm to configure the devices on the appropriate controller. In this case, the command was:
        # cfgadm –c configure c6
  7. Use format to label the disk.

I used Sun document 817-3672-11 to figure out how to configure the driver.

Published in: on at 6:57 pm Leave a Comment

E25k Exploded Diagram

Here is a decent diagram from the non-login side of Sunsolve that shows an exploded view of the E25k, with parts labeled.

Note that the IO Board is labeled “hsPCI Assembly.” The original diagram can be found on this page.

Published in: on June 26, 2006 at 2:13 am Leave a Comment

Sun High-End Server Administration

Introduction

I am going to attempt to explain a little bit about system administration on Sun Microsystems current high-end
server offerings, the SunFire 15000 and the SunFire E25k. They are mostly the same hardware. The primary exception is the presence of UltraSPARC IV dual-core processors in the E25k, while the 15k
uses the older, single-core UltraSPARC III chips.

You may have noticed the difference in nomenclature. 15000 versus E25k. This is primarily a hold-over from Sun’s previous generation high-end system, the Enterprise 10000. This system was typically called the “E10k,” though the badges on the front read “Enterprise 10000.” When the SunFire moniker came around, Sun dropped the “Enterprise” nomenclature completely. Still, most people couldn’t get out of the habit of calling the SunFire 15000 “E15k.” Going forward with the E25k, Sun as much as admitted defeat, and simply called the system “E25k.” It is much easier to say anyway.

I’m writing this article mostly from memory, however I may paste in a few sanitized examples from running hardware. As far as I know, I’m not revealing any information that you couldn’t otherwise get from going to http://docs.sun.com and looking up the appropriate information.

Specs

First and foremost, I’d like to say that these machines are impressive. They are large. They are loud. They
have a lot of blinking lights. They use a lot of power.

The E25k is newer and more capable, so I will list its specs instead of the 15k. However, the 15k specs are not much different.

Maximum Number of CPUs 72 dual-core UltraSPARC IV+
CPU Speed 1.5 GHz
L2 Cache Per CPU 2 MB
L3 Cache Per CPU 32 MB
Maximum RAM 1152GB
Power Supplies 12 ea 30A Single-Phase 200 – 240 VAC
Input Power Approx 26,000 W
Heat Output/Cooling Required Approx 90,000 BTU/h
Weight 2513.7 lb.

… and that is just the system cabinet. That configuration contains no disks to speak of…

There are actually 4 small disks, but they are not directly useable by the system. I’ll explain later…

Layout

The proper term for the system cabinet is “platform,” though we often use the word “frame.” Sun uses “platform” in their documentation, so I will do the same.

The platform has 18 slots for “expander” boards. Nine in the front, and nine in the back. The brain of the box is the cross-bar switched centerplane. The centerplane is properly named the “Sun Fireplane Interconnect,” and is pretty amazing on its own. The peak bandwidth of the centerplane is 43 Gigabytes per second (gigabytes NOT gigabits).

The centerplane board was one of the biggest weaknesses of the older E10k design. E10k “system” boards, containing CPU, memory, and I/O were plugged directly into the E10k centerplane. We had to change the centerplane in one E10k twice, due directly to bent pins on the centerplane. Both times, it was a major operation, and required expenditure of the biggest resource these machines are designed to prevent: downtime.

The E25k expander boards (EX) solve this problem by acting as a sort of middle board. The expander plugs into the center plane. The system boards (SB), containing CPU and RAM and the I/O boards (IO) containing 4 individual PCI cassettes each, plug into the expander boards.

Expander boards in the platform are numbered EX0 – EX17. System boards in the platform are numbered SB0 – SB17. I/O boards in the platform are numbered IO0 – IO17.

The way it works is the IO0 and SB0 both plug into EX0, which plugs into the centerplane. EX17 is directly behind SB0. The slots are numbered from right-to-left.

FRONT                                BACK<---------------------------------------->SB0                                  SB17    ---> EX0 ---> CP <--- EX17 <---IO0                                  IO17<---------------------------------------->

Each SB holds 4 x dual-core CPUs and up to 64 GB of RAM.

Each IO board contains 2 independant I/O controllers. Each controller provides two PCI busses. So, there are 4 PCI slots on each IO board. Each PCI slot is in the form of a cassette that you caneject (individually) while the system is running (with some limitations).

Dynamic System Domains

A Dynamic System Domain is a physically distinct Solaris system carved from inside the platform.
It is common to just call them “domains.” Each domain requires:

  • At least one SB.
  • At least one IO.
  • Some sort of disk controller connected to one of the 4 PCI slots in the IO.
  • Some sort of NIC connected to one of the 4 PCI slots in the IO.
  • A distinct hostname.
  • Its very own Solaris instance. I know that at least Solaris 9 and 10 are supported. I’m not sure about Solaris 8 any more.

A domain can be as small as one system board, or as many as 18. This also means that you can have a minimum of one domain in the platform and a maximum of 18… If you wanted all 72 CPUs (144 cores) in one big system, you would make a single big domain.

The reason they’re called dynamic, is that (with some limitations) you can add or remove boards to a running domain without shutting it down. When you need more CPUs, you simply attach another board to the domain. This is achieved through a process called Dynamic Reconfiguration.

This brings us to one limitation of the E25k. Domain granularity. You have to keep all of the memory and CPUs on a single SB in the same domain. You cannot split the system board. The same goes for IO boards. All 4 PCI slots MUST go into the same domain. You cannot allocate 2 PCI slots for DOMAIN A and 2 PCI slots for DOMAIN B.

System Controllers (SCs)

Each platform has two built-in System Controllers (SC0 and SC1). They are configured to be redundant. One is called MAIN and the other is SPARE. You can force fail-over between them if you need to work on one for any reason. The SCs are basically Sun Ultra10 workstations, built into the platform. There are special slots in the centerplane for the SCs, one in front, one in back. They handle things like:

  • Create/Delete domains
  • Facilitate Dynamic Reconfiguration
  • Provide Power On Self Test (POST) to domains
  • Provide virtual Keyswitch to domains
  • Report on platform configuration
  • Control platform fan speeds
  • Platform startup/shutdown
  • Provide domain firmware (OBP)
  • etc…

Conclusion

I’ve given a brief overview of some of the E25k basics, but we’ve only scratched the surface. In Part II, I’ll go
into more detail of how to use the System Controllers to perform several platform operations. You can find more information at Sun’s web site http://www.sun.com or at Sunsolve. Sunsolve may require a service contract to see the Sun System Handbook these days, but it might not. It would be worth a try anyway because the Sun System Handbook is full of useful information.

Published in: on at 1:17 am Leave a Comment

It is probably worth remembering and documenting…

That when you’re working on an Enterprise class Solaris system, you should check that you’re booting off of the disk you really think you’re booting off of.

I ran into this problem while patching an old E10k domain. During the patch process, the first thing we do is to take a snap-shot of many system configuration files (such as /etc/vfstab) and capture the output of running several commands (such as vxprint -htA).

The second thing that we do is to break the root mirror by detaching the plex called “rootvol-02.” This un-patched copy of root gives us a path to fall-back to, if things go terribly wrong. Granted, before using such a copy, we would have to boot from the network in order to manually un-encapsulate that copy… Perhaps we should be manually un-encapsulating it before we patch, but I digress…

I patched the machine as normal, and rebooted. Then, “panic” set in:

WARNING: Error writing ufs log stateWARNING: ufs log for / changed state to ErrorWARNING: Please umount(1M) / and run fsck(1M)WARNING: Error writing master during ufs log rollWARNING: ufs log for / changed state to ErrorWARNING: Please umount(1M) / and run fsck(1M)Cannot mount root on /pseudo/vxio@0:0 fstype ufs

panic[cpu24]/thread=140a000: vfs_mountroot: cannot mount root

0000000001409970 genunix:vfs_mountroot+70 (0, 0, 0, 200, 145ba30, 0)  %l0-3: 000000000144f400 000000000144f400 0000000000002000 0000000001496690  %l4-7: 000000000149c400 0000000001412e80 000000000144fc00 0000000001452c000000000001409a20 genunix:main+90 (1409ba0, f105bd68, 1409ec0, 38f84d, 2000, 350)  %l0-3: 0000000000000001 000000000140a000 0000000001414028 0000000000000000  %l4-7: 0000000078002000 0000000000392000 00000000014a41a0 00000000010688c8

skipping system dump - no dump device configuredrebooting...Resetting...

Which is what you normally see if you try to boot from a stale VxVM mirror copy, cause by the following:
1. The plex has been dis-associated from the volume “rootvol”
2. /etc/vfstab still references /dev/vx/dsk/rootvol as the root device
3. /etc/system still has all sorts of references to root liviing on a VxVM device.

And this struck me as odd, since the system should have booted from the patched root plex. You know, the one that was still valid. Before I went into panic mode, I decided to poke around a little, and discovered that we just had some wires crossed:

SUNW,Ultra-Enterprise-10000, using Network ConsoleOpenBoot 3.2.181, 4096 MB memory installed, Serial #10921789.Ethernet address 0:0:be:a6:a7:3d, Host ID: 80a6a73d.

ok devaliasvx-disk02                /sbus@58,0/QLGC,isp@0,10000/sd@4,0:avx-disk01                /sbus@58,0/QLGC,isp@0,10000/sd@0,0:adisk                     /sbus@5d,0/SUNW,socal@1,0/sf@1,0/ssd@0,0:anet                      /sbus@5d,0/SUNW,qfe@0,8c10000ttya                     /ssp-serialssa_b_example            /sbus@40,0/SUNW,soc@0,0/SUNW,pln@b0000000,XXXXXX/SUNW,ssd@0,0:assa_a_example            /sbus@40,0/SUNW,soc@0,0/SUNW,pln@a0000000,XXXXXX/SUNW,ssd@0,0:aisp_example              /sbus@40,0/QLGC,isp@0,10000/sd@0,0net_example              /sbus@40,0/qec@0,20000/qe@0,0name                     aliases ok printenv boot-deviceboot-device =         vx-disk01 net

A cursory review of the pre-mirror-split vxprint -ht output showed this:

dm disk01       c0t0d0s2     sliced   2888     71124291 -dm disk02       c0t4d0s2     sliced   2888     71124291 -

v  rootvol      -            ENABLED  ACTIVE   62737524 ROUND     -        rootpl rootvol-01   rootvol      ENABLED  ACTIVE   62737524 CONCAT    -        RWsd disk02-01    rootvol-01   disk02   0        62737524 0         c0t4d0   ENApl rootvol-02   rootvol      ENABLED  ACTIVE   62737524 CONCAT    -        RWsd disk01-01    rootvol-02   disk01   0        62737524 0         c0t0d0   ENA

So, out boot disk is vx-disk01 == c0t0d0s2 == rootvol-02. Which is the stale copy we split-off!

The fix was simple. All I had to do was this:

 ok boot vx-disk02 -sBoot device: /sbus@58,0/QLGC,isp@0,10000/sd@4,0:a  File and args: -s,orry, variable 'scsi_option' is not defined in the 'kernel' SunOS Release 5.9 Version Generic_118558-21 64-bitCopyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.Use is subject to license terms.

I really didn’t need the “-s” argument on the end, but it gave me the opportunity to stop the boot half-way up if I needed to.

The moral of the story is that part of being a good systems administrator is preparation. It is very easy to start working on a system, thinking “Oh, I don’t need to make backups of this or that setting/config file, because I’m just making minor changes.” Seeing this message could easily have led me to believe that the root filesystem was currupted to the point of needing to restore the machine from tape. It could have lead to hours of down-time, and an irate customer. But, since I’d prepared for things like that to go wrong, I had the system up and running again at the cost of only one extra reboot.

Published in: on June 8, 2006 at 3:41 pm Leave a Comment

When your only tool is a hammer…

As a UNIX systems administrator, I’ve picked up bits and pieces of probably a dozen different programming languages. I try to avoid the programming-language-elitism that I see in other system administrators:

  • “If it can’t be done in <XYZ programming language>, it isn’t worth doing!” (Where XYZ mostly == C)
  • “<XYZ programming language> sucks!” (Where XYZ mostly == Java)

I’m most well versed (I won’t say “skilled”) in Perl and Bourne shell. But some times, another language that I’m not so used to is simply a better tool for the job at hand. So how do I decide which one to use? I have a few guidelines:

  • For quick-and-dirty one-timers, I usually use Bash + sed + awk. I have talked about awk before, but I prefer not to dig any more deeply into it that I have discussed before.
  • Solaris start-up scripts force you to use old-school Bourne shell. Bash won’t work here due to the way Solaris executes it’s startup scripts:

    for i in script1 script2 ...
    do
    /bin/sh $i start
    done
  • Anything that requires arithmetic, processing more than one input parameter, or almost any kind of string comparison, usually gets an automatic upgrade to Perl.
  • Interacting with applications in OS X requires AppleScript
  • Interacting with Solaris/UNIX system libraries obviously requires C. Even though I’m not very good at C yet, some times it is actually quicker for me to get things done with it than with Perl, due simply to the fact that I can call C libraries that I know are already on the system. Solaris has been shipping with Perl since at least version 8, but some modules that I might be tempted to use are only available from CPAN, and thus are not installed on all of my systems.
  • Any thing that needs to run fast is normally C.
  • Most any thing that needs to be written fast is normally Perl when I have a choice.
  • String processing/manipulation is abysmal in C, and I prefer using Perl.
  • Ocassionally, I have been known to resort to doing very demented things like writing Perl code that does nothing except output Bourne-Shell code.
  • Data structures in Perl are abysmal, so anything more complex than a simple array or a hash, I usually revert to C, even if it is more painful to write, it is easier for me to understand.
  • Most any thing that I’m forced to write for Windows is done in Visual Basic Scripting Edition. I prefer not having to do this, and avoid it when possible.
  • Simple web applications might get Perl. Lately, they’re Ruby-on-Rails, which I love.
  • I try to avoid building GUI programs on UNIX when possible. I’ve use Perl-Tk before. I like the look of KDE applications, but since KDE isn’t reliably available on all of my Solaris systems, this isn’t usually a good choice. Writing directly for the X Windows protocol isn’t a good idea. Ever.Generally, I try to make things into web applications to avoid writing UNIX GUIs.
  • GUI programs for OS X can be AppleScript or Objective C/Cocoa. Since I know the Cocoa framework is always there, it eliminates the problems normally associated with GUI programming under UNIX. If only Sun would adopt Apple’s GUI, running on top of a Solaris core, I’d be happy.
  • If the problem is well-suited to an object oriented solution (not all of them are, contrary to what CIS students learn at University these days), I might use Ruby for fast written programs and C++ for fast running ones. I know how to spell Java, and I’m quite fond of the coffee from there, but I don’t know much else.
  • SPARCv9 Assembler: If it can’t be done in SPARCv9 Assembler… Oh who am I kidding…
Published in: on May 19, 2006 at 2:44 pm Leave a Comment

Busy Weekend

It has been a fairly productive weekend. Amy and I did a good deal of much-needed house work, but we got that done pretty quickly. Amy had some Mary-Kay seller come out to the house to do some sort of something.

I took this as an opportunity to run off to Swan Creek Shooting Range for some trigger-time. I put 200 rounds of .45 ACP down-range, and turned my target, a cardboard box with one of those fluorescent targets taped to it, into swiss cheese at about 18-20 yards. The target told a story, and the story was this: “I may not be able to shoot very well, but a have a lot of ammo.”

I had quite a lot ( 8 – 10? ) fail-to-feed-last-round stoppages with both the Kimber and Wilson Combat magazines. The bullets were getting pinned nose-up to the top of the chamber. It seems like it may have stopped during the last 50 rounds or so. That makes a total of 600 rounds through the Kimber. Hopefully it is “broken-in” now, and I won’t see any more.

Grabbed some coffee at Starbucks in Athens on the way. A cup of Italian and a pound of Arabian Mocha Sanai. I really like Yemeni Mokka coffees. No “blends” for me, thank you! I made a pot of it today. The beans were quite oily, and the cup was excellent even out of my cheap auto-drip brewer.

Stopped by and visited my uncle and his family on the way home. I rarely get to see them, but as I was passing by their house on my way home, and they were outside, it would have been rude not to.

Amy and I watched two terrible films. “Jackie Chan is the Prisoner” and “Flash Gordon.” Both of them were absolute stinkers.

We also rented a 10′x10′ storage room to move some of Amy’s school stuff into. We relocated a truck load of boxes from our garage, which will help my state of mind tremendously.

Right now, I’m trying out Fedora Core 5 on a laptop. I’m just curious to see what they’ve changed with this rev. I’ll be happy when either Solaris 10 becomes really useful as a desktop OS (I doubt this will ever happen) or I can get OS X that runs on a Dell.

Today was the first day in two weeks that I’ve been out of bed after 6:00 AM. I slept-in until 8:30. Tomorrow, it is back to 5:30 AM.

Right now, I’m off to read some Heinlein.

Published in: on May 14, 2006 at 8:47 pm Leave a Comment

How I fixed my corrupted Entourage Database

Entourage is Microsoft’s version of Outlook for the Mac.

Yesterday, when I came into work, I found that my G5 was locked up. Hmm. The last thing I did Friday was to kick off a Norton Antivirus scan. I don’t normally do this, but I thought that I might, just in the interst of not spreading around virii that might be hiding in my mail file to any PCs, etc.

So, the G5 was locked up. Cycling the power revealed that I had some pretty bad filesystem corruption on both partitions. After some trickery with reformatting my second partition, and using Carbon Copy Cloner to make that parition bootable, I got my root partition properly though fsck_hfs.

So, most of everything seems to be fine except my Entourage database. Running the rebuild tool revealed I/O errors reading the file. So. I performed the requisite Google searches, and came across this.

Once I located the file on the hard disk, I ran it through the usual paces to see what was there… cat, strings, cp, and several others all seemed to give an I/O error after reading about 18 MB of an 837 MB file. However, running tail -100 Database showed me that there was still data accessible at the end of the file. So, the whole thing wasn’t corrupt, just some of it. I just have to be tricksy enough to be able to get at it. To make a long story short, here is what I did:

# dd if=Database of=database.new bs=512 conv=noerror,sync
# mv Database Database.hosed
# mv database.new Database

Start the Entourage Database tool.
Rebuild your database.
Most of your mail is hopefully recovered.

Published in: on February 14, 2006 at 10:00 pm Leave a Comment

Tech Time

One of the interesting things about UNIX is that it is an anthropologist’s dream. It is full of artifacts from 35+ years of hundreds of programmers hacking away. One of the things I have always liked about UNIX is the fact that it is old, arcane, and inscrutable. I’ve been working with UNIX systems for more than ten years now, and I’m always surprised by the fact that I’ve only scratched the surface of what is there. Even after ten years, there is always something new to learn.

For instance, on many UNIX and probably most Linux systems too, you will find this in /usr/include/sys/fs/ufs_fs.h:

#define FS_MAGIC 0×011954

This would be the “Magic Number” stamped on a UFS filesystem so that other programs wanting to interact with that filesystem would know what type of filesystem they were dealing with. So, where did the number 011954 come from? Well, it is apparently the birthday of one of the programmers who worked on the filesystem in the 1970’s… 01-19-54. 30 years later it is still there, and probably will remain there until everyone stops supporting UFS. As of Solaris 10, Sun still uses UFS as its default filesystem. Once ZFS is integrated into Solaris production releases, this will hopefully change. Mac OS 10.4 uses HFS+ but will still read UFS volumes. Linux uses a variety of different filesystems, depending on the distribution, but can also read UFS.

Another interesting artifact is the way that the shell is responsible for expanding things like “*”.

Let’s say you have a directory with three files in: file1 file2 file3, and three directories dir1, dir2, dir3. You want to remove the files, so you type:

rm *

Knowing that rm won’t remove the directories without the -r option. Here’s what happens:

BASH: “Hmm. I need to figure out what * means, so I can pass that to rm… Let’s see, there’s file1, file2, file3, dir1, dir2, dir3. Oh, Mr. RM! “

RM: “You have some files for me to delete, yes? Here I am, brain the size of a planet…”

BASH: “Yes. dir1, dir2, dir3, file1, file2, file3. They are in alphabetical order for you.”

RM: “Thank you kindly. remove:

dir1 (Can’t. It is a directory)

dir 2 (Can’t. It is a directory)

dir3 (Can’t. It is a directory)

file1 (ok)

file2 (ok)

file3 (ok)

DONE!

Now, this is fine until we have a file named “-r”


$ touch -- -r$ ls -ltotal 0-rw-r--r--    1 steelmi1 steelmi1        0 Dec  4 11:03 -rdrwxr-xr-x    2 steelmi1 steelmi1       68 Dec  4 11:03 dir1drwxr-xr-x    2 steelmi1 steelmi1       68 Dec  4 11:03 dir2drwxr-xr-x    2 steelmi1 steelmi1       68 Dec  4 11:03 dir3-rw-r--r--    1 steelmi1 steelmi1        0 Dec  4 11:04 file1-rw-r--r--    1 steelmi1 steelmi1        0 Dec  4 11:04 file2-rw-r--r--    1 steelmi1 steelmi1        0 Dec  4 11:04 file3$ rm *

BASH: “Hmm. I need to figure out what * means, so I can pass that to rm… Let’s see, there’s -r, file1, file2, file3, dir1, dir2, dir3. Oh, Mr. RM! “

RM: “You have some files for me to delete, yes? Here I am, brain the size of a planet…”

BASH: “Yes. -r, dir1, dir2, dir3, file1, file2, file3. They are in alphabetical order for you.”

RM: “Thank you kindly. remove:

-r (Hey, that’s an option! I’d better do what it says from now on!)

dir1 (ok, -r was specified!)

dir 2 (ok, -r was specified!)

dir3 (ok, -r was specified!)

file1 (ok)

file2 (ok)

file3 (ok)

DONE!

$ ls -ltotal 0-rw-r--r--    1 steelmi1 steelmi1        0 Dec  4 11:03 -r$


So, as you can see, if you use the “*” wildcard, your SHELL expands it. Not the command you think you’re passing “*” to. And the result of this can be deadly. With a file called “-r” in our directory, rm * just wiped out things it shouldn’t have been able to. Incredible.

Published in: on December 4, 2005 at 8:08 pm Leave a Comment

<I>AWK</I>ward situations

Just wanted to put up this link to an article I wrote for SALUG showing some of the uses a UNIX Systems Administrator has for the AWK programming language.

Probably not interesting to non-technical types.

Published in: on November 18, 2005 at 2:07 pm Leave a Comment

A product WAY ahead of its time and a bit of nostalgia…

From Byte Magazine in April of 1995: The AeroComm GoPrint.

This was a wireless print server that I installed for the Catholic Office of Religious Education in Mobile some time between 1995 and 1998, while I worked for Computer Technical Services. Another uninteresting story about ORE is that about two years ago, I found out my first grade teacher was working there…

At any rate, these little things were cool. In 1995, they could create a small, server-less printer “network.” You’d connect the base station to the parallel port on your printer, and connect a unit to the parallel port of each of your PCs. It seems like you could connect up to 6 PCs at the time. The thing knew how to spool print jobs from all 6 PCs, and keep them separate. Not only that, but the thing worked with DOS! Keep in mind Windows 95 had just come out, and was not being widely adopted at this point, so you could forget about networks. If you wanted a network back in the DOS days, you were most likely talking about Novell, and that was a very costly option a small office like the ORE could just not afford. Back to the point.. the damn thing just worked. No muss, no fuss.

The only printer sharing we’d done prior to this, was one of those crappy A B parallel switch boxes hooked up backwards. The box normally worked to switch 2 printers to 1 PC. By plugging 2 PCs and 1 printer, users were able to switch the printer back and forth between themselves. If you were on PC A, and the switch was set to B, your print job just landed in the bit-bucket. As I recall, DOS didn’t do much to check that there was actually a printer on the other end of that parallel port…

This was WELL before 802.11b had ever been heard of.

In fact, if I’m not wrong, this was before I’d ever even set up a network before… Let alone even heard of a print server. So, this thing was cool, and way ahead of its time. I don’t know if it sold very well. I think it was pretty expensive, but still much less than buying a server for Novell, licensing the software, buying ethernet (or Token Ring… eww) cards for every PC in the place, having cable drops made, and paying someone to put it all together.

Here’s a pic of one of the client side devices.

Published in: on at 1:19 pm Leave a Comment

Travelogue

May 8-12 Newark, California: SunUP Network Forum

Wednesday

SunUP didn’t start until 09:00 today. Missed the shuttle bus. No big deal. The campus is only a few miles from the hotel. Got delayed in toll-booth traffic for a few minutes. Drove straight to the right building, no wrong turns. This place is MUCH easier to get around than New York.

The first talk this morning was about the Sun-Fujitsu relationship. Basically it amounts to a way for Sun and Fujitsu to trade some technology for a few years, until they can find a way to screw each other. Diplomacy is the art of saying “Nice Doggy” until you can find a big stick.

Then came the Solaris 10 migration “Lessons Learned” session. This was possibly the best talk of the conference. Some of the highlights were:

  • Most applications experience a performance gain simply by upgrading to Solaris 10.
  • The IP Stack has been rewritten to vastly improve performance.
  • A “Container” == A Zone + Resource Management.
  • “Whole Root” local zones!!!
  • All Zones in a domain share the same process table. So a fork-bomb in any local zone will crash the global zone. I knew this already, but it is nice to see Sun admit to it.
  • Memory leaks in a local zone can also take down the global zone. I didn’t know that, but I suspected.

It would be nice if Sun were actually able to get Zones to be as fine grained and self sufficient as LPARS on an IBM mainframe, but they have a LONG way to go.

This would have been the most appropriate discussion to bring up some of the gripes I had about Solaris, but it didn’t seem right to voice them to the guy who migrated his datacenter to Solaris 10. It would have been REALLY nice to have had access to an actual Solaris Engineer.

Then we talked about the new DIMM replacement policy. Most sites like to replace DIMMs that are throwing Correctable memory errors, under the assumption that soft errors will lead to hard errors. Sun did some research, and found that 70% of these correctable errors were replaced on ’suspicion’ of being bad. They collected 800 of these DIMMs that were throwing correctable errors, and ran them all for 5 months under heavy load. They found that at the end of that 5 month period, they didn’t have a single non-correctable error (read system panic). I know that we replaced a LOT of them on our E10k machines in the first
two years I was here.

The new policy is to replace a DIMM only if it has thrown 24 errors over 24 hours. I’m not sure how this meshes with the new Memory Page Retirement functionality that was introduced in Solaris 10, then back-ported to Solaris 9 and Solaris 8. It seems like MPR would retire pages of memory (essentially a “bad block map” for RAM) before they hit that threshold of 24 in 24, and you’d never see enough errors to replace a failing DIMM. They had a customer testimonial, and the guy said that they don’t bother replacing a DIMM until the memory error is logged as persistent. That is how we’ve treated them for the most part over the last few years, anyway.

Sun also suggested the new cediag. This new and presumably useful tool does not ship with the OS, but
instead the 5.0 version of the explorer package. Talking of which, why isn’t explorer part of the OS by now??

The only choices for technical break-out sessions were “Capacity Management” and “Disaster Recovery.” I stayed for the DR discussion. It wasn’t very useful unfortunately. That being said, I’d like to see more break-out sessions next time, particularly ones with Solaris engineers.

The next discussion was on Time Dependant Reliability (snooze). The guy giving the talk was so far above the heads of the audience it wasn’t funny. The crux of his argument was that MTBF is a poor tool for reliability analysis.

The last thing we did was to plan the next meeting. Hopefully, it will be at Sun’s Broomfield campus. Fat Tire is plentiful near Broomfield because the brewery is less than an hour away. I’ve done the tour, and quite enjoyed it.

Wednesday night, I had dinner with Stephen. As good as it was to see Shannon, it was better to see Stephen because I did get to hang out with Shannon and So Jung over Christmas. Stephen, I hadn’t seen since one week before I got married, very near five years. Stephen didn’t have long. Something about Google working him to death, I suspect. Still, it is incredible to me that with real friends, the passage of time evaporates when you get together. It has been eleven years since high school, and it just didn’t matter. I really appreciate that, since it reassures me that I made the right choices in friends so long ago. We ate at the same steak place I had eaten at on the first night. I had two Lagunitas India Pale Ales which claim to be made with 65 different malts and 43 different types of hops. That is incredible. Needless to say, the first one was so good, I had to have a second. Stephen had to leave early, but it was so good to hang out with him that I didn’t care. Hopefully, I’ll get to go back some time.

Thursday

Got up at 09:30. Checked out at just before 10:30. On the I880 toward San Jose. I only missed one turn going into the airport, mostly due to construction around the airport. Flight was supposed to depart at 12:15 PDT. We had to wait on the plane at the terminal for an hour, while they fixed the plan with duct tape. Seriously. Ok, ok… so the problem was that one of the overhead bins came unhinged, and they had to tape it closed. I really didn’t think I’d make my flight from DFW to HSV, and I was certain my luggage wouldn’t. Fortunately, I got to the gate just as boarding was starting. My luggage also made it to HSV unharmed. All-in-all, long, boring, and full flights, but safe ones. I got to Huntsvegas at about 20:15, made it home by 21:00.

Published in: on May 16, 2005 at 5:58 pm Leave a Comment

Travelogue

May 8-12 Newark, California: SunUP Network Forum

Tuesday

Set the clock for 06:30. Then 06:45. Then 06:50. Amy called 06:46. Got
up. At least I didn’t have to iron anything.

Got the shuttle-bus to Sun’s Menlo Park campus. Like everything here, it was
beautiful. Nice view of the bay and of the mountains. Nice green trees. Just
like everything here.

Signed in with about 50 or 40 other people. They had packets
on the table with everyone’s names on them. When you got your
packet, they handed out Mikasa crystal wine stoppers in the shape
of grapes. This is wine country, I suppose, but that was the
strangest schwag I’ve gotten since EMC handed out toe nail clippers
in 1999. Once you got in the door, they had a table full of
circa-1999 Sun Blueprints books for grabs. I snagged 6 different
ones. A nice addition to the book shelf, but they are all hopelessly
out-of-date and mostly useless.

This was a long freaking day. We started at 08:30 PDT, and didn’t really
stop until 18:30 PDT. By about 15:45, I was ready to start throwing things.
Really, people. If you have this much junk to cover, make it a three day
conference!

As a result of the long time sitting, I began to think of ways Sun annoys me.

  1. Solaris excluded, Sun isn’t any good at making software. They have a
    lot of tools that are either half-finished, half-useful, half-tested, or
    half-assed in some other way I haven’t listed yet. Worse than that, they
    have a lot of products with overlapping functionality, and don’t really seem
    to understand the concept of code reusability. I really wish they’d get
    their act together, particularly with systems management software. They
    spent probably three hours out of the day, showing off several new offerings
    that they were really proud of, but were nothing more than an extension of
    the current mess they have (or in a couple of cases, completely new
    messes. One dude told us about this new program going by the moniker SMC. I
    don’t remember what SMC stands for in this case because Sun already has
    at least three other products called SMC.
  2. Product names. Is it Netscape Directory Server? Or maybe iPlanet. Or
    maybe SunOne. Uh… how about Java Directory Server? Sun Management Center
    (SunMC, not to be confused with Solaris Management Console: SMC) apparently
    used to be called Symon. Netconnect is now being renamed into two different
    products (that only half-work at this point). PLEASE stop renaming things.
    Or at very least, remove all of the old names from the documentation.
  3. Java AWT(Abstract Window Toolkit)/Swing. It is old.
    It is slow. It looks like it has been beaten with an ugly
    stick. When most people say “Java sucks!” they are talking
    about AWT/Swing applications. For some reason, Sun has
    chosen to make all of their systems management applications
    with Swing, rather than native tools using C and probably
    GTK (since they are now shipping Gnome). These apps look
    like they were written in 1989. They are way too slow to
    be useful to anyone. They only work half the time. If you
    need to update the Java Virtual Machine on your box, they
    probably won’t work at all. If you need to run them over
    a remote X-Windows session, the best thing to do is forget
    it. Curiously enough, Sun are not the only guilty party
    here. Veritas has recently perpetrated this with their
    NetBackup admin client. I think that Apple did it right
    by providing a Java interface into Aqua. As a result, you
    can write Java apps that at least look like native
    applications, though they may not always behave like
    native apps. From what few I’ve seen, they actually perform like native
    apps too. What Sun should to is to hire some GUI designers from Cupertino
    to work on a replacement for Swing (which was a replacement for the AWT), or
    just freaking license the Java + Aqua layers. Yeah, right. I know that UNIX is predominately a text-oriented system. I am
    fine with that. However, if you’re going to provide GUI tools and force us to use them for some tasks, PLEASE make them
    useable.

    In a lot of ways, I think that Apple has ruined things for other UNIX vendors, by proving that UNIX can look good and be functional.

  4. Could you guys please make better LDAP server and client configuration
    tools?! This is one thing Microsoft has got right. It is really easy to
    configure a secured Active Directory server, and connect clients to it,
    without ever passing clear text passwords over the wire. Replication to
    redundant servers is apparently not very difficult either. They have had
    this working well enough since NT 4.0, and you haven’t. Period. If system administrators can manage to figure out exactly what documentation they need for the Directory Server, it is possible
    to get an LDAP server running, with clients authenticating to it in
    probably about 48 hours if you’ve never done it before. Now try adding TLS.
    Good-freaking-luck. We don’t need to go dinking around with ten different
    tools for creating self-signed certificates, etc. You should assume that:

    1. We need Transaction Layer Security by default. These days, it is
      NOT acceptable to pass clear-text passwords over the wire, unless I
      specifically tell you to.
    2. Unless I tell you differently, self-signed certificates are OK. Ask
      me if I have a “Real” cert, and if not, CREATE A FREAKING CERTIFICATE
      AUTHORITY. Make the admin tools smart enough to push the proper trusts
      out to the clients when I initialize them. Instructions for working
      with self-signed certificates that say things like: “Open the Netscape
      Web Browser” are NOT ACCEPTIBLE. This needs to be automated, easy, and
      above all needs to “just work.”
    3. pam_ldap needs to be smart enough to allow RSA key authentication
      for password-less logins over SSH. You might be able to talk me out of
      that one. It would at least be nice if a system administrator could
      allow that for specific accounts.
    4. Kerberos integration should be documented and as easy to implement as it is on Windows (nearly invisible).

    In other words right now, the Directory Server that ships with Solaris is just a tool. Sun needs to evolve a little bit by providing the tool integrated with the design, configuration, and deployment tools to make it useful quickly.

  5. Jumpstart needs to be updated to be smart enough to
    use DHCP. There NO excuse for this. And “Go download JET” isn’t a good answer either, unless Jet both grows up and gets shipped with the OS. Again, this needs to be automated, easy, and just work. We have to use this tool often, and have committed significant time into customizing it for our environment. We shouldn’t have to dance around RARP any more, since DHCP has been the
    standard for at LEAST 10 years.

The patch management discussion nearly drove me over a cliff. They are
trying to make better tools, but it looks like they are worse and that
is really a shame. I can’t imagine that people are going to want to use
these tools unless something dramatic happens. If Sun are planning to charge for these tools, I will laugh.

The most interesting quote of the day was “You cannot manage
availability. Availability is a result.” That is simple, but quite
profound, and I’m glad they are thinking along those lines. Also
interesting is the statistic they gave of the % chance of a system
administrator inadvertently causing an unplanned outage: 1 in 200. So,
every time I log into a machine, I have a .5% chance of causing down time
such as accidentally rebooting production instead of a development machine,
etc. Excellent.

Over all, I left the first day of the conference a LOT more agitated than
when I went in. Hopefully I will get some opportunity to provide input
(read vent) about some of these things tomorrow.

We did get a chance to tour the iForce center today. That was pretty
neat, but I already have an E25k, and they could have done the tour in 20
minutes instead of more than an hour. It got old fast.

I skipped the free dinner, in favor of going to the Apple
Store in Palo Alto. It wasn’t worth the time. I was very
disappointed in it, as the Mac Resource in Huntsville is
way better than this place. Ate at PF Chang’s in Palo Alto.
Had the Orange Peel Shrimp and a Fat Tire.
All I ask of life is a plate of shrimp(or maybe oysters) and enough Fat Tire
to choke a goat. Maybe one day, New Belgium will expand
enough to be able to ship to Alabama. For that matter,
maybe one day Alabama will change their beer laws to make
that worthwhile.

Got Amy a present, then headed back to the hotel.

Published in: on at 5:44 pm Leave a Comment

This is exactly why I bought a Mac…

Published in: on February 25, 2005 at 1:57 pm Leave a Comment

Input devices

I was reading on Jeff’s web log an link he posted about IBM
“clicky” keyboards. These things seem to be really popular among people who have to sit at a
computer all day. I’ve never really liked them, though. I’ve been using a Sun Type 6 USB UNIX keyboard for about four years now. I bought it when I got my first PC that had a BIOS smart enough to use a USB keyboard.

Today, I’ve got it plugged into my PowerMac G5. The G5 came with it’s own keyboard, that seems to be built solidly. Over the last two weeks, I’ve been using that one, trying to get used to it. I gave up today and plugged the Sun one back in. The Apple keyboard is quite compact compared to the Sun, which is what prompted me to try it. In the end, I found the Apple keys to have too much tension and travel too far for my taste. That and the fact that the Control key isn’t where it is supposed to be.
The thing just wasn’t comfortable.

The Sun type 6 UNIX has the control key situated to the left of the “A” key. I’ve come to depend on having it there. Also, the “~” key is above the backspace. Here is a layout of the Type 6 US keyboard for reference. The
escape key is about where you’d expect it to be on a normal PC keyboard. I use vi for my text editor, and have found that I like the placement of the type 6 UNIX escape key better.

Incidentally, more keys on the Type 6 work on my G5 than seem to work out of the box than when it is plugged into an actual Sun workstation. The volume and power keys in the upper-right work on the G5 with no configuration. I’ve never seen the volume ones work on a Sun. The Help key in the upper-left works, though I almost never use it, other than to see if it works. There is also the enigmatic “Any” key, to the right of the Help key. It has no label on the key, and I’ve never seen it do anything.

The type 6 UNIX keyboard has the added benefit of driving everyone else who tries to use it crazy. Every one who tries to use my machine at work complains about it. Of course, I make it a point to complain about “normal” keyboards whenever I work on someone else’s machine at work. “Hey! Your ‘Control’ is in the wrong place!” I guess I take great joy in being comfortable on the keyboard no one else likes.

The only trouble I have with it is that:

  1. Even though OS X understands that the Meta key == the Apple Key, Open Firmware doesn’t seem to. If I need to use a key sequence at boot up (for instance CMD+S to boot the G5 into Single User Mode), it doesn’t work. I have to plug an Apple keyboard in to make this work. Bummer, but I almost never need to do that anyway.
  2. I can’t get the volume keys working under Windows. I don’t use Windows very often, so not too big a deal.
  3. On the Sun Type 5 keyboard, you plugged your mouse into a port on the keyboard. I liked that. With the Type 6, Sun moved away from that, even though they made the move to USB. It would have been easy to put a small USB hub into the keyboard. Indeed, the Apple one (and many others) already do. Presumably, this was to save cost. Let me give Sun a hint. People who buy $20,000+ servers don’t notice that they’re paying an extra $25 for the USB hub in the keyboard. Granted, most of them don’t need it like a desktop user would, but it would be nice to have.
  4. It holds up well, but it is not very sturdy.

The Type 6 USB mouse is good too, but still lacking a few features, namely:

  1. SCROLL WHEEL!
  2. Optical, like the old Sun Mice, except without the need for the metal mouse pad.

Sun, perhaps a Type 7 USB UNIX keyboard and mouse are in order?

Published in: on February 21, 2005 at 3:39 pm Leave a Comment

Tech Time

So, today we’re going to talk about automated network installs of the Solaris Operating Environment using Sun’s Jumpstart
framework.

I’ve done Jumpstart installs for years. The stuff you need to do automated network Solaris installs has shipped with Solaris
for a long time now. It takes a while to get set up and properly configured for your environment, particularly if you do it
right. It has several advantages over the manual-take-the-CD-ROM-to-the-server-and-sit-at-the-keyboard install. To wit:

  1. Usually slightly faster than a CD install, since you don’t have to swap discs.
  2. Once it is set up, you can build a lot of machines in a short time.
  3. Once it is set up, you can build a lot of machines over a period of several years that are configured EXACTLY the same.
  4. You can deploy security hardening at OS install time.
  5. etc.

The was this normally works is that you tell the Jumpstart server a few things about the machine you’re building, such as IP
Address, Ethernet MAC Address, disk paritioning, etc… Then you tell the client machine to “boot net – install” at the OBP.

The client then sends a RARP request to the Ethernet broadcast address, and your server responds with an IP Address. The client
then sends a bootparam request, then the server responds with information on a location that the client can TFTP his kernel and
where the client should try to mount his root filesystem from.

The trouble with all of this is that RARP can’t cross subnets. It is not a routed protocol. Which is fine, except that RARP
only understands class A, B, and C networks with their default subnet masks. For instance, if you have a network numbered
192.168.132.0 with a netmask of 255.255.254.0 (192.168.132.0 – 192.168.133.255), and your server was 192.168.132.1, and your
client was 192.168.133.1, Jumpstart breaks.

The conversation goes like this:

Client:HELP! Somebody tell me what the IP address for 00:0d:93:36:bb:06 is!

Server:Sure. It is 192.168.133.1! Tell’em 192.168.132.1 sent ya!

Client:Uh, Thanks. You know where a guy like me might find a kernel around here?

Server:Sure! You can TFTP a kernel from 192.168.132.1!

Client:Yum! (Starts loading kernel) Where can I mount a root filesystem from?

Server:Why, 192.168.132.1:/path/to/miniroot, of course!

Client:192.168.132.1?! That is on a different subnet than the one I’m on! (Panic, crash dump)

Server:Help! Help! I’m being repressed!

The temporary fix was of course to Jumpstart the client using an IP address in the 192.168.132.0 – 192.168.132.255 range, just
as if it had been in a class C. This worked as expected.

The long term solution is to get Jumpstart working with DHCP instead of RARP, since DHCP sends netmask information. However,
this does present some problems as well:

  1. None of the Windows machines in our network know anything at all about RARP. So, we could keep in.rarpd running without
    worrying that it might interfere with an unsuspecting PC server. PC servers do know about DHCP, so we’ll have to pay
    more attention to what we put in there or risk freaking out the squares.
  2. Sun’s DHCP server is rather cryptic to set up.
  3. The few older Sun boxes we have such as e450’s and e4500’s have OBP versions that don’t understand DHCP, so we may still
    have to RARP those if we ever need to upgrade the OS.
  4. Sun’s Jumpstart + DHCP documentation isn’t very good. Some of the older docs I have use words like “undocumented” a
    lot.
Published in: on February 20, 2005 at 3:40 pm Leave a Comment

Meanwhile, back at the farm…

It has been pretty busy at work this week. I installed Tru64 patches on four actual DEC Alpha Servers that have been running since probably about 2 years after I got out of high school. This machines have actual Digital Equipment Corporation branding on them. We have two Aplha Server 8400’s and four Alpha Server 4100’s. All of them were made before Compaq killed DEC. These machines are generally tough. I like them, but they are old. Tru64, I have a few gripes about, namely:

  1. Startup scripts live in /sbin/init.d and are linked to /sbin/rcX.d. Perhaps this isn’t “wrong,” as binaries typically go under directories that have “bin” in them, and not usually ones that have “etc” in them. Sun and others probably made a strange choice here, but then again, Sun has also been known to do things like “/usr/lib/sendmail.” HP-UX, coincidentally, does it much the same way Tru64 does.
  2. LSM + AdvFS. LSM == Veritas Volume Manager. I don’t know why DEC didn’t just call it that. And AdvFS (Advanced File System) is an abomination. First, you take your physical hard drives, and carve out what ever VxVM volumes you want. You have to go through all the same steps you would using VxVM, EXCEPT… replace the characters (vx) with the characters (vol) in almost all of the commands:
    • vxdisk list == voldisk list
    • vxprint -htg rootdh == volprint -htg rootdg
    • etc . . .

    After all of this, you create AdvFS “File Domains” on top of the VxVM volume. A “File Domain” is almost the same thing as a VxVM “Disk Group.” It is a pool of space you carve up to build filesystems on top of. A File Domain has a 1:1 relationship the LSM/VxVM volume it was created on top of. If the LSM/VxVM volume was 20 GB in size, the File Domain (which you give a name, like “ora-domain”), will be 20 GB.

    From the File Domain, you create File Sets. You can create many file sets inside a domain. A file set is mostly the same thing as a file system. Except, all file sets inside a domain SHARE the domain’s space.

    Take a look at this ‘df’ output. Notice that file sets are referenced by # in the left-most column. This is how they are in the /etc/fstab too.

    [root@batman /]# df -h
    Filesystem            Size  Used Avail Use% Mounted on
    root_domain#root      256M  190M   54M  78% /
    usr_domain#usr         17G  2.2G   13G  15% /usr
    usr_domain#var         17G  1.5G   13G  10% /var
    oradomain#home        102G   16G  6.1G  72% /oracle/home
    oradomain#dbf_03      102G   28G  6.1G  82% /oracle/dbf_03
    db26domain#dbf_03      68G   31G  9.6G  76% /oracle/dbf_03/db26
    oradomain#dbf_04      102G   52G  6.1G  90% /oracle/dbf_04
    db26domain#dbf_04      68G   27G  9.6G  74% /oracle/dbf_04/db26
    
    [root@batman /]#
    

    Notice the two mount-points in red. They are both carved from the same domain, this time usr_domain. Notice how both of them, even though they are mounted in different places, have the same space listed under “Size” and the same space listed under “Available.” However, the kernel is at least smart enough to track how much space each file set uses, independently of other file sets. By the way, the server really isn’t called batman.

  3. Patches. Patches take FOREVER to install on these machines. The ‘dupatch’ utility is apparently a shell script, and not a very efficient one at that. The more patches you have installed on your system, the longer it takes to install new ones. Right now, it takes about 58 minutes to install a single (1) 3.5 MB patch. I have a case opened with HP, but they are not interested in fixing it. Perhaps it works better with more modern (read faster) Alphas.
  4. Support from vendors. No one knows these machines any more. HP guy we had come in to replace a fan told me he took a class on them in 1995. Software vendors are getting fewer and further between. When we migrated these machines to our SAN, it took EMC weeks to find ONE guy in their organization who knew Tru64. Also, it doesn’t help that HP do not seem to be keeping parts for these things in our local parts depot. It took two days to get the fan replacement. We found out, some two years ago, that when this fan dies, it takes the whole machine with it. So it is good that this was a preemptive replacement of a squeaky fan, and not an emergency.

None of these things matter. These are the last Alphas I will work with. Tru64 is going away. HP doesn’t even really know what chip they’re going to be using on their next line of big UNIX servers, as Itanium doesn’t seem to be doing well.

Besides that, we have a Sun E25k ordered that should have shipped yesterday. 72 CPUs, 288 GB of RAM, nice. This will be our second one (actually, the first is a 15k, but we’re buying the 25k with UltraSparc III CPUs, same as the 15k). The one Sun E25k will replace all six Alphas, plus several more Oracle database servers, mostly old Sun e450s and e4500s. In all, we are trading-in (read dumping) about ten machines for this one monster. The datacenter footprint will be greatly reduced, and more importantly, I’ll only have Solaris and Linux to deal with. I’ve had to push out so many special case, one-off configurations out for the Alphas that I’m really sick of them. As nostalgic as it might be to run these old things, I’ll be glad to see the end of them.

Published in: on February 5, 2005 at 11:08 am Leave a Comment

So this is the new year? (Hat-tip: "Death Cab for Cutie")

I don’t do New Year Resolutions. So, this year will be no exception. However, it is not all that uncommon for me to set a few goals from time to time. So this year, I think that I’m going to dig up some books by and about the founding fathers of our country. The goal being mostly to remind myself that the people who founded this country were all right-wing-kook-extremists, just like me. As the post-election left-wing meltdown continues, with liberals screeching about how “Red State” people are ignorant morons, it will be nice to read the words of these intelligent men, and to remember that they are largely responsible for me
thinking the way I do. I am a conservative, I do believe in God, I am patriotic. I am not an uneducated simpleton fool. I sense that I’ll probably write two or three screeds about this in the coming year.

Specifically, I want to look through the Federalist Papers, some Washington, some Jefferson, some Ben Franklin (particularly, I’d like to reread his autobiography), and revise on some of the minor/obscure founding documents like the Constitution and the Declaration of Independence. Also, I intend to finish the 9/11 commission report.

Also, for career development this year, I’d really like to learn to write useful code in C. Really, as a full-time UNIX systems administrator, I am often ashamed of that fact that I can’t program in C. Perl is great, and as a rule has gotten me out of (also in to) many sticky situations, but really I should learn C.

Finally, I’m really going to try to write more. I will probably use this LiveJournal thing as the medium, since it is easy enough. Nobody but Chris will read it anyway, and the only reason that he will is that he gets all of my posts emailed to him automagically. Chris, read-on. I promise it will be hella boring.

Some things I want to write about:

  • Chapter 8 in the new Ann Coulter book, and how she isn’t quite right.
  • Good/Bad points in the new Bill O’Reilly book.
  • Dr. Strangeconserviative, or How I learned to stop worrying and love the US.
  • The current un-official charter of the United Nations
  • Canada: Frozen Bombing Range of the North
  • Linux Hippies are ruining it for the rest of us
  • My favorite line to use at parties: “I’m slightly to the right of Rush Limbaugh.” That one always gets GREAT responses. :)
  • My prediction that the “Half-Blood Prince” in J.K. Rowling’s new book is . . . Haggrid.
  • The real American Idiots, Anna Nichole, Paris Hilton, Reality TV, MTV, etc.
  • My life-change from Linux+Windows on PC to OS X on Apple G5, how I have adjusted, and if it was worth it.
  • Several other topics I can’t remember right now.

I had no champagne for the new year, I’m afraid. I had a nice bottle of Chimay Grand Reserve (a.k.a. Chimay Blue) that I was going to enjoy, but decided to save it. At $9 per bottle, it is packaged like champagne (750 ml, cork finished, wire bale), but tastes better than any champagne I’ve ever had, and is WAY less expensive. Vouve Clicot is probably the best tasting champagne I’ve ever had. It is currently at about $45 for the same 750 ml bottle. Also, I have one bottle of Left Hand Imperial Stout that I am saving for a cold night. I am worried that winter may be over though. We had those two days, just before Christmas where the high
was in the 20’s. Right now, we’re having upper 60’s. Happy January in North Alabama.

I got several new books for Christmas, including the Bill O’Reilly book, Who’s Looking Out for You? Anyone who thinks that O’Reilly is a conservative after reading this book should go have a mental exam or perhaps go look at a dictionary. This book was closer to a “Self Help” book than anything I have ever read before. As it turns out, I like Bill’s writing style much better than his interviewing/commentating style on his TV show (I’ve never listened to his radio show). Perhaps it is because he’s not interrupting someone else’s every third word. I’ll talk more about the book in another entry later because I thought it was actually good and made some points I hadn’t thought about.

Oh, and I swore off Slashdot just over a month ago. Haven’t found a good replacement for it yet, we’ll see how long it lasts… Annoying pratts. So far, I have done well. Been there less than 3 times in the last month. Haven’t missed it, per se. But I do miss having a good source of computer geek news updated several times per day. If not for their agenda of left-wing politics and slamming any company out there that has the nerve to actually try to (gasp) make money, I’d still be a (many-times-a-day) daily visitor.

In summary, happy new year to you all. Really, I plan to live 2005 just like I lived 2004. Keep moving forward, doing what I do, brewing a few beers, fixing a couple of computers, and generally enjoying life in North Alabama with my wife.

Backups

So, it is 0100, and I am fighting with Veritas NetBackup. This does not make me happy. Someone should explain to me why my restore job is “Queued” when I have 10 tape drives, and only 5 of them are in use. The Dr. Pepper cans are stacking up.

Published in: on December 12, 2004 at 1:02 am Leave a Comment