May 8-12 Newark, California: SunUP Network Forum
Wednesday
SunUP didn’t start until 09:00 today. Missed the shuttle bus. No big deal. The campus is only a few miles from the hotel. Got delayed in toll-booth traffic for a few minutes. Drove straight to the right building, no wrong turns. This place is MUCH easier to get around than New York.
The first talk this morning was about the Sun-Fujitsu relationship. Basically it amounts to a way for Sun and Fujitsu to trade some technology for a few years, until they can find a way to screw each other. Diplomacy is the art of saying “Nice Doggy” until you can find a big stick.
Then came the Solaris 10 migration “Lessons Learned” session. This was possibly the best talk of the conference. Some of the highlights were:
- Most applications experience a performance gain simply by upgrading to Solaris 10.
- The IP Stack has been rewritten to vastly improve performance.
- A “Container” == A Zone + Resource Management.
- “Whole Root” local zones!!!
- All Zones in a domain share the same process table. So a fork-bomb in any local zone will crash the global zone. I knew this already, but it is nice to see Sun admit to it.
- Memory leaks in a local zone can also take down the global zone. I didn’t know that, but I suspected.
It would be nice if Sun were actually able to get Zones to be as fine grained and self sufficient as LPARS on an IBM mainframe, but they have a LONG way to go.
This would have been the most appropriate discussion to bring up some of the gripes I had about Solaris, but it didn’t seem right to voice them to the guy who migrated his datacenter to Solaris 10. It would have been REALLY nice to have had access to an actual Solaris Engineer.
Then we talked about the new DIMM replacement policy. Most sites like to replace DIMMs that are throwing Correctable memory errors, under the assumption that soft errors will lead to hard errors. Sun did some research, and found that 70% of these correctable errors were replaced on ’suspicion’ of being bad. They collected 800 of these DIMMs that were throwing correctable errors, and ran them all for 5 months under heavy load. They found that at the end of that 5 month period, they didn’t have a single non-correctable error (read system panic). I know that we replaced a LOT of them on our E10k machines in the first
two years I was here.
The new policy is to replace a DIMM only if it has thrown 24 errors over 24 hours. I’m not sure how this meshes with the new Memory Page Retirement functionality that was introduced in Solaris 10, then back-ported to Solaris 9 and Solaris 8. It seems like MPR would retire pages of memory (essentially a “bad block map” for RAM) before they hit that threshold of 24 in 24, and you’d never see enough errors to replace a failing DIMM. They had a customer testimonial, and the guy said that they don’t bother replacing a DIMM until the memory error is logged as persistent. That is how we’ve treated them for the most part over the last few years, anyway.
Sun also suggested the new cediag. This new and presumably useful tool does not ship with the OS, but
instead the 5.0 version of the explorer package. Talking of which, why isn’t explorer part of the OS by now??
The only choices for technical break-out sessions were “Capacity Management” and “Disaster Recovery.” I stayed for the DR discussion. It wasn’t very useful unfortunately. That being said, I’d like to see more break-out sessions next time, particularly ones with Solaris engineers.
The next discussion was on Time Dependant Reliability (snooze). The guy giving the talk was so far above the heads of the audience it wasn’t funny. The crux of his argument was that MTBF is a poor tool for reliability analysis.
The last thing we did was to plan the next meeting. Hopefully, it will be at Sun’s Broomfield campus. Fat Tire is plentiful near Broomfield because the brewery is less than an hour away. I’ve done the tour, and quite enjoyed it.
Wednesday night, I had dinner with Stephen. As good as it was to see Shannon, it was better to see Stephen because I did get to hang out with Shannon and So Jung over Christmas. Stephen, I hadn’t seen since one week before I got married, very near five years. Stephen didn’t have long. Something about Google working him to death, I suspect. Still, it is incredible to me that with real friends, the passage of time evaporates when you get together. It has been eleven years since high school, and it just didn’t matter. I really appreciate that, since it reassures me that I made the right choices in friends so long ago. We ate at the same steak place I had eaten at on the first night. I had two Lagunitas India Pale Ales which claim to be made with 65 different malts and 43 different types of hops. That is incredible. Needless to say, the first one was so good, I had to have a second. Stephen had to leave early, but it was so good to hang out with him that I didn’t care. Hopefully, I’ll get to go back some time.
Thursday
Got up at 09:30. Checked out at just before 10:30. On the I880 toward San Jose. I only missed one turn going into the airport, mostly due to construction around the airport. Flight was supposed to depart at 12:15 PDT. We had to wait on the plane at the terminal for an hour, while they fixed the plan with duct tape. Seriously. Ok, ok… so the problem was that one of the overhead bins came unhinged, and they had to tape it closed. I really didn’t think I’d make my flight from DFW to HSV, and I was certain my luggage wouldn’t. Fortunately, I got to the gate just as boarding was starting. My luggage also made it to HSV unharmed. All-in-all, long, boring, and full flights, but safe ones. I got to Huntsvegas at about 20:15, made it home by 21:00.