SystemsUpgradeManagement

Warning! Technical deep diving follows! You have been warned!

I'll share one more fascination with you despite it being hard to limit myself to just three. I'm fascinated with Systems Management. One puzzle that is tugging on me is when to apply software maintenance or upgrades. I'm seeing a lot of companies apply updates to systems without any controls. Today's systems seem to me to be fragile. They look like they are held together with bubble gum and bailing wire. I would love to talk with you about your beliefs about when to apply maintenance and upgrades to software. --SteveSmith

System management has always been fragile. IBM MVS maintenance got pretty scientific, but arcane as heck. All the PTF / upgrade management via SMTP put the burden on composers of SMTP distributions.

Any system that supports dynamic linkage - DLLs in Windows, load modules in MVS, shared segments in Unix, libraries in Macintosh - complicates the correct update management. Application A needs DLL X, and later, so does application B. App A gets upgraded, dragging in DLL X version 2. App B goes splat. System management over time is not a science but an art. Various retro-compatability schemes and co-existence schemes come and go. Fundamentally, there is no realistic inventory with reference counting. Each in-coming unit would need references with requested versioning that could be used to analyze and manage impacts.

Apple Macintosh shared library management comes closest that I know to implementing version match-up at load time - each dynamic bind has a "version as compiled" with the component name. The component has an invitation list: "I support version n thru m of component xyz interfaces". If the two don't match, a further search is made. Unfortunately [at least prior to Mac OS X] Mac OS had no way to specify library search order.

Windows "scout's honor" Install / Uninstall list via registry kind of attacks the problem, but is not aware of reference counts > 1 - ie: more than one package shares DLL XYZ. Uninstalling is not an exact science.

No wonder systems with 128+ megabytes, 30+ mb of loaded system image get fragile!

So, I urge consideration for:

Partitioning of libraries capability.
Reference count of dependencies coming and going.
Version awareness both at system update and at each run-time binding.
Tools and repositories to report collision impacts before they happen.

BobLee 2002.04.15

Fun! Thank you for sharing.

First, isn't it great that the acronym PTF stands for Program Temporary Fix? One of my early lesson in mainframes was that temporary could mean forever.

You mentioned that SMP places a burden on its distributors. I thought that it was a good tradeoff. In my experience, it resulted in significantly more reliable systems. We went from prayer meetings around the console to eventually days than weeks than months than, in some cases, years without IPLs (reboots) or interuptions. As arcane as I find mainframes to be today, the Unix and, especially, Windows servers still aren't even close in terms of uptime. I think it's because mainframers learned and applyed systems management lessson. We seem to be still learning those lesson in Unix and Windows environments. What do you think?

Next discussion DLLs. SteveSmith 2002.04.15

IBM Mainframes were explicitly designed for Commercial DP applications. The essence of DP applications is continued processability of data. UNIX & PCs, on the other hand, were designed for Scientific processing - get the answer then forget it. System outage is a nuissance for Scientific apps, but a disaster for DP. You go bankrupt if you don't make payrolls! The years from 1965 to 1990 honed OS/360 -> OS/MVT -> OS/MVS -> MVS/370 -> ???/390. Interestingly, much of the low-level tape I/O recovery logic on OS/360 was ported from the 707x line. It was loaded with pragmatic decisions that got fragile as performance & capacity increased. On the other hand, disk access was mostly concurrent and built pretty robustly. I do recall bumbling into a loophole in 1975: I was testing read-only from our test machine against the live agent's master file that was in access by the production policy writing system (shared DASD). It caused I/O errors on the production system - the prod folks had zapped-off the shared bit for that pack to squeeze just a little faster access in their seek/read programs. My seeks from the other CPU moved the heads and confused the channel programming all to heck. Taught me a thing or two about dubious optimizations!

BobLee 2002.04.15

I like how you framed the difference between the requirements of a commercial and scientific application. That frame makes a lot of sense before the advent of crucial client/server applications, such as Exchange or Notes, which in many companies seem possess the requirements of the commercial applications you described. These applications are now vital components for the communication infrastructure of companies, especially geographically dispersed companies. From what I can see, these systems aren't managed as well in terms of uptime requirements as mainframes. For instance, I rarely if ever see change control in the Unix and Windows environments. I'm curious whether you think that the same type of systems management that was used on mainframes is needed for servers. I'm searching for ideas about what's appropriate. SteveSmith 2002.04.16

I think the problem is less simple than that. Mainframes had the huge advantage of being centralized - single system image.

When I was with Concord Communications in late 1995, they were developing the Network Health software tool. It was predicated on the fact that in any large company, the network topology never holds still long enough to be mapped by humans. The number of desktop installs / uninstalls / upgrades per day ensures that network testing coverage is an oxymoron. Every one of those boxes is a unique operating system instance, rarely holding still. Network Health silently discovered topology, polled the SNMP NIBs [MIBs ?], and accumulated statistics that let it filter out the "Top n Problem Nodes" on a daily basis plus emergency warnings on outages exceeding threshholds. That kind of sophistication at a reasonably crude level allows mere mortals to cope with such a moving target.

Inside Unix & Windows, it's kind of pig heaven from what I've seen - very little of the library segregation that's prominent on big blue iron. The consequence of this lack of isolation, is that "try it first" is difficult. There is a tremendous amount of production / test isolation and production migration sophistication in mainframes that was earned the hard way, and will probably be learned that way again.

When Microsoft introduced MTS, I felt a blinding flash of "That's CICS, all over again!" -- I could name the logical functional service blocks from memory. I think one of those old subsystem block diagrams from 1980 would fit perfectly, though some of the acronyms have changed.

I wouldn't throw away those IBM Systems Journals from 1970-1990, they're coming around again!

BobLee 2002.04.17

There is a tremendous amount of production / test isolation and production migration sophistication in mainframes that was earned the hard way, and will probably be learned that way again. Amen.

Your above comment tells me that first the system architects have to learn the lesson (understand new requirements). Then, and only then, will they understand the need to extend the architecture to create isolation.

I suspect the same kind of process happens with the people who run the systems. First, they and their customers must experience some painful lessons so that it is clear that their existing technology is not working effectively. Then, and only then, will they accept that change control and new technology can help relieve the pain.

My undestanding is that with ME and XP, Microsoft has isolated DLL changes by system and applications. There are some articles on the Microsoft web site about the isolation, but doing that kind of research is low on my priority list.

SteveSmith 2002.04.18

One nice thing on MacOS since it moved to PowerPC mumble years ago, is that shared libraries ( the equivalent of DLLs) are (1) versioned and (2) named independantly of the file name. There are resources internal to the file that contain the name the run-time-linker uses, the version of that library, and what other versions that that library is compatible with. Not easy to explain without drawing a diagram or two.

#2 means that a user could rename shared library files as much as he wants, and everything would still work, as long as they stayed in the extensions folder.

#1 and #2 means that two applications using two versions of the 'same' library can each install it into the extensions folder, and each would run the the version appropriate for the application.

Windows XP is attempting to do something similar now, but only for applications built to take advantage of it, and many apps are breaking on XP whether they try to use this library versioning or not.

Other than WinXP's half-baked imitation, I'm not aware of any other operating system that versions the shared libraries this way.

KeithRay 2002.04.19

That's about state of the art. Java's run-time may be smarter, I haven't drilled, but the problem of these schemes is complete lack of a viable purge algorithm that doesn't cause smoke to rise from your ears afterwards.

Sooner or later, the accretion of DLLs, Shared Libraries, etc. causes the search scheme to bog down, performance dips, and you feel a real desire to clean out unreachable loads. No tooling to evaluate the reference tree either to report or to back the purge algorithm. The other problem is that all the eggs are in one basket - single level of "what's present" without support for "under trial" or "backed out until we find the problem". We used to manage availability pretty easily on big iron with library staging & renaming.

The ability to partition an application or system and it's run time components was sacrificed due to the scarcity of disk space historically. Better to use the shared copy of the MSVC runtime DLL - although that has been upgraded incompatibly time after time. When application X runs, it may have been affected by any subsequent installs - especially "free Beta Trials" that had non-standard runtime libraries that were newer than the official, managed ones. Unintended runtime bugs can untest stable installed software mysteriously this way. BobLee 2002.04.19

Java is a little smarter and dumber. It specifies rules for binary compatibility across versions of a class [e.g., a newer class could have more methods, but has to have all the old methods to be compatible -- much more flexible than C++ binary compatibility, where new methods could not be added unless they were static/non-virtual.], and the class loader enforces those rules. 'Smarter' in that if a Java application is trying to load a class, and it isn't binary compatible, then the class loader will throw an exception (and probably bring down the whole java app unless the writer was specifically checking for class-load-failures.) "Dumber' applies to not having a defined way of searching for more than one version of a class - the class loader will not search for a compatible version of a class, it will just try to load the first one with the correct class-name and package-name that it finds. KeithRay 2002.04.20

It all goes to prove once more that fools can create systems that wise folks cannot keep track of. The fools are always ahead in this game, unless the wise ones are wise enough to set limits on foolishness. - JerryWeinberg 02.04.20

Never strive for foolproof - fools are too ingenious! Whenever you assume foolproof, instead of coping with excess foolishness you're in trouble.

BobLee 2002.04.21

I have spent the past few years working on some aspects of this problem for Linux. Specifically, Linux has a tool for managing packages (RPM, the RedHat Package Management Tool) and I was able to extend in two ways.

1) I added support for a 'manifest file'. You can now put a text file list of packages into your Version Control System and a given machine would look at the list and make its installed package set match the list. Now we have CM for binaries.

2) I added additional dependency support. Some packages will not work unless other packages which they use are installed on the system. RPM track these dependencies and when you install new packages it tries to ensure you have a complete system.

The package manager always knew about binary dependencies since it could ask the operating system (LDD) what are the shared libraries which a file uses. I added dependency support for all kinds of interpreted/semi-interpreted languages (Perl, Shell, Java, Python, HTML, SQL). I can even track build time dependencies (make files are just shell scripts).

Anyway to make a long story short after 4 years of working with the RPM maintainer at RedHat and using this tool at various jobs, I can not get people to think about release management issues.

KenEstes 2002.06.27

Ken, I like the idea of the manifest, but how does it accommodate incremental changes? Ie: Java VM level X.Y presumes Java class libraries 1.3.n today, but new Java servlet comes in and wants 1.4.1 and a second servlet wanting 1.3.n. How and when is the concurrent conflict resolved? Is it resolved at pre-install "What if?"?

When the last applet stops needing Java 1.1 libraries, do you know it? Can you discover it?

I see a lot of "right now" tools, but I don't see "down the road" tools as clearly.

How does a 3rd party, ie: user, query the manifest productively?

--BobLee 2002.06.28

Bob,

You have asked many unresolved questions.

1) The manifest is strictly a list of packages with their version numbers. It might look like this (in part):

procmail-3.13.1-2 libghttp-devel-1.0.2-3 time-1.7-9 libtermcap-2.0.8-13

You use it to install all the packages you want on your system. Since it is a text file living in Version Control, you can now roll back you system to any previous configuration of files. It is also easy to see "what was installed on machine Y three months ago".

2) When you create a package you have an opportunity to add dependencies to it manually. You can say things like:

requires: libghttp-devel >= 1.0.2, libtermcap = 2.0.8 provides: httpd-server

Here you are using package names on the first line to make sure that the package with the correct version is installed.

What if you specify the dependency as above but in fact it breaks with libghttp-devel = 1.0.4 well you are in trouble, this is a tracking system not a prophet. Where do these dependencies come from? I often get the manual dependencies from the documentation of the new application.

The provides line is more abstract. It is describing a type of "service" so that I can specify different packages which provide the same service and ensure that at least one of them is installed. Dependencies are just strings and I can make up any strings I want.

There are more then manual dependencies. I have a java dependency engine which will go groveling through your class files and tell me which other classes (and methods within a class) need to be present for this code to run. These dependency get added to the package at package construction time. Also the engine will find java comments of a certain type and assume that the developers put manual dependencies into the code. These dependencies might code for things like "authentication-protocol=3.0", "file-format=4.7".

Conflict resolution.

When you go to install a new manifest file (upgrade a machines set of packages) the installer may complain that you do not have the dependencies for the install. You may have to find them (what is the name of a package which gives me a shared library called libwwwutils.so?) or ignore them (this "depends on "cobol" but that just means they gave us some cobol hooks, we do not use cobol so we do not depend on it." Analyzing the dependencies is very manual and very centralized. I often have to repackage other peoples packages to get the dependencies to work out better. However my system has some hope of tracking incompatible states.

> When the last applet stops needing Java 1.1 libraries, do you know > it? Can you discover it?

You can query the system, 'whatrequires libtermcap-2.0.8-13'

> How does a 3rd party, ie: user, query the manifest productively?

This is a problem, but often it can be solved by adding manual dependencies to the package at build time. Again I assume that the enterprise has a centralized package management function.

The RedHat person I know was working on some fancy dependency resolution stuff. With clever algorithms to query centralized databases.

KenEstes buildmaster to the stars 2002.06.28

Then the good news is that there's a consumer-wide market for systematic version control analysis.

Issues:

to what extent is the info in the manifest "scout's honor" as opposed to derived mechanically? (Can a mis-typed 1.3.8.4a really mean to be 1.3.4.8a etc.). When Micro$oft introduced the "registry database" it was intended to solve things just like this.
what does it need to become pro-active (watch out for...) rather than reactive (oops... now let's backscan the version control logs...)
support for "what if", "rollback", "commit" concepts

MS Registry suffers from:

It's just 1-deep (history is lost with updates) "What was this when it worked last Tuesday?"
It's arcane and virtually untestable
Any smart-dumb fool can break it with regedit
There's no "undo", no trial-mode, no rollback.
The registry is an ACTIVE component of run-time components that bypasses the version/release stamp controls, and is largely invisible.
It's become the most popular resting place for PTFs (Programming Temporary Fixes) Look at a registry for a patched MS DLL component. The "PTFs" rarely get cleaned out after the fix has been made in source leaving scary "Don't touch that!!!" clutter behind.
Testing coverage for a DLL that interacts with the registry or invokes another DLL that might is not managable or predictable over a wide customer base.

Nice problems to chew on. But there's innovation waiting to be made & sold out there!

--BobLee 2002.06.28

RPM is a database. It is easy to run all kinds of queries to get the data you want out of it. You can easily generate the current manifest by a standard query. RPM does not have a version control system built in, so historical data is unavailable. I use scripts to get my manifests from version control and load the new set of packages from the manifest. Thus my procedures keep the version control in sync with the current state of the machine.

The MS registry seems to take the place of all the unix text files which are used for configuration. I find it easier not to use RPM for managing these files. I keep all configuration files in Version Control and treat the manifest file as just another system configuration file.

> Then the good news is that there's a consumer-wide market for > systematic version control analysis.

I do not see this at all. The maintainer of RPM feels disenfranchised from the rest of RedHat even though both he and I believe he is doing some work of critical importance to the company.

I just had an interview with people who wanted to use my expertise and promised me that "release management is not a second class citizen here". They want exactly what I have developed over the last few years and I am one one the few people with intimate RPM knowledge and contacts. I got low balled on that contract, they did not even offer what the going rate in the area is.

Future Research

The maintainer of RPM has some really interesting (but a bit far-out, research). I do not claim to understand everything he is working on, but he is smart and these may be fruitful areas of research for someone:

1) He claims package management problems (the hard ones) are exactly the same as managing a key-ring for public key cryptography. He wants to solve both problems at the same time and have a package manager and a key-ring manager.

2) He finds that for management work the complete dependency graph is too complicated for analysis. He has a notion of "primary dependency" so each package can depend on at most one other package. This simplified view makes some very interesting graphs of the full RedHat install which seem very "intuitive" and useful.

3) There is some interest in combining the RPM backend directly to the CVS Version Control front end which would do away with packages as an intermediate form and make the version control repository primary for all storage.

KenEstes 2002.06.29

All this stuff I wrote about systems upgrades will certainly be confusing to the uninitiated. I want to say a few thoughts, as clearly as possible, which encompass my views on the subject.

1) There is a large portion of CM which is a book keeping exercise. Much more data then most people are aware of should probably be tracked in one or more databases. This is strictly a Computer Science problem and its principles are fairly general and apply to all projects and I have a bunch of experience and thoughts (internalized models) about this.

2) Systems need to be designed so that they can be upgraded. Too few projects think about how the system will be evolved as it changes. This is a problem of how the system is architected but is often ignored. Solutions in this space need to focus closely on 'how things really get done around here' especially what is easy in this organization to upgrade pieces of the system and what is likely for management to change their minds about and what kind of errors are we likely to make during an upgrade or as a result of a bad rollout. This problem is hard because good solutions end up being specific to the organization they were designed for. General principles become more scarce here.

3) Project management needs to plan reasonable strategies for managing the upgrades. How do we communicate changes, how do we baseline parts of a system. This is a really hard management problem and I have never been involved in organization which take this seriously. If this problem is not addressed then solutions in the other two spaces loose their power. In fact I suspect if this problem is address successfully the needs for both 1 and 2 become dramatically less. (I just got Dwayne Phillips book "The Software Project Managers Handbook" and I hope to learn more about this space. Would someone please twist his arm and 'make him come' (inflict travel upon him) to AYE.)

KenEstes 2002.06.29

Ken, that's a lot of System Management above. I think you've got a pretty good outline of the problem relative to a single platform.

One of my revelations with distributed applications [that surfaced while supporting 1000+ mainframe customers in 1990-1995] is that the configuratation management done by vendors is necessarily toothless:

Vendors can't test the infrastructures customers will have - now or future - unless the customers are captive in-house folks with no installation options at all. (Corporate folk brew-up system images & download them.)
You can't imagine all the customizations and accidents that you can encounter on a customer's platform - the permutations are wickedly variant.
You can't expect customers to stop everything and install your latest [or even N-2] release in any convenient timeframe. We got support calls for some very retro releases of OUR product and that was mixed with some pretty retro operating system releases, too!

Since 1995, I've been on both the vendor and customer side of the SM problem, and I have to say nothing makes it easy for anyone to get it right. Too many times it was cheaper in salary expense to dump an old box and all its software and install clean, new boxes with current releases - getting there incrementally was too disruptive.

What's everybody's experience with "Oh, just [re-]install X, Y, and Z and you'll be cool," my experience is that I usually lose productivity for about 2 weeks each time I upgrade MS Word, a development IDE, a browser, e-mailer, etc. After about 2 weeks I usually get back up to just about as good as before it was enhanced!!!

--BobLee 2002.06.30

In the mid-90s I did some work for a custmer service organization that was supporting multiple financial products with a vast array of CS and mainframe applications. (Wasn't Fidelity with Bob, 'twas a different mutual fund complex.)

They had approached the SM problem by creating software tiers for desktop machines (I don't remember this exactly, so this is quite approximate):

Base tier had the OS and office productivity bundle.

Tier 1 had the applications that were common to 90% of the agents.

Tier 2 had another bundle of applications used by a large per centage of the agents.

Then they had a bunch of specialize tiers used by groups who supported particular products or functions (e.g., life underwriting).

The tier system helped with isolation for testing. They used Tivoli to poll status and push updates to the desktop. They still had a ton of problems keeping track of which machine was where, though and getting permissions set up for access to sensitive data. And it was expensive.

Of course the answer to "expensive," given by some one who had been away from technology for a good long while, was to only update the desktop once a year. Of course that didn't sit well with the business, because they launched new products several times a year and had to respond to regulatory changes at unpredictible intervals.

Which brings me to another point I've been pondering recently:

In many companies, the people making strategic decisions that involve technology are way up in the organizaiton... they've moved up the management ladder, and haven't really written (or seen) any code for 15 - 20 years.

And building software these days isn't always like it was 20 years ago. Now developers are likely to stitch together commercial components and add some application code on top of that. THe difficulty is around the interfaces and in working with code that you can't see, don't have access to and have to rely on someone else to debug and fix (in their own sweet time, of course).

I keep running into top managers who are working out of the mental model of what it was like writing code 15 - 20 years ago.

How do we help managers who have been away from development for a long time update their mental model of developing software? What does every manager need to know about the current realities of developing software?

(I'll split this onto another page, UpdatingMentalModelsofSoftwareDevelopment)

EstherDerby 09.02.2002

Updated: Thursday, September 5, 2002