ProductionPracticesCheckList

Reports for upper management, prepared periodically (monthly?):

*) Time that the service was unavailable to our customers. This includes external network problems, service degradation, reduction in the number of services which are available. This report is needed by senior managment to determine the availibility of the production environment. All measures should be made in a way which is meaningful to the business. If outages and service inturuptions are frequent it may be useful to bin the data into meaningful catagories. for example:

occasional transient failures
(failures lasting less then one minute but less

than once an hour)
occasional small outages (less then one hour duration,
less frequently then once every three months)
systemic small outages (less then one hour duration,
more frequently then once every three months)
large outages (more then one hour duration)

*) Costs for each product/release in production. This reports should be prepared in the same units which the business group uses (dollars per customer, dollars per transaction per customer). This report includes all maintenance costs for operating the finished system including: software licenses, hardware, storage media, bandwidth, system administration time (broken down into configuration, monitoring and manual actions). This report shall be used to determine our break even costs for our services and shall be used to determine which releases of our software are cheapest to operate and what costs need to be tackled in the next version of the software. This is for both upper managment and the development organization. The most compelling arguments for changes in software is costs. Only the production environment can track the real costs ofthe software. No other organziation knows the total amount of hardware needed for each release nor the people cost for manual adminstration tasks.

*) Capacity limits and growth plans. Production must monitor the usage and growth rates of all scarce resources (network bandwidth, IO bandwidth, CPU activity, storage usage,) to ensure that adequate resource remain available for the next purchasing period. Production must work with the business units to ensure that they are able to handle the projected growth in existing products as well as any new products under development. Production needs to work with development to ensure that applications can collect the data needed for accurate utilization measures. The real utilzation of equipment must be measured to find bottlnecks and to accuratly reflect the usage of the system (Search for Purple Tornado documents which is a good reference for best practices)

*) Monitor recommended OS Patches, Security Upgrades and prepare a report detailing the time from when the advisory was made public to the time when the fix was deployed across our servers. The average time to deploy a fix should decrease with experience and upper management should set a goal for this metric. Few organizations currently track vendor patches and this is a large risk. Security problems are discovered regularly and it is always possible that new bugs in software will corrupt data and damage files. Only by assigning people to monitor vendors errata annoucements can an organization protect itsself from major known risks in third party software. The Managment of the production organization needs to have a clear idea what their time to deploy these 'bugfix' upgrades are so that they can determine if this is an acceptable risk.

Actions which are productions responsibility:

*) Track the serial numbers of hardware so that:

$) problem components or vendors who deliver faulty hardware can be
detected and removed from production.
$) individual machines can be located quickly.

*) Track the licenses to be sure that they have not expired and that we have enough licenses for our expected needs.

*) Releases: All releases shall be able to be backed out. We should be able to rollback any release which has severe problems. Production must set detailed installation standards.

*) Backups: The backup schedule must be published. The ability to recover from backed up media must be tested regularly. All new production software must detail an appropriate backup plan for all data (customer data, internal data, code).

*) Monitoring and Logging: Production must specify a standard to record program driven events. The standard must cover security violations, catastrophic failures in the application, measures of resource utilization, performance of code modules. Production must monitor the production system.

*) Ensure that there is a clear path for escalation of problems including a trouble tickets system. Ensure that QA is involved in testing issues of concern to production.

*) Check that the /tmp and other file systems have enough space to perform their work.

*) Acceptance Criterion: Production will publish the requirements for new products/releases to be accepted by production. This will include manual for installation configuration, and monitoring, developers training the SA's in how to use the system in a simulated environment, on call databases for contacting developers in the event of product failure, analysis of the single points of failure in the product, changes to the product since the last release.

*) Change Control Board: Set up a board of effected people to schedule any changes in the production environment. This Change Control Board should be organized similar to the software development change control board which rules on changes to requirements and design of the project.

*) Reboots: nearly all applications and OS leak and fragment data when run continuously. Production should encourage developers to design their code so that the machines can be rebooted regularly (monthly?) to prevent degradation of performance. This will also help to ensure that in the event of a power outage all the systems will come up cleanly since all reboot sequences have been tested regularly.

*) Network Policies: Publish a document explaining network architecture/maintenance issues that developers need to be concerned about. This document should include:

$) How to request a port to be reserved for a new application.
$) where to get a list of currently used ports in the production environment,
$) where to find the current list of DNS servers and time servers.
$) How to request an IP address for a new machine.
$) Who is responsible for downloading the current list of root IP servers and how frequently this is done.
$) Our policy for checking accuracy of DNS entries ("we run

nslint from the Network Research Group at the Lawrence Berkeley Laboratory to ensure that there are no errors in our DNS data. We consider any output from nslint to be an error this includes stylistic warnings").

$) Issues related to the reliability of DNS servers in

production: some application hard code IP addresses as well as hostnames in their configuration just in case the DNS fails, is this allowed? Do we distribute static host tables (generated from the DNS database via dnsutl) to production machines on a periodic basis (monthly) so that in the event that DNS fails there is a reasonable backup on each machine?

$) Subnetting issues: are the Database and the webservers on the same subnet? When should direct connect be requested. Do developers need to inform production of the traffic expected between machines?
$) What does the Firewall block between production and the rest of the world?
$) Are there firewalls between different subnets of production.
$) What are the security issues between development and production networks.

*) OS Installation. All OS installations should be standard from a small number of approved configuration. The configurations will specify the components of the system (CPU type, number and size of drives, Network interfaces) as well as the OS version OS patches and installed base software. All specifications shall be stored in version control and there shall be a process where developers can modify or add specifications. QA and Development must be able to get machines setup in configurations which mirror the production installation. Development must be able to get machines setup in configurations which are nolonger considered 'current' but were current at one time.

*) A security document shall be written for developers. This will clearly state the policy for design of software and outline such issues as:

$) What sort of access do developers get on the production machines? Usually developers can log in and examine their jobs but not restart them or change configuration or data files. What measures will enforce the policy.
$) Production Ids: application processes will need to be configured to run under and ID this ID is not the ID of a person but of the application. Authentication issues for this id need to be address as well as general configuration issues. Does this ID have a home directory? How do authorized users "become" this id (sudo)? What is the convention for the password field ('*nopasswd*') what happens to mail generated for this id (cron jobs often send mail back to their owners)? How are new ID's created? Who is the responsible party if there are issues with the application or its id?
$) Will their be different classes of production SA? If some SA's are not expected to have all the privileges of others then the application needs to be designed so that some useful maintenance can be performed by the less trusted SA's.
$) What are the means for moving data between production machines and between production and development (scp). Authentication mechanisms often make automated file transfer difficult.
$) How can remote jobs be started between machines? Especially of interest are production Ids starting jobs on other machines and the starting of root programs on other machines for maintenance and software deployment issues. Authentication mechanisms often make this difficult.
$) What are the directory conventions for non OS software (/usr/local/, /opt/production). There should be a statement discouraging all files which do not match the pattern:
^[a-zA-Z-09/-_.]+$
The use of characters which are metacharaters in Unix tools shall be discouraged. In particular Java requires some classes to have names with an embedded '$' this is allow for those Java classes but discouraged in all other cases. No file names should ever be the same as standard utilities.
$) How should applications log security related data to the bastion host.
$) security audit standards.

*) A regression test should be constructed to test all complex configuration files particularly firewall and webservers. This set of tests shall provide a clear warning when the configuration is set to an unsafe state. It is expected that overtime the interactions between the various settings will have unexpected consequences. Production needs a means of ensuring that certain "invariants" are always met even when the configuration file becomes complex and confusing to edit. Thus error messages such as:

SMTP is acessable from outside the network

are better then errors like

port 25 is not blocked

*) Production shall regularly review/audit (monthly):

These reviews are intended to ensure that the data is understandable and correct.

$) review all configuration files
$) review the list of installed software

These reviews are security related and may require checking after the machine as been booted from media which is known to be 'clean'.

$) check the permissions and owner ship on all directories
$) check a cryptographic checksum of all executable and static
data files.
$) check the pathnames of all files for: characters outside of
the standard, standard commands installed in nonstandard places.
$) check for world writable directories
$) check for device files outside of the /dev directory

written by KenEstes (Late spring of 2001)

Ken I would recommend breaking this down modularly. It's too big a chunk of detail for anyone other than a sysadmin/netadmin/DBA to chew on.

Suggestions for subsetting this into smaller topics:

Classify into "production" subareas. (Data center, Network, Remote, Desktop...)
Divide tasks into:
- Attributes of interest to monitor (core / optional)
- Monitoring processes (capture methods & tooling)
- Data reduction analysis techniques (threshholds, trends, hot spots...)
- Presentation audiences (not simply "senior management" - that dog won't hunt!)
- Presentation media (Web, reports, on-request, ...)
- Responsibilities for above implied tasks
- Cost / benefit criteria - not all will implement all of above for everything. It's a risk management & cost / benefit decision.
- How to sell this to those with the money.
- How to grow this incrementally.

To capture management interest, it needs to be presented and justified in management's terminology (risk + cost / benefit). This is a job for more than one set of eyes!

--BobLee 2002.10.21

I saved all the useful comments you emailed me last year, right after I got back from AYE. What I remember most is your comment that there are an awful lot of reports. Who will read them all? When will we have time to make them all? I think you are right that this kind of thinking promotes beauracracy. However I do believe that there is much wisdom in this list. I am just not sure how to see it addressed.

--KenEstes 2002.10.21

You're right. I just looked back at the dusty email archives and saw that I repeated myself. (At least I was consistent.) Do we need a production rant session at AYE? SteveSmith's pieces on system upgrades trend there... Catch you in 2 weeks! --BobLee 2002.10.21

Ken, Keith, Bob,

It seems to me the information above isn't that complex or burdensome. Here's what I see:

Four reports to management. Not too burdensome. The fourth seems ad hoc, not monthly, as I expect change recommendations wouldn't be regular. I may be wrong.
A mix of processes, policies and data tracking that are production's responsibility. I would think these are separate documents, as they would be used and revised in very different ways and cycles.

What might I be missing?

- BeckyWinant 2002.10.22

Hi Ken,

hope you don't fall off you chair in suprise at this email. I was going through some stuff and really had to get this back to you.

Cheers,

Martin.

Production Group - comments.

(1). Overall, very very good, and like some of the others I have some additions and amendments to make:

(2). Production needs to publiscise HOW it works and what its goals are, and therefore WHY it needs certain paperwork and standards. Without this, jaded development groups are not going to willingly buy into much of the crucial process management stuff that is needed. It also needs to publish where it is intending to go, and therefore why certain additional things are going to be required in the future. A website is not enough, senior management must openly buy into the standards and promote it. A slogan like 'IT health is measured by the health of the production site' or something like that is worth thinking about.

(3). Administrative burden to implement much of the processes needs to be recognised as a cost of doing business and be factoreed into a development schedules and timelines. It also needs to be recognised that aggressive automation is key to managing the costs and that a group be established tasked with bringing down the costs and increasing reliability. It is arguable where the responsibility for this group lies, it really depends on the organisations size and structure.

(4). It seems like a truism, but all production code must be supported somewhere. Unsupported code is a huge issue and a report of unsupported code needs to be regularly reviewed and an action plan for each item agreed. Resp: Dev and Production.

(5). Monthly reboots - how about 'agree on a reboot schedule and implement it' with the rule that every machine must be rebooted once every xxx months. Also - the startup decks for the machines need to be checked every reboot. A key requirement is that a routine machine reboot must be possible without developers being contacted.

(6). No production application can be machine hostname dependant, or IP address dependant. Essentially an application must be portable to a new machine with minimal or no change required, even if the machine is larger (not smaller) than the current one.

(7). I totally agree on the root / prod ID policy - I think you wrote this based on what we did at MS!

(8). Something needs to be added to cover maintenance scripts for production applications. Who maintains them, when they run, how they are changed etc.

(9). Something needs to be added to cover alert management - how tickets are generated and reported and an agreed policy for escalation of the alerts within a pre-agreed time.

(10). Depending on company size and policy - disaster recovery procedures and test plans need to be worked out. Creation of this is a joint PROD/Dev thing but ownership for maintenance of the plan and rehearsals is with Production but with Dev imput and monitoring.

(11). Something needs to be added to cover outage post mortems - essentially when is a blip considered an outage and what are the procedures after the outage to go over what happened and what needs to be done in future to prevent this. This was one of the best things I learned at mail, and was very well run. You probably never saw this though.

I hope this helps!

Cheers,

Martin.

Martin (Pepper),

As the one person who worked with me on the two jobs which influenced this document the most I value your contributions highly.

However this article was written nearly two years ago and I am slowly moving away from such a skills based view of the world. Originally I was thinking that such a list would be helpful for an organization where the development was done according to a well defined process (say CMM or as described in Steve McConnell's "Project Management Survival ). Thus most of your concerns about developer buy in or development maintenence of code are assumed to be understood.

It seems that most organizations are too far from such issues to begin to discuss this in a reasonable manner. I have been struggling to integrate the ideas of "congruent management" into my beliefs about process as an organizing principle (I never viewed it as a management tool for controlling programmers, though this seems to be the most common interpretation.).

I have been thinking recently about the emotional and personality driven issues with process.

How much of a typical process is really just a list of organizational preferences of the process writer? For example I do "continuous improvement" on most of my daily chores. I am always looking for shortcuts on the way to work or examining my beliefs about how laundry should be done. Perhaps this idea of process is for people like me who would do it anyway.

What percentage of well run project really follow any process? It seems to me that most work would fall under "exceptions to the process as written", "clearly this is the intent of the process". If people really do not follow processes then how can we tell the difference between two groups who claim that the process is not applicable to them and they need to do something different. One group is lazy and looking for an excuse to ignore the process, and the other group has a real example which does not fit in the rigidly defined process.

--KenEstes 2002.11.21

I'm inclined to think that a skills based view of the world is NecessaryButNotSufficient for success of an individual, team, department, or organization. Personality based points of view are likewise NecessaryButNotSufficient. I have interest in exploring both the various points of view that contribute to success, and the details within any given point of view. As such I think your list, Ken, of process areas that matter is an interesting and useful contribution. I think that the observation that skills in process areas is not all that matters is at least as useful.

Personally, I become suspicious when any one point of view is expressed as the thing. "All you need is love." Well, no. Necessary, but not sufficient, I think. "All you need is skills." Again, necessary, but not sufficient. "All you need is process?" "All you need is resources?" "All you need is a problem to solve?"

There are at least three things being said when a candidate solution shows up phrased as: "All you need is . . . whatever."

Whatever is a candidate solution.
It's also an assertion that whatever is a complete candidate solution, thus "all."
It's also an assertion that a solution vs. an improvement is to be had.

I think there are maybe three useful heuristics for acting when something is explained as "All you need . . . ":

Ignore the "all" part, and the solution part. Believing in a universal, exclusive solution is probably a mistake. The other stuff - skills, process, whatever - is probably there for a reason. So reframe "All you need is . . . " into "Something that might help is . . . "
Ask: "Would some (more) of this help out?"
Ask: "Would some (more) of this help out more than the other candidate things to do?"

Personally, I think that a purely skills based view of the world is a mistake. I also think that your list is useful, as are Martin's additions. Whether your list, or any other, is sufficient in itself for a project or team to succeed is easy - no. But it may be a candidate improvement. Lists of processes capture both experience and insight about that experience. So they're useful to folks who haven't done such stuff before, or haven't understood what they experienced. That's no small gift for the well-intentioned but clueless. I know. That's me a lot of the time.

-JimBullock, 2002.11.21

Thanks Ken. I followed the links, some very interesting stuff out there. Some comments:

- My method for dealing with the check writing connundrum is simple. I borrowed from a friend of mine who always gets this sort of horrible administrivia done right away. I once remarked to him how impressed I was at how hard working he was, and he laughed out loud and said that no, in fact he was a deeply lazy person. Intrigued, I probed further, and his explanation was simple: He likes to be lazy, continuously thinking about and maybe starting to work on a dull chore and putting it down again is very inefficient and wastes a lot of effort, therefore doing it right away allows him to spend more time being lazy!

- Sometimes this is even not enough, and the next step is to remember how good I felt after having done a particlar chore in the past, and that pleasant feeling tempts me into doing it again. Using endorphins can be a powerful ally.

- Taking this further bring me back to the production group responsibility checklist. Essentially a lot of production work is really dull, chore ridden stuff and with the added kicker that failure to perform some important check runs the risk that if it does indeed go wrong, then the consequences are a severe reprimand or worse.

Having worked in that sort of environment a lot in recent jobs, has got me to thinking about how to manage this effectively. My response is to first recognise that a lot of the work is in fact dull. After that, the interesting part becomes how to put a workable system in place to ensure that routine checks are performed smartly and well. Some of that led to Checkman the automated job checker facility but this only goes so far in that only some tasks can be automatically checked and verified. So this leads to the 'how do I do routine (manual) tasks promptly? and the only answer I could come up with was the 'define the minimum taskset that gets the job done' and perform that, knowing that this is probably the most efficient way to work, and hence be maximally lazy.

This is really what the production checklist is all about, it is supposed to be a 'minimum set' of tasks. The stuff on there is based on bitter experience, failure to do these things invariably leads to more screw-ups, sleepless nights, ritual beatings etc etc. The tricky thing is that a lot of the tasks cross departmental barriers (for example turnover management) where the pain tends to be felt more in one area but the workload for the task is spread unevenly in another area. This is a very hard problem to solve, as there are several emotional and communication issues to resolve to get these issues 'visualised' in everyones head and hence to get them tackled.

- I chose 'visualised' carefully. Having done a lot of software development, I have pondered what the heck I would have done as a job instead if computers had not been invented when I happened to be around. Lets say its the 14th century in England and apart from avoiding the plague, how would I keep myself motivated? I figure that the most likely profession I would have fallen into was building cathedrals. They are complex, lots of interconnecting parts, pushing technology to the limits, complex project management needed, demanding customers, limited budgets and with the possibility of eternal damnation (or boiling oil) for screwups to haunt you! It is in fact just like software development today but with one crucial difference - visibility. When building something physical, a lot of the project is so 'obvious' to everyone working on it that communication is performed seamlessly - people can see how the task is progressing and what the problems and bottlenecks are and what needs to be done next. Also obvious is how certain parts of the task need to be performed, nobody is going to try to build the roof before the walls are constructed, and heaven help you if you forget the flying buttresses too.

With software development this visual obviousness is largely lost and hence being the creatures that we are, we all come up with our own definitions of how we think the job should be done (because it is interesting to do this, and better fun to pontificate and sound clever than actually do the work). This is how I see software development today. It is absolutely dedevilled by this problem and the more we try to be clever and come up with a better method to do the job, so someone else comes up with an equally compelling alternative theory. Trying to speed up the process, with various 'process improvement' practices simply speeds up the arms race of competing ideas. Maybe that is me being too cynical, but I think there is a lot of truth in this.

The answer I think, lies in bringing back the visualisation into the process, trying to make the process be more like building a cathedral, rather than running around in a hall of mirrors bouncing off the walls. Instead of asking 'how' to do something like a software project, we should instead recognise that there are very many ways to do the project and we should instead be asking 'why is this particular subtask in the project *really* needed' - and if the answer is 'because if it is not done, then this breaks' we are onto something. What I think we are onto is a dependency link Notice I did not say dependency tree, as I have no idea how these links structure themselves and how this varies from problem to problem. Next step is to optimise this set of dependencies so that the minimum is done to get the task completed without things breaking, in other words to be optimally lazy. Of course, recognise that a problem only needs to be slightly different from another problem, for the optimisation process to come up with a very different 'best practice' for each problem. The final thing to add to this, is to recognise that you should not try to go up that 'eternal ladder' from another post (I loved the visualisation) to optimise the workload, but instead limit yourself to a short step ladder and after a very few steps you simply state 'this is good enough' and just get the darn thing done, confident that you are being pretty efficient about it at least.

I hope I have not rehashed old thinking on the subject and that this is of use to people. Feel free to post this if you think it is worth it.

Thanks,

Martin.

One final thing to add on visualisation is that screwups in the production process are normally highly visual, whereas in the development process they are normally not, or can be easily covered up!

Cheers,

Martin.

KenEstes 2002.12.03 Posting for Martin Pepper

Updated: Tuesday, December 3, 2002