Book review: The Phoenix Project

Recently read The Phoenix Project – excellent story. This book centers around a company’s transformation from a dysfunctional organization to one uses technology effectively. There are many good lessons for anyone working in technology. The book reflects (though perhaps exaggerated a little) the common realities of working in the tech world. Many comparisons are made to manufacturing processes which is a highly optimized science today – something I’ve witnessed in my occasional visits to the Industry 4.0 Discord sever. It seems IT (and I would lump product development and most knowledge work into that as well) is still an immature science – at least in most organizations.

Some Notes/Quotes:

  • we’ve been meaning to upgrade that SAN firmware for years …
  • (ops have a) deep suspicious of developers breaking things, then disappear
  • only thing more dangerous than a developer is a developer conspiring with security
  • everyone things that the real way to get work done is to just do it
  • for years I’ve been trying to get people to use our change management process and tools
  • three management movements
    • theory of constraints
      • improvements made anywhere besides the bottleneck are an illusion
    • lean production
    • total quality management
  • your job as VP of IT operations is to ensure the fast, predictable and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work, so you can provide stable, predictable, and secure IT service.
  • control the release of work … ensure that your most constrained resources are doing only the work that serves the entire system, not just one silo.
  • four types of work
    • business projects
    • internal projects
    • changes
    • unplanned work
  • work center: machine, man, method, measure
  • repetition creates habits and habits are what enables mastery. Studies have shown that practicing 5 minutes daily is better than practicing once a week for three hours
  • our goal is to maximize flow
  • everyone needs idle or slack time. If no one has slack time, WIP gets stuck in the system
  • in any system of work, the theoretical ideal is a single-piece flow, which maximizes throughput and minimizes variance. You get there by continually reducing batch sizes.
  • the flow of work goes in one direction only: forward
  • learning is not compulsory … neither is survival. – Deming
  • without a doubt, the best times for technology are ahead of us, not behind us. There’s never been a better time to be in the technology field, and to be a lifelong learner.
  • 3 ways:
    • flow - quickly stuff moves quickly from left → right
    • feedback - shared from right → left
      • shared goals means shared pain
    • culture of continuous experimentation and learning, habits, repetition → mastery

A few more quotes:

  • Improving daily work is even more important than doing daily work.
  • Being able to take needless work out of the system is more important than being able to put more work into the system
  • We need to create a culture that reinforces the value of taking risks and learning from failure and the need for repetition and practice to create mastery.
  • until code is in production, no value is actually being generated, because it’s merely WIP stuck in the system.
  • Left unchecked, technical debt will ensure that the only work that gets done is unplanned work!
  • Remember, outcomes are what matter — not the process, not controls, or, for that matter, what work you complete.

Also reviewed the Unicorn Project by the same author.

But why? If it’s not broken and if it’s valuable to the company, do you really need to upgrade it?

Yes, security is a good reason to do so, but you may also be trading old bugs for new bugs and this doesn’t seem to jive with your other bullets as simply upgrading a thing as a piece of work doesn’t on the surface bring any value.

This is often quoted as the “get shit done” methodology. The best way to organize and accomplish work in technology is to hire good people and let them do the work. Often it’s proposed as an alternative to the massively over-designed agile processes which larger companies embrace.

These two remind me of this video by an agile guy, which actually does make a good point (although it doesn’t require the use of agile to implement): The resource utilization trap - YouTube

Similarly, this is why lots of times it looks like construction workers are just sitting around. It’s actually more efficient both in terms of project completion time and in cost to have very serialized processes (like construction) have lots of slack.

This is why I like kanban in comparison to all the other agile things. There’s a big list of the things that need to get done and then as people free up from work they pull the next task on the list which they can contribute to. Sometimes it doesn’t work as maybe there’s a big unfun task that needs to get done and someone needs to do it, so a manager may have to step in and direct someone on what to pull, but most of the time it can work very well.

This one seems very specific to the tech industry. Other industries have other kinds of debt (like physical plant investment) but usually these are very easy for a human to perceive.

Good perspectives. Based on our past discussions, I think our views on updates probably differ some. I feel with modern CI and testing methodologies, most projects/products generally get better with time and you are at more risk trying to take a large jump in time than multiple smaller jumps. The reason is that the smaller jumps are better tested by the developers. Support is the other consideration – especially in OSS projects – it is generally easier to get support for the latest version where the developers are working than a version that is several years old. So lets put this into a framework of risks:

  1. an update will introduce new bugs
  2. will we have trouble updating if we wait and take a big jump in versions?
  3. what level of support can we get for older versions?

Frankly, I’m amazed at the stability of modern OSS – Arch Linux gives me almost no problems. Gitea, Caddy, Grafana, Influxdb, Libreoffice, Browsers, etc. have all been updated many times with no problems. There is no one size fits all so there are times when it makes sense and when it does not, but generally I favor updating today. But sometimes the stakes are so high you can’t take even small risks. I don’t jump to update firmware in my car, appliances, etc.

In the SAN story, the system had crashed again and there was only one overworked person in the company who knew how to get it going again. So I think the point was the system was buggy and should have been fixed long ago, but everyone was too busy fighting fires to do any preventative work.

The resource utilization video is great! I’ve had good experiences with Kanban as well. Github recently rev’d their projects feature and it’s looking pretty nice. I’ve mostly used Trello in the past.

Are there any Kanban tools you can recommend, or do you use whiteboard?

Mostly I just keep a trello board for myself and use it in a kanban style. All the formal tools I’ve ever been exposed to for agile scheduling are way too rigid and just end up taking too much effort to deal with.

1 Like

I think my views are mostly around big corporate implementations and that mindset. For better or worse I’m still not a fan of the rolling release concept for mission critical systems, there just ends up being too many possible combinations of software to make me feel comfortable.

For OSS projects though, I definitely understand and agree that getting support for the latest release is always easiest. So for many things, rolling release can make sense. I’ve just been burned too many times by trying to run the latest software and ending up finding horrible bugs that I’m very gun-shy about it now.

1 Like

Good points – we can add a few more considerations to our framework:

  • OSS vs Commercial SW (different support models, expectations)
  • Cost/impact of outages or downtime
  • Quality of the process to create the SW
    • does it use a modern language that is more reliable
    • is it well tested (unit/e2e tests run automatically in CI)

I feel with the 3rd point, there have been big improvements in recent years.

1 Like