Deets

The Goal by Eliyahu M Goldratt
ISBN-13: 978-0884271956

Review

Book cover

What is The Goal? Better put - what is The Goal of a company? The question posed in the book is answered straightforwardly with “to make money” but it takes the protagonist some time to get there.

The Goal is a book about Systems Thinking, approached via improving assembly line processes in manufacturing. Assembly line operations in turn are based around repeatable operations. A manufacturing company is able to make money when it sells manufactured goods. The role of the factory is to produce those goods as quickly as possible so that the company can hold low inventories and therefore respond to changing needs. The Goal is a study of how to improve the throughput of a factory.

What value does assembly line operations have in a software engineering job? Software Engineering is based around projects, but many operations are themselves repeatable. Triaging issues, preparing for releases of new software, getting high touch documents out for specific issues, these are all repeatable processes. Software Engineering emphasizes the unique, and the act of creating something that can be written once before being used in multiple places, but there are still many places where repeated operations arise. Being able to move from the open space of approaching each problem as new to having a playbook to follow each time is a normal process of maturation within an organization. The Goal focuses on how to optimize that process.

The book itself is very interesting. It’s written as a story where a supporting character, an enigmatic, brilliant, professor, is an obvious stand-in for the author. The protagonist and supporting characters within the company are well meaning, but sometimes confused with basic items. Overall it’s a cute, and easy to read structure. Sometimes when insights are refined over the course of chapters, it becomes annoying, but the style is good for easy nighttime reading.

One thing I didn’t understand going in, and wish I did better, was that it doesn’t cover the unique, high touch, low repeatability, projects that are much more common in software engineering. These are actually covered in a later book called “[[The Goal (Novel)#Critical Chain Synopsis|Critical Chain]]”, but it feels much more handwavy than this one. I’d say that this book is worth reading and a synopsis of Critical Chain is a reasonable follow up.

Key Takeaways

The Goal focuses on gradually introducing the characters to the Theory Of Constraints (TOC), where the ability of a (manufacturing) company to make money is determined by the assembly lines primary limiting factor, referred to as a constraint. TOC focuses on measurements that expose this, as well as how to improve on it.

The three measurements that matter for TOC are: Throughput, Inventory, and Operational Expense. Each has definitions which may not match conventional wisdom and are important to understand.

Throughput is the rate at which a system generates money through sales, which ties the work the factory does to the larger picture. The “through sales” piece is important as well, production of goods does not count unless it fits into the larger picture, excessive production of unsold goods actually has negative value due to costs of inventory.

Inventory is the amount of money invested in things it intends to sell. This includes raw materials, items in progress, and the final finished goods. If you view a factory process as adding value by transforming raw materials (inputs) into finished goods worth more (outputs) then it follows that inventory is the amount of money stuck in the system that is released by selling the final goods.

Operational Expenses are the money required to turn inventory into throughput. This may be one time expenses, such as the purchase of a new piece of manufacturing equipment. This may also be ongoing expenses such as salaries, utility bills, rental costs, etc.

The Theory of Constraints seeks improvement via increasing throughput while simultaneously reducing both inventory and operational expenses. This is a continuous process, which needs constant adjustment and attention, in order to achieve maximal efficiency. Measurements are necessary in order to accurately discover improvements, which also implies that an accurate assessment of the chain of dependencies needs to be available as well.

Describing the chain of dependencies involves being able to show all the stages involved in a process and which stages require prior stages to complete in advance. Once the stages and dependencies between them are clear, it is possible to measure how much time is spent in each stage, how work builds up before each stage, and identify bottlenecks.

The key takeaways of The Goal is the process for improving throughput. It starts with measurement, which makes sense because you cannot improve anything until you are able to accurately describe the chain of dependencies and measure steps.

Once all the data is available, The Goal offers a 5 step process to achieve its objective of increasing throughput while decreasing inventory and operational expenses:

IDENTIFY the system’s constraint.

Decide how to EXPLOIT the system’s constraint.

SUBORDINATE everything else to the above decisions.

ELEVATE the system’s constraint.

If in the previous steps a constraint has been broken go back to step 1, but do not allow inertia to cause a system constraint.

Identifying the constraint involves finding the worst chokepoint that limits the throughput. This typically involves looking for places with multiple inputs, or which take a long time to proceed through that stage. The stage that limits the rest of the system is the chokepoint.

Exploiting the system’s constraint means figuring out how to work around the limitation of that worst stage. This doesn’t involve optimizing the throughput of that stage, but that should always be considered first if it is possible. Exploiting means ensuring that the constraint’s time is never wasted. Solutions for this may involve doing quality checks on inputs before the stage rather than after to reject already bad parts in advance, this may involve running this stage continuously to ensure other downstream stages have access to outputs. Solutions here are going to depend on the nature of the stage, but in general ensuring that when this stage is used, the time used in it is not wasted is of importance.

Subordination means that all of the other stages need to make decisions to support maximizing the throughput of the constraint stage. The use of throughput is intentional here: having excess inventory build up before the bottleneck adds to storage costs and should be avoided when possible.

Elevation of the constraint is the next step. Once the adjustments have been made if a higher throughput is needed, additional capacity can be invested in. Capacity refers to horizontal scaling of the stage and may involve additional equipment or staff to do more of the stage at once, it may also involve finding outside help with the stage.

The final step is continuous improvement. The system must be monitored to see if another stage becomes the primary constraint. If so, the same process needs to be followed to resolve that step.

Drum, Buffer, Rope

One practical outcome of the TOC is a Drum, Buffer, Rope system. The purpose of Drum, Buffer, Rope is to establish a framework for how to implement the 5 optimization steps.

In a system like this the primary constraint becomes the Drum, which beats out the pace at which the rest of the system moves.

Buffer is what you put in front of the primary constraint in order to ensure it always has work to do and is never idle. Buffer can take the form of inputs to the constraint stage, but it is often referred to as time: Having time built in to ensure that upstream stages feeding into the constraint are able to produce enough outputs, even as the pace of output varies over time, should ensure that there will be enough material inputs in a successful system.

Rope is what lets the Drum set the pace and keep the inputs from the Buffer at a reasonable amount. Rope is the signal from the Drum to the first operation(s) in the assembly line that authorizes the release of new material into the first stages. By having the Drum signal via the Rope, it is possible to prevent a build-up of excess inventory.

Why Throughput?

Throughput is the most important measurement that TOC focuses on because it is the one most important to the company’s goal of making money that an assembly line is able to influence. Traditional values like profit and return on investment don’t measure what it takes to keep things running since you can have profit, and positive ROI, but still go bankrupt. Cash flow is the traditional measurement of what it takes to keep the lights on.

The Theory of Constraints posits that inventory and operating expenses are negatives, but necessary. Throughput is what contributes to ensuring there is steady cash flow, by starting with raw goods and ending with sales. Given enough cash flow, and financial discipline to keep expenses below revenue, profits will result.

Statistical Fluctuations are the enemy of consistent throughput. Any stage is going to have a range of time it takes to produce an output, ie a stage does not take “10 minutes”, it may take between “8 - 12 minutes”. Over a number of stages these build up and cause buildups of excessive inventory, even without a complete stop in a single stage. By limiting the amount of inventory to be held before each stage, statistical fluctuations can be evened out and work can be kept at a consistent pace.

Kanban

I’m admittedly not too familiar with the history of manufacturing. I’d heard of Toyota’s process of asking 5 why’s, and Henry Ford’s assembly line for the Model T. But until reading the end of the book the tie in to Kanban boards had eluded me.

Eli Goldratt draws a clear line between TOC and both of the storied car companies. Henry Ford understood the basic principles of throughput and would limit the space provided for in between items to accumulate. If workers at a particular stage didn’t have space to put their output product, they would have to stop working. If this happens for long enough, it becomes clear at a glance what the issue is. An elegant feedback loop mechanism, without a need for a computer at all!

Toyota would further refine that with the Kanban system. In the Kanban system each space in between stages has all the inputs defined, including all the subcomponents such as screws, bolts, etc. There is a maximum number of each item allowed in each space, and a physical card that specifies what the item is and the space it belongs to. When an item is taken off the shelf, say, a box containing 50 screws for front bumper assembly, the card for that is taken and sent to the start of the screw creation process. Receiving the card triggers the process to fill a new box with screws, and specifies where to send the final product to. This improves on the Ford process by being finer grained and better defined.

Both manufacturing processes focus on throughput. By ensuring that there is a consistent pace of progress, with feedback if issues occur, final products come out at a reliable pace. By achieving consistency in pace, it is also possible to better enforce quality standards since the pace is consistently set to allow for making a set number of products rather than suddenly needing to speed up and create more.

These same principles would later apply to Kanban Boards in software engineering. Under a Kanban system the work stages are much more limited, typically something like: Backlog, Next, In Progress, Under Review, and Complete. But the core Kanban principle: There can only be so many cards in the core sections of Next, In Progress, and Under Review, are what contributes to the throughput. By preventing too many items from piling up, engineers are able to focus on what they have. If the backlog becomes too long and progress is stalled, say due to cards in “Under Review” not transitioning to “Complete” it’s possible to investigate and intervene to keep the expected amounts of work at the set pace. What’s interesting here is that manufacturing is typically broken down into low touch (low creative thought), with repetitive stages. Software Engineering is typically high touch, with less repetition. However very similar principles apply to ensuring that there is enough throughput to create end products.

Critical Chain Synopsis

Critical Chain is the follow-on book to The Goal and is focused around projects like Software Engineering, which are more high touch.

While I didn’t read Critical Chain I did read through some synopsis’s which seemed to explain the key goals as follows:

Define the project with a detailed design and deliverables.
Set timelines for deliverables, but do not add the traditional buffer to each deliverable.
Add an amount of buffer to the very end.

I feel like the first point makes a lot of sense, since I agree that most software design is done up front. The other points don’t resonate as clearly to me. In my experience ensuring that deliverables are broken up enough to be quickly delivered (ie 2 weeks or less, ideally 1 week or or less per deliverable), and have sufficient testing to be trustworthy, is reasonable enough to set a pace for a project.

Making This Concrete

To tie this together, I’ll try to put together an example of how this can all work together. One repeatable process comes up in SaaS: Post-Mortem after an availability incident with a service. A service has degraded availability or goes down entirely, and customers would like to know what happened.

Before getting to that process of course, the issue also needs to be fixed. Across a replicated service this may involve rolling out a fix to multiple places.

flowchart TD subgraph Fix Understand[Understand the issue] WriteFix[Write the fix] TestFix[Test the fix] Deploy[Begin Deploying the fix] DeployAlpha[Node Alpha] DeployBravo[Node Bravo] DeployCharlie[Node Charlie] ValidateFix[Validate Fix resolves issue in prod] Understand --> WriteFix --> TestFix --> Deploy Deploy --> DeployAlpha --> ValidateFix Deploy --> DeployBravo --> ValidateFix Deploy --> DeployCharlie --> ValidateFix end subgraph PostMortem Notes[Begin gathering facts] UnderstandCause[Understand the root cause] DescribeFix[Describe the fix] Analysis[Analyze what went right, wrong, where things could have been worse] Polish[Polish final report] Notes --> UnderstandCause --> DescribeFix --> Analysis --> Polish Understand --> UnderstandCause WriteFix --> DescribeFix end Publish[Publish Post Mortem] Fix --> Publish PostMortem --> Publish

Flowchart showing the process of an issue going from happening to resolved with a published Post-Mortem.

The following is a simplified example of how this might look. In order to get to publishing the Post-Mortem, the fix needs to be successfully rolled out. In order to do that the issue needs to be understood. But this layout also shows that there are two parallel tracks that can be worked on. While the fix is being tested or rolled out, the Post-Mortem document can be worked on. But even before then different parts of it can be worked on. If there are multiple people, or even one person switching between tasks, this can be done in a manner that is not a linear fix -> write the post mortem document.

Over the course of a few availability incidents the process can be measured at each stage. Improvements can be looked at to speed up each stage as well as how to start working on a stage when inputs are available.

Preventing a build up of inventory is in one sense done by not allowing a single input to a stage to linger for too long, ie start writing about the fix once the fix is done, measuring how long it takes to understand an issue, create a fix, etc, and ask how to improve on those times. The connection to not allowing excessive inventory to build up is less clear. If there are multiple availability incidents (yikes!) then limiting how many issues can be worked on at a time to ensure earlier ones complete is a clear tie in.

The end stage needs a tie-in as well. There are no sales associated with a post-mortem document, but these only have value if someone is consuming them. Other markers such as view count, or write-ins to ask questions, can be used to ensure that the document output is useful. If no-one is viewing or asking about the Post-Mortem document, that itself is a signal that it may not be needed. Likewise if the service’s outage doesn’t affect anyone, that is another signal about the value of the service itself. All of these can and should be measured in a process of constant feedback.

Closing Thoughts

Overall I’d recommend the book. It was an interesting read for understanding assembly line operations, which does apply to subsets of SWE like CI. Where it fell apart for me was: a) It doesn’t cover “project environments” which is covered in Critical Chain. It’s true that it helps explain it, it’s true that I now understand where the Kanban board system came from, but it feels like it is still missing some crucial steps to apply it this book to there in a manner that is super useful. b) I feel like I didn’t understand how to better measure and weigh different projects. I hope that [[How to Measure Anything]] will help with that more, and it makes sense in retrospect that wasn’t a goal, but still bums me out.

Deets#

Review#

Key Takeaways#

Drum, Buffer, Rope#

Why Throughput?#

Kanban#

Critical Chain Synopsis#

Making This Concrete#

Closing Thoughts#