The Start()/Stop() pattern


Several articles with ideas on structuring Go apps to make them easy to test:

It is really handy if you can start most of your app in a test – especially if it provides some type of server functionality. To do this, you split out most of the app logic into a Start() function and main() is just a thin wrapper around this. If you start the app in a test you also need a way to clean it up so you can start it again as needed, thus you also should have a Stop() function. So that there is no global state, these functions should be methods of a type, or Start() should return some type of context that Stop can use.

If you need Stop() at the app level, you also need it throughout your app. Everything that opens some resource needs to be cleaned up on shutdown. This is a good constraint – you should clean up after yourself.

Once you have Start()/Stop(), then you may as well use oklog/run to manage this if you have multiple parts that have long-running concurrent processes.

Once you decide to test your app, a lot of things get better because they have to. Testing is not just about finding bugs, but also encourages better structure and more discipline around things like lifecycle.

As I work through lifecycle code in Simple IoT, I’m coming to a few conclusions:

  1. You really want the scenario where you can simply start all processes in parallel without any explicit ordering requirements. The constraint of oklog/run not providing ordering is exactly what you want in the end.
  2. If there are dependencies between processes (ex: a database needs initialized), have the dependent process simply check if the database is ready. If not, then wait until it is – perhaps with a timeout. This moves dependency management into the processes where it belongs instead of in the startup code.
  3. This concept applies at the system level (SystemD units) and inside applications when starting threads/goroutines.

There are many benefits to this:

  1. We don’t have to debug race conditions on startup/shutdown.
  2. Error handling gets pushed into the processes where it belongs. The process knows what it needs and can better respond to situations where a resource is not available than the process orchestration mechanism. Thus, the system will become more resilient.

NATS works very well for this – if the server is not available, the NATS client buffers messages until it is, so you don’t even have to make sure the server is running before starting the clients. Additionally, with Jetstream, you can buffer messages as needed as clients come and go.

Caddy uses a form of the Start()/Stop() pattern at the top level:

As I work with this more, it seems any long running process can benefit from this model.

What exactly is this model? Every process is defined by the following functions:

  • Start() error
  • Stop(error)

Start() is called to start the process and does not return until the process is stopped or a critical error occurs.

Stop() is called to shutdown the process. It returns immediately. If you want to wait until the process is shutdown, then wait for Start() to return.

The details of this implementation are subtle, but important. Why not have Stop() return the status of the stop operation? Stop() is not the only thing that can stop a process, so it is more consistent to handle that in the return of Start(), as that is where things are happening anyway.

For Go programming, this allows us to follow best practices such as:

  • the caller should handle concurrency, not the library.
  • don’t pass channels around in APIs