The Start()/Stop() pattern

,

Several articles with ideas on structuring Go apps to make them easy to test:

It is really handy if you can start most of your app in a test – especially if it provides some type of server functionality. To do this, you split out most of the app logic into a Start() function and main() is just a thin wrapper around this. If you start the app in a test you also need a way to clean it up so you can start it again as needed, thus you also should have a Stop() function. So that there is no global state, these functions should be methods of a type, or Start() should return some type of context that Stop can use.

If you need Stop() at the app level, you also need it throughout your app. Everything that opens some resource needs to be cleaned up on shutdown. This is a good constraint – you should clean up after yourself.

Once you have Start()/Stop(), then you may as well use oklog/run to manage this if you have multiple parts that have long-running concurrent processes.

Once you decide to test your app, a lot of things get better because they have to. Testing is not just about finding bugs, but also encourages better structure and more discipline around things like lifecycle.

As I work through lifecycle code in Simple IoT, I’m coming to a few conclusions:

  1. You really want the scenario where you can simply start all processes in parallel without any explicit ordering requirements. The constraint of oklog/run not providing ordering is exactly what you want in the end.
  2. If there are dependencies between processes (ex: a database needs initialized), have the dependent process simply check if the database is ready. If not, then wait until it is – perhaps with a timeout. This moves dependency management into the processes where it belongs instead of in the startup code.
  3. This concept applies at the system level (SystemD units) and inside applications when starting threads/goroutines.

There are many benefits to this:

  1. We don’t have to debug race conditions on startup/shutdown.
  2. Error handling gets pushed into the processes where it belongs. The process knows what it needs and can better respond to situations where a resource is not available than the process orchestration mechanism. Thus, the system will become more resilient.

NATS works very well for this – if the server is not available, the NATS client buffers messages until it is, so you don’t even have to make sure the server is running before starting the clients. Additionally, with Jetstream, you can buffer messages as needed as clients come and go.

Caddy uses a form of the Start()/Stop() pattern at the top level:

As I work with this more, it seems any long running process can benefit from this model.

What exactly is this model? Every process is defined by the following functions:

  • Start() error
  • Stop(error)

Start() is called to start the process and does not return until the process is stopped or a critical error occurs.

Stop() is called to shutdown the process. It returns immediately. If you want to wait until the process is shutdown, then wait for Start() to return.

The details of this implementation are subtle, but important. Why not have Stop() return the status of the stop operation? Stop() is not the only thing that can stop a process, so it is more consistent to handle that in the return of Start(), as that is where things are happening anyway.

For Go programming, this allows us to follow best practices such as:

  • the caller should handle concurrency, not the library.
  • don’t pass channels around in APIs

In the process of switching the SIOT store to SQLite, I’ve been trying to improve the SIOT client API and harden the lifecycle handling of clients. The idea is that the experience writing clients should be something like:

  • create a struct that represents your client config with tags that map to point types
  • create a client that complies with the client interface
  • if a new node of the client type appears in the store, a new client will automatically be created and you will be handed the config (no manually node/point decoding required.
  • two levels of nodes are supported – example is a rule where the top level node is the rule, and it has condition/action child nodes.
  • if the config is updated in the store, the client will get sent the new points
  • if any of the nodes in the client config are removed, then the right things happen.

All this is event driven and potentially distributed (points can come from anywhere through NATS). Lifecycle code is hard as it’s easy to let race conditions slip in. A few things I’ve learned through wrangling with this:

  • Go is really good at concurrent code if you do it right.
  • It is really hard to write concurrent code, even in Go, if you don’t do it right.
  • concurrent code is hard (did I already mention that?)
  • patterns are extremely important – most of us can’t afford to re-invent new patterns for every use case
  • Mutexs should be avoided unless you have performance requirements.
  • Syncronizing access to a components state through channels works really well.
  • Syncronous APIs work well in Go
  • The Start()/Stop() pattern is amazing.
  • Start() should have most of the logic including the tear-down logic.
  • Stop() typically should just send a signal to Start() over a channel.
  • Start() should have a select{} statement that synchronizes access to all state.

If you follow the Start()/Stop() pattern, then most functionality is in a simple straight-line Start() function that starts when the module starts, and returns when it is finished. This is easy to manage and monitor – you always know the state of things – is it running or is it not. There are no callbacks, channels in APIs, etc.

As an example, consider these two modules that are used to manage clients and automate most of the lifecycle concerns:

This code took quite a few iterations to get right. The result is fairly simple – how things should be. Simple is hard.

This talk is really good – really helped me down the path to writing better concurrent Go code.

cc @bminer

1 Like

When he talked about N chans vs. 1 chan, it reminded me the event loop pattern, similar to what Node.js uses under the hood.