IoT system reliability

cbrake · October 24, 2020, 7:02pm

One of the IoT systems I’m involved in started as a Meteor project and thus uses MongoDB for a database. Even though I’m not using Meteor in new projects, Meteor has worked out quite well as a web application solution. We have used other companies to host MongoDB – initially mLab, and now MongoDB Atlas.

We’ve had several database related problems in the last year:

At one point, our Go backend process was spawning new connections at a very rapid rate and would eventually exhaust the number of file resources available on the machine. This appears to be caused by some type of networking problem between our server and the MongoDB cluster.
Recently, we completely lost the network connection between our server (hosted by Linode), and the MongoDB cluster for 30m or so at a time (happened twice). We could communicate with both the server and the DB cluster from a workstation, but they could not talk to each other. A backup server could also talk to the DB during the outage. Without a database, our entire system basically goes down.

So, while the idea of having a hosted MongoDB cluster with redundant nodes (such that if one fails, the other keeps going) is a nice idea, the failures we’ve had to date are related to the connection between the server and these nodes – not a DB or server failure. Because the cluster nodes are in the same data center, it is likely a networking problem will affect the entire cluster. These problems are very difficult to debug as we don’t control the Internet and have very little visibility into the networking infrastructure of the server and DB hosting services.

To date, we would have had a more reliable system if we had simply hosted the database on the main cloud server, where there is no chance of a networking problem. If the server ever died, we would have to rebuild it, but with a good Ansible process, that should only take 30m (which is acceptable downtime in most systems), and I’ve yet to have a cloud server die on me. Typically if there are problems with a cloud server, the hosting provider simply migrates the cloud instance to another machine.

To preserve our existing architecture, we will likely move toward running two servers.

Some thoughts on this:

simple is often better than complex
distributed systems are hard
it is important to own your own platform – otherwise you can’t do much when there are problems
question if you really need a DB cluster if you have good backups and restore procedures
this seems related to the end-to-end argument – building extreme reliability into one part of the system does very little if the system is not end-to-end reliable
typically the problems you run into in life are rarely the ones you imagined or predicted

These experiences continue to refine and reinforce the vision I have for Simple IoT:

simple is better than complex
local databases are better than remote for small/mid-size deployments where the data easily fits on one machine
redundancy is overrated if you can tolerate a little downtime to rebuild
redundancy only really works if your machines are in separate data centers

The following are some thoughts on the SIOT architecture as it continues to be defined:

each SIOT instance will have its own embedded database (by default – an external DB could be used for systems of larger scale)
a SIOT instance can run in the cloud, or on edge devices.
the synchronization mechanism between cloud-cloud, or cloud-edge instances is the same – an edge instance simply synchronizes less of the data graph.
if you want redundancy in the cloud or at the edge, simply add more SIOT instances.

This is optimized for the long tail of IoT, because most of us are not Google. Data needed to manage even 100,000 devices can easily fit on one machine. With a moderate 20GB of storage, that is 200MB per device – seems that should be plenty. And, if you do happen to reach google scale, then swap out the embedded data store for something that scales better.