Book review: Designing Data-Intensive Applications

This book is highly recommended by my associate Blake, so starting through it.

https://dataintensive.net/

Two quotes early in the book caught my attention:

Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make under represented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.

(I’ve had similar thoughts as we observe that past few years.)

Computing is pop culture. […] Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future—it’s living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from]. —Alan Kay, in interview with Dr Dobb’s Journal (2012)

If you are considering event-driven architectures for your applications, this book is gold. It has completely transformed the way I think about databases. For those who want a short teaser, Martin discusses a few concepts in his book in this presentation: "Turning the database inside out with Apache Samza" by Martin Kleppmann - YouTube

1 Like

Thanks @bminer for the presentation link, and welcome to the TMPDIR forum!

One of my goals with the TMPDIR community is to gain insight from people over a wide breadth of experiences (hardware, software, data, UI, mechanical, ML, etc), instead of a community narrowly focused on only one project/skill. So nice to have someone here with some experience in the data realm.

Currently reading through the section on Graph-like models. A few quotes:

For example, Facebook maintains a single graph with many different types of vertices and edges: vertices represent people, locations, events, checkins, and comments made by users; edges indicate which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, and so on [35].

Property Graphs

In the property graph model, each vertex consists of:

  • A unique identifier
  • A set of outgoing edges
  • A set of incoming edges
  • A collection of properties (key-value pairs)

Each edge consists of:

  • A unique identifier
  • The vertex at which the edge starts (the tail vertex)
  • The vertex at which the edge ends (the head vertex)
  • A label to describe the kind of relationship between the two vertices
  • A collection of properties (key-value pairs)

Some important aspects of this model are:

  1. Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
  2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward. (That’s why Example 2-2 has indexes on both the tail_vertex and head_vertex columns.)
  3. By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.

This is very similar to the Simple IoT data model. Below are the data structures used:

// Node represents the state of a device. UUID is recommended                                                                                                                        
// for ID to prevent collisions is distributed instances.                                                                                                                            
type Node struct {                                                                                                                                                                   
        ID     string
        Type   string
        Points Points
}

// Edge is used to describe the relationship                                                                                                                                         
// between two nodes                                                                                                                                                                 
type Edge struct {                                                                                                                                                                   
        ID     string
        Up     string
        Down   string
        Points Points
        Hash   []byte
} 

In this case we are using an array of Points to represent the properties instead of key/value pairs.

// Point is a flexible data structure that can be used to represent                                                                                                                  
// a sensor value or a configuration parameter.                                                                                                                                      
// ID, Type, and Index uniquely identify a point in a device                                                                                                                         
type Point struct {                                                                                                                                                                  
        // ID of the sensor that provided the point                                                                                                                                  
        ID string
                                                                                                                                                                                     
        // Type of point (voltage, current, key, etc)                                                                                                                                
        Type string
                                                                                                                                                                                     
        // Index is used to specify a position in an array such as                                                                                                                   
        // which pump, temp sensor, etc.                                                                                                                                             
        Index int
                                                                                                                                                                                     
        // Time the point was taken                                                                                                                                                  
        Time time.Time
                                                                                                                                                                                     
        // Duration over which the point was taken. This is useful                                                                                                                   
        // for averaged values to know what time period the value applies                                                                                                            
        // to.                                                                                                                                                                       
        Duration time.Duration
                                                                                                                                                                                     
        // Average OR                                                                                                                                                                
        // Instantaneous analog or digital value of the point.                                                                                                                       
        // 0 and 1 are used to represent digital values                                                                                                                              
        Value float64
                                                                                                                                                                                     
        // Optional text value of the point for data that is best represented                                                                                                        
        // as a string rather than a number.                                                                                                                                         
        Text string
                                                                                                                                                                                     
        // statistical values that may be calculated over the duration of the point                                                                                                  
        Min float64
        Max float64
}             

The Point data-structure is working out very well. Most points only use a few of the fields, but having the extra fields typically does not cost much with encoding algorithms like protobuf. The time field is especially critical for sensor data and synchronization. Even for configuration values, it is very handy to know when the value was last changed. And, if you data properties are all expressed as points, it is very simple to capture all data changes in a historian like InluxDB. The Point.Index field also allows us to express arrays in a property. Simply use the same Point.Type for a series of points and increment Point.Index to express the point position in an array. These points can be easily turned into an array when it comes time to use the data.

One thing I had not considered yet is adding a Label or Type field to the Edge struct. This would allow us to describe different types of relationships – may be useful in the future.

Thanks for the warm welcome, Cliff!

From Chapter 4:

In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it. There may just be a single process accessing the database, in which case the reader is simply a later version of the same process—in that case you can think of storing something in the database as sending a message to your future self.

Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode what you previously wrote.

In general, it’s common for several different processes to be accessing a database at the same time. Those processes might be several different applications or services, or they may simply be several instances of the same service (running in parallel for scalability or fault tolerance). Either way, in an environment where the application is changing, it is likely that some processes accessing the database will be running newer code and some will be running older code—for example because a new version is currently being deployed in a rolling upgrade, so some instances have been updated while others haven’t yet.

This means that a value in the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for databases.

However, there is an additional snag. Say you add a field to a record schema, and the newer code writes a value for that new field to the database. Subsequently, an older version of the code (which doesn’t yet know about the new field) reads the record, updates it, and writes it back. In this situation, the desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t be interpreted.

The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level, as illustrated in Figure 4-7. For example, if you decode a database value into model objects in the application, and later reencode those model objects, the unknown field might be lost in that translation process. Solving this is not a hard problem; you just need to be aware of it.

The SIOT Node/Edge/Point database structures are one way to accomplish this. Because the data structures are generic, any version of software can read/write/store/synchronize/transmit them. Yes, it is a little more work to translate these structures into use-able structures in application code, but that is the tradeoff with distributed systems. Our job is not to make things easy, but keep things simple, and make them work. Boilerplate is not the enemy.

Regarding the last statement in Kleppmann’s quote above – SIOT also solves this – at least at the node level. Nodes are only modified by writing points. The new points are simply merged with the existing points, and then the updated node is written. There is no way for an update to “delete” a point. So if a new point type is written by a new version of software, it will be syncronized/stored/etc by all SIOT versions and cannot be deleted by older versions of the software – it will simply be ignored.

In distributed actor frameworks, this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.

Erlang is probably what most of us think of in regards to the actor model. Using the same message-passing mechanism (NATS) has been a great benefit in the SIOT application. It allows the system to be easily extended by adding additional applications on the same machine or different machines. Erlang’s model is likely a lot more convenient vs the simplified data model NATS uses, but there are drawbacks to convenience:

In Erlang OTP it is surprisingly hard to make changes to record schemas (despite the system having many features designed for high availability); rolling upgrades are possible but need to be planned carefully [53]. An experimental new maps datatype (a JSON-like structure, introduced in Erlang R17 in 2014) may make this easier in the future [54].

For a lightweight IoT system like SIOT, it seems like forward/backward compatibility is a key consideration – especially with the potential to have 1000’s of instances in the system and the difficulty of updating systems over slow networks like Cat-M. If forward/backward compatibility is important, then simple data models are needed.

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. – Richard Feynman, Rogers Commission Report (1986)

As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible [28].

An IoT system is inherently a multi-leader replication scenario – especially if configuration changes can be made at the edge or in the cloud. At a minimum, you typically have sensor values changing at the edge, and configuration values changing in the cloud, and data needs to flow both ways.

While wondering if databases are crash resistant, I found the following article. ALICE looks like a useful tool for testing applications that write to files.

Some more quotes from the book:

If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

Not all systems make a clear distinction between systems of record and derived data in their architecture, but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other.

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments. – Donald Knuth

From a purist’s point of view, it may seem that this careful modeling and import is desirable, because it means users of the database have better-quality data to work with. However, in practice, it appears that simply making data available quickly—even if it is in a quirky, difficult-to-use, raw format—is often more valuable than trying to decide on the ideal data model up front.

A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. – John Gall, Systemantics (1975)