Hash Algorithms

Some ideas on improving the hash algorithm in Simple IoT:

Since we are not trying to protect against intentional modifications, it seems XOR’ing 32-bit CRCs will give us some nice properties to allow efficient updates.

This is interesting:

Rolling checksum implementation

The rolling checksum is implemented by combining checksums of every page together. When a page is written, LiteFS will compute the CRC64 of the page number and the page data and XOR them into the rolling checksum. It will also compute this same page checksum for the old page data and XOR that value out of the rolling checksum.

This approach gives us strong guarantees about the exact byte contents of the database at every transaction and it is fast to compute. As XOR is associative, it is also possible to compute on a raw database file from scratch to ensure consistency.

This is very similar to the approach we used in SimpleIoT where we XOR 32-bit checksums:

https://docs.simpleiot.org/docs/ref/sync.html#hash-algorithm

Hash Algorithm

We don’t need cryptographic level hashes as we are not trying to protect against malicious actors, but rather provide a secondary check to ensure all data has been synchronized. Normally, all data will be sent via points as it is changes and if all points are received, the Hash is not needed. Therefore, we want to prioritize performance and efficiency over hash strength. The XOR function has some interesting properties:

  • Commutative: A ⊕ B = B ⊕ A (the ability to process elements in any order and get the same answer)
  • Associative: A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C (we can group operations in any order)
  • Identity: A ⊕ 0 = A
  • Self-Inverse: A ⊕ A = 0 (we can back out an input value by simply applying it again)

See hash_test.go for tests of the XOR concept.

Point CRC

Point CRCs are calculated using the crc-32 of the following point fields:

  • Time
  • Type
  • Key
  • Text
  • Value

Updating the Node Hash

  • edge or node points received
    • for points updated
      • back out previous point CRC
      • add in new point CRC
    • update upstream hash values (stops at device node)
      • create cache of all upstream edges to root
      • for each upstream edge, back out old hash, and xor in new hash
      • write all updated edge hash fields

Currently, Simple IoT only supports syncing an instance to a subset of the upstream’s data. This is one thing that makes the Simple IoT synchronization task more difficult – we are typically only sync’ing a subset of the data, not the entire dataset.

Long term I would like to support cloud peers for redundancy, so very interested in the LiteFS work.

XOR checksums are wonderful things. In this case, we want to exclude the edge points from the hash if we are comparing the root node, so it’s simple to remove these from the hash value by XOR’ing them in again.

	if nodeLocal.ID == up.rootLocal.ID {
		// we need to back out the edge points from the hash as don't want to sync those
		for _, p := range nodeUp.EdgePoints {
			nodeUp.Hash ^= p.CRC()
		}

		for _, p := range nodeLocal.EdgePoints {
			nodeLocal.Hash ^= p.CRC()
		}
	}