I recently spent some time learning about the practical implementation details of Merkle trees. One thing that I found underdocumented and required some thinking was the concept of “second preimage attacks” against Merkle trees. This is an attack which requires changes to the naive design for implementing this tree. I am documenting it here in hopes that it will be more clear to others implementing it.

A Merkle tree is a type of tree in which every leaf node is the hash of a block of input data. Every branch node is the hash of the concatenation of its child values.

Merkle tree

Picture of a Merkle Tree. Taken from https://en.wikipedia.org/wiki/File:Hash_Tree.svg and licensed under CC0. This example is vulnerable to a second preimage attack.

Merkle trees are useful because they allow for the efficient verification of large amounts of data. To verify, from the picture above, that L1, L2, L3, and L4 were all received successfully, the Top Hash is all that is needed. This is bog-standard hashing and not impressive. Where this becomes valuable is if you start with the Top Hash and want to validate the contents of L1 through L4. Making it more complex, each of the data blocks can be received independently of one another. This is a common situation in file sharing protocols such as Bitorrent and DC++.

To check each of the data blocks a validator needs to know their corresponding leaf node from the tree. L1 corresponds to Hash 0-0 such that if you know the value of Hash 0-0 the validator knows they received L1 successfully. However all that the validator is starting with is the Top Hash. With the Top Hash however you can work your way down the tree. Assuming the hash algorithm is a cryptographic hash, and collisions are infeasible, it is possible to validate Hash 0 and Hash 1 by checking that the hash of their concatenation matches the Top Hash. By repeating this it is possible to work ones way down the tree, even in environments where the validator may not trust the other party telling it the branch and leaf hash values. Other clever optimizations exist, a reader can check Wikipedia or read a textbook for more information.

Second Preimage Attack

All this brings us to the second preimage attack. A second preimage attack is as follows: Given a hash function H() and two inputs x and x' where x != x', H(x) = H(x'). Hash functions such as SHA-256 have second preimage resistance, the discovery of x and x' being difficult, as a core property.

If a Merkle tree is implemented naively, with a single function such as SHA-256 being used for every node, it is vulnerable to a second preimage attack. That is the implementation as shown in the picture above. Assume for the purposes of this attack the attacker knows every value in the Merkle tree: top hash, hash 0, hash 1, hash 0-0, etc. An adversary can now create a new data block E1 = Hash 0-0 + Hash 0-1.

graph BT     TopHash("Top Hash: Hash_F(Hash 0 || Hash 1)")     Hash0("Hash 0: Hash_F(Hash 0-0 || Hash 0-1)")     Hash1(Hash 1)     Hash1 --> TopHash     Hash10("Hash 1-0: Hash_F(L3)")     Hash11("Hash 1-1: Hash_F(L4)")     Hash10 --> Hash1     Hash11 --> Hash1     Hash0 --> TopHash     subgraph Data Blocks         E1["E1: Hash 0-0 || Hash 0-1"] L3 L4     end     L3 --> Hash10     L4 --> Hash11     E1 --> Hash0

An illustration of the attack. “||” is used to represent concatenation and Hash_F represents the hash function.

When a victim comes looking for the tree values that correspond to top hash the adversary can provide Hash 0, Hash 1, Hash 1-0 and Hash 1-1. These values will all be validated as being under top hash. The trap is now sprung. Instead of providing Hash 0-0 and Hash 0-1 the adversary claims that Hash 0 is a leaf node. There is no information about the depth of a tree or leaf vs. branch nodes stored in a naive Merkle tree implementation. The attacker then provides E1, L3, and L4 as the data blocks. L1 and L2 have been lost and a second preimage attack has occurred. It is possible to also carry this out to not provide L3 and L4 as well, allowing for an attacker to carry out a complete second preimage attack with no knowledge of the L data blocks.

Defenses

There are several possible defenses against a second preimage attack. Some of these are as follows:

  • Mix information about the node type into the hash: This is what is done for Certificate Transparency (https://datatracker.ietf.org/doc/html/rfc9162) where 0x00 is prepended to leaf nodes. 0x01 is prepended to branch nodes.
    • I like this approach because it is straightforward to implement. See below for a description of why it works.
  • Require a defined depth: Agree upon a depth that the tree must have. A data block is valid if and only if the nodes leading to the leaf, which has the hash of the data block, are of the defined depth.
    • This works for trees that can be of a uniform depth, i.e. chunks of a large file. This does not work as well for other situations, for example if the Merkle tree is over a filesystem and branch nodes represent folders that can be arbitrarily nested.

Why does prepending information work?

Under Defenses I stated that prepending information on if a node is a leaf or branch fixes the second preimage attack vulnerability. This section explores why that is the case.

For this section I will define two new hash functions, based on the prior hash functions.

  • Hash_L for the Leaf nodes. This is defined as Hash_L = Hash('A' || Input)
  • Hash_B for the Branch nodes. This is defined as Hash_B = Hash('B' || Input)

A simple tree now looks like the following:

graph BT     TopHash("Top Hash: Hash_B(Hash 0 || Hash 1)")     Hash0("Hash 0: Hash_B(Hash 0-0 || Hash 0-1)")     Hash1("Hash 1: Hash_B(Hash 1-0 || Hash 1-1)")     Hash1 --> TopHash     Hash10("Hash 1-0: Hash_L(L3)")     Hash11("Hash 1-1: Hash_L(L4)")     Hash00("Hash 0-0: Hash_L(L1)")     Hash01("Hash 0-1: Hash_L(L2)")     Hash00 --> Hash0     Hash01 --> Hash0     Hash10 --> Hash1     Hash11 --> Hash1     Hash0 --> TopHash     subgraph Data Blocks         L1         L2         L3         L4     end     L1 --> Hash00     L2 --> Hash01     L3 --> Hash10     L4 --> Hash11

Binary Merkle tree with 4 data elements. Each hash is either Hash_L for leaf nodes or Hash_B for branch nodes.

To make this example more real, I will assign values to each of the nodes and calculate the hashes based on that. I will use SHA-256 as the hash function of choice.

  • L1 = “Foo”
  • L2 = “Bar”
  • L3 = “Baz”
  • L4 = “Qux”
  • Hash 0-0 = Hash_L(L1) = SHA-256("AFoo") = c394c83b94a489485a0ea7dbfd43d6b18a7cd6ef7a4d607f5c667e978a232d07
  • Hash 0-1 = Hash_L(L2) = SHA-256("ABar") = 2fa32a6f44e6ed4e6357e25c7fae397d51028cbb1ea527f1831439eaeefdb64a
  • Hash 1-0 = Hash_L(L3) = SHA-256("ABaz") = bf2d89291058bc1e2e5cd3255dcc3d0d39eb1a54623c268e9adf892b903f38e7
  • Hash 1-1 = Hash_L(L4) = SHA-256("AQux") = eb5e40d6a93d1db9ab45db84465bd5809064741588415d46c817b3bdd338ce1f
  • Hash 0 = Hash_B( Hash 0-0 || Hash 0-1 ) = SHA-256("Bc394c83b94a489485a0ea7dbfd43d6b18a7cd6ef7a4d607f5c667e978a232d072fa32a6f44e6ed4e6357e25c7fae397d51028cbb1ea527f1831439eaeefdb64a") = ca10d813fdb6e755f9a0984b53a7600c49b2fc6aaba028ddc90b8d4a7b766242
  • Hash 1 = Hash_B( Hash 1-0 || Hash 1-1 ) = SHA-256("Bbf2d89291058bc1e2e5cd3255dcc3d0d39eb1a54623c268e9adf892b903f38e7eb5e40d6a93d1db9ab45db84465bd5809064741588415d46c817b3bdd338ce1f") = a9d8f95ea21aff1147359ca4a040f39aa21fc951e942466b0861bf14486db4f5
  • Top Hash = Hash_B( Hash 0 || Hash 1 ) = SHA-256("Bca10d813fdb6e755f9a0984b53a7600c49b2fc6aaba028ddc90b8d4a7b766242a9d8f95ea21aff1147359ca4a040f39aa21fc951e942466b0861bf14486db4f5") = c21243ead9185756bac406d7c6132479c27a0f3bfbd378ffbd28b3596ba7a313

So now the Top Hash is known to be c21243ead<etc>. What happens when an attacker creates E1?

graph BT     TopHash("Top Hash: Hash_B(Hash 0 || Hash 1)")     Hash0("Hash 0: Hash_L(Hash 0-0 || Hash 0-1)")     Hash1("Hash 1: Hash_B(Hash 1-0 || Hash 1-1)")     Hash1 --> TopHash     Hash10("Hash 1-0: Hash_L(L3)")     Hash11("Hash 1-1: Hash_L(L4)")     Hash10 --> Hash1     Hash11 --> Hash1     Hash0 --> TopHash     subgraph Data Blocks         E1["E1: Hash 0-0 || Hash 0-1"] L3 L4     end     L3 --> Hash10     L4 --> Hash11     E1 --> Hash0

Attempting the second preimage attack with the Hash_B and Hash_L functions.

The important part to note here is that Hash 0 no longer has child nodes. That means that it’s hash function has changed from Hash_B to Hash_L. This in turn means that the new value of Hash 0 is SHA-256("Ac394c83b94a489485a0ea7dbfd43d6b18a7cd6ef7a4d607f5c667e978a232d072fa32a6f44e6ed4e6357e25c7fae397d51028cbb1ea527f1831439eaeefdb64a") = 1d0048fc197f5848147a93f19ef7ce1107e2df37ca0d2b6c5cc2c2ada004d5aa. This is different from the original Hash 0 and will fail a verification when combined with the Hash 1 value to check against Top Hash. This simple change has removed the attack vector.

References