26 KiB
Merklist¶
A definition for the hash of a list that is robust to arbitrary partitioning
- toc: true
- categories: [merklist]
Matrix multiplication's associativity and non-commutativity properties provide a natural definition for a cryptographic hash / digest / summary of an ordered list of elements while preserving concatenation operations. Due to the non-commutativity property, lists that differ in element order result in a different summary. Due to the associativity property, arbitrarily divided adjacent sub-lists can be summarized independently and combined to find the summary of their concatenation in one operation. This definition provides exactly the properties needed to define a list, and does not impose any unnecessary structure that could cause two equivalent lists to produce different summaries. The name Merklist is intended to be reminicent of other hash-based data structures like Merkle Tree and Merklix Tree.
Definition¶
This definition of a hash of a list of elements is pretty simple:
- A list element is an arbitrary buffer of bytes. Any length, any content. Just bytes.
- A list, then, is a sequence of such elements.
- The hash of a list element is the cryptographic hash of its bytes, formatted into a square matrix with byte elements. (More details later.)
- The hash of a list is reduction by matrix multiplication of the hashes of all the list elements in the same order as they appear in the list.
- The hash of a list with 0 elements is the identity matrix.
This construction has a couple notable concequences:
- The hash of a list with only one item is just the hash of the item itself.
- You can calculate the hash of any list concatenated with a copy of itself by matrix multiplication of the the hash with itself. This works for single elements as well as arbitrarily long lists.
- A list can have multiple copies of the same list item, and swapping them does not affect the list hash. Consider how swapping the first two elements in
[1, 1, 2]
has no discernible effect. - The hash of the concatenation of two lists is the matrix multiplication of their hashes.
- Concatenating a list with a list of 0 elements yields the same hash.
Lets explore this definition in more detail with a simple implementation in python+numpy.
#collapse-hide # Setup and imports import hashlib import numpy as np from functools import reduce def assert_equal(a, b): return np.testing.assert_equal(a, b) def assert_not_equal(a, b): return np.testing.assert_raises(AssertionError, np.testing.assert_equal, a, b)
The hash of a list element - hash_m/1
¶
The function hash_m/1
takes a buffer of bytes as its first argument, and returns the sha512 hash of the bytes formatted as an 8×8 2-d array of 8-bit unsigned integers with wrapping overflow. We define this hash to be the hash of the list element. Based on a shallow wikipedia dive, someone familiar with linear algebra might say it's a matrix ring, $R_{256}^{8×8}$. Not coincidentally, sha512 outputs 512 bits = 64 bytes = 8 * 8 array of bytes, how convenient. (In fact, that might even be the primary reason why I chose sha512!)
def hash_m(e): hash_bytes = list(hashlib.sha512(e).digest())[:64] # hash the bytes e, convert the digest into a list of 64 bytes return np.array(hash_bytes, dtype=np.uint8).reshape((8,8)) # convert the digest bytes into a numpy array with the appropriate data type and shape
8×8 seems big compared to 3×3 or 4×4 matrixes. The values are as random as you might expect a cryptographic hash to be, and range from 0-255:
#collapse-hide print(hash_m(b"Hello A")) print() print(hash_m(b"Hello B"))
[[ 14 184 108 217 131 164 222 93] [132 227 82 144 111 178 195 109] [ 25 250 155 17 131 183 151 217] [212 60 138 36 0 60 115 181] [ 51 0 87 43 93 252 56 61] [108 239 175 222 23 142 41 216] [203 98 234 13 65 169 255 240] [ 46 127 15 167 112 153 222 94]] [[ 63 144 188 5 57 146 32 56] [ 27 189 98 140 113 194 70 87] [115 21 136 27 116 167 85 48] [ 29 162 119 29 104 32 145 241] [166 197 57 165 132 213 50 202] [ 48 71 33 19 230 26 58 164] [242 172 65 202 193 50 193 141] [206 110 165 129 52 132 250 73]]
The hash of a list - mul_m/2
¶
Ok so we've got our element hashes, how do we combine them to construct the hash of a list? We defined the hash of the list to be reduction by matrix multiplication of the hash of each element:
def mul_m(he1, he2): return np.matmul(he1, he2, dtype=np.uint8) # just, like, multiply them
Consider an example:
# # `elements` is a list of 3 elements elements = [b"A", b"Hello", b"World"] # first, make a new list with the hash of each element element_hashes = [hash_m(e) for e in elements] # get the hash of the list by reducing the hashes by matrix multiplication list_hash1 = mul_m(mul_m(element_hashes[0], element_hashes[1]), element_hashes[2]) # an alternative way to write the reduction list_hash2 = reduce(mul_m, element_hashes) # check that these alternative spellings are equivalent assert_equal(list_hash1, list_hash2)
Expand the sections below to see a comparison
#collapse-hide #collapse-output print("List of elements:") print(elements) print() print("Hash of each element:") print(element_hashes) print() print("Hash of full list:") print(list_hash1) # Expand the section below to see the output
List of elements: [b'A', b'Hello', b'World'] Hash of each element: [array([[ 33, 180, 244, 189, 158, 100, 237, 53], [ 92, 62, 182, 118, 162, 142, 190, 218], [246, 216, 241, 123, 220, 54, 89, 149], [179, 25, 9, 113, 83, 4, 64, 128], [ 81, 107, 208, 131, 191, 204, 230, 97], [ 33, 163, 7, 38, 70, 153, 76, 132], [ 48, 204, 56, 43, 141, 197, 67, 232], [ 72, 128, 24, 59, 248, 86, 207, 245]], dtype=uint8), array([[ 54, 21, 248, 12, 157, 41, 62, 215], [ 64, 38, 135, 249, 75, 34, 213, 142], [ 82, 155, 140, 199, 145, 111, 143, 172], [127, 221, 247, 251, 213, 175, 76, 247], [119, 211, 215, 149, 167, 160, 10, 22], [191, 126, 127, 63, 185, 86, 30, 233], [186, 174, 72, 13, 169, 254, 122, 24], [118, 158, 113, 136, 107, 3, 243, 21]], dtype=uint8), array([[142, 167, 115, 147, 164, 42, 184, 250], [146, 80, 15, 176, 119, 169, 80, 156], [195, 43, 201, 94, 114, 113, 46, 250], [ 17, 110, 218, 242, 237, 250, 227, 79], [187, 104, 46, 253, 214, 197, 221, 19], [193, 23, 224, 139, 212, 170, 239, 113], [ 41, 29, 138, 172, 226, 248, 144, 39], [ 48, 129, 208, 103, 124, 22, 223, 15]], dtype=uint8)] Hash of full list: [[178 188 57 157 60 136 190 127] [ 40 234 254 224 38 46 250 52] [156 72 193 136 219 98 28 4] [197 2 43 132 132 232 254 198] [ 93 64 113 215 2 246 130 192] [ 91 107 85 13 149 60 19 173] [ 84 77 244 98 0 239 123 17] [ 58 112 98 250 163 20 27 6]]
What does this give us? Generally speaking, multiplying two square matrixes $M_1×M_2$ gives us at least these two properties:
- Associativity - Associativity enables you to reduce a computation using any partitioning because all partitionings yield the same result. Addition is associative $(1+2)+3 = 1+(2+3)$, subtraction is not $(5-3)-2\neq5-(3-2)$. (Associative property)
- Non-Commutativity - Commutativity allows you to swap elements without affecting the result. Addition is commutative $1+2 = 2+1$, but division is not $1\div2 \neq2\div1$. And neither is matrix multiplication. (Commutative property)
This is an unusual combination of properties for an operation. It's at least not a combination encountered in introductory algebra:
associative | commutative | |
---|---|---|
$a+b$ | ✅ | ✅ |
$a*b$ | ✅ | ✅ |
$a-b$ | ❌ | ❌ |
$a/b$ | ❌ | ❌ |
$a^b$ | ❌ | ❌ |
$M×M$ | ✅ | ❌ |
Upon consideration, these are the exact properties that one would want in order to define the hash of a list of items. Non-commutativity enables the order of elements in the list to be well defined, since swapping different elements produces a different hash. Associativity enables calculating the hash of the list by performing the reduction operations in any order, and you still get the same hash.
Lets sanity-check that these properties can hold for the construction described above.
Associativity¶
If it's associative, we should get the same hash if we rearrange the parenthesis to indicate reduction in a different operation order. That is: $((e1 × e2) × e3) = (e1 × (e2 × e3))$
e1 = hash_m(b"Hello A") e2 = hash_m(b"Hello B") e3 = hash_m(b"Hello C")
x = np.matmul(np.matmul(e1, e2), e3) y = np.matmul(e1, np.matmul(e2, e3)) # observe that they produce the same summary assert_equal(x, y)
Expand the sections below to see a comparison
#collapse-hide #collapse-output print(x) print() print(y)
[[ 58 12 144 134 100 158 159 51] [ 73 206 202 190 87 79 223 2] [210 122 142 117 37 148 106 45] [175 146 187 223 235 171 64 226] [149 85 203 87 92 251 243 206] [ 18 252 160 103 125 251 181 133] [191 132 220 104 213 154 34 154] [127 197 95 87 166 3 22 3]] [[ 58 12 144 134 100 158 159 51] [ 73 206 202 190 87 79 223 2] [210 122 142 117 37 148 106 45] [175 146 187 223 235 171 64 226] [149 85 203 87 92 251 243 206] [ 18 252 160 103 125 251 181 133] [191 132 220 104 213 154 34 154] [127 197 95 87 166 3 22 3]]
Non-Commutativity¶
If it's not commutative, then swapping different elements should produce a different hash. That is, $e1 × e2 \ne e2 × e1$:
x = np.matmul(e1, e2) y = np.matmul(e2, e1) # observe that they produce different summaries assert_not_equal(x, y)
Expand the sections below to see a comparison
#collapse-hide #collapse-output print(x) print() print(y)
[[ 87 79 149 131 148 247 195 90] [249 84 195 58 142 133 211 15] [177 93 69 254 240 234 97 37] [ 46 84 76 253 55 200 43 236] [ 21 84 99 157 55 148 170 2] [168 123 6 250 64 144 54 242] [230 78 164 76 30 29 214 68] [ 47 183 156 239 157 177 192 184]] [[149 18 239 238 84 188 191 109] [239 150 214 235 59 161 9 133] [ 89 174 59 14 70 113 124 243] [ 66 113 176 124 227 247 17 25] [247 138 152 181 177 143 184 97] [113 249 199 153 154 75 45 105] [121 201 225 42 249 213 180 244] [ 85 31 72 28 181 182 140 176]]
Other functions¶
# Create a list of 1024 elements and reduce them one by one list1 = [hash_m(b"A") for _ in range(0, 1024)] hash1 = reduce(mul_m, list1) # Take a starting element and square/double it 10 times. With 1 starting element over 10 doublings = 1024 elements hash2 = reduce((lambda m, _ : mul_m(m, m)), range(0, 10), hash_m(b"A")) # Observe that these two methods of calculating the hash have the same result assert_equal(hash1, hash2) # lets call it double def double_m(m, d=1): return reduce((lambda m, _ : mul_m(m, m)), range(0, d), m) assert_equal(hash1, double_m(hash_m(b"A"), 10)) def identity_m(): return np.identity(8, dtype=np.uint8) # generalize double_m to any length, not just doublings, performed in ln(N) matmuls def repeat_m(m, n): res = identity_m() while n > 0: # concatenate the current doubling iff the bit representing this doubling is set if n & 1: res = mul_m(res, m) n >>= 1 m = mul_m(m, m) # double matrix m # print(s) return res # repeat_m can do the same as double_m assert_equal(hash1, repeat_m(hash_m(b"A"), 1024)) # but it can also repeat any number of times hash3 = reduce(mul_m, (hash_m(b"A") for _ in range(0, 3309))) assert_equal(hash3, repeat_m(hash_m(b"A"), 3309)) # Even returns a sensible result when requesting 0 elements assert_equal(identity_m(), repeat_m(hash_m(b"A"), 0)) # make helper for reducing an iterable of hashes def reduce_m(am): return reduce(mul_m, am)
print(hash1) print() print(hash2)
[[ 68 252 159 3 14 52 199 199] [136 124 6 34 58 174 206 54] [ 3 234 2 13 120 240 7 163] [102 47 66 61 87 234 246 72] [ 19 135 80 115 75 242 242 5] [244 165 250 28 76 43 188 254] [233 46 187 39 151 241 175 130] [132 138 6 215 20 132 89 33]] [[ 68 252 159 3 14 52 199 199] [136 124 6 34 58 174 206 54] [ 3 234 2 13 120 240 7 163] [102 47 66 61 87 234 246 72] [ 19 135 80 115 75 242 242 5] [244 165 250 28 76 43 188 254] [233 46 187 39 151 241 175 130] [132 138 6 215 20 132 89 33]]
Fun with associativity¶
Does the hash of a list change even when swapping two elements in the middle of a very long list?
a = hash_m(b"A") b = hash_m(b"B") a499 = repeat_m(a, 499) a500 = repeat_m(a, 500) # this should work because they're all a's assert_equal(reduce_m([a, a499]), a500) assert_equal(reduce_m([a499, a]), a500) # these are lists of 999 elements of a, with one b at position 500 (x) or 501 (y) x = reduce_m([a499, b, a500]) y = reduce_m([a500, b, a499]) # shifting the b by one element changed the hash assert_not_equal(x, y)
Flex that associativity - this statement is true and equivalent to the assertion below:
$(a × (a499 × b × a500) × (a500 × b × a499) × a) = (a500 × b × (a500 × a500) × b × a500)$
assert_equal(reduce_m([a, x, y, a]), reduce_m([a500, b, repeat_m(a500, 2), b, a500]))
Unknowns¶
This appears to me to be a reasonable way to define the hash of a list. The mathematical definition of a list aligns very nicely with the properties offered by matrix multiplication. But is it appropriate to use for the same things that a Merkle Tree would be? The big questions are related to the valuable properties of hash functions:
- Given a merklist summary but not the elements, is it possible to produce a different list of elements that hash to the same summary? (~preimage resistance)
- Given a merklist summary or sublist summaries of it, can you derive the hashes of elements or their order?
- Is it possible to predictably alter the merklist summary by concatenating it with some other sublist of real elements?
- Are there other desirable security properties that would be valuable for a list hash?
- Is there a better choice of hash function as a primitive than sha512?
- Is there a better choice of reduction function that still retains associativity+non-commutativity than simple matmul?
- Is there a more appropriate size than an 8x8 matrix / 64 bytes to represent merklist summaries?
Matrixes are well-studied objects, perhaps such information is already known. If you know something about deriving the preimage of the multiplication of a matrix ring, $R_{256}^{8×8}$, I would be very interested to know.
What's next?¶
***If** this construction has the appropriate security properties*, it seems to be a better merkle tree in all respects. Any use of a merkle tree could be replaced with this, and it could enable use-cases where merkle trees aren't useful. Some examples of what I think might be possible:
- Using a Merklist with a sublist summary tree structure enables creating a $O(1)$-sized 'Merklist Proof' that can verify the addition and subtraction of any number of elements at any single point in the list using only $O(log(N))$ time and $O(log(N))$ static space. As a bonus the proof generator and verifier can have totally different tree structures and can still communicate the proof successfully.
- Using a Merklist summary tree you can create a consistent hash of any ordered key-value store (like a btree) that can be maintained incrementally inline with regular node updates, e.g. as part of a LSM-tree. This could facilitate verification and sync between database replicas.
- The sublist summary tree structure can be as dense or sparse as desired. You could summarize down to pairs of elements akin to a merkle tree, but you could also summarize a compressed sublist of hundreds or even millions of elements with a single hash. Of course, calculating or verifying a proof of changes to the middle of that sublist would require rehashing the whole sublist, but this turns it from a fixed structure into a tuneable parameter.
- If all possible elements had an easily calculatable inverse, that would enable "subtracting" an element by inserting its inverse in front of it. That would basically extend the group from a ring into a field, and might have interesting implications.
- For example you could define a cryptographically-secure rolling hash where advancing either end can be calculated in
O(1)
time and space.
- For example you could define a cryptographically-secure rolling hash where advancing either end can be calculated in
To be continued...