notebook/00-Merklist.ipynb

21 KiB
Raw Blame History

Merklist Tree

Using matrix multiplication's associativity and non-commutativity to construct a digest / summary of an ordered list of elements where mutations to the list can be computed, verified, and stored using $O(log(N))$ time and space. Due to the associativity property, arbitrarily divided adjacent sub-lists can be summarized independently and combined to find the summary of their concatenation.

Construction

In [1]:
# setup

import hashlib
import numpy as np
from functools import reduce

def assert_equal(a, b):
    return np.testing.assert_equal(a, b)

def assert_not_equal(a, b):
    return np.testing.assert_raises(AssertionError, np.testing.assert_equal, a, b)

hash_m/1

The function hash_m/1 takes a buffer of bytes as its first argument, and returns the sha512 hash of the bytes formatted as an 8×8 2-d array of 8-bit unsigned integers with wrapping overflow. Based on a shallow wikipedia dive, someone familiar with linear algebra might say it's a matrix ring, $R_{256}^{8×8}$. Not coincidentally, sha512 outputs 512 bits = 64 bytes = 8 * 8 array of bytes, how convenient. That might even be the primary reason why I chose sha512.

In [2]:
def hash_m(s):
    hash_bytes = list(hashlib.sha512(s).digest())[:64]
    return np.array(hash_bytes, dtype=np.uint8).reshape((8,8))

8×8 seems big compared to 3×3 or 4×4 matrixes. The values are as random as you might expect a cryptographic hash to be:

In [3]:
print(hash_m(b"Hello A"))
print()
print(hash_m(b"Hello B"))
[[ 14 184 108 217 131 164 222  93]
 [132 227  82 144 111 178 195 109]
 [ 25 250 155  17 131 183 151 217]
 [212  60 138  36   0  60 115 181]
 [ 51   0  87  43  93 252  56  61]
 [108 239 175 222  23 142  41 216]
 [203  98 234  13  65 169 255 240]
 [ 46 127  15 167 112 153 222  94]]

[[ 63 144 188   5  57 146  32  56]
 [ 27 189  98 140 113 194  70  87]
 [115  21 136  27 116 167  85  48]
 [ 29 162 119  29 104  32 145 241]
 [166 197  57 165 132 213  50 202]
 [ 48  71  33  19 230  26  58 164]
 [242 172  65 202 193  50 193 141]
 [206 110 165 129  52 132 250  73]]

List

Ok so we've formatted hashes of bytes into matrixes, but we haven't actually done anything with them yet.

Consider a list of arbitrarily many arbitrary byte buffers. Define the 'hash' of the list to be reduction by matrix multiplication of the hash of each byte buffer.

In [4]:
def mul_m(hm1, hm2):
    return np.matmul(hm1, hm2, dtype=np.uint8)

Consider an example:

In [5]:
# list1 contains 3 elements
list1 = [b"A", b"Hello", b"World"]
# first hash each element
hashes1 = [hash_m(e) for e in list1]
# get the hash of the list by reducing the hashes by matrix multiplication
hash1 = mul_m(mul_m(hashes1[0], hashes1[1]), hashes1[2])
# an alternative way to write the reduction
hash2 = reduce(mul_m, hashes1)
In [6]:
print("List of byte buffers:")
print(list1)
print("\nHashes of byte buffers:")
print(hashes1)
print("\nHash of full list:")
print(hash1)
assert_equal(hash1, hash2)
List of byte buffers:
[b'A', b'Hello', b'World']

Hashes of byte buffers:
[array([[ 33, 180, 244, 189, 158, 100, 237,  53],
       [ 92,  62, 182, 118, 162, 142, 190, 218],
       [246, 216, 241, 123, 220,  54,  89, 149],
       [179,  25,   9, 113,  83,   4,  64, 128],
       [ 81, 107, 208, 131, 191, 204, 230,  97],
       [ 33, 163,   7,  38,  70, 153,  76, 132],
       [ 48, 204,  56,  43, 141, 197,  67, 232],
       [ 72, 128,  24,  59, 248,  86, 207, 245]], dtype=uint8), array([[ 54,  21, 248,  12, 157,  41,  62, 215],
       [ 64,  38, 135, 249,  75,  34, 213, 142],
       [ 82, 155, 140, 199, 145, 111, 143, 172],
       [127, 221, 247, 251, 213, 175,  76, 247],
       [119, 211, 215, 149, 167, 160,  10,  22],
       [191, 126, 127,  63, 185,  86,  30, 233],
       [186, 174,  72,  13, 169, 254, 122,  24],
       [118, 158, 113, 136, 107,   3, 243,  21]], dtype=uint8), array([[142, 167, 115, 147, 164,  42, 184, 250],
       [146,  80,  15, 176, 119, 169,  80, 156],
       [195,  43, 201,  94, 114, 113,  46, 250],
       [ 17, 110, 218, 242, 237, 250, 227,  79],
       [187, 104,  46, 253, 214, 197, 221,  19],
       [193,  23, 224, 139, 212, 170, 239, 113],
       [ 41,  29, 138, 172, 226, 248, 144,  39],
       [ 48, 129, 208, 103, 124,  22, 223,  15]], dtype=uint8)]

Hash of full list:
[[178 188  57 157  60 136 190 127]
 [ 40 234 254 224  38  46 250  52]
 [156  72 193 136 219  98  28   4]
 [197   2  43 132 132 232 254 198]
 [ 93  64 113 215   2 246 130 192]
 [ 91 107  85  13 149  60  19 173]
 [ 84  77 244  98   0 239 123  17]
 [ 58 112  98 250 163  20  27   6]]

What does this give us? Generally speaking, multiplying two matrixes $M_1×M_2$ gives us at least these two properties:

  • Associativity - Associativity enables you to reduce a computation using any partitioning because all partitionings yield the same result. Addition is associative $(1+2)+3 = 1+(2+3)$, subtraction is not $(5-3)-2\neq5-(3-2)$. (Associative property)
  • Non-Commutativity - Commutativity allows you to swap elements without affecting the result. Addition is commutative $1+2 = 2+1$, but division is not $1\div2 \neq2\div1$. And neither is matrix multiplication. (Commutative property)

This is an unusual combination of properties, at least not a combination encountered under normal algebra operations:

associative commutative
+
*
-
/
exp
M×M

Upon consideration, these are the exact properties that one would want in order to define the hash of a list of items. Non-commutativity enables the order of elements in the list to be defined, since swapping them produces a different hash. Associativity enables caching the summary of an arbitrary sublist; doing this heirarchally on a huge list enables an algorithm to calculate the hash of any sublist at the cost of O(log(N)) time and space.

Associativity

In [7]:
f1 = hash_m(b"Hello A")
f2 = hash_m(b"Hello B")
f3 = hash_m(b"Hello C")
In [8]:
# x is calculated by association ((f1 × f2) × f3)
x = np.matmul(np.matmul(f1, f2), f3)

# y is calculated by association (f1 × (f2 × f3))
y = np.matmul(f1, np.matmul(f2, f3))

# observe that they produce the same result
assert_equal(x, y)
In [9]:
print(x)
print()
print(y)
[[ 58  12 144 134 100 158 159  51]
 [ 73 206 202 190  87  79 223   2]
 [210 122 142 117  37 148 106  45]
 [175 146 187 223 235 171  64 226]
 [149  85 203  87  92 251 243 206]
 [ 18 252 160 103 125 251 181 133]
 [191 132 220 104 213 154  34 154]
 [127 197  95  87 166   3  22   3]]

[[ 58  12 144 134 100 158 159  51]
 [ 73 206 202 190  87  79 223   2]
 [210 122 142 117  37 148 106  45]
 [175 146 187 223 235 171  64 226]
 [149  85 203  87  92 251 243 206]
 [ 18 252 160 103 125 251 181 133]
 [191 132 220 104 213 154  34 154]
 [127 197  95  87 166   3  22   3]]

Non-Commutativity

In [10]:
# x is f1 × f2
x = np.matmul(f1, f2)

# y is f2 × f1
y = np.matmul(f2, f1)

# observe that they produce different results
assert_not_equal(x, y)
In [11]:
print(x)
print()
print(y)
[[ 87  79 149 131 148 247 195  90]
 [249  84 195  58 142 133 211  15]
 [177  93  69 254 240 234  97  37]
 [ 46  84  76 253  55 200  43 236]
 [ 21  84  99 157  55 148 170   2]
 [168 123   6 250  64 144  54 242]
 [230  78 164  76  30  29 214  68]
 [ 47 183 156 239 157 177 192 184]]

[[149  18 239 238  84 188 191 109]
 [239 150 214 235  59 161   9 133]
 [ 89 174  59  14  70 113 124 243]
 [ 66 113 176 124 227 247  17  25]
 [247 138 152 181 177 143 184  97]
 [113 249 199 153 154  75  45 105]
 [121 201 225  42 249 213 180 244]
 [ 85  31  72  28 181 182 140 176]]

Other functions

In [12]:
# Create a list of 1024 elements and reduce them one by one
list1 = [hash_m(b"A") for _ in range(0, 1024)]
hash1 = reduce(mul_m, list1)

# Take a starting element and square/double it 10 times. With 1 starting element over 10 doublings = 1024 elements
hash2 = reduce((lambda m, _ : mul_m(m, m)), range(0, 10), hash_m(b"A"))

# Observe that these two methods of calculating the hash have the same result
assert_equal(hash1, hash2)

# lets call it double
def double_m(m, d=1):
    return reduce((lambda m, _ : mul_m(m, m)), range(0, d), m)

assert_equal(hash1, double_m(hash_m(b"A"), 10))

def identity_m():
    return np.identity(8, dtype=np.uint8)

# generalize to any length, not just doublings, performed in ln(N) matmuls
def repeat_m(m, n):
    res = identity_m()
    while n > 0:
        # concatenate the current doubling iff the bit representing this doubling is set
        if n & 1:
            res = mul_m(res, m)
        n >>= 1
        m = mul_m(m, m) # double matrix m
        # print(s)
    return res

# repeat_m can do the same as double_m
assert_equal(hash1, repeat_m(hash_m(b"A"), 1024))

# but it can also repeat any number of times
hash3 = reduce(mul_m, (hash_m(b"A") for _ in range(0, 3309)))
assert_equal(hash3, repeat_m(hash_m(b"A"), 3309))

# Even returns a sensible result when requesting 0 elements
assert_equal(identity_m(), repeat_m(hash_m(b"A"), 0))

# make helper for reducing an iterable of hashes
def reduce_m(am):
    return reduce(mul_m, am)
In [13]:
print(hash1)
print()
print(hash2)
print()
print(np.identity(8,"B"))
[[ 68 252 159   3  14  52 199 199]
 [136 124   6  34  58 174 206  54]
 [  3 234   2  13 120 240   7 163]
 [102  47  66  61  87 234 246  72]
 [ 19 135  80 115  75 242 242   5]
 [244 165 250  28  76  43 188 254]
 [233  46 187  39 151 241 175 130]
 [132 138   6 215  20 132  89  33]]

[[ 68 252 159   3  14  52 199 199]
 [136 124   6  34  58 174 206  54]
 [  3 234   2  13 120 240   7 163]
 [102  47  66  61  87 234 246  72]
 [ 19 135  80 115  75 242 242   5]
 [244 165 250  28  76  43 188 254]
 [233  46 187  39 151 241 175 130]
 [132 138   6 215  20 132  89  33]]

[[1 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 1]]

Exploring

In [14]:
a = hash_m(b"A")
b = hash_m(b"B")

a499 = repeat_m(a, 499)
a500 = repeat_m(a, 500)

# this should work because they're all a's
assert_equal(reduce_m([a, a499]), a500)
assert_equal(reduce_m([a499, a]), a500)

# these are lists of 999 elements of a, with one b at position 500 (x) or 501 (y)
x = reduce_m([a499, b, a500])
y = reduce_m([a500, b, a499])

# shifting the b by one element changed the hash
assert_not_equal(x, y)

Flex that associativity (a × (a499 × b × a500) × (a500 × b × a499) × a) = (a500 × b × (a500 × a500) × b × a500)

In [15]:
assert_equal(reduce_m([a, x, y, a]), reduce_m([a500, b, repeat_m(a500, 2), b, a500]))

"Merklist"?

Merklist ~ Merklix ~ Merkle

"Tree"?

Ok great, so now we can construct a hash for a list that always produces the same hash for the same list, independent of which pairs in the list are reduced first. I think this enables verifyably adding or removing any number of elements at any point in the list with only $O(log(N))$ additional time and space, but what does that look like specifically? Or alternatively, where does the "Tree" part of "Merlist Tree" come in?

To be continued...

Conclusion / Security

This appears to me to be a reasonable way to define the hash of a list. Definitionally, it needs to preserve the order of elements; this is provided by the non-commutativity property. For efficiency, it would be nice if any two sublists can have a known equality; this is provided by the associativity property. It's not clear to me under what circumstances any information about what is contained in the list could be derived from the hash.

Obviously, being "not clear to me how" is not a proof of impossibility. Matrixes are well-studied objects, perhaps such information is already known. If you know something about deriving the preimage of a matrix ring, $R_{256}^{8×8}$, I would be very interested to know.

If this construction has the appropriate security properties, it seems to be a better merkle tree in all respects. Any use of a merkle tree could be replaced with this, and it would enable many more use-cases where merkle trees are not applicable. For example, this would allow you to track the hash of a btree-like structure over time with no additional cost (asymptotically). Of course, these ideas are putting the cart before the horse; we need to know more about its properties first.