notebook/2021-07-07-Merklist.ipynb

26 KiB
Raw Blame History

Merklist

A definition for the hash of a list that is robust to arbitrary partitioning

  • toc: true
  • categories: [merklist]

Matrix multiplication's associativity and non-commutativity properties provide a natural definition for a cryptographic hash / digest / summary of an ordered list of elements while preserving concatenation operations. Due to the non-commutativity property, lists that differ in element order result in a different summary. Due to the associativity property, arbitrarily divided adjacent sub-lists can be summarized independently and combined to find the summary of their concatenation in one operation. This definition provides exactly the properties needed to define a list, and does not impose any unnecessary structure that could cause two equivalent lists to produce different summaries. The name Merklist is intended to be reminicent of other hash-based data structures like Merkle Tree and Merklix Tree.

Definition

This definition of a hash of a list of elements is pretty simple:

  • A list element is an arbitrary buffer of bytes. Any length, any content. Just bytes.
  • A list, then, is a sequence of such elements.
  • The hash of a list element is the cryptographic hash of its bytes, formatted into a square matrix with byte elements. (More details later.)
  • The hash of a list is reduction by matrix multiplication of the hashes of all the list elements in the same order as they appear in the list.
  • The hash of a list with 0 elements is the identity matrix.

This construction has a couple notable concequences:

  • The hash of a list with only one item is just the hash of the item itself.
  • You can calculate the hash of any list concatenated with a copy of itself by matrix multiplication of the the hash with itself. This works for single elements as well as arbitrarily long lists.
  • A list can have multiple copies of the same list item, and swapping them does not affect the list hash. Consider how swapping the first two elements in [1, 1, 2] has no discernible effect.
  • The hash of the concatenation of two lists is the matrix multiplication of their hashes.
  • Concatenating a list with a list of 0 elements yields the same hash.

Lets explore this definition in more detail with a simple implementation in python+numpy.

In [1]:
#collapse-hide
# Setup and imports
import hashlib
import numpy as np
from functools import reduce

def assert_equal(a, b):
    return np.testing.assert_equal(a, b)

def assert_not_equal(a, b):
    return np.testing.assert_raises(AssertionError, np.testing.assert_equal, a, b)

The hash of a list element - hash_m/1

The function hash_m/1 takes a buffer of bytes as its first argument, and returns the sha512 hash of the bytes formatted as an 8×8 2-d array of 8-bit unsigned integers with wrapping overflow. We define this hash to be the hash of the list element. Based on a shallow wikipedia dive, someone familiar with linear algebra might say it's a matrix ring, $R_{256}^{8×8}$. Not coincidentally, sha512 outputs 512 bits = 64 bytes = 8 * 8 array of bytes, how convenient. (In fact, that might even be the primary reason why I chose sha512!)

In [2]:
def hash_m(e):
    hash_bytes = list(hashlib.sha512(e).digest())[:64]          # hash the bytes e, convert the digest into a list of 64 bytes
    return np.array(hash_bytes, dtype=np.uint8).reshape((8,8))  # convert the digest bytes into a numpy array with the appropriate data type and shape

8×8 seems big compared to 3×3 or 4×4 matrixes. The values are as random as you might expect a cryptographic hash to be, and range from 0-255:

In [3]:
#collapse-hide
print(hash_m(b"Hello A"))
print()
print(hash_m(b"Hello B"))
[[ 14 184 108 217 131 164 222  93]
 [132 227  82 144 111 178 195 109]
 [ 25 250 155  17 131 183 151 217]
 [212  60 138  36   0  60 115 181]
 [ 51   0  87  43  93 252  56  61]
 [108 239 175 222  23 142  41 216]
 [203  98 234  13  65 169 255 240]
 [ 46 127  15 167 112 153 222  94]]

[[ 63 144 188   5  57 146  32  56]
 [ 27 189  98 140 113 194  70  87]
 [115  21 136  27 116 167  85  48]
 [ 29 162 119  29 104  32 145 241]
 [166 197  57 165 132 213  50 202]
 [ 48  71  33  19 230  26  58 164]
 [242 172  65 202 193  50 193 141]
 [206 110 165 129  52 132 250  73]]

The hash of a list - mul_m/2

Ok so we've got our element hashes, how do we combine them to construct the hash of a list? We defined the hash of the list to be reduction by matrix multiplication of the hash of each element:

In [4]:
def mul_m(he1, he2):
    return np.matmul(he1, he2, dtype=np.uint8) # just, like, multiply them

Consider an example:

In [17]:
#
# `elements` is a list of 3 elements
elements = [b"A", b"Hello", b"World"]
# first, make a new list with the hash of each element
element_hashes = [hash_m(e) for e in elements]
# get the hash of the list by reducing the hashes by matrix multiplication
list_hash1 = mul_m(mul_m(element_hashes[0], element_hashes[1]), element_hashes[2])
# an alternative way to write the reduction
list_hash2 = reduce(mul_m, element_hashes)
# check that these alternative spellings are equivalent
assert_equal(list_hash1, list_hash2)

Expand the sections below to see a comparison

In [18]:
#collapse-hide
#collapse-output
print("List of elements:")
print(elements)
print()
print("Hash of each element:")
print(element_hashes)
print()
print("Hash of full list:")
print(list_hash1)
# Expand the section below to see the output
List of elements:
[b'A', b'Hello', b'World']

Hash of each element:
[array([[ 33, 180, 244, 189, 158, 100, 237,  53],
       [ 92,  62, 182, 118, 162, 142, 190, 218],
       [246, 216, 241, 123, 220,  54,  89, 149],
       [179,  25,   9, 113,  83,   4,  64, 128],
       [ 81, 107, 208, 131, 191, 204, 230,  97],
       [ 33, 163,   7,  38,  70, 153,  76, 132],
       [ 48, 204,  56,  43, 141, 197,  67, 232],
       [ 72, 128,  24,  59, 248,  86, 207, 245]], dtype=uint8), array([[ 54,  21, 248,  12, 157,  41,  62, 215],
       [ 64,  38, 135, 249,  75,  34, 213, 142],
       [ 82, 155, 140, 199, 145, 111, 143, 172],
       [127, 221, 247, 251, 213, 175,  76, 247],
       [119, 211, 215, 149, 167, 160,  10,  22],
       [191, 126, 127,  63, 185,  86,  30, 233],
       [186, 174,  72,  13, 169, 254, 122,  24],
       [118, 158, 113, 136, 107,   3, 243,  21]], dtype=uint8), array([[142, 167, 115, 147, 164,  42, 184, 250],
       [146,  80,  15, 176, 119, 169,  80, 156],
       [195,  43, 201,  94, 114, 113,  46, 250],
       [ 17, 110, 218, 242, 237, 250, 227,  79],
       [187, 104,  46, 253, 214, 197, 221,  19],
       [193,  23, 224, 139, 212, 170, 239, 113],
       [ 41,  29, 138, 172, 226, 248, 144,  39],
       [ 48, 129, 208, 103, 124,  22, 223,  15]], dtype=uint8)]

Hash of full list:
[[178 188  57 157  60 136 190 127]
 [ 40 234 254 224  38  46 250  52]
 [156  72 193 136 219  98  28   4]
 [197   2  43 132 132 232 254 198]
 [ 93  64 113 215   2 246 130 192]
 [ 91 107  85  13 149  60  19 173]
 [ 84  77 244  98   0 239 123  17]
 [ 58 112  98 250 163  20  27   6]]

What does this give us? Generally speaking, multiplying two square matrixes $M_1×M_2$ gives us at least these two properties:

  • Associativity - Associativity enables you to reduce a computation using any partitioning because all partitionings yield the same result. Addition is associative $(1+2)+3 = 1+(2+3)$, subtraction is not $(5-3)-2\neq5-(3-2)$. (Associative property)
  • Non-Commutativity - Commutativity allows you to swap elements without affecting the result. Addition is commutative $1+2 = 2+1$, but division is not $1\div2 \neq2\div1$. And neither is matrix multiplication. (Commutative property)

This is an unusual combination of properties for an operation. It's at least not a combination encountered in introductory algebra:

associative commutative
$a+b$
$a*b$
$a-b$
$a/b$
$a^b$
$M×M$

Upon consideration, these are the exact properties that one would want in order to define the hash of a list of items. Non-commutativity enables the order of elements in the list to be well defined, since swapping different elements produces a different hash. Associativity enables calculating the hash of the list by performing the reduction operations in any order, and you still get the same hash.

Lets sanity-check that these properties can hold for the construction described above.

Associativity

If it's associative, we should get the same hash if we rearrange the parenthesis to indicate reduction in a different operation order. That is: $((e1 × e2) × e3) = (e1 × (e2 × e3))$

In [7]:
e1 = hash_m(b"Hello A")
e2 = hash_m(b"Hello B")
e3 = hash_m(b"Hello C")
In [8]:
x = np.matmul(np.matmul(e1, e2), e3)
y = np.matmul(e1, np.matmul(e2, e3))

# observe that they produce the same summary
assert_equal(x, y)

Expand the sections below to see a comparison

In [9]:
#collapse-hide
#collapse-output
print(x)
print()
print(y)
[[ 58  12 144 134 100 158 159  51]
 [ 73 206 202 190  87  79 223   2]
 [210 122 142 117  37 148 106  45]
 [175 146 187 223 235 171  64 226]
 [149  85 203  87  92 251 243 206]
 [ 18 252 160 103 125 251 181 133]
 [191 132 220 104 213 154  34 154]
 [127 197  95  87 166   3  22   3]]

[[ 58  12 144 134 100 158 159  51]
 [ 73 206 202 190  87  79 223   2]
 [210 122 142 117  37 148 106  45]
 [175 146 187 223 235 171  64 226]
 [149  85 203  87  92 251 243 206]
 [ 18 252 160 103 125 251 181 133]
 [191 132 220 104 213 154  34 154]
 [127 197  95  87 166   3  22   3]]

Non-Commutativity

If it's not commutative, then swapping different elements should produce a different hash. That is, $e1 × e2 \ne e2 × e1$:

In [10]:
x = np.matmul(e1, e2)
y = np.matmul(e2, e1)

# observe that they produce different summaries
assert_not_equal(x, y)

Expand the sections below to see a comparison

In [11]:
#collapse-hide
#collapse-output
print(x)
print()
print(y)
[[ 87  79 149 131 148 247 195  90]
 [249  84 195  58 142 133 211  15]
 [177  93  69 254 240 234  97  37]
 [ 46  84  76 253  55 200  43 236]
 [ 21  84  99 157  55 148 170   2]
 [168 123   6 250  64 144  54 242]
 [230  78 164  76  30  29 214  68]
 [ 47 183 156 239 157 177 192 184]]

[[149  18 239 238  84 188 191 109]
 [239 150 214 235  59 161   9 133]
 [ 89 174  59  14  70 113 124 243]
 [ 66 113 176 124 227 247  17  25]
 [247 138 152 181 177 143 184  97]
 [113 249 199 153 154  75  45 105]
 [121 201 225  42 249 213 180 244]
 [ 85  31  72  28 181 182 140 176]]

Other functions

In [12]:
# Create a list of 1024 elements and reduce them one by one
list1 = [hash_m(b"A") for _ in range(0, 1024)]
hash1 = reduce(mul_m, list1)

# Take a starting element and square/double it 10 times. With 1 starting element over 10 doublings = 1024 elements
hash2 = reduce((lambda m, _ : mul_m(m, m)), range(0, 10), hash_m(b"A"))

# Observe that these two methods of calculating the hash have the same result
assert_equal(hash1, hash2)

# lets call it double
def double_m(m, d=1):
    return reduce((lambda m, _ : mul_m(m, m)), range(0, d), m)

assert_equal(hash1, double_m(hash_m(b"A"), 10))

def identity_m():
    return np.identity(8, dtype=np.uint8)

# generalize double_m to any length, not just doublings, performed in ln(N) matmuls
def repeat_m(m, n):
    res = identity_m()
    while n > 0:
        # concatenate the current doubling iff the bit representing this doubling is set
        if n & 1:
            res = mul_m(res, m)
        n >>= 1
        m = mul_m(m, m) # double matrix m
        # print(s)
    return res

# repeat_m can do the same as double_m
assert_equal(hash1, repeat_m(hash_m(b"A"), 1024))

# but it can also repeat any number of times
hash3 = reduce(mul_m, (hash_m(b"A") for _ in range(0, 3309)))
assert_equal(hash3, repeat_m(hash_m(b"A"), 3309))

# Even returns a sensible result when requesting 0 elements
assert_equal(identity_m(), repeat_m(hash_m(b"A"), 0))

# make helper for reducing an iterable of hashes
def reduce_m(am):
    return reduce(mul_m, am)
In [13]:
print(hash1)
print()
print(hash2)
[[ 68 252 159   3  14  52 199 199]
 [136 124   6  34  58 174 206  54]
 [  3 234   2  13 120 240   7 163]
 [102  47  66  61  87 234 246  72]
 [ 19 135  80 115  75 242 242   5]
 [244 165 250  28  76  43 188 254]
 [233  46 187  39 151 241 175 130]
 [132 138   6 215  20 132  89  33]]

[[ 68 252 159   3  14  52 199 199]
 [136 124   6  34  58 174 206  54]
 [  3 234   2  13 120 240   7 163]
 [102  47  66  61  87 234 246  72]
 [ 19 135  80 115  75 242 242   5]
 [244 165 250  28  76  43 188 254]
 [233  46 187  39 151 241 175 130]
 [132 138   6 215  20 132  89  33]]

Fun with associativity

Does the hash of a list change even when swapping two elements in the middle of a very long list?

In [14]:
a = hash_m(b"A")
b = hash_m(b"B")

a499 = repeat_m(a, 499)
a500 = repeat_m(a, 500)

# this should work because they're all a's
assert_equal(reduce_m([a, a499]), a500)
assert_equal(reduce_m([a499, a]), a500)

# these are lists of 999 elements of a, with one b at position 500 (x) or 501 (y)
x = reduce_m([a499, b, a500])
y = reduce_m([a500, b, a499])

# shifting the b by one element changed the hash
assert_not_equal(x, y)

Flex that associativity - this statement is true and equivalent to the assertion below:

$(a × (a499 × b × a500) × (a500 × b × a499) × a) = (a500 × b × (a500 × a500) × b × a500)$

In [15]:
assert_equal(reduce_m([a, x, y, a]), reduce_m([a500, b, repeat_m(a500, 2), b, a500]))

Unknowns

This appears to me to be a reasonable way to define the hash of a list. The mathematical definition of a list aligns very nicely with the properties offered by matrix multiplication. But is it appropriate to use for the same things that a Merkle Tree would be? The big questions are related to the valuable properties of hash functions:

  • Given a merklist summary but not the elements, is it possible to produce a different list of elements that hash to the same summary? (~preimage resistance)
  • Given a merklist summary or sublist summaries of it, can you derive the hashes of elements or their order?
  • Is it possible to predictably alter the merklist summary by concatenating it with some other sublist of real elements?
  • Are there other desirable security properties that would be valuable for a list hash?
  • Is there a better choice of hash function as a primitive than sha512?
  • Is there a better choice of reduction function that still retains associativity+non-commutativity than simple matmul?
  • Is there a more appropriate size than an 8x8 matrix / 64 bytes to represent merklist summaries?

Matrixes are well-studied objects, perhaps such information is already known. If you know something about deriving the preimage of the multiplication of a matrix ring, $R_{256}^{8×8}$, I would be very interested to know.

What's next?

***If** this construction has the appropriate security properties*, it seems to be a better merkle tree in all respects. Any use of a merkle tree could be replaced with this, and it could enable use-cases where merkle trees aren't useful. Some examples of what I think might be possible:

  • Using a Merklist with a sublist summary tree structure enables creating a $O(1)$-sized 'Merklist Proof' that can verify the addition and subtraction of any number of elements at any single point in the list using only $O(log(N))$ time and $O(log(N))$ static space. As a bonus the proof generator and verifier can have totally different tree structures and can still communicate the proof successfully.
  • Using a Merklist summary tree you can create a consistent hash of any ordered key-value store (like a btree) that can be maintained incrementally inline with regular node updates, e.g. as part of a LSM-tree. This could facilitate verification and sync between database replicas.
  • The sublist summary tree structure can be as dense or sparse as desired. You could summarize down to pairs of elements akin to a merkle tree, but you could also summarize a compressed sublist of hundreds or even millions of elements with a single hash. Of course, calculating or verifying a proof of changes to the middle of that sublist would require rehashing the whole sublist, but this turns it from a fixed structure into a tuneable parameter.
  • If all possible elements had an easily calculatable inverse, that would enable "subtracting" an element by inserting its inverse in front of it. That would basically extend the group from a ring into a field, and might have interesting implications.
    • For example you could define a cryptographically-secure rolling hash where advancing either end can be calculated in O(1) time and space.

To be continued...