Second pass on Merklist

2021-07-07 19:54:56 +00:00
parent 2dcbc90919
commit ba78bd0834
5 changed files with 87 additions and 111 deletions
@@ -0,0 +1 @@
+.ipynb_checkpoints
@@ -5,8 +5,8 @@
   "id": "2bdec887-ee29-4bef-8978-88a81940f7bc",
   "metadata": {},
   "source": [
-    "# Merklist Tree\n",
-    "Using matrix multiplication's associativity and non-commutativity to construct a digest / summary of an ordered list of elements where mutations to the list can be computed, verified, and stored using $O(log(N))$ time and space. Due to the associativity property, arbitrarily divided adjacent sub-lists can be summarized independently and combined to find the summary of their concatenation."
+    "# Merklist\n",
+    "Using matrix multiplication's associativity and non-commutativity properties provides a natural definition of a cryptographic hash / digest / summary of an ordered list of elements. Due to the non-commutativity property, lists that only differ in element order result in a different summary. Due to the associativity property, arbitrarily divided adjacent sub-lists can be summarized independently and combined to quickly find the summary of their concatenation. This definition provides exactly the properties needed to define a list, and does not impose any unnecessary structure that could cause two equivalent lists to produce different summaries. The name *Merklist* is intended to be reminicent of other hash-based data structures like [Merkle Tree](https://en.wikipedia.org/wiki/Merkle_tree) and [Merklix Tree](https://www.deadalnix.me/2016/09/24/introducing-merklix-tree-as-an-unordered-merkle-tree-on-steroid/)."
   ]
  },
  {
@@ -14,7 +14,25 @@
   "id": "3f17d376-b03f-498b-a794-ea566e0b63f7",
   "metadata": {},
   "source": [
-    "## Construction"
+    "## Definition\n",
+    "\n",
+    "This definition of a hash of a list of elements is pretty simple:\n",
+    "\n",
+    "* A **list element** is an arbitrary buffer of bytes. Any length, any content. Just bytes.\n",
+    "* A **list**, then, is a sequence of such elements.\n",
+    "* The **hash of a list element** is the cryptographic hash of its bytes, formatted into a square matrix with byte elements. (More details later.)\n",
+    "* The **hash of a list** is reduction by matrix multiplication of the hashes of all the list elements in the same order as they appear in the list.\n",
+    "* The **hash of a list with 0 elements** is the identity matrix.\n",
+    "\n",
+    "This construction has a couple notable concequences:\n",
+    "\n",
+    "* The hash of a list with only one item is just the hash of the item itself.\n",
+    "* You can calculate the hash of any list concatenated with itself by matrix multiplication of the the hash with itself. This works for single elements as well as arbitrarily long lists.\n",
+    "* A list can have multiple copies of the same list item, and swapping them does not affect the list hash. Consider how swapping the first two elements in `[1, 1, 2]` doesn't change it.\n",
+    "* Concatenating two lists is accomplished by matrix multiplication of their hashes, in the correct order.\n",
+    "* Appending or prepending lists of 0 elements yields the same hash, as expected.\n",
+    "\n",
+    "Lets explore this definition in more detail with a simple implementation in python+numpy."
   ]
  },
  {
@@ -22,9 +40,6 @@
   "execution_count": 1,
   "id": "99b521d8-1c66-49d7-98e9-6fa1d8d7c18f",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "tags": []
   },
   "outputs": [],
@@ -47,20 +62,22 @@
   "id": "fc1306b8-5e89-460a-997c-c9464c16615d",
   "metadata": {},
   "source": [
-    "### hash_m/1\n",
-    "The function `hash_m/1` takes a buffer of bytes as its first argument, and returns the sha512 hash of the bytes formatted as an 8×8 2-d array of 8-bit unsigned integers with wrapping overflow. Based on a shallow wikipedia dive, someone familiar with linear algebra might say it's a [matrix ring](https://en.wikipedia.org/wiki/Matrix_ring), $R_{256}^{8×8}$. Not coincidentally, sha512 outputs 512 bits = 64 bytes = 8 * 8 array of bytes, how convenient. That might even be the primary reason why I chose sha512."
+    "### The hash of a list element - `hash_m/1`\n",
+    "The function `hash_m/1` takes a buffer of bytes as its first argument, and returns the sha512 hash of the bytes formatted as an 8×8 2-d array of 8-bit unsigned integers with wrapping overflow. **This is the hash of a list element consisting of those bytes.** Based on a shallow wikipedia dive, someone familiar with linear algebra might say it's a [matrix ring](https://en.wikipedia.org/wiki/Matrix_ring), $R_{256}^{8×8}$. Not coincidentally, sha512 outputs 512 bits = 64 bytes = 8 * 8 array of bytes, how convenient. (In fact, that might even be the primary reason why I chose sha512!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3ccc7fdc-fa6a-48e3-accb-3c1070b4559c",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
-    "def hash_m(s):\n",
-    "    hash_bytes = list(hashlib.sha512(s).digest())[:64]\n",
-    "    return np.array(hash_bytes, dtype=np.uint8).reshape((8,8))"
+    "def hash_m(e):\n",
+    "    hash_bytes = list(hashlib.sha512(e).digest())[:64]          # hash the bytes e, convert the digest into a list of 64 bytes\n",
+    "    return np.array(hash_bytes, dtype=np.uint8).reshape((8,8))  # convert the digest bytes into a numpy array with the appropriate data type and shape"
   ]
  },
  {
@@ -73,7 +90,7 @@
    "tags": []
   },
   "source": [
-    "8×8 seems big compared to 3×3 or 4×4 matrixes. The values are as random as you might expect a cryptographic hash to be:"
+    "8×8 seems big compared to 3×3 or 4×4 matrixes. The values are as random as you might expect a cryptographic hash to be, and range from 0-255:"
   ]
  },
  {
@@ -81,9 +98,6 @@
   "execution_count": 3,
   "id": "65aa7c7a-25d5-4971-8780-661f367e45ab",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "slideshow": {
     "slide_type": "skip"
    },
@@ -125,10 +139,8 @@
   "id": "c0c37110-b38d-4420-adf9-11ff5c5cd590",
   "metadata": {},
   "source": [
-    "## List\n",
-    "Ok so we've formatted hashes of bytes into matrixes, but we haven't actually done anything with them yet.\n",
-    "\n",
-    "Consider a list of arbitrarily many arbitrary byte buffers. **Define the 'hash' of the list to be reduction by matrix multiplication of the hash of each byte buffer.**"
+    "### The hash of a list - `mul_m/2`\n",
+    "Ok so we've got our element hashes, how do we combine them to construct the hash of a list? We defined the hash of the list to be reduction by matrix multiplication of the hash of each element:"
   ]
  },
  {
@@ -138,8 +150,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "def mul_m(hm1, hm2):\n",
-    "    return np.matmul(hm1, hm2, dtype=np.uint8)"
+    "def mul_m(he1, he2):\n",
+    "    return np.matmul(he1, he2, dtype=np.uint8) # just, like, multiply them"
   ]
  },
  {
@@ -158,13 +170,13 @@
   "outputs": [],
   "source": [
    "# list1 contains 3 elements\n",
-    "list1 = [b\"A\", b\"Hello\", b\"World\"]\n",
+    "elements = [b\"A\", b\"Hello\", b\"World\"]\n",
    "# first hash each element\n",
-    "hashes1 = [hash_m(e) for e in list1]\n",
+    "element_hashes = [hash_m(e) for e in elements]\n",
    "# get the hash of the list by reducing the hashes by matrix multiplication\n",
-    "hash1 = mul_m(mul_m(hashes1[0], hashes1[1]), hashes1[2])\n",
+    "list_hash1 = mul_m(mul_m(element_hashes[0], element_hashes[1]), element_hashes[2])\n",
    "# an alternative way to write the reduction\n",
-    "hash2 = reduce(mul_m, hashes1)"
+    "list_hash2 = reduce(mul_m, element_hashes)"
   ]
  },
  {
@@ -172,9 +184,6 @@
   "execution_count": 6,
   "id": "694b4727-621e-4c1b-a2af-99296a8e664a",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "tags": []
   },
   "outputs": [
@@ -182,10 +191,10 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "List of byte buffers:\n",
+      "List of elements:\n",
      "[b'A', b'Hello', b'World']\n",
      "\n",
-      "Hashes of byte buffers:\n",
+      "Hash of each element:\n",
      "[array([[ 33, 180, 244, 189, 158, 100, 237,  53],\n",
      "       [ 92,  62, 182, 118, 162, 142, 190, 218],\n",
      "       [246, 216, 241, 123, 220,  54,  89, 149],\n",
@@ -222,13 +231,13 @@
    }
   ],
   "source": [
-    "print(\"List of byte buffers:\")\n",
-    "print(list1)\n",
-    "print(\"\\nHashes of byte buffers:\")\n",
-    "print(hashes1)\n",
+    "print(\"List of elements:\")\n",
+    "print(elements)\n",
+    "print(\"\\nHash of each element:\")\n",
+    "print(element_hashes)\n",
    "print(\"\\nHash of full list:\")\n",
-    "print(hash1)\n",
-    "assert_equal(hash1, hash2)"
+    "print(list_hash1)\n",
+    "assert_equal(list_hash1, list_hash2)"
   ]
  },
  {
@@ -236,12 +245,12 @@
   "id": "de064a80-208d-4850-b95e-c5a707f7f3b3",
   "metadata": {},
   "source": [
-    "What does this give us? Generally speaking, multiplying two matrixes $M_1×M_2$ gives us at least these two properties:\n",
+    "What does this give us? Generally speaking, multiplying two square matrixes $M_1×M_2$ gives us at least these two properties:\n",
    "\n",
    "* [Associativity](#Associativity) - Associativity enables you to reduce a computation using any partitioning because all partitionings yield the same result. Addition is associative $(1+2)+3 = 1+(2+3)$, subtraction is not $(5-3)-2\\neq5-(3-2)$. ([Associative property](https://en.wikipedia.org/wiki/Associative_property))\n",
    "* [Non-Commutativity](#Non-Commutativity) - Commutativity allows you to swap elements without affecting the result. Addition is commutative $1+2 = 2+1$, but division is not $1\\div2 \\neq2\\div1$. And neither is matrix multiplication. ([Commutative property](https://en.wikipedia.org/wiki/Commutative_property))\n",
    "\n",
-    "This is an unusual combination of properties, at least not a combination encountered under normal algebra operations:\n",
+    "This is an unusual combination of properties for an operation, at least not a combination encountered under normal algebra operations:\n",
    "\n",
    "|     | associative | commutative |\n",
    "| --- | ---         | ---         |\n",
@@ -252,7 +261,9 @@
    "| exp | ❌ | ❌ |\n",
    "| M×M | ✅ | ❌ |\n",
    "\n",
-    "Upon consideration, these are the exact properties that one would want in order to define the hash of a list of items. Non-commutativity enables the order of elements in the list to be defined, since swapping them produces a different hash. Associativity enables caching the summary of an arbitrary sublist; doing this heirarchally on a huge list enables an algorithm to calculate the hash of any sublist at the cost of `O(log(N))` time and space."
+    "Upon consideration, these are the exact properties that one would want in order to define the hash of a list of items. Non-commutativity enables the order of elements in the list to be well-defined, since swapping different elements produces a different hash. Associativity enables caching the summary of an arbitrary sublist; I expect that doing this heirarchally on a huge list enables an algorithm to calculate the hash of any sublist at the cost of `O(log(N))` time and space.\n",
+    "\n",
+    "Lets sanity-check that these properties can hold for the construction described above."
   ]
  },
  {
@@ -301,9 +312,6 @@
   "execution_count": 9,
   "id": "b7a1906d-524c-4339-920a-978a0385d6cc",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "tags": []
   },
   "outputs": [
@@ -369,9 +377,6 @@
   "execution_count": 11,
   "id": "7f833e44-79d8-4c98-af41-0c915bee66ed",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "tags": []
   },
   "outputs": [
@@ -471,9 +476,6 @@
   "execution_count": 13,
   "id": "84738470-61c9-44b5-b6b7-9971a02547bd",
   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
    "tags": []
   },
   "outputs": [
@@ -520,10 +522,12 @@
  },
  {
   "cell_type": "markdown",
-   "id": "b2a7b1fd-6d75-4790-ad46-b97767c4f98c",
+   "id": "f66e8f69-260c-40ca-bf26-306a85582ad6",
   "metadata": {},
   "source": [
-    "# Exploring"
+    "# Fun with associativity\n",
+    "\n",
+    "Does the hash of a list change even when swapping two elements in the middle of a very long list?"
   ]
  },
  {
@@ -571,12 +575,23 @@
  },
  {
   "cell_type": "markdown",
-   "id": "da076abe-bd14-4d5d-98be-2c8980e538e5",
+   "id": "6cc30cb3-8079-4f8a-9b7c-a3b4f7e384a3",
   "metadata": {},
   "source": [
-    "## \"Merklist\"?\n",
+    "# Unknowns\n",
    "\n",
-    "Merklist ~ [Merklix](https://www.deadalnix.me/2016/09/24/introducing-merklix-tree-as-an-unordered-merkle-tree-on-steroid/) ~ [Merkle](https://en.wikipedia.org/wiki/Merkle_tree)"
+    "This appears to me to be a reasonable way to define the hash of a list. The mathematical definition of a list aligns very nicely with the properties offered by matrix multiplication. But is it appropriate to use for the same things that a Merkle Tree would be? The big questions are related to the valuable properties of hash functions:\n",
+    "\n",
+    "* Given a merklist summary or sublist summaries of it, can you derive the hashes of elements or their order? (Elements themselves are protected by the preimage resistance of the underlying hash function.)\n",
+    "    * If yes, when is that a problem?\n",
+    "* Given a merklist summary but not the elements, is it possible to produce a different list of elements that hash to the same summary? (~preimage resistance)\n",
+    "* Is it possible to predictably alter the merklist summary by concatenating it with some other sublist of real elements?\n",
+    "* Are there other desirable security properties that would be valuable for a list hash?\n",
+    "* Is there a better choice of hash function as a primitive than sha512?\n",
+    "* Is there a better choice of reduction function that still retains associativity+non-commutativity than simple matmul?\n",
+    "* Is there a more appropriate size than an 8x8 matrix / 64 bytes to represent merklist summaries?\n",
+    "\n",
+    "Matrixes are well-studied objects, perhaps such information is already known. If *you* know something about deriving the preimage of the multiplication of a [matrix ring](https://en.wikipedia.org/wiki/Matrix_ring), $R_{256}^{8×8}$, I would be very interested to know."
   ]
  },
  {
@@ -584,24 +599,18 @@
   "id": "4c4d4a83-8e2e-46d7-b2e3-2d59ba9c9e8c",
   "metadata": {},
   "source": [
-    "# \"Tree\"?\n",
-    "Ok great, so now we can construct a hash for a list that always produces the same hash for the same list, independent of which pairs in the list are reduced first. I think this enables verifyably adding or removing any number of elements at any point in the list with only $O(log(N))$ additional time and space, but what does that look like specifically? Or alternatively, where does the \"Tree\" part of \"Merlist Tree\" come in?\n",
+    "# What's next?\n",
+    "\n",
+    "***If** this construction has the appropriate security properties*, it seems to be a better merkle tree in all respects. Any use of a merkle tree could be replaced with this, and it could enable use-cases where merkle trees aren't useful. Some examples of what I think might be possible:\n",
+    "\n",
+    "* Using a Merklist with a sublist summary tree structure enables creating a $O(1)$-sized 'Merklist Proof' that can verify the addition and subtraction of any number of elements at any single point in the list using only $O(log(N))$ time and $O(log(N))$ static space. As a bonus the proof generator and verifier can have totally different tree structures and can still communicate the proof successfully.\n",
+    "* Using a Merklist summary tree you can create a consistent hash of any ordered key-value store (like a btree) that can be maintained incrementally inline with regular node updates, e.g. as part of a [LSM-tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree). This could facilitate verification and sync between database replicas.\n",
+    "* The sublist summary tree structure can be as dense or sparse as desired. You could summarize down to pairs of elements akin to a merkle tree, but you could also summarize a compressed sublist of hundreds or even millions of elements with a single hash. Of course, calculating or verifying a proof of changes to the middle of that sublist would require rehashing the whole sublist, but this turns it from a fixed structure into a tuneable parameter.\n",
+    "* If all possible elements had an easily calculatable inverse, that would enable \"subtracting\" an element by inserting its inverse in front of it. That would basically extend the group from a ring into a field, and might have interesting implications.\n",
+    "    * For example you could define a cryptographically-secure rolling hash where advancing either end can be calculated in `O(1)` time.\n",
    "\n",
    "To be continued..."
   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6cc30cb3-8079-4f8a-9b7c-a3b4f7e384a3",
-   "metadata": {},
-   "source": [
-    "# Conclusion / Security\n",
-    "This appears to me to be a reasonable way to define the hash of a list. Definitionally, it needs to preserve the order of elements; this is provided by the non-commutativity property. For efficiency, it would be nice if any two sublists can have a known equality; this is provided by the associativity property. It's not clear to me under what circumstances any information about what is contained in the list could be derived from the hash.\n",
-    "\n",
-    "Obviously, being \"not clear to me how\" is not a proof of impossibility. Matrixes are well-studied objects, perhaps such information is already known. If *you* know something about deriving the preimage of a [matrix ring](https://en.wikipedia.org/wiki/Matrix_ring), $R_{256}^{8×8}$, I would be very interested to know.\n",
-    "\n",
-    "*If* this construction has the appropriate security properties, it seems to be a better merkle tree in all respects. Any use of a merkle tree could be replaced with this, and it would enable many more use-cases where merkle trees are not applicable. For example, this would allow you to track the hash of a btree-like structure over time with no additional cost (asymptotically). Of course, these ideas are putting the cart before the horse; we need to know more about its properties first."
-   ]
  }
 ],
 "metadata": {
@@ -1,12 +1,15 @@
 {
 "cells": [
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4d4564a4-8af4-4cd2-a085-ec4484c83dbf",
+   "cell_type": "markdown",
+   "id": "d6b5f16e-76a4-473f-a8cd-efd532f8673f",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "# Merklist Tree\n",
+    "\n",
+    "Part 2 of the Merklist idea. Constructing a tree structure of summarized sublists, so that mutations to the list can be computed, verified, and stored using $O(log(N))$ time and space.\n",
+    "\n"
+   ]
  }
 ],
 "metadata": {
@@ -1,37 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\"hello\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "12+4"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Julia 1.5.3",
-   "language": "julia",
-   "name": "julia-1.5"
-  },
-  "language_info": {
-   "file_extension": ".jl",
-   "mimetype": "application/julia",
-   "name": "julia",
-   "version": "1.5.3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}