An Inside Look at Clustering Methods: The Patoshi Pattern
Blockchain technology carries the promise of an open, permissionless, and censorship-resistant database. Yet after 15 years of development, accessing and interpreting blockchain data remains a challenge. At Elementus, we take a uniquely scientific approach. Using an array of proprietary algorithms and deductive techniques, we accurately link cryptocurrency accounts with real-world entities and thereby chart an ever-growing map of the blockchain ecosystem. This “blockchain cartography” ultimately allows our customers to make better, faster, and more informed decisions.
With the launch of our technical blog, our first post will be on a classic example of Bitcoin clustering. It is a phenomenon known as the Patoshi pattern, and can be used to identify Satoshi Nakamoto’s stash of more than 1 million Bitcoin.
The Patoshi Pattern
This is ultimately a story about the little-known extranonce. As a quick reminder, Bitcoin miners validate transactions by finding a nonce, which is a number that combines with those transactions to produce a sufficiently small hash value. It turns out that such a nonce might not exist; when this occurs, a secondary value called the extranonce is incremented and the miner restarts their search.
Ten years ago, security researchers discovered a surprising pattern in the distribution of extranonce values. The pattern is a result of several privacy flaws in the original Bitcoin client, and it can be used to infer the identities of early miners. See the plot below of the first 10,000 extranonces, and note that they lie along straight lines of varying slopes. This is the Patoshi pattern:
Details within the source code imply that each miner generates an approximately-linear sequence of extranonces, and therefore, each linear segment in the above figure corresponds to the same miner. (It’s improbable that two miners began at the exact same time.) The steeper segments are particularly notable, due to their density and the fact that each begins precisely when the previous ends. There is more to be said, but this consistency and coordination ultimately lead to the conclusion that all of the steep segments correspond to a single miner. By referencing against a few widely known attributions, this miner can be identified as Satoshi Nakamoto.
Although clustering these extranonces by hand (or rather, by eye) is fairly straightforward, automating the process is another matter. It’s tempting to begin by calculating a slope for the steeper lines, but this immediately leads to several problems. First is that the segments are not formed contiguously but are interwoven with those of other miners. So to calculate the slope, say by running a linear regression, we need to have already isolated the data points that we’re looking for! Even if the slope is known, there is still the issue of determining exactly where the desired lines are located. These troubles can be overcome by applying a simple geometric idea.
The Space of Lines
There are many ways to describe a line, but the most common is by an equation of the form y = mx + b. Here, m and b are parameters that determine the line. Therefore the space of all possible lines in the plane, i.e line space, can be identified with the parameter space of all values for m and b. It’s another 2-dimensional plane! What matters is how the two planes relate to one another.
For example, the line given by y = 2x + 3 corresponds to the point (2,3) in line space. By considering all of the lines through a chosen point in the xy-plane, a curve is formed in the line space. Therefore points in the xy-plane correspond, via the family of straight lines through them, to special curves in the line space.
[ For technical readers looking to recreate our results, you may realize that the “curve” described above should actually be a straight line, whereas ours appears sinusoidal. It turns out that (m,b) coordinates have some undesirable properties, which can be mitigated by instead using the alternate equation for a line x*cos(t)+ y*sin(t) = d. Here the parameters that determine the line are (t,d). In our line space plots, the horizontal axis is t and the vertical axis is d. ]
When this is repeated for multiple points in the xy-plane, we end up with multiple curves in the line space. Importantly, each pair of curves will intersect at exactly one point, which corresponds to the unique line passing through both of those points in the xy-plane. This relationship between collinearity in the xy-plane and intersections in linespace extends to any number of points.
The idea is then to transform the collection of extranonce data points into the line space, and extract the resulting intersections. The number of curves at each intersection indicates the number of original data points along the corresponding line, and so we expect a small number of high-density intersections. A density plot for a small sample of data is shown in figure 4. Note that there are 4 primary intersection regions, which reflect the four dense linear segments in the original data.
The lines of interest can now be directly identified with your clustering algorithm of choice. Selecting the corresponding extranonce values takes some additional work, but can also be accomplished using ordinary methods. This leads us to more than 1 million Bitcoin in mining rewards, unspent to this day.
This is just a glimpse of the methods behind Elementus’ data. As the blockchain ecosystem grows at a rapid pace, so too do our techniques for mapping it out. Millions of new accounts and transactions are processed every day and we strive to make this information accessible and transparent to our users. Imagine embarking on a journey, but only having scattered pieces of your map. At Elementus, we’re on a mission to help deliver the full picture and fulfill the promise of blockchain technology.
 The total space is a little more complicated, but with minor restrictions we can regard it as a plane