| |
Subscribe / Log in / New account

A new hash algorithm for Git

A new hash algorithm for Git

Posted Feb 3, 2020 18:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
Parent article: A new hash algorithm for Git

> There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document
One trick that worked for me in a similar case was to switch the encoding. SHA-1 is encoded as hex numbers, we can simply switch SHA-256 to be encoded as letters "g" to "v", so they will be immediately recognizable.


A new hash algorithm for Git

Posted Feb 3, 2020 18:41 UTC (Mon) by juliank (guest, #45896) [Link] (14 responses)

This sounds like a totally reasonable thing to do.

A new hash algorithm for Git

Posted Feb 3, 2020 23:48 UTC (Mon) by dsommers (subscriber, #55274) [Link] (13 responses)

No, not really. What josh suggests, prefixing the string makes more sense.

* performance: Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/libraries. This gets more evident when when considering large repositories like the Linux kernel.

* future compatibility: Shifting a-f chars to another set of 6 other letters will only work 3 more times if only considering lower case letters - 6 letters (a-f) * 4 shifts = 24. So at the 5 change, something new must be done to avoid breaking compatibility. Of course the counter argument is "how often will such new algorithms occur in reality?"; but none of us really knows that for sure - just as we don't know how long a git repository will live and be accessed.

From this article (I've not paid attention to discussions in the git community), it seems like they account for the possibility change it again later on again. So having a prefix possibility with just one prefix or suffix letter makes it possible to change algorithms 26 times, with no performance loss (except the "skip one byte" operation when evaluating the hash). If that is two little, 3 letters gives the possibility for 17576 changes; which is probably enough for most of us alive today - but using 4 letters increases that once again to an even more insane number.

But say you then settle for 4 letters prefix (456.976 possibilities) ... then you're not that far away from {sha256} which is 8 letters, with basically an unlimited amount of algorithm changes. What is inside the {} can be any length while containing a good description of what kind of algorithm in use, without needing to lookup that "AAAC" means SHA512.

A new hash algorithm for Git

Posted Feb 4, 2020 0:10 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

A prefix would also work, but let's limit it to 1 letter. This would realistically give more than enough coding space to last until git is no longer useful. It can also be extended to two characters later if needed.

A new hash algorithm for Git

Posted Feb 4, 2020 5:56 UTC (Tue) by eru (subscriber, #2753) [Link] (4 responses)

One prefix letter would allow signifying only 20 possible hash algorihms, because you should avoid [a-f] that can start a valid hash value in the current scheme.
But probably that would be enough, it is unlikely the hash changes more than once in a decade...

A new hash algorithm for Git

Posted Feb 4, 2020 6:06 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

You can then use the UTF-8-like encoding, reserving 1 bit for "next byte continues the encoding ID" flag. So you can extend it indefinitely.

A new hash algorithm for Git

Posted Feb 4, 2020 16:07 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

There are also uppercase letters if we really get desperate :) .

A new hash algorithm for Git

Posted Feb 5, 2020 2:56 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

So, SHA-1 gets hex, and SHA-2 gets Base64? That'd be entertaining to watch.

A new hash algorithm for Git

Posted Feb 5, 2020 3:32 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

It's still base16, just with shifted letters.

> That'd be entertaining to watch.
Why?

A new hash algorithm for Git

Posted Feb 4, 2020 19:40 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

> Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/libraries

Hash functions don't operate on the hex encoding of the hash digest. If you need to parse base-16 to binary anyway, there's no penalty arising from choosing an alternate set of characters to represent that base-16 value.

A new hash algorithm for Git

Posted Feb 5, 2020 15:00 UTC (Wed) by obi (guest, #5784) [Link] (5 responses)

The protocol labs people seem to use a function + length prefix, see: https://multiformats.io/multihash/

A new hash algorithm for Git

Posted Feb 5, 2020 17:26 UTC (Wed) by juliank (guest, #45896) [Link]

This, this is what people should be using, yes.

A new hash algorithm for Git

Posted Feb 6, 2020 0:41 UTC (Thu) by pj (subscriber, #4506) [Link] (2 responses)

I was going to suggest just this thing. Also, is there any reason a repo has to allow/use only one hash function? The hashes are metadata, and sure it'll be some CPU to recalc, but it would seem totally possible to have parallel metadata objects without touching the data objects (the files, commit messages, etc) at all.

A new hash algorithm for Git

Posted Feb 7, 2020 22:14 UTC (Fri) by Jandar (subscriber, #85683) [Link] (1 responses)

The hashes are not metadata, they are the identifier of the data.

A new hash algorithm for Git

Posted Feb 7, 2020 23:17 UTC (Fri) by Jandar (subscriber, #85683) [Link]

On reading this myself again, I think the better wording would be:

The hash is like an inode-number for a file.

A new hash algorithm for Git

Posted Feb 6, 2020 9:50 UTC (Thu) by ivyl (subscriber, #88764) [Link]

This looks neat but makes using the shortened hashes much more challenging as each of them will start with a fixed prefix

A new hash algorithm for Git

Posted Feb 3, 2020 21:44 UTC (Mon) by josh (subscriber, #17465) [Link] (2 responses)

Or just add a single special character at the beginning, like a capital H. (Using a letter will make sure that people's "select by word" mechanisms pick it up.)

A new hash algorithm for Git

Posted Feb 4, 2020 17:13 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

Gerrit already uses a SHA-1 prefixed with "I" for its Change-Id (a persistent identifier of a patch). Are there any other popular Git-related tools that use a similar pattern? If Git started adding its own prefix letters, it would be nice to avoid ambiguity with them.

A new hash algorithm for Git

Posted Feb 5, 2020 11:55 UTC (Wed) by mgedmin (subscriber, #34497) [Link]

Not sure if this counts, but 'git describe' prefixes the commit hash with a 'g', in the third part of its '<tag>-<number>-g<sha1>' output.

A new hash algorithm for Git

Posted Feb 4, 2020 14:58 UTC (Tue) by ballombe (subscriber, #9523) [Link] (1 responses)

Does no work in general:
12345678 is perfectly valid in both notation.

A new hash algorithm for Git

Posted Feb 4, 2020 15:51 UTC (Tue) by willy (subscriber, #9762) [Link]

He didn't say "use the digits 0123456789ghijkl". He said "use the digits ghijklmnopqrstuv".

A new hash algorithm for Git

Posted Feb 4, 2020 23:09 UTC (Tue) by flussence (guest, #85566) [Link]

A modest proposal: prefix the SHA-256 with "$5$rounds=1$".


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds