| |
Subscribe / Log in / New account

A new hash algorithm for Git

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
February 3, 2020
The Git source-code management system is famously built on the SHA‑1 hashing algorithm, which has become an increasingly weak foundation over the years. SHA‑1 is now considered to be broken and, despite the fact that it does not yet seem to be so broken that it could be used to compromise Git repositories, users are increasingly worried about its security. The good news is that work on moving Git past SHA‑1 has been underway for some time, and is slowly coming to fruition; there is a version of the code that can be looked at now.

How Git works, simplified

To understand why SHA‑1 matters to Git, it helps to have an idea of how the underlying Git database works. What follows is an oversimplified view of how Git manages objects that can be skipped by readers who are already familiar with this material.

Git is often described as being built on a content-addressable filesystem — one where you can look up an object if you know that object's contents. That may not seem particularly useful, but there's more than one way to "know" those contents. In particular, you can substitute a cryptographic hash for the contents themselves; that hash is rather easier to work with and has some other useful properties.

Git stores a number of object types, using SHA‑1 hashes to identify them. So, for example, the SHA‑1 hash of drivers/block/floppy.c in a 5.6-merge-window kernel, as calculated by Git, is 485865fd0412e40d041e861506bb3ac11a3a91e3. Conceptually, at least, Git will store that version of floppy.c in a file, using that hash as its name; early versions of Git actually did that. If somebody makes a change to floppy.c, even just removing an extra space from the end of a line, the result will have a completely different SHA‑1 hash and will be stored under a different name.

A Git repository is thus full of objects (often called "blobs") with SHA‑1 names; since a new one is created for each revision of a file, they tend to proliferate. Your editor's kernel repository currently contains 8,647,655 objects. But blobs are not the only types of objects stored in a Git repository.

An individual file object holds a particular set of contents, but it has no information about where that file appears in the repository hierarchy. If floppy.c is moved to drivers/staging someday, its hash will remain the same, so its representation in the Git object database will not change. Keeping track of how files are organized into a directory hierarchy is the job of a "tree" object. Any given tree object can be thought of as a collection of blobs (each identified by its SHA‑1 hash, of course) associated with their location in the directory tree. As one might expect, a tree object has an SHA‑1 hash of its own that is used to store it in the repository.

Finally, a "commit" object records the state of the repository at a particular point in time. A commit contains some metadata (committer, date, etc.) along with the SHA‑1 hash of a tree object reflecting the current state of the repository. With that information, Git can check out the repository at a given commit, reproducing the state of the files in the repository at that point. Importantly, a commit also contains the hash of the previous commit (or multiple commits in the case of a merge); it thus records not just the state of the repository, but the previous state, making it possible to determine exactly what changed.

Commits, too, have SHA‑1 hashes, and the hash of the previous commit (or commits) is included in that calculation. If two chains of development end up with the same file contents, the resulting commits will still have different hashes. Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit.

Why hash security matters

The compromise of kernel.org in 2011 created a fair amount of concern about the security of the kernel source repository. If an attacker were able to put a backdoor into the kernel code, the result could be the eventual compromise of vast numbers of deployed systems. Malicious code placed into the kernel's build system could be run behind any number of corporate and government firewalls. It was not a pleasant scenario but, thanks to the use of Git, it was also not a particularly likely one.

Let us imagine that some attacker has gained control of kernel.org and wants to place some evil code into floppy.c — something unspeakable like a change that replaces random sectors with segments from Rick Astley videos, say. Somehow this change would have to be incorporated into the repository so that it would be included in subsequent pulls. But the change to floppy.c changes its SHA‑1 hash; that, in turn, will change every tree object containing the evil floppy.c and every commit that includes it as well. The head commit for the repository would certainly change, as would older ones if the attacker tried to make the change appear to have happened in the distant past.

Somewhere out there is certainly some developer who actually memorizes SHA‑1 hashes and would immediately notice a change like that. The rest of us probably would not, but Git will. The distributed nature of Git means that there are many copies of the repository out there; as soon as a developer tries to pull from or push to the corrupted repository, the operation will fail due to the mismatched hashes between the two repositories and the corruption will come to light.

Repository integrity is also protected by signed tags, which include the hash for a specific commit and a cryptographic signature. The chain of hashes leading up to a given tag cannot be changed without invalidating the tag itself. The use of signed tags is not universal in the kernel community (and rare to nonexistent in many other projects), but mainline kernel releases are signed that way. When one sees Linus Torvalds's signature on a tag, one knows that the repository is in the state he intended when the tag was applied.

All of this depends on the strength of the hash used, though. If our attacker is able to modify floppy.c in such a way that its SHA‑1 hash does not change, that modification could well go undetected. That is why the news of SHA‑1 hash collisions creates concern; if SHA‑1 cannot be trusted to detect hostile changes, then it is no longer assuring the integrity of the repository.

The world has not ended yet, fortunately. It is still reasonably expensive to create any sort of SHA‑1 hash collision at all. Creating any new version of floppy.c with the same hash would be hard. An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry (at least not more than it already does). Creating such a beast is probably still unfeasible. But the writing is clearly on the wall; the time when SHA‑1 is too weak for Git is rapidly approaching.

Moving to a stronger hash

Back in the early days of Git, Torvalds was unconcerned about the possibility of SHA‑1 being broken; as a result, he never designed in the ability to switch to a different hash; SHA‑1 is fundamental to how Git operates. As of 2017, the Git code was full of declarations like:

    unsigned char sha1[20];

In other words, the type of the hash was deeply wired into the code, and it was assumed that hashes would fit into a 20-byte array.

At that time, Git developer brian m. carlson was already at work to separate the Git core from the specific hash being used; indeed, he had been working on it since 2014. It was unclear what hash might eventually replace SHA‑1, but it was possible to create an abstract type for object hashes that would hide that detail. At this point, that work is done and merged.

The decision on a replacement hash algorithm was made in 2018. A number of possibilities were considered, but the Git community settled on SHA‑256 as the next-generation Git hash. The commit enshrining that choice cites its relatively long history, wide support, and good performance. The community has also decided on (and mostly implemented) a transition plan that is well documented; most of what follows is shamelessly cribbed from that file.

With the hash algorithm abstracted out of the core Git code, the transition is, on the surface, relatively easy. A new version of Git can be made with a different hash algorithm, along with a tool that will convert a repository from the old hash to the new. With a simple command like:

   git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \
   	--liability-waiver=none --use-shovels --carbon-offsets

a user can leave SHA‑1 behind (note that the specific command-line options may differ). There is only one problem with this plan, though: most Git repositories do not operate in a vacuum. This sort of flag-day conversion might work for a tiny project, but it's not going to work well for a project like the kernel. So Git needs to be able to work with both SHA‑1 and SHA‑256 hashes for the foreseeable future. There are a number of implications to this requirement that make themselves felt throughout the system.

One of the transition design goals is that SHA‑256 repositories should be able to interoperate with SHA‑1 repositories managed by older versions of Git. If kernel.org updates to the new format, developers running older versions should still be able to pull from (and push to) that site. That will only happen if Git continues to track the SHA‑1 hashes for each object indefinitely.

For blobs, this tracking will happen through the maintenance of a set of translation tables; given a hash generated with one algorithm, Git will be able to look up the corresponding hash from the other. Needless to say, this lookup will only succeed for objects that are actually in the repository. These translation tables will be maintained in the "pack files" that hold most objects in a contemporary Git repository. There will be a separate table for "loose objects" that are stored as separate files rather than in packs; the cost of lookups in that table is seen as being high enough that measures need to be taken to minimize the number of loose objects in any given repository.

The handling of other object types is a bit more complicated. An SHA‑1 tree object, for example, must contain SHA‑1 hashes for the objects in the tree. So if such a tree object is requested, Git will have to locate the SHA‑256 version, then translate all the object hashes contained within it before returning it. Similar translations will be required for commits. Signed tags will contain both hashes.

With this machinery in place, Git installations will be interoperable during the transition. Eventually, all users will have upgraded to SHA‑256-capable versions of Git, at which point repository owners could begin turning off the SHA‑1 capability and removing the translation tables. The transition will, at that point, be complete.

Some inconvenient details

There are likely to be some glitches along the way, naturally. One of them is a simple human-factors problem: when a user supplies a hash value, should it be interpreted as SHA‑1 or SHA‑256? In some cases, it's unambiguous; SHA‑1 hashes are 160 bits wide, so a 256-bit hash must be SHA‑256, for example. But a shorter hash could be either, since hashes can be (and often are) abbreviated. The transition document describes a multi-phase process during which the interpretation of hash values would change, but most users are unlikely to go through that process.

There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document:

    git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

For a Git user interface this is relatively straightforward and concise, but one can still imagine that users might tire of it relatively quickly. The obvious solution to this sort of bracket fatigue is to fully transition a project to SHA‑256 as quickly as possible.

There is another issue out there, though: there are a lot of SHA‑1 hash values in the wild. The kernel repository currently contains over 40,000 commits with a Fixes: tag; each one of those includes an SHA‑1 hash. These hash values also can be found in bug-tracker histories, release announcements, vulnerability disclosures, and more. In a repository without SHA‑1 compatibility, all of those hashes will become meaningless. To address this issue, one can imagine that the Git developers may eventually add a mode where translations for old SHA‑1 hashes remain in the repository, but no SHA‑1 hashes for new objects are added.

Current state

Much of the work to implement the SHA‑256 transition has been done, but it remains in a relatively unstable state and most of it is not even being actively tested yet. In mid-January, carlson posted the first part of this transition code, which clearly only solves part of the problem:

First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA‑256 repository, but will be unable to read it.

The value of write-only repositories is generally agreed to be relatively low; not even SCCS was so limited. Carlson's purpose in posting the code at this stage is to try to reveal any core issues that will be harder to change as the work progresses. Developers who are interested in where Git is going may well want to take a close look at this code; converting their working repositories over is not recommended, though.

As it turns out, carlson's work goes well beyond what has been put out for testing now; he will post it when he is ready, but really curious people can see it now in his GitHub repository. This work is unlikely to land on the systems of most Git users for some time yet, but it is good to know that it is getting close to ready. The Git developers (carlson in particular) have quietly been working on this project for years; we will all benefit from it.


A new hash algorithm for Git

Posted Feb 3, 2020 18:15 UTC (Mon) by IanKelling (subscriber, #89418) [Link] (1 responses)

Great article. I'd love to see a similar one about GPG and SHA-1.

A new hash algorithm for Git

Posted Feb 3, 2020 18:34 UTC (Mon) by zdavatz (guest, #70954) [Link]

Great article, thank you!

A new hash algorithm for Git

Posted Feb 3, 2020 18:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (21 responses)

> There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document
One trick that worked for me in a similar case was to switch the encoding. SHA-1 is encoded as hex numbers, we can simply switch SHA-256 to be encoded as letters "g" to "v", so they will be immediately recognizable.

A new hash algorithm for Git

Posted Feb 3, 2020 18:41 UTC (Mon) by juliank (guest, #45896) [Link] (14 responses)

This sounds like a totally reasonable thing to do.

A new hash algorithm for Git

Posted Feb 3, 2020 23:48 UTC (Mon) by dsommers (subscriber, #55274) [Link] (13 responses)

No, not really. What josh suggests, prefixing the string makes more sense.

* performance: Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/libraries. This gets more evident when when considering large repositories like the Linux kernel.

* future compatibility: Shifting a-f chars to another set of 6 other letters will only work 3 more times if only considering lower case letters - 6 letters (a-f) * 4 shifts = 24. So at the 5 change, something new must be done to avoid breaking compatibility. Of course the counter argument is "how often will such new algorithms occur in reality?"; but none of us really knows that for sure - just as we don't know how long a git repository will live and be accessed.

From this article (I've not paid attention to discussions in the git community), it seems like they account for the possibility change it again later on again. So having a prefix possibility with just one prefix or suffix letter makes it possible to change algorithms 26 times, with no performance loss (except the "skip one byte" operation when evaluating the hash). If that is two little, 3 letters gives the possibility for 17576 changes; which is probably enough for most of us alive today - but using 4 letters increases that once again to an even more insane number.

But say you then settle for 4 letters prefix (456.976 possibilities) ... then you're not that far away from {sha256} which is 8 letters, with basically an unlimited amount of algorithm changes. What is inside the {} can be any length while containing a good description of what kind of algorithm in use, without needing to lookup that "AAAC" means SHA512.

A new hash algorithm for Git

Posted Feb 4, 2020 0:10 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

A prefix would also work, but let's limit it to 1 letter. This would realistically give more than enough coding space to last until git is no longer useful. It can also be extended to two characters later if needed.

A new hash algorithm for Git

Posted Feb 4, 2020 5:56 UTC (Tue) by eru (subscriber, #2753) [Link] (4 responses)

One prefix letter would allow signifying only 20 possible hash algorihms, because you should avoid [a-f] that can start a valid hash value in the current scheme.
But probably that would be enough, it is unlikely the hash changes more than once in a decade...

A new hash algorithm for Git

Posted Feb 4, 2020 6:06 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

You can then use the UTF-8-like encoding, reserving 1 bit for "next byte continues the encoding ID" flag. So you can extend it indefinitely.

A new hash algorithm for Git

Posted Feb 4, 2020 16:07 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

There are also uppercase letters if we really get desperate :) .

A new hash algorithm for Git

Posted Feb 5, 2020 2:56 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

So, SHA-1 gets hex, and SHA-2 gets Base64? That'd be entertaining to watch.

A new hash algorithm for Git

Posted Feb 5, 2020 3:32 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

It's still base16, just with shifted letters.

> That'd be entertaining to watch.
Why?

A new hash algorithm for Git

Posted Feb 4, 2020 19:40 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

> Doing char replacing in strings is more CPU intensive than just skipping one single byte and continue using standard functions/libraries

Hash functions don't operate on the hex encoding of the hash digest. If you need to parse base-16 to binary anyway, there's no penalty arising from choosing an alternate set of characters to represent that base-16 value.

A new hash algorithm for Git

Posted Feb 5, 2020 15:00 UTC (Wed) by obi (guest, #5784) [Link] (5 responses)

The protocol labs people seem to use a function + length prefix, see: https://multiformats.io/multihash/

A new hash algorithm for Git

Posted Feb 5, 2020 17:26 UTC (Wed) by juliank (guest, #45896) [Link]

This, this is what people should be using, yes.

A new hash algorithm for Git

Posted Feb 6, 2020 0:41 UTC (Thu) by pj (subscriber, #4506) [Link] (2 responses)

I was going to suggest just this thing. Also, is there any reason a repo has to allow/use only one hash function? The hashes are metadata, and sure it'll be some CPU to recalc, but it would seem totally possible to have parallel metadata objects without touching the data objects (the files, commit messages, etc) at all.

A new hash algorithm for Git

Posted Feb 7, 2020 22:14 UTC (Fri) by Jandar (subscriber, #85683) [Link] (1 responses)

The hashes are not metadata, they are the identifier of the data.

A new hash algorithm for Git

Posted Feb 7, 2020 23:17 UTC (Fri) by Jandar (subscriber, #85683) [Link]

On reading this myself again, I think the better wording would be:

The hash is like an inode-number for a file.

A new hash algorithm for Git

Posted Feb 6, 2020 9:50 UTC (Thu) by ivyl (subscriber, #88764) [Link]

This looks neat but makes using the shortened hashes much more challenging as each of them will start with a fixed prefix

A new hash algorithm for Git

Posted Feb 3, 2020 21:44 UTC (Mon) by josh (subscriber, #17465) [Link] (2 responses)

Or just add a single special character at the beginning, like a capital H. (Using a letter will make sure that people's "select by word" mechanisms pick it up.)

A new hash algorithm for Git

Posted Feb 4, 2020 17:13 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

Gerrit already uses a SHA-1 prefixed with "I" for its Change-Id (a persistent identifier of a patch). Are there any other popular Git-related tools that use a similar pattern? If Git started adding its own prefix letters, it would be nice to avoid ambiguity with them.

A new hash algorithm for Git

Posted Feb 5, 2020 11:55 UTC (Wed) by mgedmin (subscriber, #34497) [Link]

Not sure if this counts, but 'git describe' prefixes the commit hash with a 'g', in the third part of its '<tag>-<number>-g<sha1>' output.

A new hash algorithm for Git

Posted Feb 4, 2020 14:58 UTC (Tue) by ballombe (subscriber, #9523) [Link] (1 responses)

Does no work in general:
12345678 is perfectly valid in both notation.

A new hash algorithm for Git

Posted Feb 4, 2020 15:51 UTC (Tue) by willy (subscriber, #9762) [Link]

He didn't say "use the digits 0123456789ghijkl". He said "use the digits ghijklmnopqrstuv".

A new hash algorithm for Git

Posted Feb 4, 2020 23:09 UTC (Tue) by flussence (guest, #85566) [Link]

A modest proposal: prefix the SHA-256 with "$5$rounds=1$".

A new hash algorithm for Git

Posted Feb 3, 2020 18:52 UTC (Mon) by meyert (subscriber, #32097) [Link] (29 responses)

I wonder if the much increased complexity is reall worth the value given a very theoretical hash collision.

A new hash algorithm for Git

Posted Feb 3, 2020 19:03 UTC (Mon) by martin.langhoff (subscriber, #61417) [Link] (19 responses)

At some point, it will be worthwhile. We don't know exactly when that'll be, but the trick is to do it _before_ that inflection point.

A new hash algorithm for Git

Posted Feb 3, 2020 19:58 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (18 responses)

By then, SHA-256 will be broken as well. SHA-2 uses the same underlying structure as SHA-1 and is almost only more secure due to its length. Anything new deployed now should use SHA-3 (Keccak) right from the start. The comparison with OpenPGP also lags, people can choose the hash algorithm there (even though a gpg2 --version shows there’s no SHA-3 yet).

Also, I wonder, will I be able to verify old signed commits and tags after the transition is complete? Doesn’t seem so…

A new hash algorithm for Git

Posted Feb 3, 2020 22:02 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

The best attacks on SHA-1 reduce complexity from 2^80 (still unfeasible to brute-force) to 2^68 (just barely feasible). That's about 2^12 times speedup.

SHA-256 has 2^128 collision probability to start with, any realistic attacks won't lower the complexity below 2^100 (WAY outside of possible attacks).

A new hash algorithm for Git

Posted Feb 4, 2020 2:47 UTC (Tue) by wahern (subscriber, #37304) [Link] (3 responses)

The recently published SHAttered attack (https://shattered.it/) took ~2^63 computations. That said, AFAIU none of the recent SHA-1 attacks carry over to SHA-256. And other than length extension attacks, I don't think the Merkle–Damgård construction is considered fundamentally broken; it's just well analyzed.

A new hash algorithm for Git

Posted Feb 4, 2020 11:32 UTC (Tue) by heftig (subscriber, #73632) [Link] (2 responses)

Since Git prefixes an object with its length before hashing it, does length extension still apply?

A new hash algorithm for Git

Posted Feb 5, 2020 9:19 UTC (Wed) by bmenrigh (subscriber, #63018) [Link] (1 responses)

Well no (and yes). Depending on the specific details of the length prefix you may be able to extend the hash with a longer message whose length is interpreted using part of the start of the existing message. In this way you would be limited to only certain pre-determined lengths only. I don’t know the details of the how git does things to comment further.

A new hash algorithm for Git

Posted Feb 5, 2020 15:02 UTC (Wed) by johill (subscriber, #25196) [Link]

I don't think this is possible - git prefixes with e.g. "blob 1234\0" (yes, the ascii text) for a 1234-byte blob. It then looks for the first \0 and you can't get it to treat that as a valid length digit.

A new hash algorithm for Git

Posted Feb 4, 2020 2:16 UTC (Tue) by KaiRo (subscriber, #1987) [Link] (8 responses)

For signed commits or other signatures, the other question is quantum-safety of the signatures themselves, which is also probably not ensured right now. I'm actually a bit more worried about switching to quantum-safe async crypto than about those hash collisions, but both are somewhat worrisome.

A new hash algorithm for Git

Posted Feb 4, 2020 2:22 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (7 responses)

Quantum what?

I was just wondering because the signature is over the SHA-1 hash.

A new hash algorithm for Git

Posted Feb 4, 2020 2:39 UTC (Tue) by KaiRo (subscriber, #1987) [Link] (6 responses)

"Signature" usually means that you sign some arbitrary data (in this case a SHA-1 hash) using some async crypto key material (in this case usually some RSA variant). RSA and other async crypto algorithms used commonly nowadays are not safe from being cracked by quantum computers once we have some with enough capacity. That puts all signatures, identification and encryption based on those algorithms at risk once we have those kinds of quantum computers, so where we use those we will need to find solutions for that (quantum-safe algorithms are in development or testing right now but not finalized AFAIK). The common hash algorithms have no big issues with that, so it doesn't affect git itself directly, but it certainly does or will affect the signatures of signed commits.

A new hash algorithm for Git

Posted Feb 4, 2020 19:57 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (5 responses)

I know what a signature is and all that, but I absolutely don’t get where you are going with this.

When I currently have a signed commit…

-----BEGIN cutting here may damage your screen surface-----
$ git cat-file -p HEAD
tree 937122472a792ada03309a60b7a31e02a29aa764
parent 53861b4a1544c7c8825f1414c37c9694c84c5d92
author mirabilos <m@mirbsd.org> 1580771045 +0100
committer mirabilos <mirabilos@evolvis.org> 1580771470 +0100
gpgsig -----BEGIN PGP SIGNATURE-----
Comment: ☃ ЦΤℱ—8 ☕☂☄

iQIcBAABCQAGBQJeOKiPAAoJEIlQwYleuNOzSzYP/3xowIYpxJwuHfdP8oRekbSZ
eVI9mO5g8KC+SUe5oGCbocH478pBUp5AOYlFGL0awetklijRmF+EeYp+a1IluCww
GD2pSPFCpxSjScERlED5YYpfaaw1XEutoGHYQNMAUQhlRMzS8NwhGJjTuoIbvE4X
hMntoMtDM7sPJ3CIADIoYzXIcdaqsELvqptuvNdo9S/PIyR6OFWhpF68Qn+SILqk
N+fOA/KpgQLsRmMEVy3YtqmMdToYXoP3m4ec0/QSoN90QVrO9ZnVG2+0f9yeEiVn
xEWiaSSsz5vtniBLzOvQ6FeE0h08ZsQi9dcTj8aq3tDtUJb2sQi6q79Gl5StmfHI
8HN9q8ZQP/Vh8kIT5z3lcuNnb3y7sc90ZzY5i7Q2YwfKNbJ5mAEMvSgzBxcrDflR
/kjUJcXJg98IzJsWbE3k9gRc9yatqKQii0GiaxID13fCfl++4klJrFMEyoTdhta4
5a7vGa6OuHr+MWsT+35yQsR6Mt1DnMY2oNArTgWG3DfNQK8zb7rIExPbuV6pLP2O
X67ZCVSHwRTrLWnDHjSuQH4Hfoibq96Ga9wJwEjw0+sWKzg4CgvQH6L+UiXIZO0/
2+hhF507WUCKh8Uit2nrRsGhVnXJrI5QZsD857oAifcBFslbTLwTCkj+3gccHxwH
A/BAeG4zN0JrdvMzx0pN
=9w0P
-----END PGP SIGNATURE-----

erm yes, the symlink…
-----END cutting here may damage your screen surface-----

… or tag…

-----BEGIN cutting here may damage your screen surface-----
$ git cat-file -p mksh-57-6
object 3ece4d6c67f32b8e2b9b00900d05cc06c658fc87
type commit
tag mksh-57-6
tagger mirabilos <mirabilos@evolvis.org> 1580771932 +0100

mksh (57-6) unstable; urgency=low
-----BEGIN PGP SIGNATURE-----
Comment: ☃ ЦΤℱ—8 ☕☂☄

iQIcBAABCQAGBQJeOKpdAAoJEIlQwYleuNOzE3EP/1Qu6w3ZnelCbTcR0/lR1QaH
qisRANlIKYq0MVDOmhzGZ4m6/ri9b2njI16x0R3otaIT2QfG2ldj8U/Sq7Vpm6Xb
uTpMluMzFj6sungPYOCvgbDVcVqt4+qCAwtFL5Lt2gpfN45KwYO0RdrSCY8wFD3N
TO3Wq7M3DXt99F9mMY/L+XfvbpDAMzjCEK0tgTAal4QWnnb7V2Y1bVnZjos5XZTV
hWW4kJMqBp2Hf99KLqnjijfPgZkqbSMYKy14Nsqo1cSujwPpOH2MgDbyuun1SuSA
K6U0JT1iyIsL/ixkCx8vi6ejIGGQXXpGEq4K4RA3Wc4ALB/FWC9Y2MrCEExG0wEV
tDkto90sbD6Nymnii1apG2Q7aSyDNDjsiRT2tzYN2S5EzItYtV0V8ZXoxiYk/c/Z
ttAcdXxh8R4+5p3yNYwAjTSzZe8ohvgHFXoAUGVpk7g9oArlNiJmqkrW3BGdFrCb
gH0h4UpiXr3pgnlPi247alGT18Xly5cBX3CbjORGDNsUDZoGPLlVuyW46PaRel3V
P8BODtOoFkoK7JyFCRP70Z97vQig+L9nbN5tf50haYlxhO7oOSU7RzQJxgv2tLza
AT0bg6Wfs4I9VV/MjocIirwrbihZY1gMgURgad5PdoNjoyNy+vd6OKMFQm1i/eUF
hGIwKngrue1A9RMKPaCG
=JPiZ
-----END PGP SIGNATURE-----
-----END cutting here may damage your screen surface-----

… these hardcode the SHA-1 hashes. These are, thus, needed to verify the signature. This also cannot be rewritten.

As a user, I’d expect that, after full git conversion to a new hash, I’ll still be able to verify these. That was the question.

A new hash algorithm for Git

Posted Feb 5, 2020 9:50 UTC (Wed) by geert (subscriber, #98403) [Link] (2 responses)

So you have to keep the translation table from SHA-1 to SHA-256 for all old objects.

Cfr. " To address this issue, one can imagine that the Git developers may eventually add a mode where translations for old SHA‑1 hashes remain in the repository, but no SHA‑1 hashes for new objects are added.".

A new hash algorithm for Git

Posted Feb 13, 2020 23:57 UTC (Thu) by floppus (guest, #137245) [Link] (1 responses)

But to verify a signature, you don't just need a translation table for all the old objects, you actually need the old objects themselves (or be able to somehow reconstruct the original objects, byte for byte.)

Otherwise, assuming you trust the person who generated the signature but you don't trust the contents of the git repository, you have no way of knowing that the commit you're looking at actually corresponds to the same source tree that the person signed.

A new hash algorithm for Git

Posted Feb 14, 2020 1:20 UTC (Fri) by excors (subscriber, #95769) [Link]

Perhaps the translation table should be signed. After the repository is converted to SHA-256, someone trusted (e.g. Linus Torvalds) can run a tool that scans through the translation table, fetches the object corresponding to each SHA-256, computes its SHA-1, verifies the translation table says the same SHA-1, then signs the translation.

Then a regular user can verify a commit (identified by its SHA-256) by using the signed translation table to find the corresponding SHA-1 and checking the committer's signature of that SHA-1. That avoids the performance cost of having to fetch the entire object from disk to compute its SHA-1 before checking the signature, while avoiding the danger of a falsified translation table that tries to link the signed SHA-1 to a totally different commit that doesn't actually match that SHA-1.

As a bonus, if SHA-1 gets completely broken in the future, I think the repository would remain secure. If a future attacker can manufacture a commit whose SHA-1 matches an old signed commit, they could try to insert that commit into the repository with a valid translation table entry (containing the colliding SHA-1 and a new non-colliding SHA-256) and reuse the old commit's signature on their new commit (since it's only signing the SHA-1). If the translation table was unsigned, the attacker could succeed. But if it was signed, there's no way to insert the new translation table entry without tricking Linus into signing the new table. And Linus can avoid being tricked if he simply stops signing any new translation tables beyond the point when SHA-1 gets completely broken (which should be many years away).

A new hash algorithm for Git

Posted Feb 5, 2020 19:40 UTC (Wed) by KaiRo (subscriber, #1987) [Link] (1 responses)

The problem is that with quantum computers, you can actually forge signatures, i.e. you actually _can_ rewrite those things and/or make something, e.g. a git commit, look like it could be verified to be from someone despite it being from someone else - at least using the current (RSA) mechanisms. We need to use new, quantum-safe signatures in the future. Unfortunately, the security community has not settled on what the widely-accepted algorithms for that may be, though there are developments in this area.

The actual hash that is signed is a different topic. You should be able to verify those signed hashes as long as the original hash is available (part of what the original article is about) and the signature can be trusted (which may not be the case forever, as I was pointing to).

A new hash algorithm for Git

Posted Feb 6, 2020 15:40 UTC (Thu) by luto (subscriber, #39314) [Link]

There are several excellent hash-based signature algorithms that appear to be fully secure against quantum attack. They don’t extend to encryption or to key exchange, so they are not full RSA replacements.

Maybe Skip SHA-3

Posted Feb 4, 2020 8:19 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Adam Langley suggests sticking with the SHA-2 family while things shake out in the relatively new frontier that is Keccak-style algorithms.

https://www.imperialviolet.org/2017/05/31/skipsha3.html

SHA-3 is significantly slower than SHA-2 which is already very slow for a hash (if we didn't need a crypto hash there are lots of very very fast hashes used elsewhere) so it's a big penalty when you aren't buying say, future proofing, which you aren't because SHA-3 was agreed way before the dust settled on how to do this style of hash, there are currently half a dozen like it, all seemingly secure, most faster, none standardised. This isn't like AES where the rough direction is understood and now you're buying hardware that accelerates it, so that not doing AES ends up slower because you lose hardware assist.

Langley recommends SHA-512/256 (note for those unfamiliar this is literally the name of the hash, not two different hashes you can pick from) if you care about length extension attacks and otherwise SHA-256 is fine. The reason for SHA-512/256 is that the output isn't the entire internal state, it's only half the state, meaning a length extension fails, and it only needs the same size structure to store the hash as SHA-256 (but it is slower).

A new hash algorithm for Git

Posted Feb 4, 2020 10:41 UTC (Tue) by epa (subscriber, #39769) [Link] (1 responses)

Also, I wonder, will I be able to verify old signed commits and tags after the transition is complete?
Perhaps you would be able to verify them slowly by recomputing the SHA-1 hashes of each object from scratch, even if they aren't stored in the repository.

A new hash algorithm for Git

Posted Feb 5, 2020 17:06 UTC (Wed) by droundy (subscriber, #4559) [Link]

That's my impression from the article, and why the compatibility mode is slow.

A new hash algorithm for Git

Posted Feb 5, 2020 8:26 UTC (Wed) by bluss (subscriber, #47454) [Link]

> By then, SHA-256 will be broken as well. SHA-2 uses the same underlying structure as SHA-1 and is almost only more secure due to its length

It's true the structure and construction is the same, but SHA-2 is has a bigger state and much more involved mixing steps per round, and those are the biggest difference versus SHA-1, not the digest length.

A new hash algorithm for Git

Posted Feb 3, 2020 20:00 UTC (Mon) by chfisher (subscriber, #106449) [Link]

Since it is generally conceded that the question is not "if SHA-1 will be compromised" but "when will SHA-1 be compromised", it behooves us as developers to move to a more secure option BEFORE that compromise occurs, since an exploit that successfully infects the kernel would have such wide ranging (and expensive) implications.

A new hash algorithm for Git

Posted Feb 3, 2020 20:15 UTC (Mon) by dkg (subscriber, #55359) [Link] (7 responses)

A hash collision for SHA-1 is not theoretical at all. Rather, it is within reach of moderately funded attacker, on the order of $100K, and has been practically demonstrated by a university+corporate team. The price is expected only to fall.

The authors of the recent "SHA-ttered" collision have this to say about git:

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits. It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one. An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision.

Note that this weakness in git means that even git signatures made with strong modern crypto are vulnerable, because they are signing objects that refer to other objects only by their SHA-1 digest.

For instance, when signing tags, the signed tag itself cannot be replaced, but the thing that the tag points to can be replaced without invalidating the signature.

Kudos to carlson for having been working on this; it's a shame that this kind of maintenance work never seems to get prioritized by projects until there is a fire that needs putting out. It would have been better if we had already completed this transition years ago.

A new hash algorithm for Git

Posted Feb 3, 2020 20:27 UTC (Mon) by walters (subscriber, #7396) [Link]

See also https://github.com/cgwalters/git-evtag for a stronger signed tag.

A new hash algorithm for Git

Posted Feb 3, 2020 21:11 UTC (Mon) by martin.langhoff (subscriber, #61417) [Link] (4 responses)

It's been demonstrated on a pair of PDF files. The format is pretty opaque to the typical end user, and the "good" file was pre-doctored. These are very artificial conditions.

As many have pointed out, including this article, current attacks match the SHA-1 of an existing file that wasn't built to facilitate the attack in the first place... have to add a bunch of "random" data to get to a collision. For a code file, which is the typical content of git, that's pretty "visible".

A new hash algorithm for Git

Posted Feb 3, 2020 21:49 UTC (Mon) by dkg (subscriber, #55359) [Link] (2 responses)

I recommend reading Joey Hess's discussion from 2011 (in particular the discussion in the comments) for why the legibility of the commit messages and code objects typically covered by git is not necessarily sufficient: other stuff can be included in the hashes that won't be visible to normal end users. (maybe this was fixed in the last decade? i haven't tested recently)

Even if it were somehow true that git hashes only cover the things that are directly exposed to the user, "git history is cryptographically strong for repositories that contain only human-readable code" is a significant reduction in scope from "git history is cryptographically strong". I don't think we want to make that reduction, and i know of no repositories (and no tooling) that would deliberately enforce that kind of limitation for the sake of retaining cryptographic strength of the git history.

Also, many "code only" repositories contain the occasional binary graphic file (screenshot, logo etc), firmware, test corpus, etc, all of which could be used to hide the "tumor" needed for this kind of collision attack.

This needs fixing, and we've known it needed fixing for nearly as long as git has been around. Why advance an argument that seems like it would only help to delay getting a fix deployed?

A new hash algorithm for Git

Posted Feb 3, 2020 23:39 UTC (Mon) by martin.langhoff (subscriber, #61417) [Link]

To be clear, this is progress, and progress is needed. At some point, SHA-1 will be truly broken in a "useful" way, with real life impact, and we better have made the transition by then.

It's not known to be broken today in a useful, usable way, for the typical uses of git.

A new hash algorithm for Git

Posted Feb 4, 2020 15:25 UTC (Tue) by joey (guest, #328) [Link]

It has not been fixed.

A new hash algorithm for Git

Posted Feb 3, 2020 22:03 UTC (Mon) by khim (subscriber, #9252) [Link]

I don't know where you get the notion that problem of "creating an existing file that wasn't built to facilitate the attack" is even remotely possible.

Not even MD-4 is broken for preimage attack in practice. MD-4 was "broken" with preimage attack of complexity 2¹⁰² - which is really worrying: maybe in a few more years with some ASICs… maybe… Very unlikely though: very few entities could spend literally trillions of dollars to show that old, almost completely forgotten, hash is no longer useful.There are exist theoretical attack on MD-5 of 2¹²³⋅⁴ complexity, but if you'll recall that there are 2¹²⁸ MD-5 hashes is total… that's pretty trivial improvement. SHA-1 doesn't even have a theoretical preimage attacks currently (but there are few for "reduced" versions means soon we'll see something for the full one).

So no, don't expect preimage attack on SHA-1 to happen in your lifetime… unless you plan to live for 300 years.

Now, collision attacks are pretty easy for MD-4, MD-5, and relatively easy for SHA-1 (tens of thousands of dollars) - but they all require attacker to "plant bomb" in the "good repo". These are still nasty enough to worry about these, but as you could guess urgency is quite low: I still think it's cheaper to just submit a dozen of patches with subtle buffer overflows and get one of them accepted than to generate such a collision. But price goes down each year…

A new hash algorithm for Git

Posted Feb 3, 2020 21:56 UTC (Mon) by dkg (subscriber, #55359) [Link]

I should also mention that the "shambles" attack published in January 2020 claims costs of $11K (USD) for an arbitrary collision and $45K (USD) for a chosen-prefix collision.

A new hash algorithm for Git

Posted Feb 3, 2020 22:02 UTC (Mon) by josh (subscriber, #17465) [Link] (4 responses)

The decision on SHA256 as the successor was made back in 2018. I wonder if the rationale still holds as strongly then as it does now? There are several new candidates that have substantially higher performance than SHA256, and in particular, a couple that have the advantage of supporting parallel hashing for large blocks of data, notably BLAKE3.

(I *don't* want to bikeshed the hash selection here. But I wonder if that hash selection might be worth benchmarking and re-evaluating now that the infrastructure is ready.)

A new hash algorithm for Git

Posted Feb 4, 2020 2:07 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

I ' ve wondered about that as well - SHA256 has good hardware support right now but SHA3/keccak or even the very new blake3 would technically be better, though it will take some time until esp. the latter will be supported in hardware - probably before SHA1 collisions will be a practical problem in git repos though. How flexible is the code in that patch to go right to an even newer hash algorithm?

A new hash algorithm for Git

Posted Feb 4, 2020 9:07 UTC (Tue) by jwilk (subscriber, #63328) [Link]

A new hash algorithm for Git

Posted Feb 4, 2020 14:40 UTC (Tue) by cesarb (subscriber, #6266) [Link]

BLAKE3 might have another potential advantage for Git: due to its tree structure, it could allow breaking large blobs into small pieces which can be hashed independently, without changing the final hash. This might help with some of the issues Git has with large files in a repository.

A new hash algorithm for Git

Posted Feb 5, 2020 15:17 UTC (Wed) by smoogen (subscriber, #97) [Link]

There are two items:
1. Like a game of scissors-rocks-paper there is no one right choice that 'wins'. You choose shasha versus chachacha and you find out that both are weak against an attack that plugh294 isn't. However plug294 is weak against and attack that shasha is good at and chachacha is sort of ok.. etc etc etc.
2. Because of that and that the attacks get better.. any choice you make at time X will look bad in time X+1. This leads to a lot of projects doing the 'jump to the latest findings' switching crypto or checksums or signing to the latest thing which was written to be stronger than whatever you chose at X time. However also due to 1.. you end up finding that you have to keep hopping.

In the end, you just have to choose something and implement it and know that you will have to choose something else again in 2-3 years and implement that. There is no 'right' choice. There are just an infinite 'wrong' ones which are either more wrong or less wrong.

A new hash algorithm for Git

Posted Feb 3, 2020 22:10 UTC (Mon) by newren (guest, #5160) [Link]

It's worth noting that Git already transitioned away from SHA1 to SHA1DC (SHA-1 with detection of collisions), using https://github.com/cr-marcstevens/sha1collisiondetection. This was done about 3 years ago, and prevents a lot of the existing sha1 attacks, including even the recent sha-mbles stuff (see e.g. https://lore.kernel.org/git/20200107203147.r33c5plp5g7pmx...)

All that said, I'm glad Brian is doing such great work in transitioning the codebase over to a newer hash algorithm. It's a huge pile of work, and I'm glad he's been tackling it.

A new hash algorithm for Git

Posted Feb 4, 2020 5:21 UTC (Tue) by pabs (subscriber, #43278) [Link] (1 responses)

I wonder what other changes should be added when changing the format of git repositories.

For example: I would like to see restic/borg style rolling chunking, for more efficient storage of large files.

A new hash algorithm for Git

Posted Feb 5, 2020 11:54 UTC (Wed) by nix (subscriber, #2304) [Link]

Note that various git-using projects already do this: e.g. both that and slicing objects into smaller pieces (at points identified by an rsync-style rolling hash) are more or less the core of bup. (Well, that and extensive Bloom filtering to make massive repositories fast enough to use while not having to access their packfiles or indexes more than strictly necessary.)

A new hash algorithm for Git

Posted Feb 4, 2020 11:45 UTC (Tue) by keeperofdakeys (guest, #82635) [Link] (2 responses)

It's worth pointing out that the current collisions rely on inserting or appending an arbitrary amount of data to create the collision. Git stores both a type and size in a git commit, so its much harder to successfully create a malicious object with the same hash as another compared to existing attacks.

https://marc.info/?l=git&m=148787047422954

A new hash algorithm for Git

Posted Feb 4, 2020 20:40 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (1 responses)

Basically, no.

Collisions are not a second pre-image attack. The bad guys create two blobs, which are the same size, and have the same hash but are different. They get to show you either blob and trick you by substituting the other one which you'll believe is the same because it has the same SHA-1.

An attacker would need to target git specifically, yes, but it isn't particularly more difficult as a result of tracking size and type.

A new hash algorithm for Git

Posted Feb 5, 2020 15:44 UTC (Wed) by iabervon (subscriber, #722) [Link]

You can think of the collision attacks as a special kind of back door an attacker could try to add to your code: if their object passes code review and becomes relevant to the project, they can replace it with an entirely different object. Their object will contain a bunch of bytes that have no obvious purpose, which constitute the back door. The question is whether you review all bytes of all objects, rather than putting the file through a program that interprets and displays some bytes and ignores others.

Would your project notice unmotivated color table entries in an image and ask why it was done in such an unintuitive way? Would you go through the layout logic in a PDF, rather than just looking at it?

A new hash algorithm for Git

Posted Feb 4, 2020 15:04 UTC (Tue) by osma (subscriber, #6912) [Link] (3 responses)

I wonder if it would make sense to use a combination of SHA-1 and SHA-256 for the new hash, just concatenating them together. I know this is not much secure than either hash alone in terms of cryptography, but then the shortened commit IDs would still remain the same and existing references in, say, commit messages and other sources would still be valid prefixes of the new hashes.

A new hash algorithm for Git

Posted Feb 4, 2020 20:18 UTC (Tue) by Hattifnattar (subscriber, #93737) [Link]

Wow! This is a brilliant idea! I am not sure it will be adopted, though...

A new hash algorithm for Git

Posted Feb 4, 2020 21:20 UTC (Tue) by meuh (guest, #22042) [Link]

+1, I like that.

I've not found this suggestion being rejected in https://github.com/git/git/blob/v2.25.0/Documentation/tec... but I would assume there's a catch !

A new hash algorithm for Git

Posted Feb 5, 2020 20:11 UTC (Wed) by meuh (guest, #22042) [Link]

Note: this was suggested already by epa in the previous article's comments: see https://lwn.net/Articles/716001/

A new hash algorithm for Git

Posted Feb 5, 2020 11:25 UTC (Wed) by unixbhaskar (guest, #44758) [Link] (1 responses)

Nice writeup Jon. Thanks, man! ...and it seems the change will overhaul git for good reason.

A new hash algorithm for Git

Posted Feb 6, 2020 19:59 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

You. Said. Blockchain.

You did it Jon; now the ICO vultures and others will be all over this site looking for their next HODL opportunity.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds