Thursday, September 23, 2010

How about that Skein hash?

Okay, nerdy post.

The Novacut video editor will "save" the edits as JSON using a simple graph-based description.  As it will be a distributed video editor, it's very important that the video and audio source files (produced by your HDSLR camera and digital audio recorder) be referenced in a globally unique way.

The solution is easy: reference these media files by their content hash.  The question is, what hash?  After some research and testing, I'm leaning toward Skein, specifically skein-512 with a 240-bit digest.  But I would love some feedback on this.  I would especially love some feedback from my former freeIPA teammates at Red Hat because, well, you people are security rock stars.  And opinionated.  Yes, Simo, I'm looking at you!  So Rob, Pavel, Martin, John, Dmitri, Simo, Stephen, what do you think?  My own rationale goes something this:
  • The hash needs an extremely long useful life: the video edit description is designed specifically for remixing, so these hashes will become the keepers of a (hopefully) large body of read/write culture
  • At the same time, the hash should have a reasonably small digest size so it's URL friendly, easy to use in many contexts
  • Ideally the hash would have a digest size that is a multiple of 40-bits so it can be cleanly base32 encoded (I'm avoiding base64 so I can use the hash to name files, even on case-insensitive file systems)
  • sha1 (40 * 4 = 160bits) is already considered pretty broken, so that doesn't sound future proof to me
  • skein-512 is fast, has a conservative design with a large 512-bit internal state, and can produce any digest size desired (so we just pick our favorite multiple of 40-bits)
  • A 240-bit digest means happy birthday in 2**120, which is darn close to the fuzzy feeling I get when anything security-related requires 2**128 operations to brute-force
  • When we base32-encode a 240-bit digest, we get a 48-character string, which is still short enough to be fairly URL friendly
What do people think about Skein?  What do people think about the 240-bit digest size?  Should I play it safe and use a 280-bit digest?  Should I chose a shorter, even more URL friendly 200-bit digest?  Or did I nail it with 240-bits?

The only worry I have about Skein is that the rotational constants might be changed again, which would be quite disruptive.  Not impossible to deal with (the editing format should really have a graceful way to migrate to a different hash anyway), but the timing would suck.


I saw on Bruce Schneier's blog that a constant will be changed in Threefish, the block-cipher used by Skein:

Even with the attack, Threefish has a good security margin. Also, the attack doesn't affect Skein. But changing one constant in the algorithm's key schedule makes the attack impossible. NIST has said they're allowing second-round tweaks, so we're going to make the change. It won't affect any performance numbers or obviate any other cryptanalytic results -- but the best attack would be 33 out of 72 rounds.

As this change will change the value of Skein hashes, I'll wait to use Skein in dmedia till after the change. Hopefully it will be completed soon. In the meantime, I'll use a base32-encoded sha1 hash in dmedia. Depending on how many adventurous beta-testers we have, I may not provide a sha1 to skein migration path.

Also, thanks for your input, Simo!


  1. Hi Jason,
    I think the best course of action would be to make the hash "pluggable", prefix your hashes with a marker that identifies what hash is being used and allow future changes, This way you will have to do little or no modifications to your software should you need to change the hash for whatever reason.
    This is what we do in a few places already. For example in Directory Server, SHA hashes are stored with {SHA} as prefix of the base64 string. A short 5/10 characters prefix should be short enough to not cause you URL issues.

    Have fun with this project, it looks really awesome.


  2. Simo,

    Thanks for your feedback! Yeah, something along the lines of the markers you suggested would be a good way to transition to another hash if needed.

    Do you have any opinions on the digest size? Is 240-bits enough if I want this name-space to have a very long life?

    And hey, how about that Red Hat stock price! Freakin' amazing!

  3. 240 bits is a huge size. And if I am not wrong you are not using this hash for cryptographic purposes but merely as a way to uniquely identify video chunks.
    So unless you expect someone to try to maliciously match hashes for some reason, I think it is even more than you really need. However if you want to match SHA2 you could go up to 256 bits :)

  4. Thinking ahead to Novacut, how do you want to handle the idea that a single "take" may be represented in many different files with different contents -- different encoding formats or bitrates, different trim lengths, different color corrections? They will all have different identifiers in the DB because they will have different hashes, but it would be nice if the DB can help us identify different versions of the same take.

    Maybe an optional 'takename' attribute should be added to all assets. Or maybe allow different productions to define custom attributes that they can associate with every asset when they import.

    Bob Nolty
    (I just learned about your project today via!)