Discussion:
Some thoughts on hard to compute hashes
Anthony Ferrara
2011-07-01 22:36:11 UTC
Permalink
Hello all,

Let me preface this by saying that I'm no expert in cryptography. I
am pretty good with math and the algorithms behind the ciphers, but
I've never significantly studied them. I understand how and why they
work, but I'm not versed in the intricacies of their design or
testing. So I am posting this question to see if this thought is even
worth pursuing.

So I was conversing with Solardiz on a drupal topic
(http://drupal.org/node/1201444#comment-4675994) and we were
discussing scrypt hashing and how there are really two forms of the
algorithm. One that uses a lot of memory, and one that uses a lot of
logic (as a tradeoff). The logic version, while significantly more
complicated and slower, could be potentially run on a massively
parallel GPU (not parallelizing the algorithm, just running multiple
hashes at the same time). While the memory bound version is not
particularly suited to high parallelism due to the memory demands.

So that started me thinking. What can a general purpose CPU do
efficiently that neither an ASIC nor a GPU could do in an efficient
manner (at a reasonable cost at least)? And the answer that I came up
with was branching logic. Modern CPUs have lots of die space devoted
to branching prediction and prefetching. Therefore, a CPU should be
more efficient (from my limited view at least) at running a hashing
algorithm with significant branching than either a GPU or an
inexpensive ASIC (as the intelligent branching logic adds complexity
and significant numbers of transistors to the design).

So, if we take a memory-hard algorithm such as used by scrypt, and add
branching logic inside the smix() function (for example), we could
theoretically make the algorithm favor general purpose CPUs.

For example, the current smix function is:

X <-- B
For i = 0 ... N - 1
V_i <-- X
X <-- H(X)
For i = 0 ... N - 1
j <-- Integerify(x) mod N
X <-- H(X xor V_j)

What if we changed the second loop to something such as:

For i = 0 ... N - 1
j <-- Integerify(x) mod N
If j > (N / B)
X <-- H(X xor V_j)
Elseif j > (N / 2B)
V_i <-- H(X xor V_j)
Elseif j > (N / 4B)
V_j <-- H(X xor V_i)
Else
X <-- H(X xor V_i)

Where B is a "branching factor" where 2^N > 4B > 4. The addition of
the branching factor will adjust the percentages that each branch gets
executed (with B=2 equal to 50%, 25%, 12.5%, 12.5%). By increasing
this factor, it should adjust the ability of the processor to
optimized the branch (where B=2 would favor a prefetch of all
branches, and large B values favoring prediction).

The exact contents of the branches is open, but I just wanted
something meaningful to demonstrate the concept.

Just wondering if this idea is interesting enough to do some testing
and modeling of, or if it's been done before and is not worth it (or
it's not worth it for other reasons).

Thanks,

Anthony
Solar Designer
2011-07-12 08:16:48 UTC
Permalink
Hi Anthony,

I'm sorry for not commenting on this sooner.
Post by Anthony Ferrara
So that started me thinking. What can a general purpose CPU do
efficiently that neither an ASIC nor a GPU could do in an efficient
manner (at a reasonable cost at least)? And the answer that I came up
with was branching logic. Modern CPUs have lots of die space devoted
to branching prediction and prefetching. Therefore, a CPU should be
more efficient (from my limited view at least) at running a hashing
algorithm with significant branching than either a GPU or an
inexpensive ASIC (as the intelligent branching logic adds complexity
and significant numbers of transistors to the design).
You're right about GPUs, but probably mostly not about ASICs. I'd
expect an ASIC not to have to implement the equivalent of a CPU's branch
predictor/prefetcher, but instead just have a hard-coded state machine
specific for your algorithm. On the other hand, if your code is large
and its different portions are dissimilar, then you might force an
optimal ASIC to effectively be a special-purpose CPU. The specific
example you provided doesn't do that, though.
Post by Anthony Ferrara
Just wondering if this idea is interesting enough to do some testing
and modeling of, or if it's been done before and is not worth it (or
it's not worth it for other reasons).
It's been known before that branching is more unfriendly to stream
processors such as in GPUs than it is to CPUs. I don't know if it's
been proposed for a KDF or a password hashing method before or not.

My concern is that it's somewhat unfriendly to CPUs as well, so we might
not be making optimal use of the CPUs' execution units (leaving them
idle while code is being fetched into L1 cache). If we consider attacks
with GPUs only (and not ASICs), then we could fit all the code in L1
cache and still achieve GPU-unfriendliness with branches. We'd have to
include some amount of parallelism, though (enough for CPUs), and we'd
rely on the CPU's branch predictor to be no worse than some minimum we
design for. Also, we'd need to keep the number of different branch
targets small enough that we don't fill up the Pentium 4 trace cache or
the like.

Overall, yes, this makes sense.

We've been considering another way of GPU-unfriendliness so far -
arbitrary memory accesses, like in Blowfish. I think both approaches
may be combined.

Alexander
Solar Designer
2011-07-14 17:40:57 UTC
Permalink
Anthony,
Post by Anthony Ferrara
For i = 0 ... N - 1
j <-- Integerify(x) mod N
If j > (N / B)
X <-- H(X xor V_j)
Elseif j > (N / 2B)
V_i <-- H(X xor V_j)
Elseif j > (N / 4B)
V_j <-- H(X xor V_i)
Else
X <-- H(X xor V_i)
In your example, the branching may be avoided by turning the "j > (N / B)",
etc. comparison results into all-zero and all-ones bitmasks (which is
trivial to do without branching), then applying those in expressions on
X, V_j, V_i variables. H() is then invoked from just one place,
unconditionally. Its output is similarly put into the right variable or
memory location using several expressions involving ANDs with the
bitmasks. So all of X, V_i, and V_j are written to, but only one of
those writes actually changes the value at the target location.

Here's another example of a branching algorithm turned into non-branching:

http://lists.randombit.net/pipermail/cryptography/2010-September/000184.html

So it'll take some more effort to make branch-avoidance costly - e.g.,
you could use very different H() functions in different cases, so a
branch-less implementation would have to compute each of them, whereas a
branching one would compute just one needed for the current iteration.
In that variation of your example, you'd achieve a 4x slowdown for a GPU
that has to avoid branching.

Alexander
Anthony Ferrara
2011-07-14 19:45:32 UTC
Permalink
That's quite fair. I was just using that as a demonstration of the concept.

But that was a very interesting read. It makes perfect sense and is
intuitive, but it's still quite interesting.

Thanks for the input,

Anthony
Post by Solar Designer
Anthony,
Post by Anthony Ferrara
For i = 0 ... N - 1
    j <-- Integerify(x) mod N
    If j > (N / B)
        X <-- H(X xor V_j)
    Elseif j > (N / 2B)
        V_i <-- H(X xor V_j)
    Elseif j > (N / 4B)
        V_j <-- H(X xor V_i)
    Else
        X <-- H(X xor V_i)
In your example, the branching may be avoided by turning the "j > (N / B)",
etc. comparison results into all-zero and all-ones bitmasks (which is
trivial to do without branching), then applying those in expressions on
X, V_j, V_i variables.  H() is then invoked from just one place,
unconditionally.  Its output is similarly put into the right variable or
memory location using several expressions involving ANDs with the
bitmasks.  So all of X, V_i, and V_j are written to, but only one of
those writes actually changes the value at the target location.
http://lists.randombit.net/pipermail/cryptography/2010-September/000184.html
So it'll take some more effort to make branch-avoidance costly - e.g.,
you could use very different H() functions in different cases, so a
branch-less implementation would have to compute each of them, whereas a
branching one would compute just one needed for the current iteration.
In that variation of your example, you'd achieve a 4x slowdown for a GPU
that has to avoid branching.
Alexander
Loading...