using scrypt for user authentication

Post by Solar Designer
1. Use such settings that scrypt doesn't use more than, say, 1 MB of
memory. Is 1 MB way too low? Is scrypt at this setting significantly
better than bcrypt or not?

According to Colin Percival's BSDCan2009 paper the amortized cost (chip
area times time) of scrypt is (at least) 1024 N^2 r^2 p s t where
parameters N and r determine the size of memory (1024 N r + O(r) bits), p
is a paralellization parameter and s and t are unit costs of storage and
computation.

The paper claims the cost of scrypt with (N, r, p) = (2^14, 8, 1) is
approximately 35 times higher than the cost of bcrypt with cost = 11 while
the time needed to compute both of those functions on a general-purpose
CPU is comparable. These ratios are probably quite stable even when
hardware evolves and unit costs (s, t) change.

The aformentioned parameters (N = 2^14, r = 8) correspond to 16 MiB of RAM
if my calculation is correct. In order to reduce memory consumption to
1 MiB you would have to reduce the product of N and r 16-fold. p can be
increased from 1 to 16 now but the overall cost would still be reduced by
a factor of 16 because its dependence on N and r is quadratic.

Such a change would degrade the strength of scrypt almost to the level of
bcrypt. On customized hardware. On the other hand, it would probably use
enough memory and memory bandwidth to choke GPUs and other hardware that
has not been explicitly designed to crack it.

--
Pavel Kankovsky aka Peak / Jeremiah 9:21 \
"For death is come up into our MS Windows(tm)..." \ 21st century edition /

Solar Designer

2012-09-02 15:13:37 UTC

Since this thread is so old, I'll over-quote, then add the new stuff

Post by Pavel Kankovsky

Post by Solar Designer
1. Use such settings that scrypt doesn't use more than, say, 1 MB of
memory. Is 1 MB way too low? Is scrypt at this setting significantly
better than bcrypt or not?

According to Colin Percival's BSDCan2009 paper the amortized cost (chip
area times time) of scrypt is (at least) 1024 N^2 r^2 p s t where
parameters N and r determine the size of memory (1024 N r + O(r) bits), p
is a paralellization parameter and s and t are unit costs of storage and
computation.
The paper claims the cost of scrypt with (N, r, p) = (2^14, 8, 1) is
approximately 35 times higher than the cost of bcrypt with cost = 11 while
the time needed to compute both of those functions on a general-purpose
CPU is comparable. These ratios are probably quite stable even when
hardware evolves and unit costs (s, t) change.
The aformentioned parameters (N = 2^14, r = 8) correspond to 16 MiB of RAM
if my calculation is correct. In order to reduce memory consumption to
1 MiB you would have to reduce the product of N and r 16-fold. p can be
increased from 1 to 16 now but the overall cost would still be reduced by
a factor of 16 because its dependence on N and r is quadratic.
Such a change would degrade the strength of scrypt almost to the level of
bcrypt. On customized hardware. On the other hand, it would probably use
enough memory and memory bandwidth to choke GPUs and other hardware that
has not been explicitly designed to crack it.

We got a curious real-world data point due to Litecoin using scrypt with
(2^10, 1, 1), which corresponds to 128 KiB. I guess the (misguided?)
intent was to use L2 cache instead of RAM. IIRC, scrypt was designed
not to bump into RAM bandwidth on typical computers anyway, so there was
little point in limiting it to just L2 cache. Well, perhaps it'd bump
into RAM bandwidth with many CPU cores running instances of scrypt at once.
(How many concurrent instances would this take?)

A result is that GPU miners for Litecoin now exist, and they outperform
CPUs roughly by a factor of 10 per-chip:

https://github.com/litecoin-project/litecoin/wiki/Mining-hardware-comparison

Yes, they do optionally use the time-memory tradeoff previously
mentioned in here:

http://www.openwall.com/lists/crypt-dev/2011/07/01/1
http://www.openwall.com/lists/crypt-dev/2011/07/01/2

although according to the documentation for cgminer "performance peaks
at a gap of 2", which corresponds to very light use of recomputation
(avoiding storage of every other lookup table element). Thus, perhaps
part of the 10x speedup comes not from the GPUs' computing power, but
rather from GPU cards' memory subsystem being on par with or faster than
CPUs' L2 cache for scrypt's access pattern (entire cache lines being
requested) and from the greater concurrency of such accesses.

For comparison, current implementations of bcrypt on CPU and GPU achieve
roughly the same speed per-chip (no advantage from GPUs yet, except that
it may be cheaper to build hobbyist multi-GPU than multi-CPU systems).

Thus, we can see that scrypt with improper settings can be worse
than bcrypt for a very realistic attack. (For attacks with ASICs,
things may be different, but attacks with GPUs are relevant too.)

Curiously, bcrypt uses only 4 KiB, which is less than Litecoin scrypt's
128 KiB. However, bcrypt's use of L1 (rather than L2) cache on CPUs
provides an advantage to CPU implementations, and its access pattern
(random 32-bit accesses) makes attempts to use L2+ cache or RAM (for
many concurrent instances of bcrypt) very inefficient. On GPUs, best
performance at bcrypt is currently achieved by using local memory (very
limited in size) and keeping many of the execution units idle (since
there's not enough local memory to use them all). Trying to use global
memory (partially cached) slows things down. (A local+global mixed
implementation might be a little bit faster than local alone, though.
This was briefly discussed on john-dev.) Credit: the experiments with
bcrypt on GPU were performed by Sayantan Datta.

It is plausible that scrypt at 1 MiB is comparable to bcrypt in terms of
attacks with GPUs, but this is not certain. scrypt at 1 MiB might as
well be somewhat weaker than bcrypt in this respect.

Overall, I think scrypt at 1 MiB is not a good replacement for bcrypt
currently. We need bigger settings. (And if we go for much bigger
settings, this may imply a need to limit concurrency when scrypt is used
on authentication servers.)

We might also want to have a tradeoff-defeating variation of scrypt, as
Anthony Ferrara suggested or otherwise. It appears that this would make
scrypt maybe a factor of 2 more resistant to attacks with current GPUs
at Litecoin's low settings and perhaps a lot more at bigger settings
(where the total GPU card's RAM size is more easily hit if the tradeoff
is not used in an attack). Maybe this property should be configurable
(a Boolean parameter, to be encoded along with the resulting encrypted
data or password hash).

As usual, I'd appreciate any comments.

Alexander

Colin Percival

2012-09-10 00:52:12 UTC

Post by Pavel Kankovsky
The paper claims the cost of scrypt with (N, r, p) = (2^14, 8, 1) is
approximately 35 times higher than the cost of bcrypt with cost = 11

... if you're building ASICs to crack passwords.

Post by Solar Designer
We got a curious real-world data point due to Litecoin using scrypt with
(2^10, 1, 1), which corresponds to 128 KiB. I guess the (misguided?)
intent was to use L2 cache instead of RAM.

Yes, that definitely qualifies as "misguided".

Post by Solar Designer
IIRC, scrypt was designed
not to bump into RAM bandwidth on typical computers anyway

Correct.

Post by Solar Designer
Thus, we can see that scrypt with improper settings can be worse
than bcrypt for a very realistic attack. (For attacks with ASICs,
things may be different, but attacks with GPUs are relevant too.)

This doesn't surprise me at all. The idea behind the design of scrypt
is to make the memory-time product the deciding factor in cost -- and
to stay far away from other bottlenecks on modern hardware (e.g., L1
cache bandwidth) so that we can use as much RAM as possible.

In some ways -- available parallelism and RAM:compute die area ratio --
GPUs fall somewhere between CPU and ASIC, but on "random access to small
amounts of RAM" I'm not at all surprised to find that they are further
in the "CPU" direction than CPUs themselves.

Post by Solar Designer
Overall, I think scrypt at 1 MiB is not a good replacement for bcrypt
currently. We need bigger settings.

I agree.

Post by Solar Designer
(And if we go for much bigger
settings, this may imply a need to limit concurrency when scrypt is used
on authentication servers.)

What sort of authentication servers are you running where you only have
1 MB of RAM per CPU core?

Post by Solar Designer
We might also want to have a tradeoff-defeating variation of scrypt, as
Anthony Ferrara suggested or otherwise. It appears that this would make
scrypt maybe a factor of 2 more resistant to attacks with current GPUs
at Litecoin's low settings and perhaps a lot more at bigger settings
(where the total GPU card's RAM size is more easily hit if the tradeoff
is not used in an attack). Maybe this property should be configurable
(a Boolean parameter, to be encoded along with the resulting encrypted
data or password hash).

This seems like an attempt to patch over the problem of "scrypt not being
used as intended".

--
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Solar Designer

2012-09-10 01:24:14 UTC

Post by Colin Percival
In some ways -- available parallelism and RAM:compute die area ratio --
GPUs fall somewhere between CPU and ASIC, but on "random access to small
amounts of RAM" I'm not at all surprised to find that they are further
in the "CPU" direction than CPUs themselves.

Yes, except that it's random access to cache line sized blocks of data,
unlike what we have e.g. in bcrypt.

Post by Solar Designer
(And if we go for much bigger
settings, this may imply a need to limit concurrency when scrypt is used
on authentication servers.)

What sort of authentication servers are you running where you only have
1 MB of RAM per CPU core?

Not per CPU core, but maybe per concurrent authentication attempt - if
concurrency is not limited. If we simply introduce scrypt as a
supported crypt(3) flavor in an OS, then its memory usage needs to be
sane in the event of occasional spikes in concurrent authentication
attempts, including on rather small systems (e.g., 256 MB RAM VPSes).

A solution to this could be limiting concurrency - perhaps to the number
of CPU cores, as you suggest. I mentioned some approaches to this here:

http://www.openwall.com/lists/crypt-dev/2011/05/12/4

I'd appreciate your comments on this.

For dedicated authentication servers at some specific organization, it
can be a custom interface, so limiting the concurrency is easier - and
much larger amounts of RAM may be used anyway.

This seems like an attempt to patch over the problem of "scrypt not being
used as intended".

I am not proposing this as an alternative to using larger memory sizes
(as intended). I am proposing it for use along with scrypt's intended
settings. Due to the tradeoff, GPUs may be potentially reasonably
usable for attacking scrypt even with large memory settings. If a user
of scrypt (e.g., a sysadmin) doesn't expect to need to crack scrypt on
GPUs (and most won't), the user will be able to turn the knob and make
scrypt more GPU-unfriendly. Note that since GPUs are only efficient
when a lot of parallelism is available, they're likely not reasonably
usable for defensive scrypt computation anyway (unless "p" is set very
high in advance).

Thank you for your helpful comments!

Alexander

Colin Percival

2012-09-10 07:46:54 UTC

Post by Colin Percival
What sort of authentication servers are you running where you only have
1 MB of RAM per CPU core?

If you're seeing enough concurrent authentication attempts that using
16 MB for each of them comes close to eating up your 256 MB of RAM, odds
are that they're simply never going to finish due to CPU utilization
alone...

Post by Solar Designer
A solution to this could be limiting concurrency - perhaps to the number
http://www.openwall.com/lists/crypt-dev/2011/05/12/4
I'd appreciate your comments on this.
For dedicated authentication servers at some specific organization, it
can be a custom interface, so limiting the concurrency is easier - and
much larger amounts of RAM may be used anyway.

I admit that I haven't given much thought to the details of integrating
scrypt into crypt(3), beyond noting that the "100 ms" parameters yield a
memory usage of 16 MB which seems unlikely to impose an excessive strain
on many systems.

The case which interests me more is the large websites which make a habit
of leaking millions of hashed passwords, and I would expect them to set
up internal scrypt-computation services.

This seems like an attempt to patch over the problem of "scrypt not being
used as intended".

Point taken... although one supposes that a GPU might be a solution to your
hypothesized denial-of-service attack problem. :-)

--
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Solar Designer

2012-09-10 09:29:35 UTC

Post by Colin Percival
If you're seeing enough concurrent authentication attempts that using
16 MB for each of them comes close to eating up your 256 MB of RAM, odds
are that they're simply never going to finish due to CPU utilization
alone...

This depends on duration of the spike in concurrent authentication
attempts. For example, at 1 MB, if 50 concurrent attempts arrive, but
are not followed by any more for at least a few seconds, the system will
survive with no long-term impact on other running services. At 16 MB,
it will temporarily fully run out of memory, so other services may be
impacted (requiring restart).

I admit this may be a contrived example and possibly not the most
typical case, yet it is realistic. Those many virtualized systems with
relatively small memory limits do fail on out-of-memory conditions once
in a while. Replacing e.g. bcrypt with scrypt at 16 MB will increase
the frequency of such failures.

Anyhow, even if you convince me that scrypt at 16 MB is OK for crypt(3),
we'd have a hard time convincing others of it. Besides, I'd consider
using more than 16 MB if we limit concurrency.

Post by Colin Percival
The case which interests me more is the large websites which make a habit
of leaking millions of hashed passwords, and I would expect them to set
up internal scrypt-computation services.

Yes, this interests me as well. With proper concurrency limits, we
could even use gigabytes of RAM per scrypt instance there. However,
then it becomes challenging to keep the runtime per password hashed
sane. Considering memory bandwidth, we might need to stay at up to
1 GB per instance currently - or less if we don't let one hash
computation fully use the memory bandwidth, to let other CPU cores
use the same memory bus in parallel. That's a pity.

Post by Solar Designer
We might also want to have a tradeoff-defeating variation of scrypt

[...]

Post by Solar Designer
I am not proposing this as an alternative to using larger memory sizes
(as intended). I am proposing it for use along with scrypt's intended
settings. Due to the tradeoff, GPUs may be potentially reasonably
usable for attacking scrypt even with large memory settings. If a user
of scrypt (e.g., a sysadmin) doesn't expect to need to crack scrypt on
GPUs (and most won't), the user will be able to turn the knob and make
scrypt more GPU-unfriendly. Note that since GPUs are only efficient
when a lot of parallelism is available, they're likely not reasonably
usable for defensive scrypt computation anyway (unless "p" is set very
high in advance).

Point taken... although one supposes that a GPU might be a solution to your
hypothesized denial-of-service attack problem. :-)

GPU cards are actually within consideration for "large websites" like
you mentioned, specifically for higher memory bandwidth if we're talking
scrypt - but there are _many_ challenges in using them like that.

Thanks,

Alexander

Colin Percival

2012-09-10 10:55:22 UTC

True.

Post by Solar Designer
[...]
Anyhow, even if you convince me that scrypt at 16 MB is OK for crypt(3),
we'd have a hard time convincing others of it. Besides, I'd consider
using more than 16 MB if we limit concurrency.

I said 16 MB because that's what you end up using given the parameters
(N, r, p) = (2^14, 8, 1) which I mentioned in my paper as being good for
100 ms of KDF computation. Obviously that can be adjusted, but you can't
use more than 16 MB of RAM unless you spend longer than 100 ms (or have a
faster CPU than what I had in my laptop 3 years ago -- I'm guessing that
we're at 32 MB of RAM for 100 ms by now).

Depending on your value of "sane". There's a reason why my paper gives
examples of 100 ms for interactive logins and 5 s for file encryption.

--
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Solar Designer

2012-09-30 11:25:16 UTC

Colin,

Post by Colin Percival
I said 16 MB because that's what you end up using given the parameters
(N, r, p) = (2^14, 8, 1) which I mentioned in my paper as being good for
100 ms of KDF computation. Obviously that can be adjusted, but you can't
use more than 16 MB of RAM unless you spend longer than 100 ms (or have a
faster CPU than what I had in my laptop 3 years ago -- I'm guessing that
we're at 32 MB of RAM for 100 ms by now).

You're right about the 16 MB / 32 MB, and I agree that 100 ms is a sane
time to spend hashing as it relates to user experience with interactive
logins. However, based on some back of the envelope calculation I did,
100 ms might not be affordable when the number of users is very large
(millions), logins are frequent (millions per day), and accounts are of
relatively low value. I think reasonable values are in the range 1 ms
to 100 ms.

Anyhow, even at 100 ms the 32 MB upper limit on what we can use with
scrypt bothers me. I want it to be a lot higher. If we consider the
memory bandwidth as the hard limiting factor on how much memory we can
use in 100 ms, it'd be a lot more than 32 MB.

Trying to see how much further we can go by replacing Salsa20/8 with
something else (or simply reducing the number of rounds), I tried
placing a "return 0;" at the start of salsa20_8(). With this, I am able
to fit 64 MB in 100 ms. Not that much of a change. :-(

This is on an older server, though (Xeon E5420's, and memory bandwidth
corresponding to that epoch). I'll re-test on a newer one.

If the desired duration is 1 ms, scrypt is unreasonable to use at all -
at those settings, it's weaker than bcrypt.

Do you have thoughts on a possible revision of scrypt to allow it to use
more memory at these low durations?

I think it should be possible to have a memory-hard function that would
let us use hundreds of MB at 100 ms and a few MB at 1 ms, on currently
typical server hardware (memory bandwidth of ~5 to 10 GB/sec).
Scalability will be poor, though (that is, if we run multiple instances
on the different CPU cores, they'll compete for the memory bandwidth as
we'd have to approach it closely even in just one instance).

Could something like Elias Yarrkov's zen32(), but adjusted to access
entire cache lines, be it? What do you think of it? I guess you've
seen the postings on the cryptography list. For those who haven't,
we're talking about the KDF portion of
http://cipherdev.org/dhbitty.html

Thanks,

Alexander

Solar Designer

2012-10-01 06:56:33 UTC

Post by Solar Designer
Trying to see how much further we can go by replacing Salsa20/8 with
something else (or simply reducing the number of rounds), I tried
placing a "return 0;" at the start of salsa20_8(). With this, I am able
to fit 64 MB in 100 ms. Not that much of a change. :-(

Correction: once I've excluded some overhead and system time from my
measurements, I am able to go to 256 MB with salsa20_8() NOP'ed out, yet
stay under 100 ms. I am not sure if avoiding the system time would be
possible in actual use, though - I just presume that it might be.

This is (2^18, 8, 1) with salsa20_8() NOP'ed out on E5649 (2.53 GHz):

real 0m0.240s
user 0m0.080s
sys 0m0.159s

At 2 rounds of Salsa20 (instead of 8), I am still only able to go for 64 MB:

(2^16, 8, 1), 2 rounds, same CPU:

real 0m0.124s
user 0m0.087s
sys 0m0.037s

(2^15, 8, 1), 8 rounds (default), same CPU:

real 0m0.121s
user 0m0.100s
sys 0m0.020s

I also did some testing with other values of r, with higher p, with
"#pragma omp parallel for" for the p loop, and with concurrent instances
(without parallelization within the instances) - for up to 24 (this
machine has two E5649's, so 12 cores, 24 logical).

As expected, scalability is quite poor, except at relatively low working
set sizes (such as fitting in the 12 MB L3 cache). I don't have time to
properly process and post all of the results right now.

I was compiling the SSE2 intrinsics code for x86-64 using gcc 4.6.2 with
"-march=native -O3 -fomit-frame-pointer".

Alexander

Solar Designer

2012-11-16 03:47:15 UTC

Post by Solar Designer
Correction: once I've excluded some overhead and system time from my
measurements, I am able to go to 256 MB with salsa20_8() NOP'ed out, yet
stay under 100 ms. I am not sure if avoiding the system time would be
possible in actual use, though - I just presume that it might be.
real 0m0.240s
user 0m0.080s
sys 0m0.159s

I think I need to clarify:

Fully NOP'ing out salsa20_8() resulted in a non-random access pattern to
V, which made unrealistically good use of caches. So it tells us how
fast the remaining code can work when accessing caches rather than RAM.
It does not mean we can actually use 256 MB in under 100ms, even with a
very fast data mixing function, if that function actually works.

Post by Solar Designer
real 0m0.124s
user 0m0.087s
sys 0m0.037s
real 0m0.121s
user 0m0.100s
sys 0m0.020s

Alexander

Colin Percival

2012-10-01 20:49:15 UTC

Post by Solar Designer
You're right about the 16 MB / 32 MB, and I agree that 100 ms is a sane
time to spend hashing as it relates to user experience with interactive
logins. However, based on some back of the envelope calculation I did,
100 ms might not be affordable when the number of users is very large
(millions), logins are frequent (millions per day), and accounts are of
relatively low value. I think reasonable values are in the range 1 ms
to 100 ms.

Here's my back-of-the-envelope calculation: Assume 100 ms / 32 MB, a 25%
load factor, and we're using EC2 c1.medium instances at $0.165/hour (which
is far more than you'd pay setting up a cluster). That's 2 cpus * 36000
hashes per cpu-hour * 25% load factor = 18000 hashes per hour, or slightly
under 10 microdollars per hash. I can't think of any website I enter a
password into more than five times a day -- most are far less than that,
since like most people, I stay cookied on twitter/facebook/etc. -- so I
would cost at most $0.02 per year.

For typical sites, where I stay cookied for weeks at a time, it would be
under $0.001 per year; or even less if they set up their own password hash
cluster instead of renting EC2 instances by the hour. I'm having trouble
imagining a company which can't afford this.

Post by Solar Designer
Anyhow, even at 100 ms the 32 MB upper limit on what we can use with
scrypt bothers me. I want it to be a lot higher. If we consider the
memory bandwidth as the hard limiting factor on how much memory we can
use in 100 ms, it'd be a lot more than 32 MB.

Agreed. It's possible that I erred too far on the side of caution in using
salsa20/8 in the construction; at Passwords'12 (Oslo, early December) I'm
hoping I'll get a chance to corner Joan Daemen and get some input from him
about this -- I wouldn't be surprised if a reduced-round version of Keccak
turned out to be optimal for this.

Post by Solar Designer
If the desired duration is 1 ms, scrypt is unreasonable to use at all -
at those settings, it's weaker than bcrypt.

Agreed.

Post by Solar Designer
Do you have thoughts on a possible revision of scrypt to allow it to use
more memory at these low durations?

I don't want to retroactively redefine scrypt, but that doesn't mean that
we can't discuss a possible replacement for it. :-)

--
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Solar Designer

2012-10-04 01:13:44 UTC

Colin,

First of all, thank you for these cost estimates!

Post by Colin Percival
Here's my back-of-the-envelope calculation: Assume 100 ms / 32 MB, a 25%
load factor, and we're using EC2 c1.medium instances at $0.165/hour (which
is far more than you'd pay setting up a cluster). That's 2 cpus * 36000
hashes per cpu-hour * 25% load factor = 18000 hashes per hour, or slightly
under 10 microdollars per hash. I can't think of any website I enter a
password into more than five times a day -- most are far less than that,
since like most people, I stay cookied on twitter/facebook/etc. -- so I
would cost at most $0.02 per year.

$0.02 per user per share is actually quite a lot for some orgs. Note
that this is not 100% of what a user costs, but it's _extra_ cost for
better password security (which might not provide any ROI, unless the
org brags about it for good publicity).

Let's consider Facebook. Wikipedia says they have 955 million of active
users as of June 2012, and their 2011 revenue was $3.71 billion. That's
about $4 per user, so $0.02 would be 0.5% - I'd say it's quite a lot.
Now let's consider their profit rather than revenue. The data I was
able to find quickly (by looking up the FB ticker) is 46.77B market
capitalization and 121.87 P/E, which gives the earnings estimate at 384M.
(I arrive at the same number going from EPS 0.18 and share count at 2.14B
as well.) So that's only about $0.40 per user per year. $0.02 would
thus be 5% - truly a lot! Granted, Facebook is going to try to improve
their monetization a lot, and investors expect them to (which is why P/E
is so high) - yet I guess the current low earnings affect what the
company would be willing to spend.

I think that 25% load factor (average / worst-spike) is optimistic.
Note that we need to consider even very short duration spikes.

The expectation that scrypt will scale to 2 CPU cores perfectly is also
optimistic. Well, it might scale to 2, but likely not much/any further,
unless we specifically revise it for scalability or/and have it use
caches rather than DRAM on a shared bus. In my testing on 12-core
(24 logical), it only scaled to 2x or 3x the throughput of a single
instance at best (and in some of the tests it did not scale at all -
that is, same throughput for 1 or 24 instances was seen), unless I
reduced its working set size to fit in L3 or smaller. Scalability can
be improved by increasing the Salsa20 round count, but then memory usage
(size) in under 100 ms becomes even lower. These trade-offs are rather
fundamental, I think.

Post by Colin Percival
For typical sites, where I stay cookied for weeks at a time, it would be
under $0.001 per year; or even less if they set up their own password hash
cluster instead of renting EC2 instances by the hour. I'm having trouble
imagining a company which can't afford this.

OK. They may be able to afford this. I am not saying that they can't.
Rather, I guess they'd choose not to spend a non-negligible amount of
money (in absolute terms, as well as a percentage of earnings per user)
on something with no or little expected ROI.

As discussed off-list, we need a hashing method reasonably usable (and
consistently stronger than bcrypt) for the 1 ms to 100 ms range, which I
think is the reasonable range to support for user authentication.
Supporting 1 ms will ease adoption.

Cool!

A concern: let's not forget that we should also use the CPU efficiently.
This probably means having and using more parallelism at instruction and
SIMD level.

I think that a single instance of Salsa20 (with whatever round count)
doesn't use a modern CPU fully. It does use 128-bit vectors, but I
think there's not enough parallelism between instructions to fully use
the available execution units and fully hide instruction latencies.
Also, as you're aware, we'll need to support 256-bit soon.

Post by Solar Designer
Do you have thoughts on a possible revision of scrypt to allow it to use
more memory at these low durations?

I don't want to retroactively redefine scrypt, but that doesn't mean that
we can't discuss a possible replacement for it. :-)

Sounds good.

I was thinking that maybe we could extend rather than redefine or
replace scrypt. That is, add more parameters to scrypt. With some
values for the additional parameters, it'd be the current scrypt.

Thanks again,

Alexander

Solar Designer

2012-09-10 02:27:44 UTC

Post by Solar Designer
We might also want to have a tradeoff-defeating variation of scrypt, as
Anthony Ferrara suggested or otherwise.

For context, here's Anthony Ferrara's comment that I am referring to:

http://drupal.org/node/1201444#comment-4675994

Here's an attack on Anthony's tradeoff defeater ("simply writing to V_j
in the second loop"):

Assume that we don't want to use the tradeoff to its fullest, but rather
just halve our memory needs. Without the tradeoff defeater, this is
accomplished by storing every other V_j in an array that is twice
smaller than original. The odd-indexed V_j's are then recomputed as
needed from the preceding values that we did store.

Now, with the tradeoff defeater we can't recompute these values like
that because the preceding value might have been modified by prior
iterations of the second loop. Or can we? Actually, quite often we
can: by the time the second loop terminates, it will have modified only
about 63% of V elements. And only 39% when we're at N/2 iterations.

Of course, "quite often" is not sufficient for proper computation of the
entire scrypt, so we do need to store the entire V array (not just its
half) somewhere. However, we may be able to arrange things such that
one half of V is "hot" (accessed more frequently) and the other is
"cold" (accessed less frequently). The other half may then be stored in
slower memory (e.g., off-chip).

Similarly, we can't store odd-indexed modified V_j values into the hot
half (so we'll have to store/load them from the cold one). Or can we?
Actually, we can, by agreeing that the hot half stores either of the
two values in each element (and the cold half stores the other).

Overall, we need two one-bit flags per element in the hot half, or one
extra bit per V element on average (two for the hot half and none for
the cold half). The flags themselves need to be in hot memory. Yet our
overall hot memory needs have almost halved, like they would without the
tradeoff defeater.

The possible flag values may be:

00 - original even-indexed value from the first loop is stored
01 - modified even-indexed value is stored (can't use for recomputation)
10 - odd-indexed value is stored (the even-indexed one is in cold)

Does this mean that Anthony's tradeoff defeater is not worthwhile? Not
quite. The cold memory is nevertheless accessed pretty often - just not
as often as the hot portion. So there is some reduction in attack speed
due to the defeater even in this special case. Then, if we try to
reduce our hot memory needs by more than a factor of 2, the corresponding
revisions of the attack described above become a lot less efficient
(mostly in terms of frequency of accesses to the cold portion).

Yet we could enhance the tradeoff defeater to make the attack less
efficient: write into more than one V element per loop iteration and/or
write into such V elements that more of them are written to (perhaps
100% instead of just 63% by loop end).

Alexander

Solar Designer

2013-03-17 04:03:21 UTC

Colin, Anthony, all -

The message I am following up on is the one where I described an attack
on Anthony's scrypt time-memory tradeoff defeater:

http://www.openwall.com/lists/crypt-dev/2012/09/10/3

I'll skip most of its content now, and only comment on my attempt to
actually implement a time-memory tradeoff defeater of this sort (despite
of the possible attack).

Post by Solar Designer
http://drupal.org/node/1201444#comment-4675994
Here's an attack on Anthony's tradeoff defeater ("simply writing to V_j

[... lots of text skipped ...]

Post by Solar Designer
Does this mean that Anthony's tradeoff defeater is not worthwhile? Not
quite. The cold memory is nevertheless accessed pretty often - just not
as often as the hot portion. So there is some reduction in attack speed
due to the defeater even in this special case. Then, if we try to
reduce our hot memory needs by more than a factor of 2, the corresponding
revisions of the attack described above become a lot less efficient
(mostly in terms of frequency of accesses to the cold portion).
Yet we could enhance the tradeoff defeater to make the attack less
efficient: write into more than one V element per loop iteration and/or
write into such V elements that more of them are written to (perhaps
100% instead of just 63% by loop end).

So I've tried to implement an enhancement like what I described above:
writing not only to V_j, but also to V_i (with another value).
Unfortunately, it turns out that BlockMix's shuffling(*) gets in the way
of efficient implementations, and _possibly_ the inefficiency mostly hurts
cleaner CPU code, as opposed to hacks and ASICs. Specifically, when the
V element we write into is or might be the same as V_j, in a clean
implementation we have to postpone those writes until the BlockMix
completes, and this means re-reading the same values from memory
(hopefully, from L1 cache) rather than simply storing register values
into two locations at once (such as X and V_j). In order to do the
latter without introducing extra checks, I had to limit the writes to
elements that are definitely not same as V_j; I chose (~j & (N - 1)) for
their indices.

(*) which I am still looking for an explanation of:
http://mail.tarsnap.com/scrypt/msg00059.html

I've attached a diff between two of my code revisions. escrypt-23
implements the official scrypt, with only some optimizations to
crypto_scrypt-sse.c. escrypt-24 renames the function to escrypt and
adds a Boolean defeat_tmto parameter. When it is zero, the function
returns the same values as the official scrypt does. When it is
non-zero, the trivial TMTO defeater as described above is enabled.
It is implemented in all three versions of the code (-ref, -nosse,
-sse), although only the -sse revision is (partially) optimized (reuses
register values instead of copying memory). (Checking defeat_tmto
outside of the loop and then using one of two specialized loop versions
is an obvious further optimization.)

I may consider adding a TMTO defeater by means of changes to the first
loop as well - an approach needed for my "ROM-port hard functions" idea
anyway, because the ROM can't be written to in the second loop.

I'd appreciate any comments.

For testing:
MD5 of output of "tests" with defeat_tmto = 0:
4455b1ce0529e7f877de53f24ff78bec
MD5 of output of "tests" with defeat_tmto = 1:
43f6df099d2d6dec402960ae41ee1d08

Alexander

Solar Designer

2013-03-17 21:53:04 UTC

Colin, Anthony, all -

Post by Solar Designer
writing not only to V_j, but also to V_i (with another value).
Unfortunately, it turns out that BlockMix's shuffling(*) gets in the way
of efficient implementations, and _possibly_ the inefficiency mostly hurts
cleaner CPU code, as opposed to hacks and ASICs. Specifically, when the
V element we write into is or might be the same as V_j, in a clean
implementation we have to postpone those writes until the BlockMix
completes, and this means re-reading the same values from memory
(hopefully, from L1 cache) rather than simply storing register values
into two locations at once (such as X and V_j).

Essentially, we'd have a hard to optimize out blkcpy(), which is bad in
that we're not doing any crypto on the CPU while it runs. I think we
prefer to have a fine mix of crypto code and memory accesses, which makes
greater use of CPU resources.

Post by Solar Designer
In order to do the
latter without introducing extra checks, I had to limit the writes to
elements that are definitely not same as V_j; I chose (~j & (N - 1)) for
their indices.

Here's what I think is a better idea: instead of "NOT j (mod N)", use
"j XOR 1" (assumes N >= 2). This way, with halved r (e.g., moving from
r=8 to r=4 when enabling the TMTO defeater) we maintain the same TLB
pressure (since the V element to update is then on the same memory page,
if r was such that one V element fit in a page at all). As a bonus,
this avoids the extra "& (N - 1)".

Also, rather than merely overwrite V elements, we can use them first.
I've implemented XOR'ing in of the adjacent V element into BlockMix
output (in SMix's second loop only, indeed). Although I see no shortcut
from not doing this, I think it's a good thing to do, in part because
the CPU might be fetching some or all of this data into cache anyway
when we initiate the writes. That's because our writes are currently
of 16 bytes per instruction, whereas cache lines are currently 64 bytes.
Before we've issued 4 such write instructions, the CPU might already be
fetching the rest of the cache line in. I guess there's logic in the
CPU to avoid or reduce such needless activity when the writes are
performed with nearly adjacent instructions, but I am not sure if ours
are always grouped together closely enough as we're mixing crypto code
and memory accesses. (BTW, this may also affect the first loop in SMix,
which fills in the array for the first time.)

New revision is attached. Comments are welcome.

New MD5 of output of "tests" with defeat_tmto = 1:
1720dceef20e9d63dd526b717aa6d947

Alexander

Solar Designer

2013-03-18 06:40:16 UTC

Post by Solar Designer
Unfortunately, it turns out that BlockMix's shuffling(*) gets in the way
of efficient implementations, and _possibly_ the inefficiency mostly hurts
cleaner CPU code, as opposed to hacks and ASICs. Specifically, when the
V element we write into is or might be the same as V_j, in a clean
implementation we have to postpone those writes until the BlockMix
completes, and this means re-reading the same values from memory
(hopefully, from L1 cache) rather than simply storing register values
into two locations at once (such as X and V_j).

The above applies when trying to store (or XOR) BlockMix output into a V
element, but that in itself is problematic: if V_j on the very next
iteration of the SMix loop is the same as the element we just wrote
into, the XORs may cancel each other, in some cases resulting in really
nasty consequences. The previous TMTO defeater patches I posted in the
last 2 days suffer from this problem. Oops. I hope this is taken for
granted, but to be on the safe side let me say that none of these
early/experimental TMTO defeater patches are meant for actual use.

Now, I could fix this by adding even more complexity, such as to avoid
repeating j's (detect repeats and XOR with 1 if so, which may be done in
a branch-less fashion). However, I chose to rethink the whole approach
instead, and realized that Anthony's original example did not have this
problem due to storage of the block value from just prior to a BlockMix.

A reason why I did not try implementing Anthony's original example right
away was that it was incompatible with an optimization I had made in
crypto_scrypt-sse.c, where the XORs were re-ordered. Now I went for the
extra implementation complexity in this version of the code,
implementing a separate BlockMix function with the original order of
XORs so that the right value could be written back into V_j. This
appears to work well, and is fast. -ref and -nosse implement Anthony's
original example without such extra complexity.

New revision is attached. Comments are welcome.

New MD5 of output of "tests" with defeat_tmto = 1:
2843045848a204727c8dd7677bd6b8e3

Because of all the partial bypasses of this kind of TMTO defeater (that
I've documented in separate postings), I am likely to also proceed to
implement a defeater by means of changes to the first SMix loop. Then
the question will be whether to keep the defeater in the second loop as
well or not.

Alexander

Solar Designer

2013-03-17 22:28:02 UTC

Post by Solar Designer
http://drupal.org/node/1201444#comment-4675994
Here's an attack on Anthony's tradeoff defeater ("simply writing to V_j
Assume that we don't want to use the tradeoff to its fullest, but rather
just halve our memory needs. Without the tradeoff defeater, this is
accomplished by storing every other V_j in an array that is twice
smaller than original. The odd-indexed V_j's are then recomputed as
needed from the preceding values that we did store.
Now, with the tradeoff defeater we can't recompute these values like
that because the preceding value might have been modified by prior
iterations of the second loop. Or can we? Actually, quite often we
can: by the time the second loop terminates, it will have modified only
about 63% of V elements. And only 39% when we're at N/2 iterations.
Of course, "quite often" is not sufficient for proper computation of the
entire scrypt, so we do need to store the entire V array (not just its
half) somewhere.

... and then I went on to describe the hot/cold approach:
http://www.openwall.com/lists/crypt-dev/2012/09/10/3

However, since we only write into about 63% of V elements, we only have
to store this many. We may index them via a hash table (yes, this means
we'll need some more memory) and we'll also need to store some elements
of the original V array (as computed by SMix's first loop) - e.g., every
8th element. This way, we'll reduce our memory needs to about 76% of
original (not counting memory consumed by the hash table, which might
not be much) and pay for it by doing 8x more computation in SMix's
second loop, or 4.5x more computation total (not including hash table
lookups overhead). In some cases, this tradeoff may be worthwhile (when
the attacker is constrained to a device with a disproportionate amount
of computing power for the device's memory size and/or bandwidth).

Additionally, if the attacker runs multiple such instances in parallel
and (de)synchronizes them in a smart way, their average memory usage per
instance will be less than the 76% figure. If we consider SMix's second
loop only, it may be about 52% of original. If we consider both loops
and assume they take equal time to run, it may be 32% of original. So
the attacker would reduce memory needs by a factor of 3 with only a 4.5x
increase in computation. This may be worthwhile on even more devices
than the example above.

The (de)synchronization trick is also applicable to the original scrypt
(although it makes less of a difference there), where SMix's first loop
does not need all of the memory at once. Only SMix's second loop was
included in Colin's attack cost estimates, but for practical purposes I
like to consider both loops.

Alexander

Solar Designer

2013-11-20 05:57:57 UTC

Hi,

Attached is a tiny C program demonstrating a time-memory tradeoff still
practical with Anthony Ferrara's proposed change to scrypt. The
tradeoff is a lot less beneficial to an attacker, if at all (and in
fewer cases) than that possible with the original scrypt.

The basic idea is trivial: instead of storing some of the intermediate
computation results, store the paths leading to them from other
intermediate results. So we replace large values with smaller indices.

As implemented, the program uses two TMTO factors: 5 for SMix's first
loop (tunable) and 2 for SMix's second loop (embedded in the algorithm).
This results in an overall reduction of modified scrypt's memory usage
to 70% of full (as 20% by first loop and separate 50% by second loop),
plus a little bit of overhead for pointers, indices, and counters (2%
total at scrypt's typical r=8 and assuming 32-bit size of these pointer
and integer values - although they can as well be smaller). The
performance hit is less than a factor of 2. By increasing the TMTO
factor for the first loop, the total memory usage may be brought to just
a little over 50% of full - but the performance hit would be much higher.

If someone generalizes this algorithm to higher TMTO factors for the
second loop, I'd appreciate that. I think this is possible, but it may
be more tricky and I don't know what the costs would be (I am curious).

Here's sample output of the program:

$ ./sim-tmto
scrypt classic:
B' = 8c126669
t_cost = 512
m_cost = 256
scrypt TMTO = 5:
B' = 8c126669
t_cost = 1010
m_cost = 52
scrypt ircmaxell:
B' = 24efce90
t_cost = 512
m_cost = 256
scrypt ircmaxell TMTO1 = 5, TMTO2 = 2:
B' = 24efce90
t_cost = 978
m_cost = 180 elements + 256 indices (363 alloc + 256 ptrs) + 512 counters

Assuming element size 1024 bytes (which it is in scrypt with r=8) and
32-bit pointers and integers, and including even the memory allocation
fragmentation overhead (which is avoidable), we get:

(180*1024 + (363+256+512)*4) / (256*1024) = 72% memory usage
978/512 = 191% CPU time usage (counting only BlockMix invocations)

Overall area-time is increased by a factor of 1.376 - so not beneficial
for ASICs at all (but could be e.g. for GPUs). In the case of original
scrypt, overall area-time is instead decreased by up to a constant
factor (as discussed in here earlier), so benefits ASICs.

Indeed, this is not a proof that there's no more efficient TMTO for this
modified scrypt. There might be.

Alexander

P.S. Quoted below is a previous message I had posted to this thread,
describing other TMTO attacks on the same modified scrypt. Those were
less efficient, and I never tried implementing them.

http://www.openwall.com/lists/crypt-dev/2012/09/10/3
However, since we only write into about 63% of V elements, we only have
to store this many. We may index them via a hash table (yes, this means
we'll need some more memory) and we'll also need to store some elements
of the original V array (as computed by SMix's first loop) - e.g., every
8th element. This way, we'll reduce our memory needs to about 76% of
original (not counting memory consumed by the hash table, which might
not be much) and pay for it by doing 8x more computation in SMix's
second loop, or 4.5x more computation total (not including hash table
lookups overhead). In some cases, this tradeoff may be worthwhile (when
the attacker is constrained to a device with a disproportionate amount
of computing power for the device's memory size and/or bandwidth).
Additionally, if the attacker runs multiple such instances in parallel
and (de)synchronizes them in a smart way, their average memory usage per
instance will be less than the 76% figure. If we consider SMix's second
loop only, it may be about 52% of original. If we consider both loops
and assume they take equal time to run, it may be 32% of original. So
the attacker would reduce memory needs by a factor of 3 with only a 4.5x
increase in computation. This may be worthwhile on even more devices
than the example above.
The (de)synchronization trick is also applicable to the original scrypt
(although it makes less of a difference there), where SMix's first loop
does not need all of the memory at once. Only SMix's second loop was
included in Colin's attack cost estimates, but for practical purposes I
like to consider both loops.
Alexander

Solar Designer

2013-11-25 07:38:57 UTC

Post by Solar Designer
B' = 24efce90
t_cost = 978
m_cost = 180 elements + 256 indices (363 alloc + 256 ptrs) + 512 counters
Assuming element size 1024 bytes (which it is in scrypt with r=8) and
32-bit pointers and integers, and including even the memory allocation
(180*1024 + (363+256+512)*4) / (256*1024) = 72% memory usage
978/512 = 191% CPU time usage (counting only BlockMix invocations)

In the memory usage calculation above, I forgot to account for stack
usage by possible recursive calls to getVrec():

uint32_t getVrec(uint32_t i, uint32_t j)
{
[...]
return Y ^ getVrec(ii, jj);

The good news (for my analysis) is that this recursion was easily
avoidable - which I did in the code revision attached to this message.
The program output remained exactly the same, and the memory usage
calculation above is now valid.

Alexander

Solar Designer

2013-12-31 04:56:58 UTC

Colin, all -

Post by Solar Designer
We got a curious real-world data point due to Litecoin using scrypt with
(2^10, 1, 1), which corresponds to 128 KiB.

[...]

Post by Solar Designer
Curiously, bcrypt uses only 4 KiB, which is less than Litecoin scrypt's
128 KiB. However, bcrypt's use of L1 (rather than L2) cache on CPUs
provides an advantage to CPU implementations, and its access pattern
(random 32-bit accesses) makes attempts to use L2+ cache or RAM (for
many concurrent instances of bcrypt) very inefficient. On GPUs, best
performance at bcrypt is currently achieved by using local memory (very
limited in size) and keeping many of the execution units idle (since
there's not enough local memory to use them all). Trying to use global
memory (partially cached) slows things down. (A local+global mixed
implementation might be a little bit faster than local alone, though.
This was briefly discussed on john-dev.) Credit: the experiments with
bcrypt on GPU were performed by Sayantan Datta.
It is plausible that scrypt at 1 MiB is comparable to bcrypt in terms of
attacks with GPUs, but this is not certain. scrypt at 1 MiB might as
well be somewhat weaker than bcrypt in this respect.
Overall, I think scrypt at 1 MiB is not a good replacement for bcrypt
currently. We need bigger settings. (And if we go for much bigger
settings, this may imply a need to limit concurrency when scrypt is used
on authentication servers.)

YACoin (or YAC for short), a new alt-coin, is using revised scrypt (with
Keccak and ChaCha20 instead of SHA-256 and Salsa20) with N increasing
over time (it's tied to real time). Currently YAC is at N=2^15
(Nfactor=14), which means 4 MiB, and here's how it will be increasing:

https://bitcointalk.org/index.php?topic=206577.msg2162620#msg2162620

At the current 4 MiB, YAC is about as fast to mine on CPU as on GPU:

http://yacoinwiki.tk/index.php/Mining_Hardware_Comparison

In June 2014, we'll see numbers for 8 MiB there. Indeed, we may run
benchmarks now, but maybe implementations optimized/tuned for the
increased N will appear along with the monetary incentive. And yes, YAC
is worth some Bitcoins now, which are worth some dollars.

So it appears that in terms of attacks with current GPUs scrypt at 4 MiB
is comparable to bcrypt at 4 KiB. In other words, defensive use of
scrypt needs 1000x more memory to provide same security against GPUs.

That's with r=1 and p=1. With higher r, scrypt might be more
GPU-friendly (larger sequential accesses to off-chip memory).

And yes, GPUs still provide no speedup over CPUs at bcrypt cracking,
neither in absolute terms, nor per-Watt:

http://www.openwall.com/presentations/Passwords13-Energy-Efficient-Cracking/

These results (for GPUs) have been confirmed by bcrypt implementations
in hashcat and hashkill (by their respective authors), besides in JtR.

Alexander

Solar Designer

2014-03-12 19:13:27 UTC

Post by Solar Designer
YACoin (or YAC for short), a new alt-coin, is using revised scrypt (with
Keccak and ChaCha20 instead of SHA-256 and Salsa20) with N increasing
over time (it's tied to real time). Currently YAC is at N=2^15
https://bitcointalk.org/index.php?topic=206577.msg2162620#msg2162620
http://yacoinwiki.tk/index.php/Mining_Hardware_Comparison

Earlier today, I noticed that a surprisingly fast GPU miner was added to
that wiki page, doing 5.04 kh/s on overclocked GTX 780. This is several
times faster than what CPUs do, and twice faster than best speeds
reported for AMD GPUs. This is also surprising in that AMD GPUs
generally outperform NVIDIA's at symmetric crypto.

So I went to verify, since luckily we have a GTX TITAN. I did a git
checkout of CudaMiner, built it, figured out how to use it, and after a
little bit of tuning got an even higher speed: I added it to the wiki as
6.83 kh/s, although actually it changes with time a bit (the highest I
saw was 6.90 kh/s), maybe because of boost or/and pool delays. And yes,
the shares are being accepted by the pool just fine, so this is for real.

This is more than twice faster than the previous best YAC mining speed I
had for CPUs (also added to that wiki page) - ~3 kh/s on 2x E5-2670 (a
total of 16 cores, 32 hardware threads). Thus, there's a 4x speedup
per-chip, when comparing the TITAN vs. one E5-2670. (And the TITAN is
actually cheaper than one E5-2670.)

It's also important to note that only 2.5 GiB out of 6 GiB of TITAN's
memory is in use. (I could make the miner use more memory with other
settings, but this is what appeared to provide optimal speed now, not
needing to use more memory yet.) Thus, I think the GPU/CPU speed ratio
would remain about the same when YAC moves to higher Nfactor, for 8 MiB.

And the TITAN is likely to remain faster than its contemporary CPUs even
at 16 MiB, and possibly at some larger sizes as well. The current 4x
performance gap is just too much to be fully removed by the move from
8 MiB to 16 MiB. (Perhaps only a fraction of it would be removed.)

BTW, this is with "lookup gap" (TMTO factor) of 6. (I tried 5 and 7,
and either of these is slightly slower.) Previously, scrypt-based
cryptocurrency miners typically found the optimal "lookup gap" to be 2.

Post by Solar Designer
That's with r=1 and p=1. With higher r, scrypt might be more
GPU-friendly (larger sequential accesses to off-chip memory).

Alexander

Solar Designer

2014-03-13 19:05:14 UTC