Discussion:
alternative approach
Solar Designer
2011-05-09 15:39:07 UTC
Permalink
Hi,

Initially, I proposed that we try to design a hashing method that could
be implemented near-optimally on all of: CPU (including SIMD), GPU, and
FPGA (and ASIC) - perhaps with different locally-configurable parameter
values, though. It would also require a configurable amount of memory.

I am not giving up on that idea, however:

It alone assumes that an attacker will always have the most suitable
hardware anyway. While we definitely must not rely on an attacker
having no access to specialized hardware, we may also want to consider
that in practice many attackers will in fact happen to be limited in the
kinds of hardware available to them. Someone may readily have only
their own gaming computer, someone else may readily have a GPU farm, and
yet another person may control a botnet built of typical end-user PCs
(most of them without decent GPUs and proper drivers), etc. Thus,
having to get the most suitable hardware would be extra cost to them.

Here's an alternative approach that we could consider:

Include several components in our hashing method - e.g., we may have a
component well-suited for CPU (maybe no SIMD?) and FPGA/ASIC, but not
for GPU (bcrypt is close to being such a thing), we may have a component
well-suited for all of these, and we may have a component that requires
FPGA/ASIC for efficient implementation. The relative weights of these
(iteration counts) may be configurable (and encoded along with the
hashes) depending on what kinds of hardware we have (or expect to have)
in servers at a particular site. By configuring the GPU-unfriendly
component to have significant weight, we make life harder (costs higher
or success rate lower) for that gamer with a graphics card and for that
attacker who readily has a GPU farm. By also using a CPU and GPU
unfriendly component, we make the botnet less effective.

The near-optimal on all hashing method becomes just a special case of
the above, then - just configure the specialized components to have very
low weight (or even zero).

A major drawback is complexity. Another drawback is that we might end
up using many ciphers/hashes, which might make it difficult to get this
solution accepted in places that care about or have to use certified
crypto only, or to obtain due government certification for products
implementing the hashing method. On the other hand, we may specifically
design the components using ciphers that would be acceptable in
different places - e.g., use SHA-2 or AES in one component and GOST in
another - then the unacceptable cipher could be disabled for use in
specific organizations or for product certification by setting the
corresponding component's weight to zero. (Of course, doing so goes
against security...)

Overall, the number of criteria affecting our design decisions becomes
rather large, which is unfortunate.

I'd appreciate any comments.

Alexander
Yuri Gonzaga
2011-05-11 14:22:38 UTC
Permalink
Hi, Alexander

It is complicated to proposed a new method that doesn't use a combination of
well-known and acceptable hashes/ciphers because of the lack of
certification and acceptance since it take some time to happens, only after
cientific and comercial communities appreciation.
Thus, it is better to our project to use validated hashes/ciphers in order
to achieve acceptance. Also, the time of GSoC will be better spent if we
focus in implement these blocks instead of proposing a new method.
So, my suggestion is that you have to define 3 components: one for CPU,
other for GPU and another for FPGA.
The result of each component will be built in a final result. For example,
do a XOR operation using 3 results or use one output as input in the next
component.
Finally, as you are the expert in this area, I think you should establish
these details so I can go on with the project implementation.

Regards,

Yuri Gonzaga
Solar Designer
2011-05-11 19:07:14 UTC
Permalink
Hi Yuri,

Is my understanding of your reply correct that you agree it is a good
idea to use the "alternative approach", but you propose that when doing
so we limit ourselves to "well-known and acceptable hashes/ciphers" as
our building blocks?
Post by Yuri Gonzaga
It is complicated to proposed a new method that doesn't use a combination of
well-known and acceptable hashes/ciphers because of the lack of
certification and acceptance since it take some time to happens, only after
cientific and comercial communities appreciation.
Thus, it is better to our project to use validated hashes/ciphers in order
to achieve acceptance.
Right. However, most modern cryptographic primitives were designed for
efficient software implementation (as well as hardware implementation).
So I'm afraid there isn't a truly software-unfriendly modern hash/cipher
that has received much scrutiny and government certification.

Blowfish (and perhaps Twofish) might be the closest we can get - it has
received a lot of scrutiny, but no government certification. It is
friendly to 32-bit CPUs, but not to typical SIMD architectures (which
lack vectorized array lookups, for good reasons).

Thus, we may choose to use a component based on Blowfish (such as
Eksblowfish or the full bcrypt) or even one based on our own Blowfish
variation (more unfriendly to software implementation and requiring less
logic in hardware), but include proof that cryptographic security of our
entire KDF is not dependent on that of this component. That is, even if
a gaping cryptographic weakness is found in this component, the worst
impact would be faster testing of a candidate key in a brute-force
attack - partially or fully bypassing computation of this "non-crypto"
component. Of course, we wouldn't expect such bypass to be actually
practical, which is why we'd include this component.

(We may use a similar approach to gain acceptance for a variation of
scrypt, formally relying on SHA-256 and PBKDF2 rather than on its
Salsa20/8 component.)
Post by Yuri Gonzaga
Also, the time of GSoC will be better spent if we
focus in implement these blocks instead of proposing a new method.
Maybe, or maybe not. We definitely need to have something usable and
useful by the end of summer. We've already decided to include an FPGA
implementation and an optional FPGA-based component in our KDF. No
matter what we choose, if we implement this at all and it works, it will
be "usable". As to it being "useful", it will need to be somehow better
than what exists now, for some uses and in some aspects. I see two ways
to make it useful in this sense:

1. Have our KDF, when running on one CPU plus one FPGA, be significantly
more resistant against offline brute-force attacks by just CPUs (no
FPGAs) and/or by GPUs, compared to an existing KDF implemented on a CPU
only. By "significantly" I mean something like a 10x difference in
attack speed (assuming fixed $ cost of the attacker's equipment), and
preferably more than 10x.

2. Include and use a host-unreadable local parameter on the FPGA board.
For this, we need to figure out if Pico Computing's boards are readily
usable to achieve this task or not... If not yet, then maybe we won't
have this by the end of summer (although it'd be a pity).

If we're 100% confident that we'd achieve one of these by the end of
summer, then the other is optional (for the summer). However, at least
one of these is a must to achieve by the end of the summer.

Now, while we could wish FPGAs were some magic high-performance beasts,
definitely much faster than CPUs when properly programmed, they are not.
If we pick a software-friendly cryptographic primitive, it could turn
out that our optimized implementation of it on a typical server CPU
achieves comparable performance to our optimized implementation on our
FPGA board (one we could offer for installation into servers).

Here are some performance numbers for DES, which was originally designed
for hardware implementation, but then had efficient software
implementations developed for it:

http://www.sump.org/projects/password/

The performance number given for "AMD64 3000+" is incorrect, please
disregard it. See the comments (some of them posted by me) for correct
performance numbers for CPU implementations.

This shows that XC3S1000-4 achieves a speed similar to a quad-core CPU
of 2008. Core i7-2600 (this year's quad-core CPU) is twice faster - it
does over 20M of DES-based crypt(3) keys per second with my current code.

Thus, apparently XC3S5000-4, which is reported to theoretically achieve
45 MKeys/sec in a table on that web page, is twice faster than a modern
quad-core CPU. However, for our project a mere 2x performance gain of
FPGA vs. CPU is not good enough to achieve the #1 goal mentioned above.

(I don't know how FPGAs on Pico Computing's boards meant for production
use in servers compare to a XC3S5000-4. I'd appreciate info on that.)

Overall, the above makes me think that we may in fact need to design and
implement a software-unfriendly component. Definitely more than DES
(which can be made to use arbitrarily wide SIMD vectors in software with
a bitslice implementation), and perhaps more than Blowfish. With
original Blowfish, we might achieve a 4x "boost" in software
unfriendliness vs. DES (since it only uses 32-bit operations, whereas
modern CPUs can do 128-bit almost as fast), which might give us a total
"FPGA boost" vs. CPUs of 8x (relying on speeds for XC3S5000-4 vs. Core
i7-2600 mentioned above). However, I'd prefer something better than 8x
(although 8x is already reasonable for our project), and I am not
certain of it (full/original Blowfish probably consumes more FPGA logic
than DES does, which I did not take into account in these calculations,
so we'd actually fit fewer Blowfish cores in our FPGA chip; also, there
are differences in pipelining ability, which is nearly perfect for DES).
Post by Yuri Gonzaga
So, my suggestion is that you have to define 3 components: one for CPU,
other for GPU and another for FPGA.
Maybe. I suggest that we omit a GPU-specific component for now, for a
variety of reasons.
Post by Yuri Gonzaga
The result of each component will be built in a final result. For example,
do a XOR operation using 3 results
Yes, however XOR only works well when the components' outputs are of the
same size and all are "random enough" - otherwise the combined output
might reveal some bits of individual outputs, which would allow for
faster brute-force attacks (on those revealed outputs separately).
Thus, we'd want to feed the components' outputs into a generic KDF
(relatively fast). We may even include the original password and salt
along with those component outputs. This gives us proof that the
security of our entire construct depends on that final KDF only.
Post by Yuri Gonzaga
or use one output as input in the next component.
This reduces parallelism for actual use while not affecting parallelism
available for attacks. Not good.
Post by Yuri Gonzaga
Finally, as you are the expert in this area, I think you should establish
these details so I can go on with the project implementation.
Yes, I intend to define the project more specifically, but I am trying
to get you involved in the design decisions and to let you learn new
things. Also, some design decisions depend on what they'd mean for
implementation. So I might ask you to try implementing different things
before we decide on which of those to use.

Thanks,

Alexander
David Hulton
2011-05-11 19:31:00 UTC
Permalink
Alexander & Yuri,

I think that DES would be our best bet as far as certified ciphers.
We're able to do on the order of around 22 billion DES operations/sec
on our current M-501 modules (we currently are able to do over 1
trillion keys/sec on our 48-board 5U clusters) and definitely in the
billions range on the E-101 board that we're sending to Yuri. If we
combined PBKDF2-like iteration algorithm with DES/3DES we might be
able to get away with something that conforms close enough to the
standards out there to be usable, allows us to maintain a full
pipeline with the DES implementation (for efficient operation in
server mode) and give us a 10x+ advantage over GPU's/CPU's. I haven't
seen many benchmarks for GPU's and DES but the closest I've seen for
Lanman is still in the hundreds of millions at best.

I have a feeling that going with DES puts us on the right track and
one other benefit is that if this was ever ASICed we would immediately
have another 6x-10x advantage and major cost savings on larger scale
deployments.

One option that comes to mind is doing something like 16 or 32 (or any
other higher convenient multiple of 16) number of PBKDF2 operations in
parallel (maybe each with different salts?) and then xor all of the
results together at the end. This would give us some degree of
parallelism (if we have a 16-stage DES pipeline we can have all of our
PBKDF2 instances running efficiently in server mode possibly across
multiple DES cores depending on the parallelism setting) and also
providing a tuneable tradeoff with the sequential running time.

Just some thoughts.. I think part of the problem is that most
mainstream algorithms designed in the past 20 years have been heavily
targeted to run efficiently in software rather than gates so
unfortunately we may need to go to the past to take advantage of this.
The other other algorithms that I think would run much more
efficiently would be ones that do lots of smaller bit operations like
LFSR based algorithms or possibly FFT-like hashes...

-David
Post by Solar Designer
Hi Yuri,
Is my understanding of your reply correct that you agree it is a good
idea to use the "alternative approach", but you propose that when doing
so we limit ourselves to "well-known and acceptable hashes/ciphers" as
our building blocks?
Post by Yuri Gonzaga
It is complicated to proposed a new method that doesn't use a combination of
well-known and acceptable hashes/ciphers because of the lack of
certification and acceptance since it take some time to happens, only after
cientific and comercial communities appreciation.
Thus, it is better to our project to use validated hashes/ciphers in order
to achieve acceptance.
Right.  However, most modern cryptographic primitives were designed for
efficient software implementation (as well as hardware implementation).
So I'm afraid there isn't a truly software-unfriendly modern hash/cipher
that has received much scrutiny and government certification.
Blowfish (and perhaps Twofish) might be the closest we can get - it has
received a lot of scrutiny, but no government certification.  It is
friendly to 32-bit CPUs, but not to typical SIMD architectures (which
lack vectorized array lookups, for good reasons).
Thus, we may choose to use a component based on Blowfish (such as
Eksblowfish or the full bcrypt) or even one based on our own Blowfish
variation (more unfriendly to software implementation and requiring less
logic in hardware), but include proof that cryptographic security of our
entire KDF is not dependent on that of this component.  That is, even if
a gaping cryptographic weakness is found in this component, the worst
impact would be faster testing of a candidate key in a brute-force
attack - partially or fully bypassing computation of this "non-crypto"
component.  Of course, we wouldn't expect such bypass to be actually
practical, which is why we'd include this component.
(We may use a similar approach to gain acceptance for a variation of
scrypt, formally relying on SHA-256 and PBKDF2 rather than on its
Salsa20/8 component.)
Post by Yuri Gonzaga
Also, the time of GSoC will be better spent if we
focus in implement these blocks instead of proposing a new method.
Maybe, or maybe not.  We definitely need to have something usable and
useful by the end of summer.  We've already decided to include an FPGA
implementation and an optional FPGA-based component in our KDF.  No
matter what we choose, if we implement this at all and it works, it will
be "usable".  As to it being "useful", it will need to be somehow better
than what exists now, for some uses and in some aspects.  I see two ways
1. Have our KDF, when running on one CPU plus one FPGA, be significantly
more resistant against offline brute-force attacks by just CPUs (no
FPGAs) and/or by GPUs, compared to an existing KDF implemented on a CPU
only.  By "significantly" I mean something like a 10x difference in
attack speed (assuming fixed $ cost of the attacker's equipment), and
preferably more than 10x.
2. Include and use a host-unreadable local parameter on the FPGA board.
For this, we need to figure out if Pico Computing's boards are readily
usable to achieve this task or not...  If not yet, then maybe we won't
have this by the end of summer (although it'd be a pity).
If we're 100% confident that we'd achieve one of these by the end of
summer, then the other is optional (for the summer).  However, at least
one of these is a must to achieve by the end of the summer.
Now, while we could wish FPGAs were some magic high-performance beasts,
definitely much faster than CPUs when properly programmed, they are not.
If we pick a software-friendly cryptographic primitive, it could turn
out that our optimized implementation of it on a typical server CPU
achieves comparable performance to our optimized implementation on our
FPGA board (one we could offer for installation into servers).
Here are some performance numbers for DES, which was originally designed
for hardware implementation, but then had efficient software
http://www.sump.org/projects/password/
The performance number given for "AMD64 3000+" is incorrect, please
disregard it.  See the comments (some of them posted by me) for correct
performance numbers for CPU implementations.
This shows that XC3S1000-4 achieves a speed similar to a quad-core CPU
of 2008.  Core i7-2600 (this year's quad-core CPU) is twice faster - it
does over 20M of DES-based crypt(3) keys per second with my current code.
Thus, apparently XC3S5000-4, which is reported to theoretically achieve
45 MKeys/sec in a table on that web page, is twice faster than a modern
quad-core CPU.  However, for our project a mere 2x performance gain of
FPGA vs. CPU is not good enough to achieve the #1 goal mentioned above.
(I don't know how FPGAs on Pico Computing's boards meant for production
use in servers compare to a XC3S5000-4.  I'd appreciate info on that.)
Overall, the above makes me think that we may in fact need to design and
implement a software-unfriendly component.  Definitely more than DES
(which can be made to use arbitrarily wide SIMD vectors in software with
a bitslice implementation), and perhaps more than Blowfish.  With
original Blowfish, we might achieve a 4x "boost" in software
unfriendliness vs. DES (since it only uses 32-bit operations, whereas
modern CPUs can do 128-bit almost as fast), which might give us a total
"FPGA boost" vs. CPUs of 8x (relying on speeds for XC3S5000-4 vs. Core
i7-2600 mentioned above).  However, I'd prefer something better than 8x
(although 8x is already reasonable for our project), and I am not
certain of it (full/original Blowfish probably consumes more FPGA logic
than DES does, which I did not take into account in these calculations,
so we'd actually fit fewer Blowfish cores in our FPGA chip; also, there
are differences in pipelining ability, which is nearly perfect for DES).
Post by Yuri Gonzaga
So, my suggestion is that you have to define 3 components: one for CPU,
other for GPU and another for FPGA.
Maybe.  I suggest that we omit a GPU-specific component for now, for a
variety of reasons.
Post by Yuri Gonzaga
The result of each component will be built in a final result. For example,
do a XOR operation using 3 results
Yes, however XOR only works well when the components' outputs are of the
same size and all are "random enough" - otherwise the combined output
might reveal some bits of individual outputs, which would allow for
faster brute-force attacks (on those revealed outputs separately).
Thus, we'd want to feed the components' outputs into a generic KDF
(relatively fast).  We may even include the original password and salt
along with those component outputs.  This gives us proof that the
security of our entire construct depends on that final KDF only.
Post by Yuri Gonzaga
or use one output as input in the next component.
This reduces parallelism for actual use while not affecting parallelism
available for attacks.  Not good.
Post by Yuri Gonzaga
Finally, as you are the expert in this area, I think you should establish
these details so I can go on with the project implementation.
Yes, I intend to define the project more specifically, but I am trying
to get you involved in the design decisions and to let you learn new
things.  Also, some design decisions depend on what they'd mean for
implementation.  So I might ask you to try implementing different things
before we decide on which of those to use.
Thanks,
Alexander
Solar Designer
2011-05-12 09:51:54 UTC
Permalink
David, Yuri -
Post by David Hulton
I think that DES would be our best bet as far as certified ciphers.
I think that technically DES would be great for a component that we want
to be optimal for both FPGA/ASIC and CPU, if we design our KDF such that
bitslice implementations of DES are convenient. Even though DES was
originally designed for hardware implementation, with bitslicing (when
that is practical) it is also software-friendly.

...I just found some pseudo-code for a bitslice DES based crypt(3) like
function, which I wrote in 1998 (according to the file timestamp). I'll
post it separately.

Some drawbacks of DES are:

- A lot of people have heard that DES has been "broken", even though it
hasn't - it's just that its 56-bit key size is insufficient.

- DES has been formally obsoleted, in favor of AES. (As an excuse, we
can build upon 3DES and say that we're using that...)

- The key size (and also block size) is in fact inadequate; we'll need
to use DES such that this does not affect our KDF's properties.

- For the "alternative approach", DES is too software-friendly (but it
is also extremely hardware-friendly).
Post by David Hulton
We're able to do on the order of around 22 billion DES operations/sec
on our current M-501 modules (we currently are able to do over 1
trillion keys/sec on our 48-board 5U clusters)
That's impressive.

http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to
six M-501 modules per board". Is the 22 billion figure for one or for
six FPGAs?
Post by David Hulton
and definitely in the
billions range on the E-101 board that we're sending to Yuri.
Sounds good.

http://picocomputing.com/e_series.html says it's "Spartan-6 LX45 FPGA".

Yuri - the numbers I posted in here before were for DES-based crypt(3),
which includes 25 iterations of modified-DES. So you need to multiply
them by 25 to compare against those posted by David above. Thus, if
Core i7-2600 does 20.5M DES-based crypts per second (and it does that
here), this means that it does 512M modified-DES encryptions per second.
For plain DES (not modified), it would do approx. 550M per second in a
similar loop (no key setup in the loop, just iterated DES encryption).
More efficient parallelization over the multiple cores could provide
another 10% speedup (my 20.5M figure is for thread-safe code and OpenMP,
which has overhead).

That's still 35-40 times slower than the 22 billion mentioned by David,
which is good.
Post by David Hulton
If we
combined PBKDF2-like iteration algorithm with DES/3DES we might be
able to get away with something that conforms close enough to the
standards out there to be usable, allows us to maintain a full
pipeline with the DES implementation (for efficient operation in
server mode) and give us a 10x+ advantage over GPU's/CPU's.
This makes sense. As a bonus, since bitslice DES is also CPU-friendly,
we could try to design our KDF such that we treat the FPGA and the CPU
almost the same (at high level).

However, if 10x (or even 40x?) is possible in this way, this leads me to
think that we could achieve 100x+ with a truly CPU-unfriendly component
for the FPGA.
Post by David Hulton
I haven't
seen many benchmarks for GPU's and DES but the closest I've seen for
Lanman is still in the hundreds of millions at best.
Disclaimer: I am not into GPU programming (yet). This is why I use
words such as "apparently" below.

Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
likely to change. Bitslice implementations of DES with the S-box
expressions published so far need up to approx. 20 registers, which is
apparently more than what typical GPUs have available per processing
element. Also, apparently accessing a temporary variable spilled to
memory is relatively slow (compared to accessing the L1 cache in a CPU).

One expected change is that we (Openwall) are going to release more
GPU-friendly DES S-box expressions in about a month from now. These
match the instruction set of high-end ATI/AMD GPUs (as well as AMD XOP
and PPC AltiVec, which is why they'd get into JtR this soon), and they
will need fewer registers.

Another change that may or may not happen is GPUs getting more registers
or/and faster access to temporary variables. This sounds quite
realistic to me, but I am not an expert in this area. Based on what I
read (about the implementation difficulty for DES on GPUs), I wouldn't
be surprised if a mere 2x increase in register count results in a 10x
speedup for DES on GPUs.
Post by David Hulton
I have a feeling that going with DES puts us on the right track and
one other benefit is that if this was ever ASICed we would immediately
have another 6x-10x advantage and major cost savings on larger scale
deployments.
This makes sense. Wouldn't ASIC provide similar advantage for
Blowfish-like constructs, though?
Post by David Hulton
One option that comes to mind is doing something like 16 or 32 (or any
other higher convenient multiple of 16) number of PBKDF2 operations in
parallel (maybe each with different salts?) and then xor all of the
results together at the end. This would give us some degree of
parallelism (if we have a 16-stage DES pipeline we can have all of our
PBKDF2 instances running efficiently in server mode possibly across
multiple DES cores depending on the parallelism setting) and also
providing a tuneable tradeoff with the sequential running time.
This makes sense. I think the degree of parallelism would need to be
far greater, though. For the CPU, it will need to be at least 128
(since we'll do bitslice).

But I also need to post my thoughts on us using a sequential memory-hard
algorithm as defined in the scrypt paper. Maybe that is what will
occupy the CPU.
Post by David Hulton
Just some thoughts.. I think part of the problem is that most
mainstream algorithms designed in the past 20 years have been heavily
targeted to run efficiently in software rather than gates so
Right.
Post by David Hulton
unfortunately we may need to go to the past to take advantage of this.
Yes, or modify/design something new.
Post by David Hulton
The other other algorithms that I think would run much more
efficiently would be ones that do lots of smaller bit operations like
LFSR based algorithms
These might be too weak.
Post by David Hulton
or possibly FFT-like hashes...
I am not familiar with these.

I actually thought of modifying Blowfish in specific ways for this
project. I am likely to write about this separately.

Thank you!

Alexander
Solar Designer
2011-05-12 10:06:53 UTC
Permalink
Post by Solar Designer
...I just found some pseudo-code for a bitslice DES based crypt(3) like
function, which I wrote in 1998 (according to the file timestamp). I'll
post it separately.
Here it is, with no changes:

---
int N = sizeof(word) * 8;
word B[64], K[56];
int rounds, i;
int64 salt;
int56 k;
int64 hash[2];

decode(&rounds, &salt);

k = get7();
for (i = 0; i < N; i++) {
B{i} = salt * N | i;
K{i} = k;
}

do {
bitslice(&B, K);
k = get7();
for (i = 0; i < N; i++)
K{i} = k ^ B{i};
} while (k);

for (i = 0; i < rounds; i++)
bitslice(&B, K);

hash[0] = 0;
for (i = 0; i < N; i++)
hash[0] ^= B{i};

bitslice(&B, K);

hash[1] = 0;
for (i = 0; i < N; i++)
hash[1] ^= B{i};

encode(rounds, salt, hash);
---

The curly braces refer to bit layers, e.g. B{i} means i'th bit of every
element of B[].

gets7() reads the next 7 characters of the input password/passphrase.

bitslice() is a bitslice implementation of DES. Since the same value of
K is used in multiple calls to bitslice(), the DES key schedule setup
may actually be out of the loop. This is not shown above for simplicity.

Alexander
Solar Designer
2011-05-12 10:51:40 UTC
Permalink
David, Yuri -
Post by Solar Designer
I actually thought of modifying Blowfish in specific ways for this
project. I am likely to write about this separately.
The primary idea is to use smaller S-boxes. Normally, Blowfish uses
four 8-to-32 S-boxes. We could reduce their size maybe to 4-to-16.
This would make software implementations twice less efficient (fewer
bits of data handled per instruction) while reducing the amount of logic
(or BlockRAM) a lot. The size of S-boxes would reduce from 4 KB to 128
bytes (a 32x reduction). So we'd fit a lot more of these reduced
Blowfish cores into a chip.

I think 4-bit indices are close to the minimum we can use. With even
smaller indices, we allow for bitslice implementations in software that
keep the arrays in registers (or load them into registers when needed)
and use not-too-many bitwise operations to extract the right elements.
For 4-bit, it takes 16 registers (or loads) per S-box and at least 4
instructions (when there's a bit select instruction like selb, vsel,
vpcmov, BFI_INT) or 12 (with and-not) or 16 instructions. Oh, there
will also be plenty of instructions needed to spread/duplicate our
index bits over the SIMD vectors before they're usable for bit selects.
This is probably enough to kill any advantage of using a bitslice
implementation unless the vectors are very wide yet fast.

The S-boxes themselves will be ever-changing, like they are in Blowfish
normally. This prevents DES-like bitslice implementations.

Also, Blowfish uses XORs (no-carry addition) and ADDs (all-carry
addition). In hardware, we can instead use partial-carry addition -
e.g., do carry for every other bit. A non-bitslice software
implementation would need to use an extra instruction there, whereas
we save logic.

Finally, we can add bit permutations. These are free in hardware, but
are costly for non-bitslice software implementations. (Some modern
instruction sets include arbitrary byte permutation instructions, but
not arbitrary bit permutations.)

We'd need to do some testing of the resulting component, such as for
collision resistance, but we'd call it "non-crypto". We may use a
software implementation for most of such testing (slow, but that's OK).

As usual, comments are welcome.

Alexander
Solar Designer
2011-05-13 23:47:06 UTC
Permalink
David, Yuri -

I've experimented with this today. Attached is a program that
demonstrates a heavily cut-down Eksblowfish-like algorithm.
Post by Solar Designer
The primary idea is to use smaller S-boxes. Normally, Blowfish uses
four 8-to-32 S-boxes. We could reduce their size maybe to 4-to-16.
Reduced to 4-to-8 in the attached proof-of-concept program. The number
of S-boxes is reduced from 4 to 2.

The size of S-boxes is 256 bits (2x16x8). Besides the changing S-boxes,
there's 16 bits of internal state in the L and R variables (8-bit each).

I wonder if it's possible to synthesize this into 68 logic cells on
Virtex-6 (we have 272 bits of state, and new Xilinx logic cells have 4
output registers each). Yuri - can you give this a try? Please note
that Nrounds can be made as low as 2 if needed, eliminating that loop,
or instead it can be made high, making the outer loop's logic
non-performance-critical. Whatever works best. I did my testing with
either combination of settings (high Nrounds gives slightly better
properties).

David - for comparison, how many logic cells do the DES cores use?

Since they're pipelined, we'll need to divide by 16 to compare against
an iterated implementation... or we'll need to make a pipelined
implementation as well (and take the number of pipeline stages into
consideration when comparing). With the internal state reduced to 272
bits, I think a pipelined implementation is possible.
Post by Solar Designer
I think 4-bit indices are close to the minimum we can use. With even
smaller indices, we allow for bitslice implementations in software that
keep the arrays in registers (or load them into registers when needed)
and use not-too-many bitwise operations to extract the right elements.
For 4-bit, it takes 16 registers (or loads) per S-box and at least 4
instructions (when there's a bit select instruction like selb, vsel,
vpcmov, BFI_INT) or 12 (with and-not) or 16 instructions. Oh, there
will also be plenty of instructions needed to spread/duplicate our
index bits over the SIMD vectors before they're usable for bit selects.
When writing the above, I had a partially-bitslice implementation in
mind, and I got the numbers wrong anyway. So please disregard this.

With a fully bitslice implementation, it'd take at least 15 instructions
(assuming the most appropriate instruction set with 3 inputs per
instruction) to combine the 16 values, but there won't be any need to
spread/duplicate the index bits. Those 15+ instructions will give us
parallel 4-to-1 bit lookups (as many of them as our SIMD vector width
permits, e.g. 128). We'll need to do this 8 times for one 4-to-8 S-box
lookup. This gives us a slowdown estimate of 120x for each of the
instances being computed in parallel, compared to a purely sequential
software implementation with explicit array lookups. If we're able to
compute not just 128, but, say, 256 instances in parallel (e.g., on a
future processor implementing AMD XOP efficiently), then we might have
some speedup. There's another maybe 50% slowdown because of having to
load a half of the 256 values into registers, though, since we don't
expect our CPU to have 128+ registers and since on x86 instruction set
extensions only one input operand may reside in memory. (And on RISC
architectures, none can.) So perhaps the vectors would need to be wider
than 256 bits for any speedup due to bitslicing. On the other hand,
bitslicing allows for optimizing the partial-carry adds.
Post by Solar Designer
The S-boxes themselves will be ever-changing, like they are in Blowfish
normally. This prevents DES-like bitslice implementations.
...However, there's some similarity to DES:

With DES, we directly put the constant S-boxes into LUTs. With changing
S-boxes, we instead use the LUTs to select the right outputs, emulating
table lookups much like a bitslice software implementation would. Thus,
in either case (DES or this Blowfish-like construction) a bitslice
implementation in software would use the same approach that a hardware
implementation does.

This makes me think that we might not actually gain much (or anything)
by doing this, compared to just using DES. But this needs further
research and testing.

If there's a reduction in logic cell count (compared to that for DES),
then this is what provides advantage - we'll have many more cores per
FPGA chip. If not, then maybe the current unfriendliness to bitslice
implementations (needing relatively uncommon instructions and frequent
loads from L1 cache) provides some advantage.
Post by Solar Designer
Also, Blowfish uses XORs (no-carry addition) and ADDs (all-carry
addition). In hardware, we can instead use partial-carry addition -
e.g., do carry for every other bit. A non-bitslice software
implementation would need to use an extra instruction there, whereas
we save logic.
Here's what this looks like in a non-bitslice software implementation:

#define pcadd(a, b, mask) \
(((a) ^ (b)) + (((a) & (b) & (mask)) << 1))

The mask is constant (since we optimize for hardware implementation).

That's 5 instructions instead of 1, whereas the cost in FPGA should be
the same. Or did I miss a way to do this in software more optimally
(other than bitslice)?
Post by Solar Designer
Finally, we can add bit permutations. These are free in hardware, but
are costly for non-bitslice software implementations. (Some modern
instruction sets include arbitrary byte permutation instructions, but
not arbitrary bit permutations.)
I haven't tried this yet.
Post by Solar Designer
We'd need to do some testing of the resulting component, such as for
collision resistance, but we'd call it "non-crypto". We may use a
software implementation for most of such testing (slow, but that's OK).
I did some testing of this kind. The results are acceptable.

$ gcc bflike.c -o bflike -O2 -fomit-frame-pointer -funroll-loops -Wall
$ ./bflike < w | sort -u | wc -l
ntot/neq = 255.83
65536
$ wc -l w
65536 w

No full internal state collisions for the 65536 inputs here. Also, the
code does some sanity checks.

Yuri - here's a simplified version of encrypt(), with the sanity checks
and encryption mode removed (leaving only the slow key setup mode), for
you to try to synthesize:

static void slow_key_setup(unsigned int n)
{
word *p;
word l = 0;
word r = 0;

do {
p = &s[0][0];
do {
int i;
for (i = 0; i < Nrounds / 2; i++) {
r ^= pcadd(s[0][l & 0xf], s[1][l >> 4], 0x55);
l ^= pcadd(s[0][r & 0xf], s[1][r >> 4], 0xaa);
}
*p++ ^= l;
*p++ ^= r;
} while (p < &s[1][15]);
} while (--n);
}

Thanks,

Alexander
Solar Designer
2011-05-14 21:22:43 UTC
Permalink
David, Yuri -
Post by Solar Designer
I've experimented with this today. Attached is a program that
demonstrates a heavily cut-down Eksblowfish-like algorithm.
I've attached a new revision of it. This one compensates for my removal
of Blowfish's P array better. Although the previous revision did not
have full internal state collisions for the inputs I was throwing at it,
it would produce a bit fewer than expected 16-bit "encrypted" outputs -
e.g., for 65536 different inputs, it would typically produce between
41100 and 41400 different outputs, whereas ideally it'd produce around
41427 of them. The new revision correctly produces slightly fewer or
slightly more than 41427 different outputs on this test (with different
sets of inputs).

This expected number may be calculated as (1-1/e)*65536, which assumes
uniform distribution of outputs when encrypting a constant with
different keys, and that 65536 is large enough for this approximation to
be valid (it is).

I also used the following little program (which I wrote many years ago)
to calculate the expected number of different outputs of a certain size
(in bits) for an arbitrary number of different inputs:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

long double v, x, y;
int n;

int main(int argc, char **argv) {
v = pow(2.0, atof(argv[1]));
x = 0; n = atoi(argv[2]);
y = v * (1.0 - pow((v - 1.0) / v, n));
while (n--) x = x * ((v - 1.0) / v) + 1.0;
printf("%Lf\n%Lf\n", x, y);
return 0;
}

It calculates the same thing twice, in different ways, out of paranoia.

For 16-bit outputs, this gives 65504 for 500000, or 65535.98 for 1000000
inputs. The new revision of bflike.c passes these tests, too - e.g.,
here are two invocations of it on two different sets of 500000 inputs:

$ ./bflike < w500k1 | sort -u | wc -l
ntot/neq = 256.01
65505
$ ./bflike < w500k2 | sort -u | wc -l
ntot/neq = 256.00
65512

The ntot/neq is another sanity check. (Should be close to 256.00.)
Also, there's a check for looping internal state - if this were
detected, a message would be printed to stderr. (It would be a problem
because it'd allow an attacker to bypass further computation after
similarly detecting a loop.)

I am using those 16-bit outputs for testing only. If we end up actually
using this algorithm, we'd be passing its entire internal state into
something like SHA-256. (Unless it is somehow difficult or costly to
route those internal state bits to a SHA-256 core or to have them sent
to the host system.) So the previous revision would have been fine as
well. But I prefer to have some safety margin.

I also introduced optional bit order reversals, which don't affect the
algorithm's properties significantly. These need more thought - we'd
want to use them in such places that a sequential software
implementation wouldn't move them out of the loop (such as by
re-ordering array elements if we reverse the bits in its index).

Yuri - my request to you remains the same. You don't need to update to
this new revision, you may use the old one. I merely want to get an
idea of the logic cell count, as well as of why a certain number of
logic cells is needed to implement something like this (some sort of
diagram would help). I think you could want to implement the two rounds
at once (including their four S-box lookups), or maybe even four at once
(assuming Nrounds set to 4 or larger in the program). I think it is
possible to compute many rounds at once (per one clock cycle) as long as
the S-boxes don't change during them (they only change after Nrounds is
reached). Of course, this should result in a higher signal propagation
delay and thus lower maximum clock rate, but that's fine. Whatever turns
out to make efficient use of resources. Then we'd need to compare this
against a round of DES.

Thanks,

Alexander
Yuri Gonzaga
2011-05-19 21:32:33 UTC
Permalink
Hi!

I did the first attempt. I think I spent more time than I expected mainly
because the adaptation to Xilinx dev environment.

It is composed of a state machine to read the "n" input, calculate the
rounds and then output "l", "r" and "s" (byte-by-byte).
I put a directive `define to parameterize how much rounds are done per
single clock cycle.
The state machine has 8 states, but only 2 are related to rounds
calculation.

For NROUNDS = 2, synthesis results (Target Device: Virtex6
xc6vlx75t-3ff484):

Device Utilization Summary (estimated values)

[-] <?&ExpandedTable=DeviceUtilizationSummary(estimatedvalues)>

Logic Utilization

Used

Available

Utilization

Number of Slice Registers

35

93120

0%

Number of Slice LUTs

131

46560

0%

Number of fully used LUT-FF pairs

35

131

26%

Number of bonded IOBs

14

240

5%

Number of BUFG/BUFGCTRLs

2

32

6%

Maximum frequency: 261.356MHz

For NROUNDS = 4 (same target device):

Device Utilization Summary (estimated values)

[-] <?&ExpandedTable=DeviceUtilizationSummary(estimatedvalues)>

Logic Utilization

Used

Available

Utilization

Number of Slice Registers

35

93120

0%

Number of Slice LUTs

183

46560

0%

Number of fully used LUT-FF pairs

35

183

19%

Number of bonded IOBs

14

240

5%

Number of BUFG/BUFGCTRLs

2

32

6%

Maximum frequency: 147.432MHz

About the pipelining, how can we deal with the fact that there are
dependencies between r, l and s in calculations?
Will each stage have to store locally r, l and s?

The verilog code is attached, including simulation.

Regards,

Yuri Gonzaga
Solar Designer
2011-05-20 01:16:24 UTC
Permalink
Yuri, David -
Post by Yuri Gonzaga
I did the first attempt. I think I spent more time than I expected mainly
because the adaptation to Xilinx dev environment.
These results are very helpful, and I am pleased that you did this soon
enough considering the dev environment change. Thank You!
Post by Yuri Gonzaga
Device Utilization Summary (estimated values)
I looked at the HTML version of your message to figure this out.
(Normally, I read text/plain sections only.)
Post by Yuri Gonzaga
Logic Utilization
Used
Available
Utilization
Number of Slice Registers
35
93120
0%
Number of Slice LUTs
131
46560
0%
I find this puzzling. We have 272 bits of internal state (not including
auxiliary variables for loop control, etc.) How do they fit in 35
registers? http://www.1-core.com/library/digital/fpga-logic-cells/ says
we have four 1-bit registers per slice in a Virtex-5. Has something
changed in Virtex-6? (I doubt it.) Or is this a terminology difference?
(I find this more likely.) Yuri, David - can one of you explain this to
me, please?

s[] is declared as:

reg [7:0] s [31:0];

Shouldn't this consume 256 registers right away?

As to 131 LUTs, this is believable.
Post by Yuri Gonzaga
Number of fully used LUT-FF pairs
35
131
26%
The same info repeated - 35 flip-flops, but 131 LUTs. Since there's an
equal number of flip-flops (aka registers?) and LUTs per slice, this
means that we're only using 26% of the flip-flops that we could have.

...but I am puzzled why we need so few of them. I'd expect that we'd
have unused LUTs, not unused flip-flops (because we'd need more of them
than LUTs).
Post by Yuri Gonzaga
Number of bonded IOBs
14
240
5%
Number of BUFG/BUFGCTRLs
2
32
6%
What are these? Could we reasonably use more of them by redefining our
Blowfish-like algorithm slightly?
Post by Yuri Gonzaga
Maximum frequency: 261.356MHz
Assuming that we're LUT-bound and we can make use of all LUTs with this
(which might not be true), this gives us:

46560/131*261.356 = 92891

That's 93 billion two-round iterations per second (assuming sufficient
parallelism, which we'll provide).

We need to compare this against a CPU implementation. (And maybe also
against GPU, but that's trickier.) For a quad-core CPU at 3.5 GHz
capable of 3 instructions per cycle (on each core) running a
non-bitslice implementation, assuming 30 instructions per 2 rounds (the
best gcc-generated code I was able to get for an unrolled loop):

3500*4*3/30 = 1400

That's 1.4 billion two-round iterations per second. So the speedup is:

92891/1400 = 66.35

More optimal code might use 24 instructions per 2 rounds (with 3-operand
instructions if those are supported by the arch, which saves some moves):

3500*4*3/24 = 1750
92891/1750 = 53.08

For DES, we were getting a number of around 40 here - but comparing
actual performance (both FPGA and CPU), not theoretical.

So we have some (theoretical) improvement over DES, but it's not as good
as I would have liked it to be (I wanted to see 100+ here). Some
further improvement is possible by adding bit permutations.

Is "Virtex6 xc6vlx75t-3ff484" the same as the device that David
mentioned getting 22 billion of DES encryptions per second for, though?
If it's of a different size or speed grade, that could easily turn my
66x estimate into 100x or 50x (for the actual device).
This is also very helpful, thanks!
Post by Yuri Gonzaga
Number of Slice Registers
35
93120
0%
No more registers needed. Makes sense.
(But the low number is just as puzzling.)
Post by Yuri Gonzaga
Number of Slice LUTs
183
46560
0%
More LUTs, but not 2x more. Also makes sense. The relatively small
increase in LUT count means that with a 2-round implementation, we waste
most LUTs on the state machine, right?
Post by Yuri Gonzaga
Maximum frequency: 147.432MHz
This gives:

46560/183*147.432*2 = 75021

75 billion of 2-round iterations per second. Slightly slower than for
the 2-round implementation. But this is worth another try with our
closer-to-final Blowfish-like algorithm, if we choose to use one. On
the other hand, the numbers so far suggest that 2-round is likely
better, and that if we simplify the state machine logic (possible or
not?) it will be better yet.
Post by Yuri Gonzaga
About the pipelining, how can we deal with the fact that there are
dependencies between r, l and s in calculations?
Will each stage have to store locally r, l and s?
Yes. We'll need the full set of registers for these for each pipeline
stage. If right now we're LUT- rather than register-bound (which is
puzzling), then this makes sense (for a pipeline with 4 stages or so).
Post by Yuri Gonzaga
The verilog code is attached, including simulation.
`define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1))
Are you sure this is good enough for the compiler to do the smart thing,
realizing that we have an adder with some holes punched in it?
Post by Yuri Gonzaga
s[0] = 8'ha6;
s[1] = 8'hac;
s[2] = 8'hdb;
s[3] = 8'hb7;
...
Post by Yuri Gonzaga
s[28] = 8'ha5;
s[29] = 8'h29;
s[30] = 8'h40;
s[31] = 8'h3e;
Do these initial value assignments consume any logic? Perhaps they do.
If so, perhaps you can save some logic by having the initial state of
s[] uploaded from the host?

Did you verify correctness of your implemenation in any way? (Perhaps
by making sure it produces the same outputs for the same inputs as the C
implementation does.)

Thanks again,

Alexander
Yuri Gonzaga
2011-05-20 03:07:05 UTC
Permalink
Post by Solar Designer
I looked at the HTML version of your message to figure this out.
(Normally, I read text/plain sections only.)
Oh! I am sorry. Next time I will send plain text only.
It was easy to copy and paste the results table from Xilinx ISE.

I have to say that I am not used to Xilinx terminology.
So, it is better to wait to David to explain better the relation between
these slices.

Even so, I did some research on Google.
According to [1], in section "CLBs, Slices, and LUTs", One Virtex-6 slice
has 4 LUTs/8 flip-flops.
It looks like one LUT has up to 2 flip-flops.
So, Slice Registers = LUT-FF pairs = 35 may means that 35 LUTs were used as
register and all of them used 2 flip-flops.
Following this idea, it sums up 70 flip-flops, which are still less than 256
expected.

Maybe, some of the registers were mergerd to logic in the 131 LUTs.
I remeber the synthesizer log warning about the possibility of latch
inference.
In this case, some bits of s, l and r could be implemented in LUTs instead
of flip-flops.

IOBs are Input-Output Blocks and are affected only by the number of inputs
and outputs in FPGA level.
I guess BUFG/BUFGCTRLs are buffers used to make global signals more stable.
Specifically in this project, 2 are used, one for clock and other to reset
signals.

The relatively small
Post by Solar Designer
increase in LUT count means that with a 2-round implementation, we waste
most LUTs on the state machine, right?
I am not sure. Maybe doing some experimentations could clarify this
question.
For example, remove part of state machine and see the impact in the result.

On the other hand, the numbers so far suggest that 2-round is likely
Post by Solar Designer
better, and that if we simplify the state machine logic (possible or
not?) it will be better yet.
I Could try to optimize, but I don't know if I can achieve much more
resource saving.

Are you sure this is good enough for the compiler to do the smart thing,
Post by Solar Designer
realizing that we have an adder with some holes punched in it?
And what do you suggest? Find an equivalent function but simpler?
In any case, the synthesizer already does optimizations in logic (or at
least, tries to).

Do these initial value assignments consume any logic?


Yes.

If so, perhaps you can save some logic by having the initial state of
Post by Solar Designer
s[] uploaded from the host?
It is possible. However for each calculation it will be necessary to upload
them.
Unless we upload just once and maintain them stored in some registers.
Another possibility is to write in a ROM (depends on the board availability)
and read from it.

Did you verify correctness of your implemenation in any way? (Perhaps
Post by Solar Designer
by making sure it produces the same outputs for the same inputs as the C
implementation does.)
Yes. The testbench.v I sent has this goal.
I did exactly what you said.

[1]
http://www.dinigroup.com/product/data/DN-DualV6-PCIe-4/files/Virtex6_Overview_v11.pdf

Yuri Gonzaga
Solar Designer
2011-05-20 04:34:50 UTC
Permalink
Post by Yuri Gonzaga
Even so, I did some research on Google.
According to [1], in section "CLBs, Slices, and LUTs", One Virtex-6 slice
has 4 LUTs/8 flip-flops.
It looks like one LUT has up to 2 flip-flops.
Yes. This appears to be a 2x improvement over Virtex-5.
Post by Yuri Gonzaga
Maybe, some of the registers were mergerd to logic in the 131 LUTs.
It appears so. This "Virtex-6 Family Overview" you found says that
"between 25-50% of all slices can also use their LUTs as distributed
64-bit RAM". Per the table on page 2, for the device you synthesized
for, this percentage appears to be at around 35.9%.

Also, it appears that you synthesized for the smallest Virtex-6 device,
with 11,640 slices. The largest described in that overview is 6.4 times
larger. If my math is right, assuming the 66x figure from my previous
e-mail, it could give us a 420x advantage over a quad-core CPU, or as
much as 2500x for Pico's 6-chip board.

Would it be possible to generate some sort of diagram, to easily see how
those 131 LUTs are used?

That table on page 2 also gives the number of 36 Kbit Block RAMs - from
156 to 1064. This should let us implement this many full bcrypt cores.
I think we should try to do just that, and perhaps we'd have plenty of
LUTs left for smaller/other cores (bflike and/or DES). So please
proceed to implement bcrypt's performance-critical loop (as discussed
previously) for Virtex-6 using Block RAM.

Finally, there are lots of DSP slices - so many that it feels wrong not
to use them - from 288 to 2016. Each DSP can process 48-bit data at
600 MHz. It is unclear from this overview how much memory and by what
means the DSPs can access, though. If they can access an equivalent of
arbitrary 48-bit registers (many of them), then 2016 of those DSPs could
outperform a quad-core dual-issue 128-bit SIMD CPU at bitslice DES by a
factor of 15.
Post by Yuri Gonzaga
Post by Solar Designer
The relatively small
increase in LUT count means that with a 2-round implementation, we waste
most LUTs on the state machine, right?
I am not sure. Maybe doing some experimentations could clarify this
question.
For example, remove part of state machine and see the impact in the result.
Yes, you could try. Another guess: many LUTs are wasted on the s[]
initial values. Those values could theoretically be packed in just four
LUTs, but that would probably not allow for them to be read out quickly.
So perhaps more LUTs are wasted for that reason.
Post by Yuri Gonzaga
Post by Solar Designer
Are you sure this is good enough for the compiler to do the smart thing,
realizing that we have an adder with some holes punched in it?
And what do you suggest? Find an equivalent function but simpler?
No, I thought you'd show this partial carry addition bit-by-bit, without
redundancy that you expect the synthesizer to detect and optimize out.
But maybe this is not needed. If we could see (and understand) a diagram
of the generated circuit, we'd know for sure. Or you can try both ways
and compare the LUT counts.

Oh, here's a simpler test: try replacing pcadd() with a simple addition
or simple XOR. If the synthesizer was smart enough, this should not
change the LUT count. If this does reduce the LUT count, then perhaps
there was room for improvement.
Post by Yuri Gonzaga
In any case, the synthesizer already does optimizations in logic (or at
least, tries to).
Yes, perhaps it does that well enough.
Post by Yuri Gonzaga
Post by Solar Designer
Do these initial value assignments consume any logic?
Yes.
Post by Solar Designer
If so, perhaps you can save some logic by having the initial state of
s[] uploaded from the host?
It is possible. However for each calculation it will be necessary to upload
them.
And that's fine. We'll need to be supplying them anyway - if not from
the host, then from a SHA-256 core (or the like) implemented in the FPGA.

Also, Niter of 512 that I had in my code was just for testing. In actual
use, the iteration counts would be a lot higher - perhaps about 2 million
(to compute our hashes in 10ms at 200 MHz). Uploading some data every
10ms is fine. We'd need to do it anyway, the only question is how much
data - only input to a SHA-256 core implemented in FPGA (which would in
turn feed those other components) or the output of SHA-256 or SHA-512
running on the host (inputs to each of the thousands of cores individually).
I think that either approach would work.

BTW, I did some testing of bflike.c with millions of iterations -
detected no state repeats.
Post by Yuri Gonzaga
Unless we upload just once and maintain them stored in some registers.
No, you're missing the fact that we actually need to provide inputs to
those cores. L and R values are not sufficient - they're only 16-bit
together, so there will be lots of repeated values of them across the
cores and between iterations. We want our cores to be computing
different things, not the same thing many times (which an attacker would
optimize out).

Oh, if we used different initial values of s[] for each core, then we
could possibly send/receive just the 16-bit blocks (and accept having
some collisions). Yet this feels worse than sending/receiving s[].
Post by Yuri Gonzaga
I did exactly what you said.
Great!
Post by Yuri Gonzaga
[1]
http://www.dinigroup.com/product/data/DN-DualV6-PCIe-4/files/Virtex6_Overview_v11.pdf
Thanks,

Alexander
Solar Designer
2011-05-20 04:51:21 UTC
Permalink
David, Yuri -
Post by Solar Designer
Also, it appears that you synthesized for the smallest Virtex-6 device,
with 11,640 slices. The largest described in that overview is 6.4 times
larger. If my math is right, assuming the 66x figure from my previous
e-mail, it could give us a 420x advantage over a quad-core CPU, or as
much as 2500x for Pico's 6-chip board.
Oh, I was wrong about the largest. I looked at the last line in the table,
but that's not the largest device. The largest in terms of slice count
(XC6VLX760) is 10x larger than the smallest, but it is not the largest
in terms of Block RAMs and DSPs. Also, it lacks PCIe interface blocks,
which could be relevant to us.

David - which specific Virtex-6 devices are we going to use in production?

Alexander
David Hulton
2011-05-21 12:32:06 UTC
Permalink
Alexander, Yuri -

I think that we'd probably end up using a V6LX240T because it's not
very cost effective to go to anything larger. We will be coming out
with a V7 board later this year which will have closer to 500k LUTs
but it won't be available until after this project is over.

One thing we might want to consider is making this pack into LUTs the
most efficiently. I believe that for the DES S-boxes it only requires
4 6-input LUTs per S-box. I'm not sure how this compares to the
modified BF implementation (sorry I haven't had a chance to review
this much because I'm out of the country at the moment, btw are you
going to PH-Neutral?), but it would be good to double check how many
LUTs are used and see if there's a more optimal way to structure it
that still implements the similar amount of security.

-David
Post by Solar Designer
David, Yuri -
Post by Solar Designer
Also, it appears that you synthesized for the smallest Virtex-6 device,
with 11,640 slices.  The largest described in that overview is 6.4 times
larger.  If my math is right, assuming the 66x figure from my previous
e-mail, it could give us a 420x advantage over a quad-core CPU, or as
much as 2500x for Pico's 6-chip board.
Oh, I was wrong about the largest.  I looked at the last line in the table,
but that's not the largest device.  The largest in terms of slice count
(XC6VLX760) is 10x larger than the smallest, but it is not the largest
in terms of Block RAMs and DSPs.  Also, it lacks PCIe interface blocks,
which could be relevant to us.
David - which specific Virtex-6 devices are we going to use in production?
Alexander
Solar Designer
2011-05-21 22:36:51 UTC
Permalink
David, Yuri -
Post by David Hulton
I think that we'd probably end up using a V6LX240T because it's not
very cost effective to go to anything larger. We will be coming out
with a V7 board later this year which will have closer to 500k LUTs
but it won't be available until after this project is over.
OK. Did your DES speed numbers (such as 22 billion/second/chip) apply
to XC6VLX240T or to something else? This is important for us to know
such that we can compare DES against the alternatives we're considering.
Post by David Hulton
One thing we might want to consider is making this pack into LUTs the
most efficiently. I believe that for the DES S-boxes it only requires
4 6-input LUTs per S-box.
Yes, DES is great in that respect.

However, DES is probably larger than my bflike thing, so we'd have fewer
cores. I'd appreciate it if you post some info on that - e.g., some
synthesis results (preferably directly comparable to what Yuri posted
for bflike) for one fully-pipelined DES core. Alternatively, I need the
number of fully-pipelined DES cores that you manage to fit in a
particular Xilinx chip. There must be some overhead to manage those
cores, though (feed them with inputs, process their outputs) - I'd
appreciate info on that as well (what percentage of the logic is spent
on that?)

Also, DES is more software-friendly (with bitslice implementations).

Given the numbers Yuri posted, it appears that a XC6VLX240T would
outperform Core i7-2600 at bflike by a factor of 200. Isn't this 5x
better than the 40x we had for DES? This ignores the overhead, though.
But on the other hand, there's further room for improvement (add bit
permutations, which will slow down software).
Post by David Hulton
I'm not sure how this compares to the
modified BF implementation (sorry I haven't had a chance to review
this much because I'm out of the country at the moment, btw are you
going to PH-Neutral?), but it would be good to double check how many
LUTs are used and see if there's a more optimal way to structure it
that still implements the similar amount of security.
That's what we're trying to do, and we'd appreciate your help with it -
specifically, more info on your DES cores, and some advice on
generating and understanding a circuit diagram or the like.

No, I am not going to PH-Neutral. Have fun there!

Alexander
David Hulton
2011-05-30 17:52:44 UTC
Permalink
Alexander,
OK.  Did your DES speed numbers (such as 22 billion/second/chip) apply
to XC6VLX240T or to something else?  This is important for us to know
such that we can compare DES against the alternatives we're considering.
Yes, that's what's running on a single XC6LX240T.
However, DES is probably larger than my bflike thing, so we'd have fewer
cores.  I'd appreciate it if you post some info on that - e.g., some
synthesis results (preferably directly comparable to what Yuri posted
for bflike) for one fully-pipelined DES core.  Alternatively, I need the
number of fully-pipelined DES cores that you manage to fit in a
particular Xilinx chip.  There must be some overhead to manage those
cores, though (feed them with inputs, process their outputs) - I'd
appreciate info on that as well (what percentage of the logic is spent
on that?)
There is a little bit of overhead but not much, most of the space is
taken up by the S-Boxes. We currently have 36 fully pipelined DES
cores on our 22 billion/sec image. Each core runs at 600MHz. We could
fit more cores if we ran it at a slower speed but that's the best for
doing the high speed routing (and from what we've found is the best
tradeoff for the current design we have).
Also, DES is more software-friendly (with bitslice implementations).
Given the numbers Yuri posted, it appears that a XC6VLX240T would
outperform Core i7-2600 at bflike by a factor of 200.  Isn't this 5x
better than the 40x we had for DES?  This ignores the overhead, though.
But on the other hand, there's further room for improvement (add bit
permutations, which will slow down software).
That would be great if we could get a better advantage than DES with
this bflike algorithm. Was this verified by actually synthesizing
something close to the proposed algorithm?
That's what we're trying to do, and we'd appreciate your help with it -
specifically, more info on your DES cores, and some advice on
generating and understanding a circuit diagram or the like.
Sure thing. Do we currently have a basic implementation to work from?
I think the best thing to do at this point is to synthesize and try
tweaking to see which design is the most space/speed efficient.

-David
Yuri Gonzaga
2011-05-31 02:42:55 UTC
Permalink
Hi, there.

I synthesized again the bflike.
Now, it is was to Spartan-6 xc6slx45, same device of pico e101.
I got the following results (This time I could generate a report with more
details).

I could generate a schematic view as well (available on http://bit.ly/koMcxo),
but is very big and dificult to track. I don't know if it will help.

No, I thought you'd show this partial carry addition bit-by-bit, without
Post by Solar Designer
redundancy that you expect the synthesizer to detect and optimize out.
But maybe this is not needed. If we could see (and understand) a diagram
of the generated circuit, we'd know for sure. Or you can try both ways
and compare the LUT counts.
Oh, here's a simpler test: try replacing pcadd() with a simple addition
or simple XOR. If the synthesizer was smart enough, this should not
change the LUT count. If this does reduce the LUT count, then perhaps
there was room for improvement.
I replaced pcadd() for simpler "a ^ b ^ mask" and it really reduced the LUT
count.
However, I didn't figure out how to improve the pcadd() to do the right
thing using less LUTs.
All my attempts got wrong results in simulation.

Moreover, I am sending the verilog code again because the last one attached
has an error.

Regards,

Yuri.

-------------------------------------------------------------------------------------------------------

Slice Logic Utilization | Used | Available |
Utilization

Number of Slice Registers 35 54,576 1%

Number used as Flip Flops 3
Number used as Latches 32


Number of Slice LUTs 105 27,288 1%

Number used as logic 105 27,288 1%

Number using O6 output only 91
Number using O5 and O6 14
Number of occupied Slices 32 6,822 1%

Number of LUT Flip Flop pairs used 109

Number with an unused Flip Flop 75 109 68%
Number with an unused LUT 4 109 3%
Number of fully used LUT-FF pairs 30 109 27%
Number of unique control sets 2
Number of slice register sites lost 5 54,576 1%
to control set restrictions

Number of bonded IOBs 14 316 4%


Number of BUFG/BUFGMUXs 2 16 12%

Number used as BUFGs 2

Maximum Frequency: 76.940MHz
Solar Designer
2011-06-04 22:05:10 UTC
Permalink
Post by Yuri Gonzaga
I synthesized again the bflike.
Now, it is was to Spartan-6 xc6slx45, same device of pico e101.
I got the following results (This time I could generate a report with more
details).
I could generate a schematic view as well (available on http://bit.ly/koMcxo),
but is very big and dificult to track. I don't know if it will help.
Thanks! This is what I wanted to see, but you're right - it's so large
that it's difficult to figure anything out from it. I think we'll need
to generate such schematic views for much smaller pieces of Verilog
code, not for the entire thing at once.
Post by Yuri Gonzaga
Post by Solar Designer
Oh, here's a simpler test: try replacing pcadd() with a simple addition
or simple XOR. If the synthesizer was smart enough, this should not
change the LUT count. If this does reduce the LUT count, then perhaps
there was room for improvement.
I replaced pcadd() for simpler "a ^ b ^ mask" and it really reduced the LUT
count.
How large was the reduction? Can you also try simple "a ^ b" (no mask)
and simple "a + b", and report the LUT counts for all four (original
full pcadd(), your "a ^ b ^ mask", and these two I suggested)? These
numbers might give us some hints.
Post by Yuri Gonzaga
However, I didn't figure out how to improve the pcadd() to do the right
thing using less LUTs.
All my attempts got wrong results in simulation.
Can you possibly generate a schematic view for a pcadd() alone, with no
or little other logic (just enough to make sure parts of pcadd() aren't
optimized out as unused)? And for other variations of it as suggested
above. Then try to compare those.
Post by Yuri Gonzaga
Number of Slice LUTs 105 27,288 1%
It was 131 LUTs for your old code synthesized for Virtex-6. Why has
this reduced to 105 now? Is this for your simplified pcadd() (which
doesn't actually do the right thing)?
Post by Yuri Gonzaga
Maximum Frequency: 76.940MHz
Somehow this is twice lower than what you reported for EksBlowfish.
Sure, we're doing two rounds at once here, but the rounds are roughly
twice simpler than Blowfish's (two S-box lookups vs. four). Are lookups
from distributed memory (in LUTs) slower than those from BlockRAM? Or
does pcadd() as currently synthesized have higher propagation delays
than Blowfish's simple xor and add? Or is it the reduction in
parallelism (our two rounds are sequential, whereas Blowfish's four
S-box lookups for one round could theoretically occur in parallel)?
(Perhaps you don't have answers to these questions. I am just thinking
aloud.)

Anyway, I'd expect a comparable LUT count for EksBlowfish as well -
perhaps just a few times higher.

Thanks,

Alexander
Yuri Gonzaga
2011-06-06 23:10:27 UTC
Permalink
Post by Solar Designer
How large was the reduction? Can you also try simple "a ^ b" (no mask)
and simple "a + b", and report the LUT counts for all four (original
full pcadd(), your "a ^ b ^ mask", and these two I suggested)? These
numbers might give us some hints.
Ok. I have uploaded a pdf file reporting the 4 cases' results here:
http://bit.ly/iLaZQB

It was 131 LUTs for your old code synthesized for Virtex-6. Why has
Post by Solar Designer
this reduced to 105 now? Is this for your simplified pcadd() (which
doesn't actually do the right thing)?
I don't know. Maybe, the synthesizer is able to optimize better to this
device.
I have used the original approach to the pcadd()

Yuri.
Solar Designer
2011-06-07 00:22:21 UTC
Permalink
Yuri -
Post by Yuri Gonzaga
Post by Solar Designer
How large was the reduction? Can you also try simple "a ^ b" (no mask)
and simple "a + b", and report the LUT counts for all four (original
full pcadd(), your "a ^ b ^ mask", and these two I suggested)? These
numbers might give us some hints.
http://bit.ly/iLaZQB
This shows two things:

1. The smallest LUT count is for the original pcadd(), contrary to what
you had said.

2. All four LUT counts are a lot higher than those you had reported for
bflike. Are they for something else?

Can you perhaps share the corresponding four pieces of code as well?

Thanks,

Alexander
Yuri Gonzaga
2011-06-13 22:55:27 UTC
Permalink
Post by Solar Designer
2. All four LUT counts are a lot higher than those you had reported for
bflike. Are they for something else?
Wasting some time to answer this question, I figured out that the first
result (lower LUT count) was to the wrong version of verilog code.
So, the correct result is that with higher LUT count.

1. The smallest LUT count is for the original pcadd(), contrary to what
Post by Solar Designer
you had said.
Yes. I ran again and the results are still the same. So, please, forget
previous ones.

Can you perhaps share the corresponding four pieces of code as well?


`define NROUNDS_DIVIDED_BY_2 1
`define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1))
//`define PCADD(a, b, mask) a^b^mask
//`define PCADD(a, b, mask) a^b
//`define PCADD(a, b, mask) a+b

(...)

for(i = 0; i < `NROUNDS_DIVIDED_BY_2; i=i+1) begin
r = r ^ `PCADD(s[l[3:0]],s[16+(l>>4)], 8'h55);
l = l ^ `PCADD(s[r[3:0]],s[16+(r>>4)], 8'haa);
end


Regards,

Yuri
Solar Designer
2011-06-13 23:33:32 UTC
Permalink
Yuri -
Post by Yuri Gonzaga
Post by Solar Designer
2. All four LUT counts are a lot higher than those you had reported for
bflike. Are they for something else?
Wasting some time to answer this question, I figured out that the first
result (lower LUT count) was to the wrong version of verilog code.
So, the correct result is that with higher LUT count.
This is confusing. Can you please non-ambiguously match code versions
to LUT counts, in one message you post in here? And explain why all
four LUT counts - for different kinds of pcadd() - are several times
higher than what you had reported for bflike before.

We can't work with unreliable, inconsistent, and self-contradictory
results like this.
Post by Yuri Gonzaga
I ran again and the results are still the same. So, please, forget
previous ones.
I have no problem forgetting the previous results, but I need new ones.
Post by Yuri Gonzaga
Can you perhaps share the corresponding four pieces of code as well?
`define NROUNDS_DIVIDED_BY_2 1
`define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1))
//`define PCADD(a, b, mask) a^b^mask
//`define PCADD(a, b, mask) a^b
//`define PCADD(a, b, mask) a+b
OK. Perhaps it'd be a good idea not to omit the braces around (a), etc.
This shouldn't matter given your specific uses of PCADD(), but for
reliable testing we need to be making only the changes that we want to
see the effect of.
Post by Yuri Gonzaga
(...)
for(i = 0; i < `NROUNDS_DIVIDED_BY_2; i=i+1) begin
r = r ^ `PCADD(s[l[3:0]],s[16+(l>>4)], 8'h55);
l = l ^ `PCADD(s[r[3:0]],s[16+(r>>4)], 8'haa);
end
OK. But I am still confused what entire Verilog source files your four
results applied to, and why they were several times higher than those
you had reported for bflike before.

Summary: please provide reliable results now.

If the LUT counts of 105 (Spartan-6) and 135 (Virtex-6) for bflike were
wrong, then please explain why.

Thanks,

Alexander
Yuri Gonzaga
2011-06-14 00:59:26 UTC
Permalink
OK. I will explain better.

The first code I sent has wrong results (131 LUTs) because I did
modifications on the code after simulating.
So, next time I worked in this project, I realized the modifications set it
not to work properly.
Then, I corrected this and sent a new code to the mailing list.
After the correction, the results become different (938 LUTs).
And this is the current status. The last pdf has results related to the
correct code.

Regards,

Yuri
Solar Designer
2011-06-14 01:34:07 UTC
Permalink
Yuri -
Post by Yuri Gonzaga
The first code I sent has wrong results (131 LUTs) because I did
modifications on the code after simulating.
In other words, you synthesized, tested for correct operation, then made
incorrect changes/optimizations, synthesized again, did not test,
reported the results for the latter revision of the code (non-working).
Is this the correct understanding?
Post by Yuri Gonzaga
So, next time I worked in this project, I realized the modifications set it
not to work properly.
Then, I corrected this and sent a new code to the mailing list.
After the correction, the results become different (938 LUTs).
And this is the current status. The last pdf has results related to the
correct code.
OK. That PDF shows the smallest LUT count for the full pcadd(), as
opposed to the various alternatives. This is inconsistent with your
earlier statement that you got reduced LUT count when you tried
replacing pcadd() with the shorter expressions. Which is right?

Here's what I think we should start doing:

Each time you report synthesis results, please provide two files: one
file with full source code (something-someversion.v) and the other with
synthesis results (something-someversion.html or .pdf or whatever).
If you need to provide 4 results - such as for different pcadd()'s -
provide a total of 8 files.

This involves some overhead, but our approach so far is obviously not
reliable enough.

And please use our wiki for file uploads. For example, create this page:

http://openwall.info/wiki/crypt-dev/files

Then your uploaded files will fall under crypt-dev (DokuWiki places them
one level above the page being edited).

How does this sound to you?

Thanks,

Alexander
Yuri Gonzaga
2011-06-16 20:09:16 UTC
Permalink
Post by Solar Designer
In other words, you synthesized, tested for correct operation, then made
incorrect changes/optimizations, synthesized again, did not test,
reported the results for the latter revision of the code (non-working).
Is this the correct understanding?
Yes. Exactly!

OK. That PDF shows the smallest LUT count for the full pcadd(), as
Post by Solar Designer
opposed to the various alternatives. This is inconsistent with your
earlier statement that you got reduced LUT count when you tried
replacing pcadd() with the shorter expressions. Which is right?
The earlier statement is inconsistent because I used the source code with
error.
So, the right results are those reported recently.
Post by Solar Designer
Each time you report synthesis results, please provide two files: one
file with full source code (something-someversion.v) and the other with
synthesis results (something-someversion.html or .pdf or whatever).
If you need to provide 4 results - such as for different pcadd()'s -
provide a total of 8 files.
This involves some overhead, but our approach so far is obviously not
reliable enough.
http://openwall.info/wiki/crypt-dev/files
Then your uploaded files will fall under crypt-dev (DokuWiki places them
one level above the page being edited).
How does this sound to you?
Sounds great!
I created the wiki page you suggested and uploaded the 8 files (sources +
pdfs) of bflike.
Please, take a look and tell me if everything is right or not.

Thank you!

---
Yuri Gonzaga Gonçalves da Costa

Solar Designer
2011-06-04 23:32:27 UTC
Permalink
David,
Post by David Hulton
There is a little bit of overhead but not much, most of the space is
taken up by the S-Boxes. We currently have 36 fully pipelined DES
cores on our 22 billion/sec image. Each core runs at 600MHz.
Wow. This is where the constant S-boxes and ability to do the 8 S-box
lookups in parallel help. We're not going to achieve clock rates this
high for our Blowfish-like stuff... although I might try to come up with
a Blowfish-like construct having more parallelism in it.
Post by David Hulton
Post by Solar Designer
Given the numbers Yuri posted, it appears that a XC6VLX240T would
outperform Core i7-2600 at bflike by a factor of 200.  Isn't this 5x
better than the 40x we had for DES?  This ignores the overhead, though.
But on the other hand, there's further room for improvement (add bit
permutations, which will slow down software).
That would be great if we could get a better advantage than DES with
this bflike algorithm. Was this verified by actually synthesizing
something close to the proposed algorithm?
Yes, by looking at device utilization by one bflike core and concluding
that we'd be able to fit hundreds of those cores per chip. However,
this ignores overhead (routing, combining the cores' outputs).

Yuri posted the synthesis results in here - I'd appreciate it if you
take a look and provide your comments.
Post by David Hulton
Post by Solar Designer
That's what we're trying to do, and we'd appreciate your help with it -
specifically, more info on your DES cores, and some advice on
generating and understanding a circuit diagram or the like.
Sure thing. Do we currently have a basic implementation to work from?
Yes, we have this bflike thing in C and Verilog. It is an early draft,
Post by David Hulton
I think the best thing to do at this point is to synthesize and try
tweaking to see which design is the most space/speed efficient.
That's what we're doing.

We might end up with two kinds of bflike cores: those using BlockRAMs
and those using the remaining logic (after we're out of BlockRAMs).

Thanks,

Alexander
David Hulton
2011-05-12 17:44:17 UTC
Permalink
Alexander & Yuri,
Post by Solar Designer
- A lot of people have heard that DES has been "broken", even though it
hasn't - it's just that its 56-bit key size is insufficient.
True. I think we can probably get around this by using 3DES or doing
some sort of similar "key extension" to get more than the 56-bits out
of it. I think that aside from the small key size it's actually quite
a secure algorithm and has had a much longer time to be peer-reviewed
than most other algorithms.
Post by Solar Designer
- DES has been formally obsoleted, in favor of AES.  (As an excuse, we
can build upon 3DES and say that we're using that...)
- The key size (and also block size) is in fact inadequate; we'll need
to use DES such that this does not affect our KDF's properties.
- For the "alternative approach", DES is too software-friendly (but it
is also extremely hardware-friendly).
True. It's one tradeoff that we make by making the operations
parallelizable, but for an attacker this will always be the case since
they will be able to parallelize even fully sequential algorithms... I
think the best thing that we can do is make it so that the server side
can make use of parallelism as well so we can get away with using much
higher iteration counts and make better use of the server resources
when computing hashes legitimately....

One thought experiment is what if instead of doing the 25 iterations
of DES for crypt() they did 128 and made it parallelizable? This would
probably end up being faster to compute for the server and would also
slow down the attacker at the same time...
Post by Solar Designer
Post by David Hulton
We're able to do on the order of around 22 billion DES operations/sec
on our current M-501 modules (we currently are able to do over 1
trillion keys/sec on our 48-board 5U clusters)
That's impressive.
http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to
six M-501 modules per board".  Is the 22 billion figure for one or for
six FPGAs?
It's for a single M-501. One EX-500 with 6 M-501's is around 22
billion * 6 = 132 billion. Then we put 8 EX-500 backplanes in a
cluster to get the 132 billion * 8 = 1.056 trillion.
Post by Solar Designer
Yuri - the numbers I posted in here before were for DES-based crypt(3),
which includes 25 iterations of modified-DES.  So you need to multiply
them by 25 to compare against those posted by David above.  Thus, if
Core i7-2600 does 20.5M DES-based crypts per second (and it does that
here), this means that it does 512M modified-DES encryptions per second.
For plain DES (not modified), it would do approx. 550M per second in a
similar loop (no key setup in the loop, just iterated DES encryption).
More efficient parallelization over the multiple cores could provide
another 10% speedup (my 20.5M figure is for thread-safe code and OpenMP,
which has overhead).
That's still 35-40 times slower than the 22 billion mentioned by David,
which is good.
I have a feeling that by just adapting our current DES design (which
is really just a modified version of the fully pipelined DES core
provided by asics.ws/opencores.org) we should be able to get somewhere
around 2-4 billion on the E-101...
Post by Solar Designer
This makes sense.  As a bonus, since bitslice DES is also CPU-friendly,
we could try to design our KDF such that we treat the FPGA and the CPU
almost the same (at high level).
However, if 10x (or even 40x?) is possible in this way, this leads me to
think that we could achieve 100x+ with a truly CPU-unfriendly component
for the FPGA.
That's true. We would have to stray a bit far from the certified
algorithms but we could definitely put in some CPU/GPU roadblocks...
Post by Solar Designer
Disclaimer: I am not into GPU programming (yet).  This is why I use
words such as "apparently" below.
Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
likely to change.  Bitslice implementations of DES with the S-box
expressions published so far need up to approx. 20 registers, which is
apparently more than what typical GPUs have available per processing
element.  Also, apparently accessing a temporary variable spilled to
memory is relatively slow (compared to accessing the L1 cache in a CPU).
One expected change is that we (Openwall) are going to release more
GPU-friendly DES S-box expressions in about a month from now.  These
match the instruction set of high-end ATI/AMD GPUs (as well as AMD XOP
and PPC AltiVec, which is why they'd get into JtR this soon), and they
will need fewer registers.
Another change that may or may not happen is GPUs getting more registers
or/and faster access to temporary variables.  This sounds quite
realistic to me, but I am not an expert in this area.  Based on what I
read (about the implementation difficulty for DES on GPUs), I wouldn't
be surprised if a mere 2x increase in register count results in a 10x
speedup for DES on GPUs.
Yeah, this is true and we'll need to look into this a bit more to
actually get a feel for how fast GPUs are at this right now and where
they'll be in the near future...
Post by Solar Designer
This makes sense.  Wouldn't ASIC provide similar advantage for
Blowfish-like constructs, though?
Yeah it would and it would be good for us to do a speed comparison
with a fully pipelined version of blowfish because I think we'll see a
lot better performance if it's implemented for speed.
Post by Solar Designer
This makes sense.  I think the degree of parallelism would need to be
far greater, though.  For the CPU, it will need to be at least 128
(since we'll do bitslice).
But I also need to post my thoughts on us using a sequential memory-hard
algorithm as defined in the scrypt paper.  Maybe that is what will
occupy the CPU.
I agree.. For something like this it may be worth doing larger S-Boxes
(even of the size that would take up a full or half BlockRAM) so they
can't be bitsliced efficiently to force the CPU/GPU to go to
cache/memory. The only issue with this is making the algorithm work
efficiently and securely with a fewer number of s-boxes because
BlockRAM resources are somewhat limited (the LX45 has 116 18Kbit
BlockRAMs)...

-David
Yuri Gonzaga
2011-05-12 23:48:35 UTC
Permalink
I am following all this discussion and trying to understand as much as I
can.
By now, I like the way this is taking to a specified method and the fact we
can let GPU approaches omitted once I don't know anything about it.

What do you mean to a "non-crypto" component?

Thank you for letting me participate and have any word in design decisions.
I will read all these emails again and see if I can do anymore comment.

Yuri Gonzaga
Solar Designer
2011-05-14 00:40:43 UTC
Permalink
Yuri -
Post by Yuri Gonzaga
I am following all this discussion and trying to understand as much as I
can.
Great.
Post by Yuri Gonzaga
By now, I like the way this is taking to a specified method and the fact we
can let GPU approaches omitted once I don't know anything about it.
I don't think I had ever proposed us implementing a GPU-specific
component this summer. What I did propose initially was to design a
hashing method that could be implemented near-optimally on all common
kinds of hardware, including CPU, GPU (theoretically), and FPGA/ASIC,
with possible use of GPUs in servers in mind. However, we're trying to
see if an "alternative approach" with a component optimized specifically
for FPGAs makes more sense currently. It probably does, because
attackers who readily have GPUs are more numerous than those who readily
have FPGAs, whereas servers have neither (it's an add-on board either
way). And, yes, we need to consider our experience, too.
Post by Yuri Gonzaga
What do you mean to a "non-crypto" component?
Did you mean "by", not "to"? If so, I mean that we call it "non-crypto"
and show that the cryptographic strength of our hashing method does not
depend on this component.
Post by Yuri Gonzaga
Thank you for letting me participate and have any word in design decisions.
I will read all these emails again and see if I can do anymore comment.
You're welcome, and thanks for your comments. If anything is unclear to
you, please feel free to ask.

Alexander
Solar Designer
2011-05-13 17:51:17 UTC
Permalink
David,
Post by Solar Designer
http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to
six M-501 modules per board".
...
Post by Solar Designer
http://picocomputing.com/e_series.html says it's "Spartan-6 LX45 FPGA".
I am trying to take the available logic cells into account when I
consider different approaches/designs. I am looking at this web page:

http://www.1-core.com/library/digital/fpga-logic-cells/

This shows two kinds of logic cells available in different Xilinx FPGAs -
older ones (with two 4-input LUTs per cell) and newer ones (with four
6-input LUTs per cell). My understanding is that Virtex-6 uses the
latter. But what about Spartan-6, which will be in Yuri's board?

Also, how many logic cells do these FPGAs have?

As to the BlockRAMs, are they 36 kilobit each (not kilobyte), with two
I/O ports each (thus, usable as 2x18 kbit when only one port is needed)?

How many BlockRAMs do these specific FPGAs have?

(Of course, I could find this out on my own, but I thought it's quicker
to ask, and it's beneficial to have this mentioned in here anyway.)

Thanks,

Alexander
Solar Designer
2011-06-04 23:14:46 UTC
Permalink
David, Yuri -

I made some rather naive comments on DES on GPUs last month. I've since
discussed the matter with some people more familiar with GPUs, so here's
an update.

First, the best benchmark result I am currently aware of is for MTY, a
tripcode finder, doing 250 Mhashes/sec on ATI 5970. This translates to
6.25 billion of DES block encryptions per second. So it is still 3+
times slower than the 22 billion on Virtex-6, but it's by far not as bad
as the numbers for GPUs that we had before. Our new DES S-box
expressions will improve this number (for the GPU) some further.

http://en.wikipedia.org/wiki/Tripcode
http://sourceforge.jp/projects/naniya/
http://harry.lu/mty/
Post by Solar Designer
Disclaimer: I am not into GPU programming (yet). This is why I use
words such as "apparently" below.
Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
likely to change. Bitslice implementations of DES with the S-box
expressions published so far need up to approx. 20 registers, which is
apparently more than what typical GPUs have available per processing
element. Also, apparently accessing a temporary variable spilled to
memory is relatively slow (compared to accessing the L1 cache in a CPU).
I've since read up on this a bit, mostly with Nvidia focus (even though
AMD/ATI cards are apparently more suitable for the task). The number
of available registers per stream processor is actually pretty high
(thousands), but instruction latencies are so large that you have to run
a large number of simultaneous threads in order to hide those latencies.
This is why the number of registers actually available per thread turns
out to be pretty low (tens). Memory accesses are indeed slow, although
that differs by memory type a lot. On Nvidia, there's so-called shared
memory, which is nearly as fast as registers, but there's very little of
it (its size per stream processor may even be less than the combined
size of the stream processor's registers). Other types of memory have
huge latencies (hundreds of cycles).

With an architecture like this, not only the temporary values (such as
those computed inside DES S-box expressions), but also DES blocks and
keys need to be stored in registers (or in Nvidia's shared memory, but
there's not more of it). This gives us something like 64+56+20 = 140
registers needed per bitslice DES instance. But there are not this many
on Nvidia cards for the otherwise optimal number of threads to run.

On high-end ATI cards, according to the author of MTY, slowdown starts
to occur when a thread needs more than 49 registers, but MTY exceeds
this number anyway (no better choice). Yet, as stated above, its
overall performance is nevertheless very good (comparable to 8 billion
hashes/second reported for MD5, which is much more GPU-friendly).
Post by Solar Designer
Another change that may or may not happen is GPUs getting more registers
or/and faster access to temporary variables. This sounds quite
realistic to me, but I am not an expert in this area. Based on what I
read (about the implementation difficulty for DES on GPUs), I wouldn't
be surprised if a mere 2x increase in register count results in a 10x
speedup for DES on GPUs.
I think I was wrong about the above. If not having enough registers
results in a 10x performance hit and having 2x more registers would
avoid that, then perhaps the performance hit could be reduced from 10x
to 2x by simply running twice fewer threads. But I might be wrong about
that as well...

Alexander
Continue reading on narkive:
Loading...