EskBlowFish with RAM results

Discussion:

Yuri Gonzaga

2011-05-31 03:40:23 UTC

Hello,

I am reporting the synthesis results for EsksBlowFish using RAM to pico e101
device (Spartan-6 xc6slx45).

Regards,

Yuri.

Slice Logic Utilization | Used |
Available

Number of Slice Registers 2,506 54,576
Number used as Flip Flops 2,506

Number of Slice LUTs 8,059 27,288
Number used as logic 8,055 27,288
Number using O6 output only 7,509
Number using O5 output only 1
Number using O5 and O6 545
Number used as Memory 0 6,408
Number used exclusively as route-thrus 4
Number with same-slice register load 4

Number of occupied Slices 2,611 6,822

Number of LUT Flip Flop pairs used 8,142
Number with an unused Flip Flop 5,647 8,142
Number with an unused LUT 83 8,142
Number of fully used LUT-FF pairs 2,412 8,142
Number of unique control sets 21
Number of slice register sites lost 86 54,576
to control set restrictions

Number of bonded IOBs 115 316

Number of RAMB16BWERs 4 116

Number of BUFG/BUFGMUXs 1 16
Number used as BUFGs 1

Maximum Frequency: 143.429MHz

Solar Designer

2011-06-04 21:39:51 UTC

Permalink

Hi Yuri,

I am sorry that I am somewhat late to comment on this...

Post by Yuri Gonzaga
I am reporting the synthesis results for EsksBlowFish using RAM to pico e101
device (Spartan-6 xc6slx45).

Great! Can you post the code, please?

Post by Yuri Gonzaga
Slice Logic Utilization | Used | Available

...

Post by Yuri Gonzaga
Number of Slice LUTs 8,059 27,288

Why so many LUTs?

Post by Yuri Gonzaga
Number of occupied Slices 2,611 6,822

This is a consequence of the above...

So we'd fit only 3 EksBlowfish cores into that Spartan-6 chip, resulting
in worse than CPU performance. This shouldn't be so.

Post by Yuri Gonzaga
Number of RAMB16BWERs 4 116

This makes more sense, but still I'd expect only 2 of these used. Why 4?
Are you keeping the initial constants in separate BlockRAMs? Or maybe P?
If so, the initial constants should be uploaded by the host, and P
should be in registers.

Anyhow, even at 4 16+2 Kbit BlockRAMs used per EksBlowfish, we could fit
29 cores in that chip. So we're clearly LUT-bound now, in terms of the
number of cores we can fit. Please look into reducing the LUT count,
which I think should be possible to reduce by a factor of 10 or more.
I really don't understand why you have so many LUTs used.

Thanks,

Alexander

Solar Designer

2011-06-04 22:21:33 UTC

Permalink

Yuri -

Post by Solar Designer

Post by Yuri Gonzaga
Number of RAMB16BWERs 4 116

This makes more sense, but still I'd expect only 2 of these used. Why 4?

Oh, perhaps you're using a 16+2 Kbit BlockRAM per S-box in order to have
enough read ports for the four S-box lookups to occur in parallel. Right?

If so, this wastes half the BlockRAM space, so we'll possibly need to
switch to two 16+2 Kbit BlockRAMs per EksBlowfish core for actual use
(and have less parallelism inside those cores) in order to have more
cores per chip. Possibly this will provide an overall improvement.
It is worth testing both approaches.

Alexander

Solar Designer

2011-06-09 17:02:48 UTC

Permalink

Yuri -

Post by Solar Designer
Oh, perhaps you're using a 16+2 Kbit BlockRAM per S-box in order to have
enough read ports for the four S-box lookups to occur in parallel. Right?

I am now looking at these "lite" guides for Virtex-6 and Spartan-6:

http://www.eetimes.com/design/programmable-logic/4015235/Xilinx-Virtex-6-FPGA-User-Guide-Lite?pageNumber=2

http://www.eetimes.com/design/programmable-logic/4015237/Xilinx-Spartan-6-FPGA-User-Guide-Lite?pageNumber=3

It appears that in both cases it should be possible to do all four
Blowfish S-box lookups in parallel, without wasting any RAM. Initially,
I thought that we only had two read ports with Virtex-6's 36 Kbit
BlockRAMs, and that this was reduced to one read port when we choose to
have 18 Kbit BlockRAM "halves" instead. However, these "lite" guides
suggest that we have two read ports per 18 Kbit BlockRAMs as well.

Well, the Virtex-6 one is somewhat unclear on that. With Spartan-6,
everything is twice smaller, whereas the port count stays the same, so
it is more obvious that we can do it on Spartan-6.

You have:

module RAM
#(parameter DATA_WIDTH=32, parameter ADDR_WIDTH=10)

Maybe you can try ADDR_WIDTH=9 and try to have dual read ports for that?
Then try to synthesize for Spartan-6 (must work) and Virtex-6 (might work).

Another curious detail:

"An optional output data pipeline register allows higher clock rates at
the cost of an extra cycle of latency."

I think we'll need to make use of this, by including at least two
EksBlowfish instances per state machine.

Thanks,

Alexander

Yuri Gonzaga

2011-06-06 22:24:57 UTC

Permalink

Post by Solar Designer
Great! Can you post the code, please?

Yes. It is going attached right now.

Post by Solar Designer
Why so many LUTs?

Please look into reducing the LUT count,

which I think should be possible to reduce by a factor of 10 or more.

I really don't understand why you have so many LUTs used.

I don't know. Maybe, next experimentations could answer this question.
I will try to reduce, but I don't know if I can achieve a factor of 10.

Number of RAMB16BWERs 4 116

Post by Solar Designer
This makes more sense, but still I'd expect only 2 of these used. Why 4?
Are you keeping the initial constants in separate BlockRAMs? Or maybe P?
If so, the initial constants should be uploaded by the host, and P
should be in registers.

I think it is 4 because I am storing initial constants in a ROM and the
synthesizer implements this ROM with RAM.

Oh, perhaps you're using a 16+2 Kbit BlockRAM per S-box in order to have

Post by Solar Designer
enough read ports for the four S-box lookups to occur in parallel. Right?

I'm not doing that.

Thank you too for the feedback.

Yuri

Solar Designer

2011-06-07 01:06:59 UTC

Permalink

Yuri -

Post by Yuri Gonzaga
Yes. It is going attached right now.

Even though I don't really know Verilog, I took a look at the code,
thanks. It appears that you spend 5 clock cycles per Blowfish round:

1. L xor P[n]
2. read two S-box elements
3. process them, read two more S-box elements
4. process them
5. swap L and R, move to next round

I think you could reduce this to three (while still using two read ports
to BlockRAM only):

1. L xor P[n], read two S-box elements
2. process them, read two more S-box elements
3. process them, swap L and R, move to next round

or to 4 per two rounds:

1. L xor P[n], read two S-box elements
2. process them, read two more S-box elements
3. process them, R xor P[n], read two S-box elements
4. process them, read two more S-box elements, move to next round

I understand that you probably didn't optimize for speed yet, and maybe
I am missing something.

If it turns out that LUTs and not BlockRAMs are the scarce resource
(that is, if we fail to reduce the LUT count needed substantially), then
it'd make sense to waste half the BlockRAM space in order to use four
read ports per EksBlowfish core. Then you should be able to do two
rounds per two cycles.

In fact, I am considering a larger EksBlowfish-like component, which
would fully use the BlockRAMs in one clock cycle - simply make the
S-boxes 9-to-36 (instead of Blowfish's 8-to-32). So if you test the
above approach for the original EksBlowfish (which is directly useful
e.g. for JtR), we may reuse it for ours as well (for our new hashing
method).

Post by Yuri Gonzaga
I don't know. Maybe, next experimentations could answer this question.
I will try to reduce, but I don't know if I can achieve a factor of 10.

The first thing to do is exclude all code that is not in the performance
critical loop, as discussed before.

Post by Yuri Gonzaga
I think it is 4 because I am storing initial constants in a ROM and the
synthesizer implements this ROM with RAM.

Yes, this must be the case.

Thanks,

Alexander

Yuri Gonzaga

2011-06-14 00:36:54 UTC

Permalink

Post by Solar Designer
The first thing to do is exclude all code that is not in the performance
critical loop, as discussed before.

Done!

Post by Solar Designer
Even though I don't really know Verilog, I took a look at the code,
1. L xor P[n]
2. read two S-box elements
3. process them, read two more S-box elements
4. process them
5. swap L and R, move to next round
I think you could reduce this to three (while still using two read ports
1. L xor P[n], read two S-box elements
2. process them, read two more S-box elements
3. process them, swap L and R, move to next round

Done! But, as the following results (http://bit.ly/mOP15y) state, the LUT
resources tend to grow.

Regards,

Yuri

Solar Designer

2011-06-14 01:41:56 UTC

Permalink

Yuri -

Post by Yuri Gonzaga
Done! But, as the following results (http://bit.ly/mOP15y) state, the LUT
resources tend to grow.

The file hosting service you used this time made me wait a minute before
letting me download. Next time, please use our wiki instead.

So I downloaded a PDF file with some synthesis results. I can't make
any use of them for my decision-making and guidance to you on further
work because I don't have the corresponding source code.

You say that the LUT resources tend to grow - but I don't see that.
I have no idea what you're comparing against what else when you make
that statement.

So, as I wrote in my previous message (in another thread), please
provide 4 files at once (say, in one zip archive): two versions of the
source code and two corresponding synthesis results files, unambiguously
named such that I can match them against the source code.

Thanks,

Alexander

Yuri Gonzaga

2011-06-16 20:18:15 UTC

Permalink

Post by Solar Designer
The file hosting service you used this time made me wait a minute before
letting me download. Next time, please use our wiki instead.

I am really sorry for that.
Thank you for guiding me to use a more organized approach.

So I downloaded a PDF file with some synthesis results. I can't make

Post by Solar Designer
any use of them for my decision-making and guidance to you on further
work because I don't have the corresponding source code.

My mistake! I have upload the source code and results to that wiki page.
Sorry again!

You say that the LUT resources tend to grow - but I don't see that.

Post by Solar Designer
I have no idea what you're comparing against what else when you make
that statement.

I am comparing the new version (eksblowfish-loop) to old one
(eksblowfish-setup).
The new one has some more extra LUTs than the other.
But, on the other hand, it reduced the number of RAM blocks to 2.

So, as I wrote in my previous message (in another thread), please

Post by Solar Designer
provide 4 files at once (say, in one zip archive): two versions of the
source code and two corresponding synthesis results files, unambiguously
named such that I can match them against the source code.

OK. They are on the wiki page (eksblowfish-setup and eksblowfish-loop).

Thank you.

Yuri Gonzaga

Yuri Gonzaga

2011-06-27 19:04:48 UTC

Permalink

Hi,

I did some improvements in HDL coding, which reduced the FPGA resource
usage.

- Slice registers: from 2503 to 1145;
- Slice LUTs: from 8576 to 2649;
- Occupied slices: from 2948 to 809.

This comparasion was done between the new version (eksblowfish-loop-2) and
the previous one (eksblowfish-loop).

One more RAM block was inferred by the synthesizer to implement one of my
control registers. So, the total of RAM blocks is 3, 2 of RAMB16BWER and 1
of RAMB8BWER.

The new code and the detailed report was uploaded to
http://openwall.info/wiki/crypt-dev/files

Best Regards,

---
Yuri Gonzaga Gonçalves da Costa

Solar Designer

2011-06-27 19:44:06 UTC

Permalink

Hi Yuri,

Post by Yuri Gonzaga
I did some improvements in HDL coding, which reduced the FPGA resource
usage.
- Slice registers: from 2503 to 1145;
- Slice LUTs: from 8576 to 2649;
- Occupied slices: from 2948 to 809.
This comparasion was done between the new version (eksblowfish-loop-2) and
the previous one (eksblowfish-loop).

This is good improvement. Still not quite enough, though.

Overall, those much worse than expected LUT counts for Eksblowfish and
especially for bflike suggest that we could consider DES more seriously.

Post by Yuri Gonzaga
One more RAM block was inferred by the synthesizer to implement one of my
control registers. So, the total of RAM blocks is 3, 2 of RAMB16BWER and 1
of RAMB8BWER.
The new code and the detailed report was uploaded to
http://openwall.info/wiki/crypt-dev/files

I'll take a look.

Thanks,

Alexander