Discussion:
Yuri's Status Report - #14 of 15
Yuri Gonzaga
2011-08-16 03:50:16 UTC
Permalink
Hi you,

- Accomplishments:
- Back to Pico e101 to run 4 cores of eksblowfish loop in parallel.
- The 4 cores running in parallel to compute the same benchmark
hash example wasted 99.312 s on my Windows XP virtual machine
as on the
another examples;
- Files available at http://openwall.info/wiki/crypt-dev/files
- Priorities:
- JtR's eksblowfish loop multicore integration, compilation, execution
and comparison.


Thanks,

Yuri
Solar Designer
2011-08-16 10:00:49 UTC
Permalink
Hi Yuri,
Post by Yuri Gonzaga
- Back to Pico e101 to run 4 cores of eksblowfish loop in parallel.
- The 4 cores running in parallel to compute the same benchmark
hash example wasted 99.312 s on my Windows XP virtual machine
as on the
another examples;
If I understand what the benchmark does correctly, this is awfully slow -
over 1000 times slower than a 1 GHz CPU. Also, it is unclear to me
how this corresponds to another benchmark result you mentioned before:

http://www.openwall.com/lists/crypt-dev/2011/07/05/1

"- This was executed 10 times sequentially and waste 234.938 seconds
of execution time"

So previously you had 23 seconds per Eksblowfish main loop invocation
(which is also unacceptably slow). Now you report 99 seconds for
_parallel_ execution of 4 instances of the loop, yet you say you receive
the same speed (or maybe I misinterpret what you wrote). These numbers
just do not agree; for parallel execution, you should have received the
same total runtime for 4 instances as you did for 1 instance, but you
got it to run 4 times longer now, as if your instances are running
sequentially rather than in parallel?

I asked you some questions in:

http://www.openwall.com/lists/crypt-dev/2011/07/05/2

to which you never replied. Perhaps I should have insisted on a reply,
so we'd catch the unacceptable performance sooner. Somehow I thought
that such execution times would apply to something like 1000 invocations
of the Eksblowfish loop (you did mention just 10, though...) or to a
much higher "cost" setting (but you appear to keep it at 5 in the code
that you uploaded to the wiki). Also, I was hoping that JtR integration
would be ready soon and we'd see the actual performance from there.

The code in eksblowfish-loop-interface.zip and
4-eksblowfish-loop-cores-pico-e101.zip looks like you do just 10
invocations of Eksblowfish at cost=5 in the former and just 4 of them
in the latter (with an attempt to do them in parallel, which may or may
not be successful - we'll need to figure this out too).

So where does the extreme performance loss come from? A ridiculously
low clock rate, like 1 MHz or below? Or am I misreading things?

You need to use a reasonable clock rate, comparable to what we'd
actually use, to validate that your design works as intended in hardware.
Post by Yuri Gonzaga
- Files available at http://openwall.info/wiki/crypt-dev/files
Thanks, I downloaded your latest and took a look.
Post by Yuri Gonzaga
- JtR's eksblowfish loop multicore integration, compilation, execution
and comparison.
While this is blocked waiting for access to the remote machine, please
figure out what goes on with the horrible performance.

On a 1 GHz CPU, Eksblowfish runs at something between 100 and 200
invocations per second, at cost=5. You're reporting it running at 23
seconds, which is thus 2000 to 5000 times slower. Indeed, the clock
rate is lower, but maybe only by a factor of 10 (I am assuming that you
run this at 100 MHz or so). This is partially compensated by the
reduced number of clock cycles per Blowfish round. On a CPU, it can be
something between 5 and 10 cycles per Blowfish round:

http://www.schneier.com/blowfish-speed.html

This gives 9 for the original Pentium, I am getting 5.5 for the code in
JtR on newer CPUs. On an FPGA, you should have between 1 and 5 clock
cycles per Blowfish round. IIRC, the first implementation with BlockRAMs
that you did was meant to provide 5 cycles per round, and we discussed
how to reduce that. Even if we take 5 cycles per round for both the CPU
and the FPGA, this gives us only a 10x slowdown per core because of the
clock rate difference alone (an Eksblowfish core in an FPGA vs. code
running on a CPU core). (And we'd compensate for that by having
something like 100 cores in a larger FPGA chip, as well as by reducing
the number of clock cycles per round, which doesn't have to be as bad as 5.)
But you're getting a ridiculous 2000 to 5000 times slowdown instead?

As to LUT count, it appears that you'd fit only 7 cores in the E-101
board's Spartan-6. That's obviously too few, but we'll target much
larger chips and we'll need to optimize the LUT count. For initial
testing, even 4 or 7 cores will work, but we need to get sane performance
from them and we need them to run in parallel for real (as demonstrated
with non-increasing total execution time when you add some meant to be
parallel invocations).

Please comment ASAP.

Thanks,

Alexander
Yuri Gonzaga
2011-08-17 04:29:35 UTC
Permalink
Post by Solar Designer
So previously you had 23 seconds per Eksblowfish main loop invocation
(which is also unacceptably slow). Now you report 99 seconds for
_parallel_ execution of 4 instances of the loop, yet you say you receive
the same speed (or maybe I misinterpret what you wrote). These numbers
just do not agree; for parallel execution, you should have received the
same total runtime for 4 instances as you did for 1 instance, but you
got it to run 4 times longer now, as if your instances are running
sequentially rather than in parallel?
I did some experimentations and I think I figured out what is wrong.
We have to consider the time spent with writing and reading operations.
This time is constant for one loop invocation in each case.
I measured and found:

- Sequential case: 22,35 seconds waste with writing and reading for one
loop invocation
- Parallel case: 26,75 seconds waste with writing and reading for one
loop invocation (corresponds to one core)

Parallel case is higher because it has one more operation during writing and
reading cycles in order to select the appropriate core.

Using that same example (cost = 5), the processing time (between writing and
reading) in both case is equivalent and about 0.06 seconds.

So, the parallel case will be better than sequential one only when it
overcome the higher overhead it has on writing and reading operations.
Considering the exponential growth of cost and contrasting 4 parallel cores
to 4 sequential loop invocation, I did some calculations and found that cost
Post by Solar Designer
= 18 should be the case.
Then, I ran on the board to prove this conclusion.
With cost = 18, and 4 cores vs. 4 sequential invocations, I got:

- Sequential total time: ~ 33 minutes
- Parallel total time: ~ 9 minutes

So, this "theory" seems to be ok.
Higher values of cost should improve more the gain of parallel compared to
sequential approach.
More cores, of course, should improve it as well. But, unfortunately , I
could fit only 4 cores in e101's FPGA.
Post by Solar Designer
http://www.openwall.com/lists/crypt-dev/2011/07/05/2
to which you never replied. Perhaps I should have insisted on a reply,
so we'd catch the unacceptable performance sooner.
The following questions?

Any info on the performance? This is one Eksblowfish core running in
Post by Solar Designer
the FPGA right? If so, it got to be several times slower than native
(compiled) code on the CPU, right?
Somehow I thought
Post by Solar Designer
that such execution times would apply to something like 1000 invocations
of the Eksblowfish loop (you did mention just 10, though...) or to a
much higher "cost" setting (but you appear to keep it at 5 in the code
that you uploaded to the wiki).
We can run with these parameters to see what happens, mainly on the other
board that has a higher clock frequency and could fit more cores.

The code in eksblowfish-loop-interface.zip and
Post by Solar Designer
4-eksblowfish-loop-cores-pico-e101.zip looks like you do just 10
invocations of Eksblowfish at cost=5 in the former and just 4 of them
in the latter
Right.

So where does the extreme performance loss come from? A ridiculously
Post by Solar Designer
low clock rate, like 1 MHz or below? Or am I misreading things?
You need to use a reasonable clock rate, comparable to what we'd
actually use, to validate that your design works as intended in hardware.
The clock rate is that provided by the e101 board, 48 MHz.

On a 1 GHz CPU, Eksblowfish runs at something between 100 and 200
Post by Solar Designer
invocations per second, at cost=5. You're reporting it running at 23
seconds, which is thus 2000 to 5000 times slower. Indeed, the clock
rate is lower, but maybe only by a factor of 10 (I am assuming that you
run this at 100 MHz or so).
48 MHz on pico e101.
Post by Solar Designer
This is partially compensated by the
reduced number of clock cycles per Blowfish round. On a CPU, it can be
http://www.schneier.com/blowfish-speed.html
This gives 9 for the original Pentium, I am getting 5.5 for the code in
JtR on newer CPUs. On an FPGA, you should have between 1 and 5 clock
cycles per Blowfish round. IIRC, the first implementation with BlockRAMs
that you did was meant to provide 5 cycles per round, and we discussed
how to reduce that. Even if we take 5 cycles per round for both the CPU
and the FPGA, this gives us only a 10x slowdown per core because of the
clock rate difference alone (an Eksblowfish core in an FPGA vs. code
running on a CPU core). (And we'd compensate for that by having
something like 100 cores in a larger FPGA chip, as well as by reducing
the number of clock cycles per round, which doesn't have to be as bad as 5.)
But you're getting a ridiculous 2000 to 5000 times slowdown instead?
Let's see what we can do to achieve this goal.


As to LUT count, it appears that you'd fit only 7 cores in the E-101
Post by Solar Designer
board's Spartan-6.
4 cores, although they (plus Pico bus internal logic) are occupying 65%,
suggesting the increase of cores.
But if add only one more core the tool is not able anymore to fit everything
in that FPGA.
Post by Solar Designer
That's obviously too few, but we'll target much
larger chips and we'll need to optimize the LUT count.
In this case, We will need to revisit the LUT count optimization step.

Thanks,

---
Yuri Gonzaga Gonçalves da Costa
-------------------------------------------------------------
Mestrando em Informática
LASID - Laboratório de Sistemas Digitais
Universidade Federal da Paraíba
Solar Designer
2011-08-17 15:17:28 UTC
Permalink
Yuri -
Post by Yuri Gonzaga
I did some experimentations and I think I figured out what is wrong.
We have to consider the time spent with writing and reading operations.
Right. It appears that picoXfaceP->WriteDevice() is very slow, then -
maybe it involves bidirectional communication, to make sure the data was
actually received at the other end.

Would it be possible for you to invoke picoXfaceP->WriteDevice() just
once, or just once per core, on the entire block of data? Perhaps start
by turning this loop:

for(int i = 0; i < 1024; i++) {
if(picoXfaceP->WriteDevice(7,&sBoxes[i],4) < 0){
printf("ERRO NA ESCRITA. ABORTANDO\n");
exit(EXIT_FAILURE);
}
}

into one call to picoXfaceP->WriteDevice(), for the 4 KB of data.

On the other hand, if this is somehow time-consuming for you (would take
more than a day to implement) and the API for M-501 under Linux is very
different (I don't know if it is), then don't bother.

But we definitely do need to combine writes/reads from the FPGA board
into larger chunks for good performance.
Post by Yuri Gonzaga
Using that same example (cost = 5), the processing time (between writing and
reading) in both case is equivalent and about 0.06 seconds.
This is a lot slower than desired, but it is reasonable. It is very
similar to performance of the original Pentium at 100 MHz. This makes
sense given that the board is running at 48 MHz and it needs maybe twice
fewer clock cycles per Blowfish round. So running this on 4 cores
you'll achieve roughly the speed of Pentium 2 at 350 MHz. Obviously, we
need a more optimal implementation (fewer cycles per Blowfish round,
less FPGA resources used per core), a much bigger FPGA, and a faster
clock rate for this to make any sense for practical use. I hope we'll
get much better speed on M-501 once you're able to access it again.
Post by Yuri Gonzaga
- Sequential total time: ~ 33 minutes
- Parallel total time: ~ 9 minutes
These numbers looked reasonable to me at first, but then I did some math
and they don't agree with the 0.06 seconds figure for cost=5 that you
gave above. Specifically:

33 * 60 / (2 ^ (18 - 5)) = 0.24

I expected to see something close to 0.06. Why is it 4 times slower
here? The difference between sequential and parallel times suggests
that the reads/writes overhead is indeed pretty low at cost=18, so this
overhead does not explain the 0.06 vs. 0.24 discrepancy.

Do you have an explanation?
Post by Yuri Gonzaga
Post by Solar Designer
http://www.openwall.com/lists/crypt-dev/2011/07/05/2
to which you never replied. Perhaps I should have insisted on a reply,
so we'd catch the unacceptable performance sooner.
The following questions?
Post by Solar Designer
Any info on the performance? This is one Eksblowfish core running in
the FPGA right? If so, it got to be several times slower than native
(compiled) code on the CPU, right?
Yes, these are what I meant above. You've provided the answers now.
Post by Yuri Gonzaga
4 cores, although they (plus Pico bus internal logic) are occupying 65%,
suggesting the increase of cores.
But if add only one more core the tool is not able anymore to fit everything
in that FPGA.
Any idea why not? What specific error message does it give when you try
to fit 5 cores?
Post by Yuri Gonzaga
In this case, We will need to revisit the LUT count optimization step.
Yes, we'll need to revisit it, but perhaps you should start by reducing
the number of clock cycles per Blowfish round. I think you can do 2
cycles/round while using the same number of BlockRAMs, and 1 cycle/round
when you use twice more BlockRAMs (acceptable if LUTs and not BlockRAMs
are the scarce resource anyway).

Thanks,

Alexander
Yuri Gonzaga
2011-08-18 05:46:10 UTC
Permalink
Post by Solar Designer
Would it be possible for you to invoke picoXfaceP->WriteDevice() just
once, or just once per core, on the entire block of data? Perhaps start
for(int i = 0; i < 1024; i++) {
if(picoXfaceP->WriteDevice(7,&sBoxes[i],4) < 0){
printf("ERRO NA ESCRITA. ABORTANDO\n");
exit(EXIT_FAILURE);
}
}
into one call to picoXfaceP->WriteDevice(), for the 4 KB of data.
I gave some tries of calling WriteDevice() passing block of data. It isn't
working. It is causing the return of wrong result.
Maybe I don't know how to call that function properly or there is any
problem with byte ordering.
In fact, it works passing 4 bytes a time and greater blocks apparently not.
Post by Solar Designer
On the other hand, if this is somehow time-consuming for you (would take
more than a day to implement) and the API for M-501 under Linux is very
different (I don't know if it is), then don't bother.
It is very similar.
Post by Solar Designer
Post by Yuri Gonzaga
- Sequential total time: ~ 33 minutes
- Parallel total time: ~ 9 minutes
These numbers looked reasonable to me at first, but then I did some math
and they don't agree with the 0.06 seconds figure for cost=5 that you
33 * 60 / (2 ^ (18 - 5)) = 0.24
I expected to see something close to 0.06. Why is it 4 times slower
here? The difference between sequential and parallel times suggests
that the reads/writes overhead is indeed pretty low at cost=18, so this
overhead does not explain the 0.06 vs. 0.24 discrepancy.
Do you have an explanation?
Could you please explain better your math?
I am sorry, but I didn't understand how you composed it.
Post by Solar Designer
Post by Yuri Gonzaga
4 cores, although they (plus Pico bus internal logic) are occupying 65%,
suggesting the increase of cores.
But if add only one more core the tool is not able anymore to fit
everything
Post by Yuri Gonzaga
in that FPGA.
Any idea why not? What specific error message does it give when you try
to fit 5 cores?
The first error is:

"ERROR:Place:543 - This design does not fit into the number of slices
available
in this device due to the complexity of the design and/or constraints."

And it gives the following orientation:

"Please evaluate the following:

- If there are user-defined constraints or area groups:
Please look at the "User-defined constraints" section below to
determine
what constraints might be impacting the fitting of this design.
Evaluate if they can be moved, removed or resized to allow for fitting.
Verify that they do not overlap or conflict with clock region
restrictions.
See the clock region reports in the MAP log file (*map) for more
details
on clock region usage.

- If there is difficulty in placing LUTs:
Try using the MAP LUT Combining Option (map lc area|auto|off).

- If there is difficulty in placing FFs:
Evaluate the number and configuration of the control sets in your
design."

Regards,

---
Yuri
Solar Designer
2011-08-18 08:25:54 UTC
Permalink
Yuri, David -
Post by Yuri Gonzaga
I gave some tries of calling WriteDevice() passing block of data. It isn't
working. It is causing the return of wrong result.
Maybe I don't know how to call that function properly or there is any
problem with byte ordering.
In fact, it works passing 4 bytes a time and greater blocks apparently not.
Does this apply to E-101 only or also to M-501? I notice that in your
changes to the JtR tree, you call drv->WriteDeviceAbsolute() with sizes
larger than 4 bytes. I guess this is untested yet, but you're hoping
that it'll work. Correct?

And indeed for decent performance you'll need sizes not merely larger
than 4 bytes, but rather you need to send/receive the entire blob of
around 4.5 KB in size in one call.
Post by Yuri Gonzaga
Post by Solar Designer
Post by Yuri Gonzaga
- Sequential total time: ~ 33 minutes
- Parallel total time: ~ 9 minutes
These numbers looked reasonable to me at first, but then I did some math
and they don't agree with the 0.06 seconds figure for cost=5 that you
33 * 60 / (2 ^ (18 - 5)) = 0.24
I expected to see something close to 0.06. Why is it 4 times slower
here? The difference between sequential and parallel times suggests
that the reads/writes overhead is indeed pretty low at cost=18, so this
overhead does not explain the 0.06 vs. 0.24 discrepancy.
Do you have an explanation?
Could you please explain better your math?
Oh, I missed an important detail that explains it all: "4 sequential
invocations". Somehow I lost this "4", treating 33 minutes as time for
one invocation at cost=18. This explains the 0.24 vs. 0.06 difference.
Post by Yuri Gonzaga
Post by Solar Designer
What specific error message does it give when you try to fit 5 cores?
"ERROR:Place:543 - This design does not fit into the number of slices
available
in this device due to the complexity of the design and/or constraints."
David - any comments on this? Per the synthesis report for 4 cores, it
appears to me that we'd have sufficient resources for 6 cores. I am
referring to 4-eksblowfish-loop-cores-pico-e101.zip available here:

http://openwall.info/wiki/crypt-dev/files

and specifically to 4-eksblowfish-loop-cores-pico-e101.pdf inside that
.zip archive. It says we're using 65% of the slices with 4 cores. Yet
adding a fifth core fails. Is this because of some routing constraints?

Looks like another thing to consider while optimizing each core, then.

Thanks,

Alexander
Yuri Gonzaga
2011-08-19 05:18:45 UTC
Permalink
Post by Solar Designer
Does this apply to E-101 only or also to M-501?
I don't know yet. This answer will have to wait for the bitstream
generation.
Post by Solar Designer
I notice that in your
changes to the JtR tree, you call drv->WriteDeviceAbsolute() with sizes
larger than 4 bytes. I guess this is untested yet, but you're hoping
that it'll work. Correct?
Right. It is untested. My intention is to transfer bigger block of data at
a time.
Post by Solar Designer
And indeed for decent performance you'll need sizes not merely larger
than 4 bytes, but rather you need to send/receive the entire blob of
around 4.5 KB in size in one call.
I will have to change the loop verilog construction to receive and send
everything together.

Regards,

Yuri
David Hulton
2011-08-22 19:09:31 UTC
Permalink
You should have much lower latency on the M501 for Writes/Reads but
you'll definitely want to be transferring your data in larger chunks
(instead of calling WriteDevice multiple times, call it once with a
large length). This will take advantage of bus mastering and bursting
on the bus and if you use a FIFO on the other end it should reduce
your latency almost down to 0 if you code it properly so the core is
never waiting for the software to fill the FIFO.

Also, looking at your Manager.v code you have multiple modules
outputting to the same PicoDataOut signal. You should create a
PicoDataOut wire for each core and OR them all together, this might be
causing issues with your build... I've attached a patched Manager.v
that tries to instantiate 6 cores. Also, it seems like a lot of the
logic is probably used up by larger resources that could be shared
(since they are used at different states in the state machine). I
would recommend trying to break it up to use modules that perform the
32-bit ADDs and other more resource intensive operations (could also
make use of a DSP48 block for the 32-bit ADD for example) and then
have the different parts of the state machine that need to perform a
32-bit ADD to use the module instead of the c <= a + b because usually
the tools aren't very good at realizing that all of the ADDs can be
performed using a shared resource. I would also just look into the
possibility of doing a fully or partially unrolled pipeline design for
the LX240 since there's a lot more logic that you can make use of.

-David
Post by Yuri Gonzaga
Post by Solar Designer
Does this apply to E-101 only or also to M-501?
I don't know yet. This answer will have to wait for the bitstream
generation.
Post by Solar Designer
I notice that in your
changes to the JtR tree, you call drv->WriteDeviceAbsolute() with sizes
larger than 4 bytes.  I guess this is untested yet, but you're hoping
that it'll work.  Correct?
Right. It is untested.  My intention is to transfer bigger block of data at
a time.
Post by Solar Designer
And indeed for decent performance you'll need sizes not merely larger
than 4 bytes, but rather you need to send/receive the entire blob of
around 4.5 KB in size in one call.
I will have to change the loop verilog construction to receive and send
everything together.
Regards,
Yuri
Continue reading on narkive:
Loading...