Example: Nexus Proof of Work Function

This example discusses the compiler-autopipelining of a rather large cryptographic hash function.

Discord user Wolf9466 created and shared a highly pipelined hand written Verilog implementation in addition to some test C code. It is the test C code which is synthesized as PipelineC and compared to the Verilog implementation.

Some information on the hash function from Wolf9466: The cryptocurrency Nexus uses a SK1024 proof-of-work which consists of two Skein block processes and three Keccak block processes. It's run on the block header - Keccak-1024(Skein-1024(data))

Original C code test for the Verilog NXSTest.c

Modified C code for synthesis NXSTest_syn.c

C Code Changes for Synthesis

Pointers to fixed size arrays were replaced with fixed size arrays passed by value. This include the creation of structs to return when this was necessary. Run-time constant values as variables were replaced with macro unrolling, and inlining of some functions and some bulky bit manipulations ORing bits together were replaced by compiler built in bit manipulations functions. (need better constant propagation across function boundaries and graph partitioning/pattern matching to use as originally written - anyone want to help with graph optimizations?). memcpy and memset replaced with some arrays.h macros for copying array contents. Finally some function inner loops were made into isolated single iteration functions: for synthesis time/quality of results.

Main pipeline structure

#pragma MAIN_MHZ DoNXSTest 850.0 // Timing goal
// Two input port u64 arrays, and one u64 output
uint64_t DoNXSTest(uint64_t State[26], uint64_t Key[17])
{
  uint64_t Nonce = 0x00000001FCAFC045ULL;
  uint64_t P[16];
  
  // loads low 8 qwords
  //memcpy(P, State + 16, 64);
  uint32_t i;
  for(i = 0; i < 8; i+=1) P[i] = State[i+2];
  
  P[8] = State[24];
  P[9] = State[25];
  P[10] = Nonce;
  
  for(i = 11; i < 16; i+=1) P[i] = 0ULL;
  
  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest0(P, Key); 
  P = srt.State;
  Key = srt.Key;
  
  for(i = 0; i < 8; i+=1) Key[i] = P[i] ^ State[16 + i];
  
  Key[8] = P[8] ^ State[24];
  Key[9] = P[9] ^ State[25];
  Key[10] = P[10] ^ Nonce;
	
  for(i = 11; i < 16; i+=1) Key[i] = P[i];
  
  // Clear P, the state argument to the final block is zero.
  //memset(P, 0x00, 128);
  ARRAY_SET(P, 0x00, 16)
  Key[16] = SKEIN_KS_PARITY;
  for(i = 0; i < 16; i+=1) Key[16] ^= Key[i];

  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest1(P, Key);
  P = srt.State;
  Key = srt.Key;

  uint64_t KeccakState[25];
  // memset(KeccakState, 0x00, 200);
  ARRAY_SET(KeccakState, 0x00, 25)
  // Copying qwords 0 - 8 (inclusive, so 9 qwords).
  // Note this is technically an XOR operation, just
  // with zero in this instance.
  // memcpy(KeccakState, P, 72);
  ARRAY_COPY(KeccakState, P, 9) 
  
  keccackf_t k = keccakf(KeccakState);
  KeccakState = k.st;
  
  for(i = 0; i < 7; i+=1) KeccakState[i] ^= P[i + 9];

  KeccakState[7] ^= 0x0000000000000005ULL;
  KeccakState[8] ^= 0x8000000000000000ULL;
  
  k = keccakf(KeccakState);
  KeccakState = k.st;

  k = keccakf(KeccakState);
  KeccakState = k.st;
  
  return(KeccakState[6]);
}

Data flows from inputs to the output return value. This is a pure function and all function inputs and outputs are passed by value.

The PipelineC tool produces a (currently crude) visual aid 'pipeline map' for identifying the critical path delay through your code. Being text based it is too large to show here but it can be seen that the two SkeinRoundTest blocks are ~1/2 of the datapath total length. Followed by the three keccakf blocks which take up the later ~1/2 of the datapath.

The comb. logic path

  Source:                 Key_input_reg_reg[2][0]/C
  Destination:            return_output_output_reg_reg[45]/D
  Data Path Delay:        208.747ns  (logic 79.923ns (38.287%)  route 128.824ns (61.713%))
  Logic Levels:           904  (CARRY8=361 LUT1=1 LUT2=26 LUT3=41 LUT4=252 LUT5=152 LUT6=71)

It is this comb. logic that the PipelineC compiler begins to autopipeline.

The tool uses information about the synthesized comb. delays of various submodules/functions in the design to insert pipelining registers.

SkeinMix8 Path delay (ns): 1.181 = 846.74 MHz
keccakf_iter Path delay (ns): 1.043 = 958.77MHz
SkeinEvenRound Path delay (ns): 2.804 = 356.63 MHz
SkeinOddRound Path delay (ns): 2.823 = 354.233 MHz
keccakf Path delay (ns): 30.107 = 33.214 MHz
SkeinRoundTest0 Path delay (ns): 59.98 = 16.672 MHz
SkeinRoundTest1 Path delay (ns): 60.102 = 16.638 MHz
DoNXSTest Path delay (ns): 208.903 = 4.78 MHz

Below are the results from the fast coarse grain pipelining throughput sweep:

================== Beginning Throughput Sweep ================================
Function: DoNXSTest Target MHz: 850.0
Setting all instances to comb. logic to start...
Starting with blank sweep state...
...determining slicing information for each main function...
Starting coarse sweep...
Slicing up pipeline stages...
DoNXSTest : 176 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 418.93590280687056 ( 2.387 ns)
DoNXSTest : 220 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 460.617227084293 ( 2.171 ns)
DoNXSTest : 263 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 479.84644913627636 ( 2.084 ns)
DoNXSTest : 305 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 535.6186395286556 ( 1.867 ns)
DoNXSTest : 346 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 537.9236148466917 ( 1.859 ns)
DoNXSTest : 386 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 537.9236148466917 ( 1.859 ns)
DoNXSTest : 425 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 541.4185165132648 ( 1.847 ns)
DoNXSTest : 463 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 631.7119393556538 ( 1.583 ns)
DoNXSTest : 500 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 590.6674542232723 ( 1.693 ns)
DoNXSTest : 536 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 590.6674542232723 ( 1.693 ns)
DoNXSTest : 571 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns) *
DoNXSTest : 605 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns)
DoNXSTest : 638 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns)

The tool is able to pipeline the original 208.903 ns path up to 208.903/1.435 ns ~= 145x faster throughput, 696.86 MHz, 571 clocks deep pipeline.

Timing Report:

Slack (VIOLATED) :        -0.259ns  (required time - arrival time)
  Source:                 DoNXSTest/SkeinRoundTest0_NXSTest_syn_c_l446_c26_1a99.../C
  Destination:            DoNXSTest/SkeinRoundTest0_NXSTest_syn_c_l446_c26_1a99.../D
  Requirement:            1.176ns  (clk_850p0 rise@1.176ns - clk_850p0 rise@0.000ns)
  Data Path Delay:        1.280ns  (logic 0.319ns (24.922%)  route 0.961ns (75.078%))
  Logic Levels:           5  (LUT2=1 LUT4=1 LUT6=3)

Utilization:

+------+--------+-------+
|      |Cell    |Count  |
+------+--------+-------+
|1     |BUFG    |      1|
|2     |CARRY8  |  18112|
|3     |LUT1    |    146|
|4     |LUT2    | 253168|
|5     |LUT3    | 186722|
|6     |LUT4    |  58915|
|7     |LUT5    |  41453|
|8     |LUT6    |   9659|
|9     |SRL16E  |  42699|
|10    |SRLC32E |   5092|
|11    |FDRE    | 862090|
|12    |IBUF    |   2753|
|13    |OBUF    |     64|
+------+--------+-------+
Total LUTs: 550063

Verilog Hand Tuned Pipeline Synthesis Results

Part: xcvu9p-flgb2104-2-i

localparam TOTALSTAGES = (SKEINBLKSTAGES * 2) + (KECCAKBLKSTAGES * 3) + 2;

Wolf9466 designed a pipeline that splits the two Skein block processes and three Keccak block processes in total across 390 pipeline stages. Individual iterations of block processes we identified as being small enough to fit into single clock cycles. From there an additional pipeline stage or two were added for intermediate operations that didnt fit evenly into other stages.

Part: xcvu33p-fsvh2104-2L-e

Timing Report:

An fmax of roughly ~846MHz = ~1.182ns period.

wtiming

Utilization

+----------------------------+--------+-------+-----------+-------+
|          Site Type         |  Used  | Fixed | Available | Util% |
+----------------------------+--------+-------+-----------+-------+
| CLB LUTs*                  | 333899 |     0 |    439680 | 75.94 |
|   LUT as Logic             | 331829 |     0 |    439680 | 75.47 |
|   LUT as Memory            |   2070 |     0 |    205440 |  1.01 |
|     LUT as Distributed RAM |      0 |     0 |           |       |
|     LUT as Shift Register  |   2070 |     0 |           |       |
| CLB Registers              | 529262 |     0 |    879360 | 60.19 |
|   Register as Flip Flop    | 529262 |     0 |    879360 | 60.19 |
|   Register as Latch        |      0 |     0 |    879360 |  0.00 |
| CARRY8                     |  15280 |     0 |     54960 | 27.80 |
| F7 Muxes                   |      0 |     0 |    219840 |  0.00 |
| F8 Muxes                   |      0 |     0 |    109920 |  0.00 |
| F9 Muxes                   |      0 |     0 |     54960 |  0.00 |
+----------------------------+--------+-------+-----------+-------+

Design Comparison PipelineC vs. Verilog

Timing Paths:

Verilog: CARRY8=8 LUT2=1 path
     Likely a single u64 add. Reports more logic delay but less route delay compare to PipelineC path.
     FMAX ~= 846MHz
PipelineC: LUT2=1 LUT4=1 LUT6=3 path
     Reports less logic delay than the Verilog, but significantly more route delay.
     FMAX ~= 696MHz

Utilization:

            Verilog | PipelineC
Total LUTs:  333899 | 550063
CARRY8:      15280  | 18112
Registers:   529262 | 862090

TODO compare util and timing paths. Point out where tool need improvement, reach out for project specific help.

Example: Nexus Proof of Work Function

C Code Changes for Synthesis

Main pipeline structure

Verilog Hand Tuned Pipeline Synthesis Results

Design Comparison PipelineC vs. Verilog

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally