Skip to content

Example: Nexus Proof of Work Function

Julian Kemmerer edited this page Apr 6, 2021 · 47 revisions

This example discusses the compiler-autopipelining of a rather large cryptographic hash function.

Discord user Wolf9466 created and shared a highly pipelined hand written Verilog implementation in addition to some test C code. It is the test C code which is synthesized as PipelineC and compared to the Verilog implementation.

Some information on the hash function from Wolf9466: The cryptocurrency Nexus uses a SK1024 proof-of-work which consists of two Skein block processes and three Keccak block processes. It's run on the block header - Keccak-1024(Skein-1024(data))

Original C code test for the Verilog NXSTest.c

Modified C code for synthesis NXSTest_syn.c

C Code Changes for Synthesis

Pointers to fixed size arrays were replaced with fixed size arrays passed by value. This include the creation of structs to return when this was necessary. Run-time constant values as variables were replaced with macro unrolling, and inlining of some functions and some bulky bit manipulations ORing bits together were replaced by compiler built in bit manipulations functions. (need better constant propagation across function boundaries and graph partitioning/pattern matching to use as originally written - anyone want to help with graph optimizations?). memcpy and memset replaced with some arrays.h macros for copying array contents. Finally some function inner loops were made into isolated single iteration functions: for synthesis time/quality of results.

Main pipeline structure

#pragma MAIN_MHZ DoNXSTest 850.0 // Timing goal
// Two input port u64 arrays, and one u64 output
uint64_t DoNXSTest(uint64_t State[26], uint64_t Key[17])
{
  uint64_t Nonce = 0x00000001FCAFC045ULL;
  uint64_t P[16];
  
  // loads low 8 qwords
  //memcpy(P, State + 16, 64);
  uint32_t i;
  for(i = 0; i < 8; i+=1) P[i] = State[i+2];
  
  P[8] = State[24];
  P[9] = State[25];
  P[10] = Nonce;
  
  for(i = 11; i < 16; i+=1) P[i] = 0ULL;
  
  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest0(P, Key); 
  P = srt.State;
  Key = srt.Key;
  
  for(i = 0; i < 8; i+=1) Key[i] = P[i] ^ State[16 + i];
  
  Key[8] = P[8] ^ State[24];
  Key[9] = P[9] ^ State[25];
  Key[10] = P[10] ^ Nonce;
	
  for(i = 11; i < 16; i+=1) Key[i] = P[i];
  
  // Clear P, the state argument to the final block is zero.
  //memset(P, 0x00, 128);
  ARRAY_SET(P, 0x00, 16)
  Key[16] = SKEIN_KS_PARITY;
  for(i = 0; i < 16; i+=1) Key[16] ^= Key[i];

  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest1(P, Key);
  P = srt.State;
  Key = srt.Key;

  uint64_t KeccakState[25];
  // memset(KeccakState, 0x00, 200);
  ARRAY_SET(KeccakState, 0x00, 25)
  // Copying qwords 0 - 8 (inclusive, so 9 qwords).
  // Note this is technically an XOR operation, just
  // with zero in this instance.
  // memcpy(KeccakState, P, 72);
  ARRAY_COPY(KeccakState, P, 9) 
  
  keccackf_t k = keccakf(KeccakState);
  KeccakState = k.st;
  
  for(i = 0; i < 7; i+=1) KeccakState[i] ^= P[i + 9];

  KeccakState[7] ^= 0x0000000000000005ULL;
  KeccakState[8] ^= 0x8000000000000000ULL;
  
  k = keccakf(KeccakState);
  KeccakState = k.st;

  k = keccakf(KeccakState);
  KeccakState = k.st;
  
  return(KeccakState[6]);
}

Data flows from inputs to the output return value. This is a pure function and all function inputs and outputs are passed by value.

The PipelineC tool produces a (currently crude) visual aid 'pipeline map' for identifying the critical path delay through your code. Being text based it is too large to show here but it can be seen that the two SkeinRoundTest blocks are ~1/2 of the datapath total length. Followed by the three keccakf blocks which take up the later ~1/2 of the datapath.

The comb. logic path

  Source:                 Key_input_reg_reg[2][0]/C
  Destination:            return_output_output_reg_reg[45]/D
  Data Path Delay:        208.747ns  (logic 79.923ns (38.287%)  route 128.824ns (61.713%))
  Logic Levels:           904  (CARRY8=361 LUT1=1 LUT2=26 LUT3=41 LUT4=252 LUT5=152 LUT6=71)

It is this comb. logic that the PipelineC compiler begins to autopipeline.

Synthesis Results

Part: xcvu9p-flgb2104-2-i

Verilog Hand Tuned Pipeline

localparam TOTALSTAGES = (SKEINBLKSTAGES * 2) + (KECCAKBLKSTAGES * 3) + 2;

Wolf9466 designed a pipeline that splits the two Skein block processes and three Keccak block processes in total across 390 pipeline stages. Individual iterations of block processes we identified as being small enough to fit into single clock cycles. From there an additional pipeline stage or two were added for intermediate operations that didnt fit evenly into other stages.

Part: xcvu33p-fsvh2104-2L-e

Timing Report:

An fmax of roughly ~846MHz = ~1.182ns period.

wtiming

Utilization

+----------------------------+--------+-------+-----------+-------+
|          Site Type         |  Used  | Fixed | Available | Util% |
+----------------------------+--------+-------+-----------+-------+
| CLB LUTs*                  | 333899 |     0 |    439680 | 75.94 |
|   LUT as Logic             | 331829 |     0 |    439680 | 75.47 |
|   LUT as Memory            |   2070 |     0 |    205440 |  1.01 |
|     LUT as Distributed RAM |      0 |     0 |           |       |
|     LUT as Shift Register  |   2070 |     0 |           |       |
| CLB Registers              | 529262 |     0 |    879360 | 60.19 |
|   Register as Flip Flop    | 529262 |     0 |    879360 | 60.19 |
|   Register as Latch        |      0 |     0 |    879360 |  0.00 |
| CARRY8                     |  15280 |     0 |     54960 | 27.80 |
| F7 Muxes                   |      0 |     0 |    219840 |  0.00 |
| F8 Muxes                   |      0 |     0 |    109920 |  0.00 |
| F9 Muxes                   |      0 |     0 |     54960 |  0.00 |
+----------------------------+--------+-------+-----------+-------+

Design Comparison PipelineC vs. Verilog

TODO

Clone this wiki locally