( ESNUG 486 Item 2 ) -------------------------------------------- [10/26/10]

From: Tim Koeppe <tim.koeppe=user domain=nsn got calm>
Subject: Mentor CatapultC user on control logic synthesis and AC Channels

Hi, John,

When I first tested Catapult, I was using it for DSP applications (e.g. 
loop-based with fix loop count).  But the Catapult release in mid-to-late 
2009 met my needs better due to Mentor's improvements in control logic 
synthesis, which I discuss below.

My initial reason for using Catapult was to avoid wasted time doing 
RTL-simulations in a timed environment.  I wanted to have a very, very fast 
untimed C simulation, then skip the painful RTL coding and verification 
loop (code RTL, simulate, change the code, simulate, and so on).

Catapult supports both SystemC and C++, but I am currently designing using 
C++.  For my test case, I chose a memory management application.  It would 
be really hard to code it in RTL because I would have to do something like 
a table walk (e.g. searching through a memory for values and deal with the 
memory latency.  But when I code it in C++, it's just one line of code and 
I don't have to deal with latencies.

DESIGNING HARDWARE IN C++ INSTEAD OF RTL 

It took me about a month to understand the concepts associated with coding 
hardware design in C++.  For example, when you do RTL hardware design you 
explicitly code parallelism.  When I moved to C, I needed to get a feeling 
for writing sequential descriptions.  Catapult synthesizes high throughput 
designs by inferring parallelism from the sequential descriptions, so I 
didn't need to hard-code those design aspects anymore.  It took time to 
rewire my thinking.

I also had to understand the concepts of writing code with I/Os working in 
parallel (e.g. channels, hierarchy).

I don't have charts comparing Catapult's automatically generated RTL and 
hand-written RTL;  I didn't write the RTL code for this application.  
However, based on my experience writing RTL, I estimate that from start to 
finish, it's about 5 to 10 times faster to use Catapult than to do hand 
coding for a given application.

         Design size:  500 lines of C++ code, plus 300 additional
                       lines for the testbench. 

     Catapult Output:  13,000 lines of RTL code. 

    Catapult Runtime:  10 minutes for the conversion.

The time savings is not just about automatically converting the C-code to 
RTL versus writing it manually.  The other powerful thing is Catapult's 
verification flow.  Catapult offers an environment for quick verification 
of the generated code - it automatically verifies the RTL without me having 
to write an RTL-testbench.  I found it very easy to verify my design with 
Catapult's verification flow since I needed a C testbench to verify my C 
code anyway.  I can get very fast 'pass or fail' results.  Plus since I 
have pure C++, there are fewer details in the source.  It runs extremely 
fast and there are fewer things to verify.

Another powerful thing...imagine the following:  You create a high speed 
design in RTL with some memories and associated state machines.  After 
doing logic synthesis and place and route and some iterations, you realize 
that you have to change you pipeline.  You have to buffer your outputs at 
the memories because the clock to output of the memories is quite long.
If you do this by hand in RTL - it will take a lot of time, but with
Catapult - it's just one click.

CONTROL LOGIC SYNTHESIS

We had a buffer handler, which handled an external SRAM which held buffered 
data.  Our handler had multiple queues, each with a minimum assured buffer 
size.  A queue can 'borrow' some space as needed.  This queue will lose the 
borrowed space if another queue has request for memory and the used buffer 
size of this queue is smaller than the assured size.  The RAM is organized 
with slices and each slice is filled without gaps, which let us easily 
change parameters such as the amount of queue, SRAM size....

When we had Catapult 2008a (without control logic synthesis), we used a 
'blocking read' approach instead of control logic synthesis, where 
basically we stopped portions of the design from running.  This was very 
hard to implement, and we ended up with a gap of 11 clock cycles.

Then last year, Mentor added control logic synthesis and 'non-block reads' 
in Catapult C 2009a.  It uses a special channel, which is a non-blocking 
method, so while Catapult is waiting for data to come in the FIFO it is not 
stopping data from running.

Below is an example of our C++ code that we used to apply control logic 
synthesis for an external memory using the 2009a version of Catapult.  It 
includes syntax for a non-blocking method, and our gap to switch to the 
next descriptor was only 4 clock cycles.

  void ext_mem ( ac_channel<ext_mem_ch_t> &ext_mem_ch,
  ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
  ac_channel< ac_int<32,0> > &rd_fifo,
  ac_channel< ac_int<32,0> > &wr_fifo,
  ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT] )
  { ext_mem_ch_t rd_str ;
  while(ac_int<5,false>(ext_mem_ch.size()) > 0) //new in 2009a
  { rd_str = ext_mem_ch.read();
  unsigned int text_sram_adr = rd_str.adr;
  ac_int<12,0> tc_copy = rd_str.c_copy; ac_int<2,0> tcmd = rd_str.cmd;
  ac_int<12,false> cl; unsigned int help;
  ext_mem_copy_loop: for (cl = 0; ; cl++)
  { switch (tcmd)
  { case C_WR : help = wr_fifo.read();
  ext_sram[text_sram_adr++] = help;
  break;
  case C_RD : help = ext_sram[text_sram_adr++];
  rd_fifo.write(help);
  break;
  default : help = ext_sram[text_sram_adr++];
  ch_ext_mem_pkt_len_rd.write(help);
  } //switch
  If (cl == tc_copy) break;
  }//for
  }//while
  }

HIERARCHY

For a high throughput design, I use hierarchy to get hardware which is 
working in parallel.  A simple example:

     aa = a + b;
     loop: for (i=0; i<10; i++)) field[i] = a + b + i;
     dd = aa + b;

With RTL design, the operation:  dd = ... could be done when the loop is 
active, but, in a single block, Catapult cannot run code outside a loop as 
long as the loop is active.  So if I need more performance, I must map the 
loop to hierarchy.

There are 2 ways I can use Catapult C to design with hierarchy.  The first 
is 'out-of-order array access,' which results in ping pong mirror memories 
with the same data.  The second is 'In-order array access,' which has fast, 
automatic streaming of arrays shared between blocks that just go through a 
pipe.  We can use this type of hierarchy for items such as external memory 
functions when:

     - Indexing the blocks in the same order
     - Indexing unconditional (matched data rates)
     - Algorithmic C Channel class;  Streaming behavior coded directly into 
       the algorithm.

AC CHANNELS

Catapult has Algorithmic C Channel classes (ac_channels), with streaming 
behavior coded directly into the algorithm.  I use channels to implement 
the block interface when doing hierarchical design with multiple blocks.

This is what AC channels look like when you write them in C++.  The sample 
below was for a complex, hierarchical, external SRAM controller (i.e. 
Non-datapath) design.

  void ext_mem( ac_channel<ext_mem_ch_t> &ext_mem_ch,
  ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
  ac_channel< ac_int<32,0> > &rd_fifo,
  ac_channel< ac_int<32,0> > &wr_fifo,
  ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT]);
  void buf_handler_dh( ac_channel<ext_mem_ch_t> &ext_mem_ch,
  ac_channel<ch_cmd_t> &ch_cmd,
  ac_int<C_NEEDED_BITS_4_DW_IN_QUEUE,0> dw_used_in_queue[C_QUEUE_CNT],
  ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
  ac_channel< ac_int<32,0> > &ch_pkt_len);
  //###### pragma hls_design top #######
  void buf_handler ( ac_channel<ch_cmd_t> &ch_cmd,
  ac_channel< ac_int<32,0> > &rd_fifo,
  ac_channel< ac_int<32,0> > &wr_fifo,
  ac_int<C_NEEDED_BITS_4_DW_IN_QUEUE,0> dw_used_in_queue[C_QUEUE_CNT],
  ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
  ac_channel< ac_int<32,0> > &ch_pkt_len )
  {
  static ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT];
  static bool uninit_ext_sram = 
   ac::init_array<AC_VAL_DC>(ext_sram,DWORDS_PER_BUFFER * C_PTR_CNT);
  static ac_channel<ext_mem_ch_t> ext_mem_ch;
  buf_handler_dh( ext_mem_ch, ch_cmd, dw_used_in_queue,
                  ch_ext_mem_pkt_len_rd, ch_pkt_len);
  ext_mem(ext_mem_ch, ch_ext_mem_pkt_len_rd, rd_fifo, wr_fifo, ext_sram);

CATAPULT DESIGN ANALYSIS

It's important to make sure your C++ code is clean before you synthesize 
it, just as you perform linting on your RTL code.  Otherwise you risk 
exhibiting false errors or falsely passing the testbench.  Catapult makes 
it straightforward to check for these issues via a built-in linting for 
common mistakes.

   - Static checks.  Catapult checks the source code during compilation, 
     such as reporting of uninitialized values.
   - Runtime checks.  Catapult performs runtime checks such as reporting 
     out-of-bound array access.

You need coverage analysis for C verification just as you do for during RTL 
verification.  Catapult also does a coverage analysis verification check, 
reporting how well each branch of the design has been exercised in your 
testbench.  It's completely automated so you don't have to learn another 
tool.

DRAWBACKS

The drawbacks I found were mostly the trade-offs related to high level 
synthesis versus hand-coded RTL design.

   1. In the same way that it's hard to debug a gate level netlist after 
      Synopsys DC, it's also difficult to debug and manually change the 
      RTL code that Catapult generates.  Mentor does help you better 
      understand the generated RTL through its RTL schematic viewer 
      (block diagrams), where we can cross probe between the RTL and the 
      C source file.  Mentor also added a new naming convention 
      recently, where the objects (signals, operators) are named in a 
      consistent and predictable way.

   2. There are some tradeoffs between the C language possibilities and 
      the parallel HW.  I'd like Cat to add instrument-like threads in
      multi-tasking operating systems; right now it doesn't do so.

When we initially decided to move to Mentor Catapult C high level 
synthesis, we had questions about using C++ for hardware design, since C++ 
is a sequential language.  For example, how would we handle parallelism, 
and transition from untimed to timed code and handle external interfaces?

On the flip side, Catapult was widely used, C++ has unbeaten simulation 
time over RTL, and this way our hardware / software people would be talking 
one language.  Also, we liked Catapult's built-in verification steps.

In the end we learned that our results with Catapult depend on our coding 
style and that HDL design experience is important to have.

    - Tim Koeppe
      Nokia Siemens Networks GmbH                Greifswald, Germany
Join    Index    Next->Item










   
 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)