( ESNUG 486 Item 2 ) -------------------------------------------- [10/26/10]
From: Tim Koeppe <tim.koeppe=user domain=nsn got calm>
Subject: Mentor CatapultC user on control logic synthesis and AC Channels
Hi, John,
When I first tested Catapult, I was using it for DSP applications (e.g.
loop-based with fix loop count). But the Catapult release in mid-to-late
2009 met my needs better due to Mentor's improvements in control logic
synthesis, which I discuss below.
My initial reason for using Catapult was to avoid wasted time doing
RTL-simulations in a timed environment. I wanted to have a very, very fast
untimed C simulation, then skip the painful RTL coding and verification
loop (code RTL, simulate, change the code, simulate, and so on).
Catapult supports both SystemC and C++, but I am currently designing using
C++. For my test case, I chose a memory management application. It would
be really hard to code it in RTL because I would have to do something like
a table walk (e.g. searching through a memory for values and deal with the
memory latency. But when I code it in C++, it's just one line of code and
I don't have to deal with latencies.
DESIGNING HARDWARE IN C++ INSTEAD OF RTL
It took me about a month to understand the concepts associated with coding
hardware design in C++. For example, when you do RTL hardware design you
explicitly code parallelism. When I moved to C, I needed to get a feeling
for writing sequential descriptions. Catapult synthesizes high throughput
designs by inferring parallelism from the sequential descriptions, so I
didn't need to hard-code those design aspects anymore. It took time to
rewire my thinking.
I also had to understand the concepts of writing code with I/Os working in
parallel (e.g. channels, hierarchy).
I don't have charts comparing Catapult's automatically generated RTL and
hand-written RTL; I didn't write the RTL code for this application.
However, based on my experience writing RTL, I estimate that from start to
finish, it's about 5 to 10 times faster to use Catapult than to do hand
coding for a given application.
Design size: 500 lines of C++ code, plus 300 additional
lines for the testbench.
Catapult Output: 13,000 lines of RTL code.
Catapult Runtime: 10 minutes for the conversion.
The time savings is not just about automatically converting the C-code to
RTL versus writing it manually. The other powerful thing is Catapult's
verification flow. Catapult offers an environment for quick verification
of the generated code - it automatically verifies the RTL without me having
to write an RTL-testbench. I found it very easy to verify my design with
Catapult's verification flow since I needed a C testbench to verify my C
code anyway. I can get very fast 'pass or fail' results. Plus since I
have pure C++, there are fewer details in the source. It runs extremely
fast and there are fewer things to verify.
Another powerful thing...imagine the following: You create a high speed
design in RTL with some memories and associated state machines. After
doing logic synthesis and place and route and some iterations, you realize
that you have to change you pipeline. You have to buffer your outputs at
the memories because the clock to output of the memories is quite long.
If you do this by hand in RTL - it will take a lot of time, but with
Catapult - it's just one click.
CONTROL LOGIC SYNTHESIS
We had a buffer handler, which handled an external SRAM which held buffered
data. Our handler had multiple queues, each with a minimum assured buffer
size. A queue can 'borrow' some space as needed. This queue will lose the
borrowed space if another queue has request for memory and the used buffer
size of this queue is smaller than the assured size. The RAM is organized
with slices and each slice is filled without gaps, which let us easily
change parameters such as the amount of queue, SRAM size....
When we had Catapult 2008a (without control logic synthesis), we used a
'blocking read' approach instead of control logic synthesis, where
basically we stopped portions of the design from running. This was very
hard to implement, and we ended up with a gap of 11 clock cycles.
Then last year, Mentor added control logic synthesis and 'non-block reads'
in Catapult C 2009a. It uses a special channel, which is a non-blocking
method, so while Catapult is waiting for data to come in the FIFO it is not
stopping data from running.
Below is an example of our C++ code that we used to apply control logic
synthesis for an external memory using the 2009a version of Catapult. It
includes syntax for a non-blocking method, and our gap to switch to the
next descriptor was only 4 clock cycles.
void ext_mem ( ac_channel<ext_mem_ch_t> &ext_mem_ch,
ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
ac_channel< ac_int<32,0> > &rd_fifo,
ac_channel< ac_int<32,0> > &wr_fifo,
ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT] )
{ ext_mem_ch_t rd_str ;
while(ac_int<5,false>(ext_mem_ch.size()) > 0) //new in 2009a
{ rd_str = ext_mem_ch.read();
unsigned int text_sram_adr = rd_str.adr;
ac_int<12,0> tc_copy = rd_str.c_copy; ac_int<2,0> tcmd = rd_str.cmd;
ac_int<12,false> cl; unsigned int help;
ext_mem_copy_loop: for (cl = 0; ; cl++)
{ switch (tcmd)
{ case C_WR : help = wr_fifo.read();
ext_sram[text_sram_adr++] = help;
break;
case C_RD : help = ext_sram[text_sram_adr++];
rd_fifo.write(help);
break;
default : help = ext_sram[text_sram_adr++];
ch_ext_mem_pkt_len_rd.write(help);
} //switch
If (cl == tc_copy) break;
}//for
}//while
}
HIERARCHY
For a high throughput design, I use hierarchy to get hardware which is
working in parallel. A simple example:
aa = a + b;
loop: for (i=0; i<10; i++)) field[i] = a + b + i;
dd = aa + b;
With RTL design, the operation: dd = ... could be done when the loop is
active, but, in a single block, Catapult cannot run code outside a loop as
long as the loop is active. So if I need more performance, I must map the
loop to hierarchy.
There are 2 ways I can use Catapult C to design with hierarchy. The first
is 'out-of-order array access,' which results in ping pong mirror memories
with the same data. The second is 'In-order array access,' which has fast,
automatic streaming of arrays shared between blocks that just go through a
pipe. We can use this type of hierarchy for items such as external memory
functions when:
- Indexing the blocks in the same order
- Indexing unconditional (matched data rates)
- Algorithmic C Channel class; Streaming behavior coded directly into
the algorithm.
AC CHANNELS
Catapult has Algorithmic C Channel classes (ac_channels), with streaming
behavior coded directly into the algorithm. I use channels to implement
the block interface when doing hierarchical design with multiple blocks.
This is what AC channels look like when you write them in C++. The sample
below was for a complex, hierarchical, external SRAM controller (i.e.
Non-datapath) design.
void ext_mem( ac_channel<ext_mem_ch_t> &ext_mem_ch,
ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
ac_channel< ac_int<32,0> > &rd_fifo,
ac_channel< ac_int<32,0> > &wr_fifo,
ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT]);
void buf_handler_dh( ac_channel<ext_mem_ch_t> &ext_mem_ch,
ac_channel<ch_cmd_t> &ch_cmd,
ac_int<C_NEEDED_BITS_4_DW_IN_QUEUE,0> dw_used_in_queue[C_QUEUE_CNT],
ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
ac_channel< ac_int<32,0> > &ch_pkt_len);
//###### pragma hls_design top #######
void buf_handler ( ac_channel<ch_cmd_t> &ch_cmd,
ac_channel< ac_int<32,0> > &rd_fifo,
ac_channel< ac_int<32,0> > &wr_fifo,
ac_int<C_NEEDED_BITS_4_DW_IN_QUEUE,0> dw_used_in_queue[C_QUEUE_CNT],
ac_channel< ac_int<32,0> > &ch_ext_mem_pkt_len_rd,
ac_channel< ac_int<32,0> > &ch_pkt_len )
{
static ac_int<32,0> ext_sram[DWORDS_PER_BUFFER * C_PTR_CNT];
static bool uninit_ext_sram =
ac::init_array<AC_VAL_DC>(ext_sram,DWORDS_PER_BUFFER * C_PTR_CNT);
static ac_channel<ext_mem_ch_t> ext_mem_ch;
buf_handler_dh( ext_mem_ch, ch_cmd, dw_used_in_queue,
ch_ext_mem_pkt_len_rd, ch_pkt_len);
ext_mem(ext_mem_ch, ch_ext_mem_pkt_len_rd, rd_fifo, wr_fifo, ext_sram);
CATAPULT DESIGN ANALYSIS
It's important to make sure your C++ code is clean before you synthesize
it, just as you perform linting on your RTL code. Otherwise you risk
exhibiting false errors or falsely passing the testbench. Catapult makes
it straightforward to check for these issues via a built-in linting for
common mistakes.
- Static checks. Catapult checks the source code during compilation,
such as reporting of uninitialized values.
- Runtime checks. Catapult performs runtime checks such as reporting
out-of-bound array access.
You need coverage analysis for C verification just as you do for during RTL
verification. Catapult also does a coverage analysis verification check,
reporting how well each branch of the design has been exercised in your
testbench. It's completely automated so you don't have to learn another
tool.
DRAWBACKS
The drawbacks I found were mostly the trade-offs related to high level
synthesis versus hand-coded RTL design.
1. In the same way that it's hard to debug a gate level netlist after
Synopsys DC, it's also difficult to debug and manually change the
RTL code that Catapult generates. Mentor does help you better
understand the generated RTL through its RTL schematic viewer
(block diagrams), where we can cross probe between the RTL and the
C source file. Mentor also added a new naming convention
recently, where the objects (signals, operators) are named in a
consistent and predictable way.
2. There are some tradeoffs between the C language possibilities and
the parallel HW. I'd like Cat to add instrument-like threads in
multi-tasking operating systems; right now it doesn't do so.
When we initially decided to move to Mentor Catapult C high level
synthesis, we had questions about using C++ for hardware design, since C++
is a sequential language. For example, how would we handle parallelism,
and transition from untimed to timed code and handle external interfaces?
On the flip side, Catapult was widely used, C++ has unbeaten simulation
time over RTL, and this way our hardware / software people would be talking
one language. Also, we liked Catapult's built-in verification steps.
In the end we learned that our results with Catapult depend on our coding
style and that HDL design experience is important to have.
- Tim Koeppe
Nokia Siemens Networks GmbH Greifswald, Germany
Join
Index
Next->Item
|
|