( ESNUG 541 Item 4 ) -------------------------------------------- [05/23/14]
From: [ Boris Hristov of Ciena ]
Subject: Ciena user reviews Calypto Catapult HLS design and verification
Hi, John,
I am an ASIC designer at Ciena and also responsible for digital ASIC
methodology. For the DSP portion of our ASICs, which accounts for about
1/2 of the digital core of our recent WaveLogic 3 (WL3) ASIC, we used the
Calypto Catapult High Level Synthesis (HLS) tool in our production flow.
I wanted to share with your readers our methodology/feedback on Calypto.
For confidentiality reasons, I cannot provide the usual timing/area/power
numbers for our blocks. However, since Ciena's WL3 chips are coherent
optical processors used in networking, I can say performance and power is
critical for us. Our target bit-rates are 100 Gbit/sec or higher.
---- ---- ---- ---- ---- ---- ----
INITIAL EVAL -- CATAPULT VS HAND CODED RTL
I ran our initial Catapult evaluation about 2 years ago, against a block
that I had previously hand-coded in RTL.
I picked this particular block because it is representative of the types of
the DSP designs we have and I was very familiar with it. Catapult has a
scheduler to help visualize the loop and change parameters for latency and
throughput. I wanted to be able to explore various implementations and get
a quick feedback on actual implementation area, timing and power.
Block A
Description:
Single loop, highly parallel pipeline design with 48 total
operations: 32 multipliers, 16 adders.
Catapult's QoR Results:
It's important to note that Catapult gives feedback based on your target
ASIC libraries.
- Timing - Catapult's RTL matched my RTL code timing for throughput
and latency.
- Area - Catapult met my area requirements, and was within 10%.
- Power - Catapult does power estimation, and we used it to compare
relative power between different implementations.
Time savings:
- RTL hand-coding: Took me 1 week to hand code this one RTL block.
- Catapult HLS: It took me 2 days to set up Catapult the first time
(with the help of their FAE), then only minutes to generate several
implementation cases for the block.
Catapult passed our evaluation and we purchased it. Its major benefits to
us were: 1) a fast design exploration of the possible solution space, 2) the
visualization from the scheduler and 3) the potential verification time
savings -- more on this when I discuss our production flow.
---- ---- ---- ---- ---- ---- ----
CATAPULT PRODUCTION DESIGN RESULTS
Next, we used Catapult for two types of designs in our production blocks.
All instances of variants of these two blocks accounted for about 50% of our
total production ASIC gate count.
Block B & Block C
Descriptions:
- Block B was a FFT Fast Fourier Transformation
- Block C was a Digital Filter
- The two blocks had between ~80 and 5,000 operations for various
implementations
- In most cases blocks were integrated into the design as a triplet,
but were each designed separately with their own parameters.
- We had 16 total instances across our design, with different
parameters for each instance.
Catapult's QoR Results:
- Timing - Met tight performance targets for all instances of
each block.
- Area - met our objectives for all instances of each block.
- Power - We focused on sub-chip (sub-system) level power targets
and once we had selected the block implementation we did not use
Catapult power numbers later in the flow.
Time savings:
- Using Catapult saved us substantial time here. We created
only a few C++ models and synthesized RTL variants by changing
our Catapult parameters.
- Time to synthesize each RTL block: Varied from 20 min to 2 hours.
- We did not RTL hand-code these blocks; however based on experience,
I expect hand-coding each instance according to the specified
parameters would have taken approximately 1 week for each variant.
Verification:
- We still simulated Catapult's RTL output using a digital simulator
running in a System Verilog test bench. However, because it was
machine-generated vs. hand-coded RTL we focused on integration
simulations; however we did C to RTL equivalency simulations as
part of the System Verilog test plan.
- We primarily used scheduler visualization for throughput/latency
and resource sharing analysis and to synthesize our SystemC to RTL.
As a result of the synthesis we get timing/area and power that was
very useful to compare implementations.
Low latency Inverse FFT implementation:
High latency Inverse FFT implementation:
Using the schedule visualizer we can see each operation use and resource
sharing -- depending on the particular implementation parameters. From the
Catapult snapshots above it was immediately visible that our Low Latency
implementation had many more parallel operations while our High Latency
implementation had much more resource reuse -- hence a smaller area design.
This kind of information is crucial for selecting our best architecture on
a case-by-case basis.
There is a "sc_verify" flow for C-to-RTL equivalency simulations that we
did not try since we used our traditional System Verilog flow. Mentor
Questa was running underneath "sc_verify" for all our RTL, so we could
use its code-coverage and functional coverage. (Questa is built-in to
the Catapult tool suite.)
I also liked its interface synthesis. By selecting type of the interface
in Catapult and in relation to the throughput pragmas RTL interface for our
DSP block is also synthesized.
---- ---- ---- ---- ---- ---- ----
OUR HLS/RTL/VERIFICATION PRODUCTION DESIGN METHOLOGY
We have modified our design methodology since we added Catapult to our
production flow. Below is our current approach:
1. All our DSP models are written in C++.
2. System designers work on our untimed system algorithms in C++.
These high level designs are primarily architectural. They use
MatLab and C testbenches to test our C++ models.
3. DSP development group modify the C++ for the DSP sub-models which
get implemented inside our ASIC. These C++ models are more
detailed, and a bit closer to hardware than the system models,
but are still untimed and implementation generic. It is
useful to have a single C++ model for our system simulations
and Catapult implementation whenever possible.
4. We simulate our system-level C++ models using an in-house
simulator.
5. We use Calypto's Catapult high level synthesis to synthesize
our C++ into RTL (Verilog).
6. We verify the Verilog RTL code using standard simulators and a
System Verilog testbench.
7. We lint the RTL that Catapult produces as we do for any RTL code.
8. We run the linted and functionally verified RTL through our
standard RTL synthesis flow.
9. We are a fabless semiconductor company, so we then send our
gate-level design to our ASIC vendor for physical implementation.
---- ---- ---- ---- ---- ---- ----
CHANGES IN HOW WE DESIGN AND VERIFY
Below is where we've been evolving our methodology to include C-to-Verilog
high level synthesis:
- Our RTL designers now get involved earlier. Our RTL folks work
closely with the system C++ team to make sure that the design will
be implementable. As we get into the interfaces and data
structure, our RTL designers also need to make sure that the
models are testable.
We are actually seeing RTL guys modifying C code as they get it from DSP
guys, but it is still a collaborative effort of the DSP and RTL person on
each block. Ultimately the DSP engineer is responsible for algorithm and the
RTL engineer for gate-level implementation.
- We get our RTL designed faster. In general, for a well understood
block that would take 1 week to develop by hand-coding RTL, it will
take Catapult 1 day to finalize RTL implementation, including all
the design exploration.
- We do more design exploration. For example, two major parameters
that Catapult can control during scheduling are throughput and
latency. Catapult can graphically represent the logic for every
tick of the loop, so we can analyze the resource sharing of the
algorithms. Every time we change one of the parameters, we get a
new design with estimated timing, area, and power, based on our
target library.
We raised the level of abstraction for the bulk of our designs with C even
before we used Catapult. With Catapult we can now get to RTL implementation
much faster and explore various implementation characteristics much more
easily.
From what I'm hearing about their SLEC, it'll let us move our verification
efforts up. Our RTL test benches are not used for C-to-RTL equivalency
testing any more, but for sub-system integration testing.
---- ---- ---- ---- ---- ---- ----
TESTBENCHES, SLEC, ASSERTIONS, POWER
Automated C Testbenches and Optimal Stimulus Generation
Catapult has a "sc_verify" flow to aid the DSP block C verification. It's
unusual to not use it for C-to-RTL equivalency (formal methods are more
effective for proving equivalency) -- but instead to use it for C block
verification with Mentor inFact graph-based stimulus generation.
InFact allows for graphical representation of this stimulus and the various
interactions to be investigated -- with a minimal set of test cases created
to fully cover your desired stimulus using an automatically generated
Catapult "sc_verify" SystemC testbench.
The key value of inFact is it can generate random stimulus for C/SystemC-
based verification, using constraints in a similar way to SV constrained
random -- with the increased efficiency of its coverage closure algorithms,
and then to report the coverage achieved.
In addition, the same graph used for the C models in the "sc_verify" flow
can be targeted at the RTL if desired, with consistent stimulus applied in
each case. Even with formal methods used for equivalency the same graph-
based stimulus could still be useful for debugging differences if detected.
C-to-RTL formal equivalency
Formal equivalency checking is a great productivity enhancer for both hand
written RTL or synthesized RTL. If one can formally verify the blocks at C
level and check formal equivalency with RTL, there is no more need to create
individual block-level testbenches. This could greatly reduce the amount of
System Verilog co-simulations required.
Calypto SLEC is particularly interesting because of its obvious tight hooks
with Catapult.
- Catapult automatically produces the entire set up for SLEC, such
as interface properties, latency/throughput, loop-related
information, intermediate equivalence points.
- We put pragma's in our C++ code that Catapult synthesizes, with
specific instructions to guide the synthesis. SLEC understands
these pragmas.
There will always be some RTL that doesn't have a C model, so RTL simulation
will still be required, but mostly for SoC/block integration.
C-level assertion-based verification
Over the past decade, RTL verification has shifted from primarily directed
tests and random simulation to also include assertion-based verification.
As we shift our design focus to C++, we want to use the same rigor as was
used on the RTL. We began using assertion-based verification at the C level
to increase our confidence in the C++ code.
After you code your assertions/properties and coverage points into your
C model, HLS tools can synthesize and propagate them all into your RTL in
standard SVA or OVL formats. You can then not only use these properties at
the C level for formal proof or dynamic simulation, but can benefit from
having them automatically embedded in the RTL, rather than having to create
them manually.
With this flow we our DSP blocks are IP that has two functionally identical
"views": C and RTL, where a single C view can have multiple functionally
equivalent RTL implementations.
The assertions and coverage points increase the quality for our IP block.
Plus the assertions travel with the IP and provide white box visibility into
the IP during system verification -- regardless of what "view" is used for
the particular testing.
C level power optimization
Ciena is looking into using C-level power optimization tools; for stuff
beyond algorithm and bit widths optimizations, that are propagated to the
RTL in the form of clock-gating enables and memory access optimizations.
---- ---- ---- ---- ---- ---- ----
WHAT'S MISSING FROM CATAPULT
When you have a lot of test cases, assertions, and functional coverage, it
can be hard to visualize them -- or measure and track coverage. I would
love to see a graphical representation of the merged functional coverage
at the C level. You can do it now by synthesizing RTL and co-simulating in
a "sc_verify" flow and collecting coverage in Questa, but I'd like to have
a pure "sc_verify" C functional coverage flow.
Although Catapult can currently tackle most designs using a hierarchical
approach, I'd be happier if it could handle even larger flat designs.
---- ---- ---- ---- ---- ---- ----
CONCLUSION
It is hard to quantify exactly, but I estimate Catapult cut our overall
development effort by 25% for our DSP block level designs.
In addition, I expect some further savings in verification time with SLEC,
though it's too early for me to predict that.
Looking forward, we hope to expand our HLS use to synthesize almost all of
our DSP block level designs. We are also looking into using TLM interfaces
to build even deeper hierarchy in C.
- Boris Hristov
Ciena Corp. Ottawa, Canada
Join
Index
Next->Item
|
|