( ESNUG 485 Item 4 ) -------------------------------------------- [06/08/10]

Subject: (ESNUG 482 #6) User experiences with HLS and AutoESL's AutoPilot

> We purchased AutoESL's AutoPilot in 2008 to implement some of the time-
> consuming cores in our software into FPGA hardware for the runtime
> speed-up improvements.  We found this can often accelerate our SW
> runtimes by 2-3 orders of magnitude.
>
>     - Ningyi Xu
>       Microsoft Research Asia                    Beijing, China


From: [ The Mouse That Roared ]

Hi John,

Please keep me anonymous.

Thought I'd pass along some of our results using AutoESL's AutoPilot for
high level synthesis (HLS) -- or what you like to call "C synthesis". 

We've been tracking HLS claims for several years, but only recently has it 
begun to look like a viable option for chip design.  I've check-pointed HLS
in the past, and generally returned to RTL, since the QoR with direct RTL
is much higher.  (We're in the high-volume chip business and really can't
accept a 10% overhead.) 

We recently ran a benchmark through AutoESL's AutoPilot -- which takes an 
untimed C model (in C/C++/SystemC) and generates various fixed architectures 
of our design with timing.  We already had a VHDL RTL version of our design
along with a fixed-point C model of it (still module level though) to
compare against.

During our initial evaluation, we gave the AutoESL AE the C model for our 
design.  The AutoESL team was ultimately able to automatically generate RTL 
that beat the QoR for our optimized hand-coded RTL in terms of performance, 
area and power -- the most significant was 12% lower power in the design. 

We did a comparison on the final design with Synopsys Design Compiler and 
Magma Talus P&R where we built both designs in-house from the source
(hand-coded VHDL RTL and the AutoPilt-generated RTL), then measured power
with real activity vectors. 

Block A: AutoESL results vs. hand-coded (both results after running through 
Synopsys Design Compiler + Talus P&R) 

                 AutoPilot 1st result         AutoPilot final result
                 --------------------         ----------------------
    Area             2% larger                       1% smaller
    Power            2% higher power                12% lower
    Latency         40% higher latency               6% lower

                Original design:  800 lines of ANSI C code
              Hand-coded design:  4000 lines of VHDL RTL code
       AutoESL-generated design:  100,000 lines of Verilog RTL code 

There were two significant things about the benchmark.  

 1. First, the initial results from AutoESL were in the ballpark but 
    slightly worse in all categories.  But after the AutoESL team tweaked 
    some of the design constraints (and I'm assuming improved some of 
    AutoPilot's internal optimizations), they were able to significantly 
    improve on the result. 

 2. The second item that was surprising was there was very little change to 
    the C model for AutoESL to get the final results; it was more about 
    properly constraining the design.  They limited the design changes to 
    replacing some functions with constant coefficients, partitioning a few 
    functions into smaller blocks, and putting in-place some input/output 
    packing functions.  I would estimate that less than 5% of the code 
    changed, and for most part the core arithmetic code stayed the same. 

I don't have a full accounting of all the time spent, but in this early 
ramp-up period, I would think we spent similar amounts of time generating 
the C model version of the design as writing our hand-coded VHDL RTL model.
Our hand-coded VHDL RTL included a considerable amount of configurability
with generics, variable arrays, and generate statements.  We were trying
to make our VHDL as configurable as possible to allow better architecture
decisions, and hence the number of lines of code is probably longer than
if we had simply coded the final design architecture.  But the cool thing
we found out after all the front-end work was done with the C model and
Autopilot, was that we could use AutoPilot to search a large design space
with very little additional effort.  AutoESL handed over the all the
scripts containing constraints and directives and we were able to run
the design in-house. 

Our hand-coded RTL was based on concrete specs for throughput and latency, 
and had a particular architecture in mind.  AutoESL's generated Verilog RTL
design was more flexible, and allowed us to quickly answer these questions: 

   - what happens if we relax the latency constraint?
   - what if we want to halve the clock period, and run more things in 
     parallel?
   - what if we want to double the clock rate, and halve the parallelism?

The tool payback starts to become very apparent at this point.  In the case 
of the hand-coded RTL, we could make some architecture changes because of 
the configurability of the VHDL code, but it invariably required some 
pipelining adjustments, as well as another round of verification simulation 
to get the design working again.  AutoESL allows us to change clock period, 
pipelining, and immediately spit out a new working design with the new 
constraints. 

There was a 25X difference in lines of RTL code generated by AutoPilot 
versus our hand-coded VHDL design.  It appears that AutoESL's automatically 
generated Verilog RTL code is much closer to a netlist-style approach than 
hand-coded RTL. 


Some gotchas we found using HLS and AutoPilot:

 1. One interesting point we have found in the HLS development strategy
    is that the ability to make sure the code is clean from errors is very
    important.  Un-initialized variables are a big problem, since they can
    produce problems in downstream verification.  

 2. One of our requirements is to be able to run Valgrind on our original
    C/C++ source code in order track down the problems before the code goes
    through HLS.  Initially, our plan for code development was to use the
    standard SystemC fixed-point datatypes (sc_fixed, sc_int) for 
    detailed fixed-point design.  But after spending a considerable amount
    of time compiling in the SystemC libraries into our designs, we found
    too many problems.  For example, once you include the systemc.h header
    files, it includes many more header files under the hood.  We have
    several problems with conflicts internally on datatypes.  Furthermore,
    we could never get Valgrind to work, since it reported too many
    problems in the SystemC internals for us to even begin to debug our
    code.

 3. Ultimately, we have standardized on the SystemC formats for our coding
    style, but we are actually mapping directly to AutoESL implementation
    of the datatypes.  For example, we can write the C code with:

                            sc_fixed<10,2>

    but it is remapped to the

                            ap_fixed<10,2>

    with the correct include files.  With the AP types, we can easily get
    the code Valgrind clean, and it matches with the synthesizable code
    that Autopilot will generate.  Since AutoPilot supports all 3 languages
    (C, C++ and SystemC), we can use this kind of hybrid approach to our
    high level synthesis language input.

 4. The key to getting a good design is to know your design targets and 
    getting the constraints set properly.  My experience is that a loosely 
    constrained design in Design Compiler still gets a reasonable design 
    output.  However, AutoPilot can drastically change the microarchitecture
    such as pipelining so it is important to constrain designs judiciously
    in order to get expected results, especially latency and throughput.  We
    have seen results where you get a larger design and  higher latency if
    you don't have the constraints set properly.

 5. Furthermore, knowing the expected outcome is important to know when to 
    stop exploring the design.  In our case, we had a hand-coded RTL design
    to compare against, but in the future the HLS output is the only 
    thing you have.  Our conclusion is that HLS doesn't relieve the designer
    from understanding the intimate details of the block; rather HLS is a 
    way to not have to worry about the details of pipelining and RTL 
    construction, while focusing on architecture tradeoffs.  You still need 
    to at least have a back of the envelope estimate for the design to 
    sanity check the results.

 6. We need to instrument some non-synthesizable constructs for debug and 
    design creation in our code.  Right now in AutoESL, the non-synthe-
    sizable constructs cannot be in the code, and we need to use #ifdef 
    statements to hide those statements.  That leads to more cumbersome 
    code, and we lose some of C++ design styles (abstract base classes, and
    automatic hierarchy parsing, file I/O).  It would be much better for 
    development if the AutoPilot could automatically ignore non-synthe-
    sizable statements.

 7. Fundamentally, the output RTL from the tool is very close to a netlist 
    design, so it will be very challenging to track down a bug in the 
    AutoPilot's RTL.  Its cross-probing ability is very limited at this 
    point, and ECO's will be challenging.  It is likely that bug fixes will 
    require a full resynthesis - which can be a tough problem when an entire
    chip is close to tapeout, and we need just an ECO to fix a small 
    problem.  This will re-open the netlist, RTL synthesis, P&R, etc. 

 8. The HLS methodology requires that the C formal verification matures very
    quickly.  Right now, we assume the output of AutoPilot is correct by 
    construction, and verified with directed simulations of the C models 
    with the output RTL.  In order for HLS to become main-stream, the 
    concept of formal verifying original C model to the RTL is critical in 
    order to trust the design will work under all conditions. 

Overall we've had a good experience with AutoPilot.  Our evaluation was
strong enough that we purchased Autopilot to expand on our C level design
methodology.  We are in the process of building several different types of
designs to push the HLS methodology, and ensure an end-to-end methodology
is possible, including making certain that all the backend results are
unaffected.

    - [ The Mouse That Roared ]
Join    Index    Next->Item









   
 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)