( ESNUG 375 Item 12 ) ------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #3 ) CynLib C Benchmarks 2X To 5X Faster Than SystemC

> Here are the results of our experiments with this benchmark including a
> simulation compiled without compiler optimization:
>
>   ESNUG 373 SystemC code [1] : ############## 142.4 sec
>   ESNUG 373 SystemC code [2] : ###################### 224.9 sec
>   ARM SystemC code [1]       : ### 36.6 sec
>   Verilog equivalent [3]     : ######################## 238.0 sec
>
>   [1] SystemC 1.2; "gcc -g -O3 -march=i686"; 550MHz Pentium III
>   [2] SystemC 1.2; "gcc -g -O0"; 550MHz Pentium III
>   [3] VerilogXL 3.2; "verilog +turbo"; 550MHz Pentium III
>
> This suggests that the ESNUG 373 #2 results were generated using an
> unoptimized compilation.
>
>     - Jon Connell
>       ARM


From: Bernard Deadman <bdeadman@sdvinc.com>

John,

Comparing SystemC with any event driven Verilog simulator is unfair.
SystemC is "cycle accurate", so why not make the comparison with a 
cycle-based Verilog simulator because that's giving you the same level
of  precision?  I've never tried it but my guess is the performance of
SystemC would look pretty sick in that comparison.

Overall, SystemC *is* slow - the way to get real performance is to jettison
a lot of the simulation kernel and get closer to raw C++.  My view is
SystemC and all of the other class libraries are too heavy.  If you want 
great simulation performance its time for a 'lite' version, with a minimal 
set of classes and let the user to add in the extra stuff he needs rather 
than burden the SystemC core with a *ton* of stuff most people don't 
use.  A competent user can add in just the extra bits he actually needs.

    - Bernard Deadman
      SDV, Inc.                                  Austin, TX

         ----    ----    ----    ----    ----    ----   ----

From: John Sanguinetti <jws@forteds.com>

Hi, John,

The SystemC benchmark from [ Emperor ] and Jon Connell's optimized version
beg further analysis, and a comparison with the same code in C++/Cynlib.

The original SystemC code from [ Emperor ] has a glaring inefficiency in it:

        ...
        if (load) {
          sc_bool_vector tmp(9);
          tmp = (sc_bv<9>) count_nxt;
          save_count_out = count_nxt;
          parity_out.write(tmp.xor_reduce());
          ...
        }

The temporary variable tmp is created and destroyed every time through this
block (nearly every clock cycle).  If you make that a class member variable
instead of an automatic, then this code runs about 3.7x faster using
Connell's test bench.

If we look at what Connell did, you can see he did three optimizations:

  1) As he said, he switched the sc_bv<9> in the original to sc_uint<9>.
     The switch to sc_unit makes about an 18% improvement over the original.

  2) He replaced the call to xor_reduce with his own code to do it.
     This:

          sc_bool_vector tmp(9);
          tmp = (sc_bv<9>) count_nxt;
          parity_out.write(tmp.xor_reduce());

     Was replaced by this:

         for (int ii = 0; ii < 10; ii++) tmp ^= count_nxt[ii];
         parity_out.write(tmp);

     This is the big winner, because by making tmp a member variable of 
     type bool, instead of an automatic of type sc_bool_vector, he 
     eliminated the creation and destruction of an sc_bool_vector on every
     clock cycle.  This makes the code run 4.3x faster than the result of 
     optimization (1).

  3) He replaced

        if (up == 0 && down == 0)
          count_nxt = data_in;
        else if (up == 0 && down == 1)
          count_nxt = cnt_dn;
        else if (up == 1 && down == 0)
          count_nxt = cnt_up;
        else if (up == 1 && down == 1)
          load = 0;

     with

       int mode = (up << 1) | down;
       switch (mode) {
         case 0: count_nxt = data_in.read(); break;
         case 1: count_nxt = cnt_dn; break;
         case 2: count_nxt = cnt_up; break;
         default: load = 0; break;
       }

     This new code looks more like the original Verilog, but it doesn't 
     have much of an effect on the performance.  It only improves the speed
     by about 4% over optimizations (1) & (2).

So the results of running the original and the above three optimizations on
my 400 MHz Sun e/450 are:

                 original    opt 1     opt 1+2    opt 1+2+3

      cpu time   522 sec.    442 sec.  103 sec.   99 sec.


Now to the interesting part.  Why was that temporary variable there in the
first place?  It was there because you can't apply the xor_reduce() function
to an sc_uint<>, but only to an sc_bool_vector<> (or sc_bv<>), so the
assignment was a convenient way to cast count_nxt to the needed type.  Note
that Jon Connell solved the problem by just not using the built-in
xor_reduce function.  The real culprit here is that SystemC has too many
data types, and not all functions are available for all of them.  If you
write your code in a reasonably natural way, like [ Emperor ] did, you may
very well get  unreasonable results.


Now compare the above with the same code written in C++ using Cynlib and the
(free) Cyn++ preprocessor:

  Module up_down (In<1> clk, In<1> up, In<1> down,
                  In<9> data_in, Out<1> parity_out, Out<1> carry_out,
                  Out<1> borrow_out, Out<9> count_out)
    
     Uint<9> count_nxt;
     Uint<10> cnt_up, cnt_dn;
     Uint<1> load;
   
     Always (Posedge(clk))
         cnt_dn = count_out - 5;
         cnt_up = count_out + 3;
         load = 1;
         switch( (up,down) ) {
             case 0: count_nxt = data_in; break;
             case 1: count_nxt = cnt_dn; break;
             case 2: count_nxt = cnt_up; break;
             case 3: load = 0; break;
         }
         if( load ) {
             parity_out <<= CynRedXor(count_nxt);
             carry_out  <<= up&cnt_up(9);
             borrow_out <<= down&cnt_dn(9);
             count_out  <<= count_nxt;
         }
     EndAlways
  EndModule

This is at least as easy to read as the Verilog (or Superlog) and just as
concise.  Running Jon Connell's testbench with this code took 32 seconds,
a difference of 3x.  I ran some other testbenches to see how SystemC and
Cynlib scale (since that was one of Emperor's complaints), and I got the
following results:

                      SystemC               SystemC
     Instances    (Emperor's code)   (Connell's patched code)     Cynlib

     1 ( 1M cycles)     42 sec                21 sec                4 sec
    10 ( 1M cycles)    231                    56                   16
   100 ( 1M cycles)   2164                   391                  173
  1000 (10K cycles)    218                    43                   17

Note that scaling with size is not really a problem in any of these after
about 10 instances.  You can see that Cynlib runs between 2 and 5 times
faster than the optimized SystemC does on this model.

I should also note that Emperor's comments about SystemC taking a long time
to compile are accurate.  The 100 instance version took 2,460 seconds (41
minutes!) to compile at -O3.  Cynlib took 24 seconds.

So the Emperor's C conclusions don't hold up, at least for Cynlib C:

  Speed - Cynlib is 2x-5x faster than SystemC, and from these 
          measurements appears faster than Superlog.

  Reliability - Cyn++ is at least as easy to read as Verilog, and has 
                the advantage of having C expression semantics.

  Quality - I think Emperor meant debuggability here, and he's right, 
            SystemC is truly difficult to debug.  Cynlib has two tools
            to help here, Cyngdb being free for run-time debugging, and
            Cyntax, a Forte product, for code analysis.

  Software - Cynlib is just C++, so as he said, running a hardware 
             model with the code to drive it is simple.

  IP - We don't know what IP providers will write their models in, but
       by far the largest amount of IP available is not hardware models
       in Verilog but algorithms in C.  It's a lot easier to incorporate
       a C algorithm in Cynlib (or SystemC) than in Verilog.

  Freeware - It's hard to justify the price of a commercial HDL
             simulator when you can do the same job, with the same
             effectiveness, for free.  Of course you've got to buy design
             tools, but by using Cynlib and C++ you don't have to pay for
             simulation, allowing you to spend your tool budget on better
             productivity tools.

Finally, Wilson Snyder's comments about using SystemC to code at an RTL
level are on the mark.  It just is not very easy, and there really isn't
much point.  Cynlib is a lot easier, due in part to Cyn++ but also due to
the more rational data types and port semantics.  However, where both
SystemC and Cynlib are in their element is when elaborating a hardware
implementation from a C algorithm.  If you come from the top down, using
C++ for hardware description is quite reasonable, particularly if you use
Cynlib.

    - John Sanguinetti, CTO
      Forte Design Systems

         ----    ----    ----    ----    ----    ----   ----

> I am sure over time the SystemC kernel will vastly improve.  They haven't
> optimized the pin interconnect.  They aren't optimizing between modules.
> They aren't inlining modules.
>
>     - Wilson Snyder


From: "Janick Bergeron" <janick@qualis.com>

Hi, John,

Although that would be a great benefit, I'm wondering what would be the
motivation for *anyone* to work on improving the systemC kernel?  While EDA
companies can hope to recoup their investment in improving the performance
of their Verilog/VHDL simulators, there is no such financial benefit in
improving SystemC's performance.  Quite the contrary: you'd be canibalizing
your own VCS simulator sales.

It's not going to come from corporate users either.  They are paid to
engineer products that will generate revenue for their employers.  Any
improvement to a free/opensource tool is going to remain proprietary to a
company as a competitive advantage.

Such is the burden of free/opensource tools.

The only way I currently see SystemC's performance being improved is through
a model similar to Linux: self-motivated hackers, working on their own time,
who would get a kick to have their modification included in the "official"
release.  SystemC doesn't have that going for it, but I do expect to see a
lot of "Improving SystemC" papers from students in the near future.

    - Janick Bergeron
      Qualis Design


 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)