new CDNS 19.1 switches

( ESNUG 587 Item 5 ) ---------------------------------------------- [07/22/19]

  EDITOR'S NOTE: I love love love deep technical "how to" and especially
  "how NOT to" user posts like this!  Now that's good DeepChip!  - John

Subject: 12 good and 4 bad switches in new Genus/Innovus/Tempus 19.1 flow


Then we compared power, performance, area, (PPA) and runtime throughout
5 flows using the TSMC CLN7FF library:

 Tools                                Flow Name         Comment
 ----------------------------------   ---------------   ------------------
    DC-> test ins -> ICC2 -> PT       SNPS-All          our old SNPS flow
     SNPS Fusion Compiler -> PT       SNPS-New          new SNPS flow
 DC-> test ins -> Innovus -> PT       Innovus-PT        old Innovus flow
 DC-> test ins -> Innovus -> Tempus   Innovus-Tempus    old Innovus+Tempus
 Genus-> Modus -> Innovus -> Tempus   CDNS-All          CDNS only flow

Our goal was to have "Mongo" with 3.0 M inst, a number of ARM cores and
hard macros, and some very tight power requirements reach 3.2 GHz (or
better) in TSMC CLN7FF.






Flows


Best Frequency
Achieved


TNS left
on table


Total Power


  TAT  





SNPS-All


2.87 Ghz


97 nsec


1,838 mW


14.7
days





SNPS-New


2.67 Ghz


165 nsec


1,923 mW


12.4
days





Innovus-PT


3.06 Ghz


44 nsec


1,720 mW


11.7
days





Innovus-Tempus


3.12 Ghz


24 nsec


1,667 mW


9.8
days





CDNS-All


3.22 Ghz


0 nsec


1,586 mW


8.2
days





What we found is the "CDNS-All" flow consistently gave us better PPA in
the shortest runtime of any of the flows. 

    - from Benchmark of DC-ICC2 vs Fusion Compiler vs Genus-Innovus


From: [Ralph, an adult Mutant Ninja Turtle]

Hi, John,

We saw similar trends when we went from DC/ICC2 over to Genus/Innovus 18.1.

Right now we've just swiched over to Genus/Innovus/Tempus 19.1 and here's
how we made it work.  First, here's the four new parts of the new CDNS 19.1
flow in a nutshell.

    - new physical restructuring with iSpatial
    - new Mux and Datapath Restructuring
    - new Machine Learning based optimization
    - new Tempus ECO that skips post-route optimization

Cadence has this idea of common engines.  They used to have common placement
and common rout-ing engines across all the tools.  With CDNS 19.1, we get
"iSpatial", which what Cadence marketing calls putting GigaOpt everywhere in
their flow to have one common optimization engine everywhere, too.  (GigaOpt
is the optimization engine orginally in Innovus.  Now it's in Genus and
Tempus, too.) 

With iSpatial we see a ~1.8x runtime speed up for our full 19.1 flow vs. the
old 18.1 full flow.  The new iSpatial predicted our area exactly, and our
power exactly (because it's GigaOpt moved upstream into Genus).  This better
timing used by our RTL team to tune the RTL.  So when this better 19.1 data
is taken to Innovus, Innovus doesn't have to do placement optimization
again.  Innovus 19.1 goes purely incremental after that, which is what gets
us that 1.8x runtime speed up.

To enable iSpatial we needed to do the following switches:

    set_db limited_access_feature {ispatial 214480224}
    set_db opt_spatial_effort extreme
    syn_opt -spatial

This is what we're using now because it's what the Cadence folks told us to
do.  Are other users using different switches to turn on iSpatial?

        ----    ----    ----    ----    ----    ----    ----

On top of a 1.8x speed up, we also saw better QOR overall with iSpatial from

  - Early Clocking: since we now get Gigaopt within Genus, it now uses
    the early clocking flow in synthesis with useful skew.  The clock
    gates and skewed macros now start to show up early in synthesis.

  - Physical Restructuring: the Genus mapper now gets to use GigaOpt to
    make early (and clever) physical optimations during RTL synthesis.

    Let's say our design has a cascade of adders 6 deep.  The old 18.1
    Genus can't see the whole cone of logic it works out to.  But with
    19.1, it can see the whole cone and restructure it optimally.

With this we are getting gains on QOR and more importantly on congestion.








 18.1 Genus/Innovus 


 19.1 iSpatial flow 





Timing WNS/TNS 


-88psec/-265nsec


-32psec/-93nsec





Power


1,245 mW


1,173 mW





TTR/TAT


8.12 days


4.78 days




Here's the new switch to enable this restructuring 

    set_db limited_access_feature {ispatial_restructuring 439424160}
    set_db opt_spatial_restructuring true

        ----    ----    ----    ----    ----    ----    ----

IMPORTANT: DO NOT RERUN INNOVUS

This new 19.1 restructuring flow above puts Innovus in incremental mode.
But you might lose the TAT benefit if you re-start Innovus again.  So the
user has to watch out for that.  So you have to make sure you turn on
your Innovus place optimization to the much faster incremental version:

    setPlaceMode -place_global_exp_skip_gp true
    place_opt_design

Cadence says as long the Genus dB is handed over to Innovus backend dB, we
will get this incremental behavior in Innovus.  (This works because is
the GigaOpt stage is already done and the dB's automatically sense that.)
But it must be a dB handoff!  If you DON'T do a dB handoff, then you MUST
use the two switches above.

        ----    ----    ----    ----    ----    ----    ----

NOW HAS MUX AND DATAPATH RESTRUCTURING IN RTL SYNTHESIS

We are using the 19.1 based Genus release.  Our previous 18.1 methodology
used a low-level RTL coding style.  This is pretty common and most synthesis
tools handle this well enough.  However we are now doing chips with a lot of
machine learning logic content in them that requires the use of high level
RTL logic coded in System Verilog.  For example:

    max= 0
    for (int i=0; i <4; i++) beginif (max < array [i])
        begin
        max = array [i]
        end
    end

This would expand to in a normal unrolling of the above loop out to

But the better solution is to expand it to

For 4 levels, this saves us a 1 level of logic.  Now imagine this is with
a loop thats 16 levels deep.  It would save 11 logic levels.  This trade
off between levels of logic, area, power, timing is crucial for SV chips. 

We turned this on in Genus 19.1 with the DP Turbo switch:

    set_attr dp_opt_turbo true

We checked that the decisions 19.1 took were correct and it didn't need
any guidance to get the data path right.

Timing is important at 5nm, but our key QOR focus is on power.  Getting
data path and MUX just right is important.  This DP Turbo switch is new.
(I think Cadence calls this Compus.  I don't think they've announced this
yet.)  We like the fact that just one switch enables this, and not too
many options. 

Cadence gave us a way to measure path depths:






 Path Depths Range 


 # of Paths w/ 18.1 


 # of Paths w/ 19.1 

 plus high level opto 





0-4 levels


22349


22785





5-9 levels


34987


35955





10-14 levels


46876


46903





15-19 levels


43875


44329





20-24 levels


74892


75309





25-29 levels


54985


60783





30-34 levels


18904


24509





35-39 levels


7854


1023





40-44 levels


5689


23





45-50 levels


1208


0




Notice with 19.1 the path depth shrinks at over 35 levels.

See above 35 with 18.1 there are 7,854 + 5,689 + 1,208 == 14,751 levels
See above 35 with 19.1 there are 1,023 + 23 + 0        ==  1,046 levels

What 19.1 is doing is MUX and datapath restructing to shave off 15 levels.

We were using Genus for our high end CPU QOR.  With 19.1 we now have one
tool for both CPU style designs and other designs.  (As compared to SNPS
which uses Fusion Compiler for CPU and DC-NXT for all the non-CPU design;
and FC and DC-NXT have different UI's with different unique commands;
whereas CDNS 19.1 has the same commands throughout.)

        ----    ----    ----    ----    ----    ----    ----

CDNS 19.1 NOW HAS MACHINE LEARNING

We got new code from Cadence where they brag above having Machine Learning.
We found that it fixed their pre-route to post-route correlation problem.
It used to be when we went from 16nm to 7nm, 18.1 correction was 30% lower.
With ICC2, the correction was also 30% lower.









 19.1 without ML 


 19.1 with ML 







WNS/TNS


 -9psec/-50psec 


 -10psec/-48psec 







Power QOR 


 945 mW


 911 mW







TTR/TAT


 2.3 days


 2.2 days




With CDNS 19.1 in training, we got ~5% aligned pre-route to post-route; and
that was close enough it gave us another 3.5% power savings.  What's also
good is all our ML training design data resides on our own network and we
feel secure with that. 

We enabled it using the following

    setMachineLearningMode -training net_cell_delay
    python <training_package_path>/run.py -mode train \
           -type net \
           -datadir <data_dir> \
           -outdir <model_dir> \
           -log train.log
    setMachineLearningMode -deployment

As you can see the numbers weren't all that different with or without ML;
but fixing that pre-route to post-route 30% lowball correlation problem is
why we like the new ML in 19.1.

        ----    ----    ----    ----    ----    ----    ----

NEW 19.1 TEMPUS ECO SKIPS POST-ROUTE OPTIMIZATION

Cadence 19.1 is proposing PBA based optimization.  (They're saying skip the
post-route optimization step altogether.)  Since Tempus is integrated into
Innovus, with Tempus ECO we now get is PBA based optimization straight into
Innovus.  Looks like they strengthened their timing driven routing. 

With Machine Learning we get such good pre-route to post-route correlation,
we can directly move on to Tempus ECO power optimization as the final step
in our flow.  All PBA based.  The switch for it is:

    signoffOptDesign -setup -hold -leakage -dynamic -drv -area

The above command got us PBA based optimization and replaced the old one:

    optDesign -postRoute  <-- BAD OLD COMMAND!  DO NOT USE!!!

Important it is a must that you specify the process node setting. 

    setDesignMode -process 5

It's the one setting to make sure 5nm is enabled properly in 19.1. 


        ----    ----    ----    ----    ----    ----    ----

BROKEN/DANGEROUS STUFF IN 19.1

We tried some other stuff that didn't get us much gain and would like to see
if other users are able to make this work.

  1.) We tried hold aware scan chain reorder after inserting clocks. 

          setScanReorderMode -holdAware true
          scanReorder

      This didn't work well in reducing hold violations.  Cadence needs
      to fix this.  This is just not good right now.  It messed up our
      scan chains and improved hold but the design was unroutable.

      On another design it didn't change hold much. 

  2.) We have some glitch power left on the design to be optimized.  We
      tried the Joules engine for glitch power optimization - it does
      good job calculating and reporting it, but we need to optimize it.
      It literally cannot optimize it!  Cadence says an optimizing Joules
      is at least 3 to 6 months away. 

  3.) We had some functional ECO that might touch the clock network and
      we ended up fixing transition violations on those clock nets
      ourselves.  Cadence told us about this command:

          ccopt_pro

      That command was not good.  It destroyed and touched our whole
      clock tree whereas we just wanted part of the clock network
      optimized.   STAY AWAY FROM THIS!

  4.) We also tried to do restricted metal only ECO limited to 4 layers
      using the following 

          ecoRoute -modifyOnlyLayers 5-8

      It said it changed 4 layers but it touched the vias between the
      layer 4 to layer 5 which cost us another layer change costs. 

        ----    ----    ----    ----    ----    ----    ----

WE HAVE NOT TRIED VIRTUS YET

We are a Tempus/Voltus signoff flow.  We are worried about true IR aware
signoff at 5nm and would like to be trying out Virtus in this rev of our
chip.  We would like to find out in vectorless analysis which is our real
sensitive paths.  (Apache isn't true vestorless yet.)  Fingers crossed. 

        ----    ----    ----    ----    ----    ----    ----

CONCLUSION

Our best high level 19.1 recipe (right now) seems to involve

    - new physical restructuring with iSpatial
    - new Mux and Datapath Restructuring
    - new Machine Learning based optimization
    - new Tempus ECO that skips post-route optimization

Would like Cadence to fix their hold aware scan chain reorder problem and
to get glitch power optimization working in Joules.

    - [Ralph, an adult Mutant Ninja Turtle]

        ----    ----    ----    ----    ----    ----    ----

Related Articles

    User benchmarks DC-ICC2 vs Fusion Compiler vs Genus-Innovus flows
    Genus RTL synthesis gaining traction vs. DC is #4 of Best of 2017
    After 16nm benchmark, 7nm user swaps out DC-Graphical for Genus-RTL
    ICC2 patch rev, Innovus penetration, and the 10nm layout problem
    Aart's SUE RIVALS policy backfires horribly on core SNPS patents

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)