The 14 metrics - plus their gotchas - used to select an emulator

( ESNUG 522 Item 3 ) -------------------------------------------- [04/18/13]

From: [ Jim Hogan of Vista Ventures LLC ]
Subject: The 14 metrics - plus their gotchas - used to select an emulator

Hi, John,

Here are the 14 major metrics that I feel a design team must consider when
deciding on what specific emulator to use on their project (if any):

     1.  Price/Gate
     2.  Initialization and Dedicated Support
     3.  Capacity
     4.  Primary Target Designs
     5.  Speed Range
     6.  Partitioning
     7.  Compile Time
     8.  Visibility
     9.  Debug
    10.  Virtual Platform API
    11.  Transactor Availability
    12.  Verification Language and Native Support
    13.  Number of Users
    14.  Memory Capacity

Below I explain in detail the impact and pitfalls each metric will have on
your team's emulation decision.

    - Jim Hogan
      Vista Ventures, LLC                        Los Gatos, CA

         ----    ----    ----    ----    ----    ----   ----

 1. Price/Gate 

    The actual cost of an emulator typically runs from 1-5 cents-per-gate
    for higher capacity emulators -- both processor-based and FPGA-based.
    There is usually some recurring cost for software and maintenance.

    The lower capacity FPGA-based emulators are typically priced as
    separate HW and SW components.  The HW consists of off-the-shelf FPGA
    prototyping boards that typically cost 0.25 to 1 cent-per-gate.  The
    SW that provides emulation features on top of the prototyping boards
    are typically priced like software simulators.


 2. Initialization and Dedicated Support 

    When assessing an emulator's total cost of ownership, the cost of
    dedicated human support is a key factor, especially as it is an
    ongoing expense.  The large processor-based emulators virtually
    always require at least one support person, if not a team, dedicated
    to emulator support.  It's not due to any inherent issues with
    emulators; it's just the sheer scope and magnitude of the
    verification projects being tackled.

    The large FPGA-based emulators' need for dedicated support is similar,
    also depending on the size and complexity of the emulated designs
    and the number of end users involved.  

    Further, the time to initialize an emulator and set up models can take
    6 months or more.  Transactors may need to be developed to connect
    test benches to the design-under-test (DUT).  They are also needed to
    connect host-based virtual platforms to emulators and FPGA prototyping
    systems.  This is part of a growing trend to improve the accuracy and
    performance of virtual platforms for performance modeling and pre-
    silicon software development.

    The most complex transactors - such as PCIe transactors - can take
    more than a year to develop.  They can be the source of functionality
    and performance bugs that can delay the emulation project even longer.
    And test bench re-partitioning between the host and emulator may also
    be required to achieve acceptable performance.  


 3. Capacity

    There are subtle issues around an emulator's capacity; a given capacity
    can be delivered in multiple ways.  First, there is the straightforward
    capacity measurement in terms of total number of gates: that range for
    emulators currently ranges from 2 million to 2 billion gates.  

    Second, there is the granularity in terms of the number of devices
    (ASICs or FPGAs), boards, or boxes that are used to reach a given
    capacity.

    Processor-based emulators are architected to provide capability in a
    seamless way, where it looks monolithic to users.  With FPGA-based
    emulators, if it is a vertically integrated box, it should also appear
    monolithic to a user.  Synopsys-EVE reaches its 1 B gate capacity
    in a monolithic way by connecting multiple emulator boxes.

    A vendor may hide the emulator granularity from the customer, but higher
    granularity generates more communication overhead, which generally
    degrades performance.  This is true whether the emulator is processor-
    based or FPGA-based.  For instance, if you were emulating 10 M gates
    and it fit on a single FPGA, it could run approximately 5X faster than
    if the same 10 M gate capacity were to be divided over multiple FPGAs.

    Low-to-mid FPGA-based emulators typically expose more granularity to
    the end user.  However, an offset to this is that these boxes tend
    to take advantage of the increased capacity for newer FPGA devices
    more quickly than the mid-to-high-capacity FPGA-based emulators.  

    FPGAs on the leading edge of Moore's Law are one of the first things
    manufactured.  New FPGA product cycles run 12 to 18 months.  In
    contrast custom chip cycles are at least 4 years, and the cost of
    development, which has generally been increasing for custom ASICs,
    is borne by the custom-processor-based emulation vendor.


 4. Primary Target Designs

    SoCs are the dominant workhorse for systems companies today.  The
    target design sweet spot for each emulator type is typically defined
    by the capacity of the emulator as it relates to your design size.  

    As mentioned earlier, complexity is one of the factors driving more
    mainstream designs into requiring emulation.  Control complexity can
    require many verification cycles.  Emulation applications range from
    CPUs, GPUs, application processors such as video, audio, security
    and datapath, to IP blocks and subsystems.


 5. Speed Range

    Emulator speed is measured in cycles-per-second.  Processor-based
    speeds range from 100 K to 4 M cycles/sec, while FPGA-based range
    from 500 K to 50 M cycles/sec, depending on the number of devices.


 6. Partitioning

    Partitioning is harder than sounds.  Partitioning would be easy if
    your design could be broken up into latency insensitive blocks that
    would eliminate any strict latency requirements between partitions.
    But that's hardly ever the case.

    - Processor-based emulators have many processing units, so large
      designs must be partitioned across these units.  Processor-based
      systems basically run software using many cores, and the
      partitioning is more or less transparent to the user.  

    - FPGA-based systems are hardware-centric, based on multiple FPGAs.
      Due to the FPGA boundaries you must first split the design into
      reasonably sized pieces, so that each sub-system will fit on one
      FPGA device.  Unfortunately, you then usually end up needing more
      logical connections between partitioned sections of the design
      than there are physical wires between FPGAs to make those
      connections!  These logic connections can exceed the physical
      connections by 2X to 100X.  

      So when you stitch your partitioned design elements together to
      connect them, rather than just doing a simple logical connection,
      you must multiplex signal pins over the FPGA connections.  This adds
      substantially to complexity, particularly if this process is not
      completely automated.  The entire process takes additional time and
      effort that opens the possibility for errors to be injected in
      partitioning or reconnecting.

      A deeply partitioned design can be very difficult to debug manually
      or semi-automatically.  Design groups sometimes give up on the task
      because they can't get it right.  Certain debug features don't work
      as well in partitioned systems; debugging a single FPGA is easier.

    Debug concern only applies to FPGA-based; the processor-based emulators
    have complete visibility across all of their processors.

    Most processor-based and larger FPGA-based emulation vendors try to
    automatically take care of partitioning such that the end user doesn't
    have to deal with it.  

      - Automatic partitioning is rule-based and is assumed to be
        correct-by-construction.  If you have a problem with your
        partitioning, it's a support call to the emulation vendor.

      - The greatest partitioning risk occurs in the smaller FPGA-based
        emulator group.  Their (in)ability to partition should be
        carefully assessed -- particularly when partitioning across more
        than two FPGAs.  Some of these vendors use an independent
        partitioning tool from Auspy; however, partitioning is
        inherently never completely push button.

    Thankfully, Moore's Law has now made it possible to have 30 M gates
    in just two FPGAs.  This neutralizes the partitioning issue for a
    substantial fraction of emulation projects.  If no messy multi-part
    partitioning is needed, there's no problem.  A single partition
    between only two devices is straightforward.


 7. Compile Time

    Emulator compile time is the total time to prep a job for execution on
    your emulation system, including synthesis and routing.  For FPGA-based
    emulators, compilation time is primarily determined by the FPGA routing
    tools.  Further, some FPGA-based emulators are starting to provide
    hierarchical routing -- which enables incremental compiles.  

    Another variable for compile time is whether your design partitioning
    between the devices is automated or not.  

    - For processor-based emulators this design partitioning is
      automated and quite complete, and that time is included in the
      compile time.  The compile runs on a single workstation.

    - For FPGA-based emulators partitioning can be quite variable; your
      total compile time can more than double if the partitioning is
      semi-automated and requires user interaction.

    Another factor for compile time is whether it's parallelizable - i.e.
    whether it can be broken into independent jobs to be run on separate
    workstations.  For FPGA-based emulators, once your partitioning is
    complete, your FPGA compilation can be sped up significantly by running
    each FPGA compile on a separate workstation.  That's not the case for
    processor-based emulators.


 8. Visibility

    Visibility is the ability to see signals inside a design.  You have
    full visibility with SW simulation, because simulation is software and
    the program state is inside the simulator; it is very flexible and you
    can log any state.

    For FPGA-based emulators, there's static and dynamic visibility.
    Static visibility refers to signal probes that are defined at compile
    time.  They require FPGA resources and run fast but usually cover only
    a small subset of a design's signals.  If you need to change that
    subset of signals, you need to recompile.  Dynamic visibility refers
    to signal probes that do not need to be defined at compile time.
    These do not require additional FPGA resources and run much slower than
    static probes, but cover a much larger subset of a design's signals.

    - The signal visibility with processor-based emulators is basically
      the same as simulation.  That's because all signal states reside
      as a software addressable register somewhere in the emulation
      processor array.

    - With FPGA-based emulators, to see the internal signals, you must
      route them out through some type of multiplexing network to the
      pins.  This adds physical gate overhead whenever you take a signal
      and connect to it through an I/O multiplexer to the pins of the 
      FPGA.  The multiplexer and wires create trees of logic that add
      area overhead.  

      Additionally, those gates and wires add a performance overhead; 
      every cycle requires capturing the state and eventually sending 
      it to a host for storage.  If you try to probe every signal, your
      effective design capacity can go down by a factor of 2-5.  You 
      must navigate this carefully, or your design won't fit.

    However, the signal visibility gap is narrowing between processor-based
    and FPGA-based emulators based on two key factors:

    - FPGA capacity improvements provide increased capacity for static
      probing inside of FPGAs.

    - Xilinx has a built a feature in their FPGAs which offers dynamic
      probes for register states.  This Xilinx feature enables better
      multiplexing in the chip to access the signals without overhead.

      With effort, the individual emulator vendors can all but eliminate
      the area overhead, which then means less time recompiling to be
      able to see a different set of signals.  Some emulator vendors are
      taking advantage of this.

    The bottom line is that emulation vendors can be assessed with an area
    and performance hit for any given number of probes, as well as whether
    they offer dynamic probing.


 9. Debug

    With the exponential rise in complexity, debug is essential - as you can
    see in the chart below, teams spend 1.6x total time debugging (42%) than
    they do developing testbenches (26%) and writing/running tests (26%).

                 
    A robust debug capability is critical, and processor-based emulators
    today have debug capabilities that approach that of SW simulator debug.

    Fundamental emulator debug capabilities can include:

      - Breakpoints.  The ability to pause an emulation run based
        on event triggers.
      - Assertions support.  It flags when assertions, or logic
        statements that define the intended behavior of a design,
        are violated.
      - Simulation hot-swap.  The ability to automatically transfer
        execution to a connected simulator for more in-depth debug,
        in the form of greater visibility and control.
      - Software debug.  It can run software debuggers on the
        embedded code being executed by the processor(s).

    Debug is an area undergoing significant innovation and should be
    assessed beyond the simple list I show in the comparison chart.


10. Virtual Platform API

    Hybrid platforms consists of emulators co-simulating with virtual
    platforms, where a virtual platform simulates whole chips on a
    workstation by eliminating detail from the hardware, and plugging
    C-models together.

    Given the prevalence of virtual platforms, it is important for the
    emulator to have a standard virtual platform API, so that if an
    engineer doesn't have a virtual platform C-model for a component,
    they can plug the component into the emulator instead.  

    For example, you may have an RTL USB 3.0, but no C model.  An emulator
    with a virtual platform API could allow you to co-simulate with your
    emulator running the USB RTL at up to 4 M cycles/sec.  In contrast,
    the RTL might run at only 5 K cycles/sec in a SW simulator.


11. Transactor Availability

    Transactors facilitate the communication between the emulator and
    other platforms with different levels of abstraction.  Transactors
    are critical to making emulators work in that they remove the speed
    bottlenecks associated with co-emulation.

    - Transactors convert transactions coming from outside the emulator
      to bit-level interfaces inside the emulator and vice versa; as 
      emulators move to the mainstream, they need to communicate with 
      simulators and virtual platforms more frequently.

    - Transactors also serve as interfaces when you connect inputs and
      outputs to the I/O of a design (SoC) under test (DUT) in an
      emulator.  The transactors let you go between I/Os on the DUT to
      anything communicating with it on the host simulator, C testbench 
      or possibly another I/O board bringing in a live connection, such
      as Ethernet. Below shows an example.

             
                        Source: Mentor Graphics

    - One definition of a transactor is anything that moves something
      from one abstraction level to another.  For example, in the
      virtual platform space, transaction level modeling (TLM) is
      standard, whereas in the RTL space, bit-level interfaces are the
      norm.

    - Hardware transactors eliminate traffic over co-emulation links to
      optimize performance.  They do this by translating a compound
      transaction into a bit level handshake exploded in space and time
      on the emulator where it's more efficiently handled.  For example,
      a single transaction over the co-emulation link to "move a
      128-word block of memory from location A to location B" is
      exploded into a sequence of 128 bit-level interface operations in
      the emulator.  

    The overall emulator performance is very sensitive to transactor
    quality.  For example, an optimized transactor may have a speed
    differential of 1 M cycles/sec compared with only 100 K cycles/sec
    for an unoptimized transactor -- a 10X performance delta. 

    The individual emulator vendors provide off-the-shelf transactor libs
    available for standard interfaces such as memory and high speed I/O;
    these standard transactors can be used for different designs.  Also,
    custom transactors must often be developed for each design.

    Some projects may require 10-20 transactors, of which 5-10 may be
    custom, ranging from UARTs to proprietary system bus interconnects.
    End-to-end performance optimization of these custom transactors is
    hard -- the time can range from 1 to 12 months.  Furthermore,
    different transactor versions can be required to support multiple
    emulation platforms.  


12. Verification Language and Native Support

    Regardless of transactor quality, sometimes, you must move selected
    components, models or testbenches, over to a SW simulator.  To do
    this efficiently, ideally your emulator supports your existing
    languages.

    - Processor-based emulators typically have broad native support for
      verification languages.  This is because the emulation vendors
      often also provide simulators -- an inherent advantage.

    - FPGA-based emulators don't support native languages such as C++ or
      SystemC; they are typically limited to synthesizable Verilog and
      VHDL.  The drawback is that developing synthesizable Verilog/VHDL
      takes more time than writing a higher level language.

    Another item that can severely limit emulator performance: when test
    bench components run too slow on the host or generate too much traffic
    on the link between the host and the emulator.  Synthesizable test
    benches running in the emulator can reduce a host emulator bottleneck
    by 5X to 10X.


13. Number of Users

    The bigger an SoC, generally the larger the engineering team and the
    more geographically dispersed it is.  Therefore, the maximum number
    of users that can run an emulation at the same time on partitioned
    elements of the SoC should be taken into consideration.


14. Memory Capacity

    Processor-based emulators have a high memory capacity, up to 1 TB.
    This memory is used more like the memory in a simulator; mapping DUT
    memories is transparent and usually not an issue.

    In contrast, DUT memory on FPGA-based emulators is more explicitly
    mapped to hard macro memory blocks in the FPGA devices.  The largest
    FPGA devices have about 50 M bits per device; this capacity is usually
    enough for the memory required in DUT partitions.  Sometimes a DUT
    memory does not map well into the FPGA device memory, in which case
    the FPGA-based emulator can use special constructs to map the DUT
    memory into on-board DRAM.  It is important to know ahead of time
    whether an FPGA-based emulator has memory configuration limitations.

         ----    ----    ----    ----    ----    ----   ----

Related Articles

  Jim Hogan explains the 2013 market drivers for HW emulation growth
  The science of SW simulators, acceleration, protyping, emulation
  Hogan compares Palladium, Veloce, EVE ZeBu, Aldec, Bluespec, Dini

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)