A Novel Approach to Software Defined FPGA Computing
Software Centric Methodology
The overall process of implementing a design using QuickPlay is straightforward:
1. Develop a C/C++ functional model of the hardware engine.
2. Verify the functional model with standard C/C++ debug tools.
3. Specify the target FPGA platform and I/O interfaces (PCIe, Ethernet, DDR, QDR, etc.).
4. Compile and build the hardware engine.
The process seems simple; but if it is to work seamlessly, it is critical that the generated hardware engine be guaranteed to function identically to the original software model. In other words, the functional model must be deterministic so that, no matter how fast the hardware implementation runs, both hardware and software executions will yield the exact same results.
Unfortunately, most parallel systems suffer from nondeterministic execution. Multithreaded software execution, for example, depends on the CPU, on the OS and on nonrelated processes running on the same host. Multiple runs of the same multithreaded program can have different behaviors. Such nondeterminism in hardware would be a nightmare, as it would require debugging the hardware engine itself, at the electrical waveform level, and thus would defeat the purpose of a tool aimed at abstracting hardware to software developers. QuickPlay uses an intuitive dataflow model that mathematically guarantees deterministic execution, regardless of the execution engine. The model consists of concurrent functions, called kernels, communicating with streaming channels. It thus correlates well with how a software developer might sketch an application on a whiteboard. To guarantee deterministic behavior, the kernels must communicate with each other in a way that prevents data hazards, such as race conditions and deadlocks. This is achieved with streaming channels that are
2) blocking read and blocking write, and
Those are the characteristics of a Kahn Process Network (KPN), the computation model on which PLDA built QuickPlay. Figure 2 shows a QuickPlay design example illustrating the KPN model.
The contents of any kernel can be arbitrary C/C++ code, third-party IP or even HDL code (for the hardware designers). QuickPlay then features a straightforward design ow (Figure 3). Let’s take a closer look at each step of the QuickPlay design process.
Step 1: Pure software design. At this stage you create your FPGA design by adding and connecting processing kernels in C and by specifying the communication channels with your host software. QuickPlay’s Eclipse-based integrated development environment (IDE) provides a C/ C++ library with a simple API to create kernels, streams, streaming ports and memory ports, and to read and write to and from streaming ports and memory ports. In addition, the QuickPlay IDE provides an intuitive graphical editor that allows you to drag and drop kernels and other design elements and to draw streams.
Step 2: Functional verication. In this step, the focus is on making sure that the software model written in Step 1 works correctly. You do this by compiling the software model on the desktop, executing it with a test program that sends data to the inputs, and verifying the correctness of the outputs. The software model of the FPGA design is executed in parallel, with a distinct thread for each kernel to mimic the parallelism of the actual hardware implementation. You would then debug your software model using standard software debug techniques and tools such as breakpoints, watchpoints, step-by-step execution and printf. (You will probably want to run more tests once the implementation is in hardware; we’ll deal with that shortly.) From a design ow standpoint, this is where you do all of your verication. Once you are done with this debug phase and have fixed all functional issues, you will not need any further debugging at the hardware level. It’s important to remember that the functional model involves none of the hardware infrastructure elements. In the example above, the focus is on a simple, two-function model; none of the system aspects added in Figure 1 (such as the communication components, the control plane, and clocking and resets) are in play during this modeling and verification phase.
Step 3: Hardware generation. This step generates the FPGA hardware from your software model. It involves three simple actions:
Step 4: System execution. This is similar to the execution of the functional model in Step 2 (functional verification), except that now, while the host application still runs in software, the FPGA design runs on the selected FPGA board. This means that you can stream real data in and out of the FPGA board and thereby benet from additional verication coverage of your function. Because this will run so much faster, and because you can use live data sources, you are likely to run many more tests at this stage than you could during functional verication.
Step 5: System debug. Because you’re running so many more tests now than you were doing during the functional verication phase, you’re likely to uncover functional bugs that weren’t uncovered in Step 2. So how do you debug now? As already noted, you never have to debug at the hardware level, even if a bug is discovered after executing a function in hardware. Because QuickPlay guarantees functional equivalence between the software model and the hardware implementation, any bug in the hardware version has to exist in the software version as well. This is why you don’t need to debug in hardware; you can debug exclusively in the software domain. Once you have identied the test sequence that failed in hardware, QuickPlay can capture the sequence of events at the input of the design that generated the faulty operation and replay it back into the software environment, where you can now do your debug and identify the source of the bug using the Eclipse debugger. This is possible because QuickPlay automatically provisions hardware with infrastructure for observing all of the critical points of the design. You can disable this infrastructure to free up valuable hardware real estate. Figure 4 shows the example system with added debug circuitry. Without QuickPlay, some sort of debug infrastructure would have to be inserted and managed by hand; with QuickPlay, this all becomes automatic and transparent to the software developer.
The overall process is to model in software, then build the system and test in hardware. If there are any bugs, import the failing test sequences back into the software environment, debug there, x the source code and then repeat the process. This represents a dramatic productivity improvement over traditional flows.
Step 6 (optional): System optimization. Once you have completed the debug phase, you have a functional design that operates on the FPGA board correctly. You may want to make some performance optimizations, however, and this is the proper time to do that, as you already know that your system is running correctly. The rst optimization you should consider is to rene your functional model. There are probably additional concurrency opportunities available; for example, you might try decomposing or refactoring functions in a different way. At this level, optimizations can yield spectacular performance improvements. Needless to say, doing so with a VHDL or Verilog design would require signicant time, whereas doing the modications in C would be a quick and straightforward process. Second, you may want to try a different FPGA board with a faster FPGA. Because the mapping from the functional model to the board is so easy, it’s a simple matter to try a variety of boards in order to select the optimal one. The third optimization has to do with the hardware kernels that QuickPlay creates via high-level synthesis. While the resulting hardware is guaranteed to operate correctly and efciently, it may not operate as efciently as hardware handcrafted by a hardware engineer. At this stage, you have several options: You can optimize your code and tune QuickPlay HLS settings to improve the generated hardware, use Vivado HLS to generate more-eficient hardware, or have a hardware designer handcraft the most critical blocks in HDL. None of these optimization steps is mandatory, but they provide options when you need better-performing hardware and have limited hardware design resources available. A hardware engineer may be able to help with these optimizations. Once you have made any of these changes, simply repeat the build process.
A Universal Streaming Conduit
QuickPlay provides a universal streaming API that entirely abstracts away the underlying physical communication protocol. Streaming data is received via the ReadStream() function and is sent out using the WriteStream() function. Those functions can be used to send and receive data between kernels, to embedded or board-level memory, or to an embedded or external host CPU, thus providing broad architectural exibility with no need for the developer to comprehend or manage the underlying low-level protocols.
The selected protocol determines the hardware through which that data arrives and departs. At present, QuickPlay supports ARM® AMBA® AXI4-Stream, DDR3, PCIe (with DMA) and TCP/IP; more protocols are being added and will be added as demand dictates. Selecting the desired protocol sets up not only the hardware needed to implement the protocol, but also the software stacks required to support the higher protocol layers, as shown in Figure 5.
QuickPlay manages the exact implementation of these reads and writes (size, alignment, marshaling, etc.). The most important characteristic of the ReadStream() and WriteStream() statements is that they are blocking: When either statement is encountered, execution will not pass to the next statement until all of the expected data has been read or written. This is important for realizing the determinism of the algorithm.
The “binding” between the generic ReadStream() and WriteStream() statements and the actual underlying protocol hardware occurs at runtime via the QuickPlay Library. This not only prevents the communication details from cluttering up the software program, but also provides modularity and portability. The communication protocol can easily be changed without requiring any changes to the actual kernel code or host software. The ReadStream() and WriteStream() statements will automatically bind to whichever protocol has been selected, with no effect on program semantics. As a result of the abstraction that QuickPlay provides, the software algorithms remain pure, focusing solely on data manipulation in a manner that’s completely independent of the underlying communication details.
Production Quality Output
Depending on the HLS tool being used, results might be improved by learning coding styles that result in more efficient hardware generation, but that is optional. While in other situations the hardware platform you use may be viewed simply as a prototyping vehicle, the systems you create using QuickPlay are production-worthy. Going from a purely software implementation to a hardware-assisted or hardware-only implementation traditionally takes months. QuickPlay reduces that time to days.
The QuickPlay methodology achieves the longsought goal of allowing software engineers to create hardware implementations of all or portions of their application. By working in their familiar domain, software engineers can make use of custom hardware as needed, automatically generating hardware-augmented applications that operate more efficiently and can be production-ready months ahead of handcrafted designs
Phone: +44 (0) 7462 148 989
Address: Galloway close,
RG22 6SX Basingstoke, UK