Video is going handheld. Consumer-electronics manufacturers are rushing past handheld DVD players to compete against the Video iPod. Handheld receivers for the emerging DVB (digital-video-broadcast) standards are on everyone's drawing boards. Video approaching VGA (video-graphics-array) resolution is flickering to life in cell-phone
handsets. And in every one of these markets, the most forward-looking pundits are predicting the arrival of high-definition handheld devices.
All of this could be an obvious boon to consumer-electronics and handset manufacturers, persuading consumers to rush out and buy another multihundred-dollar toy and again replace their cell phone while they are at it. But the growing demand for decent video on handheld devices is posing a serious challenge to system designers, and it is confronting system architects with a too-large menu of hard choices.
The OEM's dilemma
The first decision an end-system designer must face is how much work to do. In many of these markets, you can license a complete system design, including the user-interface software, board film, and enclosures, and simply arrange to have an OEM manufacture it. That approach involves minimal development cost, but the only opportunity for differentiation is a clever choice of a reference-design vendor. At the other extreme, you can start with IP (intellectual-property) cores and design an SOC (system on chip), design a board, select peripherals, and create the system from an almost-clean sheet of paper. This approach offers the greatest opportunity for differentiation?without any guarantees of success unless your organization happens to have deep expertise in digital-video processing?but it could take 50 man-years.
"Customers today aren't looking for reference designs as much as they are looking for complete solutions," says Josh Kablotsky, engineering fellow at Analog Devices Inc. "They want to push all the complexity back onto their vendors, but, at the same time, they want the ability to differentiate."
It is an understandable position. The connections between algorithms, system resources, and the end-user experience are highly complex and have their roots in the deepest levels of system implementation?both hardware and software. Except for a handful of vendors that have long experience in digital-video systems, much of the expertise is inaccessible. It makes sense to ask semiconductor suppliers?which have attempted to position themselves as experts?to solve these problems.
But this stance has profound implications for both the system vendors and the semiconductor vendors. For the systems vendors, it means they must accept the design of their device as a black box, with only a few design knobs and a promised road map on which to build a market position. They must be able to estimate from demos and superficial descriptions just how far they can turn the knobs and in just which direction the road map is likely to turn.
For the semiconductor vendors, the growing importance of the reference design means that their cherished architecture?often the crown jewel of a senior design team?is all but irrelevant to the customer. "Customers just want a solution to their problem. They don't make choices based on architecture," Kablotsky says. The predicament also means that, to play in these markets, a vendor must provide a complete product, including hardware, codecs, application software, operating systems, and user-interface tools. The primary products customers will get from semiconductor vendors are the APIs (application-programming interfaces) of the software modules, not the underlying hardware. Architecture indirectly becomes an issue in system design depending on the chip developer's ability to attract software partners and to provide powerful APIs.
The sophisticated OEM
A look inside a sophisticated OEM provides an illustration of this delicate balancing act. Fast Forward Video has been building JPEG-based digital-video-recording subsystems since 1999. The company designs at the board level, eschewing FPGAs or ASICs in favor of standard-product ICs whenever possible. The company differentiates itself not with custom hardware but with substantial expertise in video-processing algorithms. This approach is no mean trick when you're performing it on someone else's chip. "The No. 1 issue is keeping the flow of data between the outside world and the disk drive unimpeded," explains Fast Forward's president, Paul DeKeyser. This issue forces the company to delve into the internals of any JPEG hardware it evaluates. "The second is to have the right codec for the desired results." This need dictates that the company must use JPEG-2000 codecs. "Any scheme that uses intraframe encoding has image problems unless you can do multipass encoding," DeKeyser observes.
One of the greatest difficulties is matching the highly variable data rates into and out of the codecs to the plodding but sporadic pace of the disk drive. "We started out with an LSI Logic chip set," says DeKeyser. "It did JPEG, but the data rate was so highly variable that we had to buffer it." Now, the company uses Zoran chips. "With JPEG-2000, we have the concept of metered data rates; the chip keeps track of a target data rate and works to keep its average close to the target. That is one of the strengths that attracted us to Zoran's chips," he says.
Fast Forward's knowledge of the chip has become intimate, reaching into parts of the JPEG engine that many users may not realize are user-programmable. To balance data rate against subjective image quality, the company experiments with such innards as Huffman-coding tables, for example. "This becomes a sort of tribal knowledge?what register settings will work best for a particular chip," says DeKeyser. But extracting this kind of knowledge takes experience. And that experience is unavailable at the beginning of a design, when chip selection must happen. So, a design team must be able to "tease an evaluation out of a demo," as DeKeyser puts it. This situation is so even if a demo looks terrible because the chip vendor doesn't understand the importance of data converters or looks wonderful because the vendor has tuned it around one sample bit stream. The design team must estimate not only the quality of a chip, but also how far team members can accurately optimize the chip before the architecture runs out of steam. It takes intuition and communication with other people in the close-knit video community who have used the chip. "We would never use a brand-new device," DeKeyser declares.
Levels of abstraction
With customers varying from those who want a turnkey system to those, like Fast Forward, who manipulate hardware registers, silicon vendors face having to provide not only full systems built on their chips, but also multiple levels of access to these reference designs. Some users may see the reference design as a complete system and want access only to the user interface. Others may see the platform as a collection of APIs running on a host CPU, perhaps complying with the Khronos Group's OpenMax royalty-free, cross-platform API, so that they can add or remove codecs in a standard way. Still others may want to understand the fine structure of application and codec tasks so they can, for instance, manipulate codec parameters or reorganize data flows to and from DRAM. And others, like Fast Forward, may want to understand the operation of individual hardware blocks at a register level. This diversity of needs can create mind-bending customer-support and design-update issues for a vendor, especially if it relies primarily on partners for codecs and applications. A silicon vendor may find itself counseling a customer on optimizing a codec that a third party in another country wrote.
The viability of providing different levels of access to a reference design, the effectiveness with which the user can manipulate the design, and the range over which they can configure the design are all functions of the underlying architecture. Some architectures target one specific market and are impractical for any other use. Some attempt to balance two mutually antagonistic goals?flexibility and efficiency?covering a range of end products with a single hardware platform. And others make strong commitments to flexibility and scalability, even at the expense of energy efficiency and cost.
The rifle shot
Some markets are large and stable enough that it makes sense to offer SOC designs just for them. The emergence of mobile-video-broadcast standards in some Asian markets is a case in point. Renesas recently announced the CPU-centric SH-MobileL3V SOC, which finds use as a handset application processor in DVB applications (Figure 1). According to Brian Davis, director of business development at Renesas, the device uses the SH-X processor core, which in turn is based on the superscalar SH-4. The SH-X adds a DSP subsystem to handle audio codecs in software. Even though the SOC targets mobile
television in just two geographic markets, the range of audio codecs in use even within this narrow space still demands a software approach, according to Davis. The SH-X provides more than enough head room for an operating system, application code, and software-audio codecs.
In contrast, Renesas judged that a pure-hardware approach was feasible for video because the device needed to support only the MPEG-4 and H.264 video profiles. It also deemed a software approach to video decoders and still-image codecs to be prohibitively damaging to battery life. Hence, the design clusters hardware blocks for video, still-camera processing, and peripheral control around the CPU, and the proprietary Super Hyway silicon bus interconnects them. "The video decoders are implemented in hardware that is like a series of state machines with built-in math functions," Davis explains. "There are enough similar functions between MPEG-4 and H.264 that, with careful design, they can share some of the state machines. And our studies show that, with additional host software, we can cover Windows VC-1 using the same blocks."
Relying primarily on hardware gives Renesas significant energy efficiency. Power numbers are notoriously difficult to compare because so many parameters can have a first-order impact on energy consumption. For instance, does a given power figure include video scaling and color-space conversion? Does it include the display drivers? According to Davis, the SOC can perform the audio/video-decoding and display-preparation jobs in CIF (common-intermediate-format) resolution at approximately 200 mW.
Renesas exposes this architecture to customers at two levels. In Japan, the chip, which now serves as the basis of a merged application
/baseband design with NTT, is available as a complete reference design. In Korea, in contrast, Renesas works with system-integration partners to tailor the reference design to local needs.
There can be many variations on this theme. Qualcomm, for example, has approached a similar problem?a line of cellular handsets with video-gaming capability?with its own set of CPU-plus-hardware architectures. But, unlike Renesas, which aimed a chip at a narrow range of markets, Qualcomm designed one SOC for each price and performance point in the handset-gaming market. "Ideally, at different performance levels, you would make different architectural trade-offs," explains Dave Ligon, senior product manager at the handset giant. "We divide this market into three segments: the multimedia phone, which is primarily cost-sensitive; the enhanced phone; and the convergence platform, in which the gaming
requirements are stringent."
Qualcomm's chosen architectures directly reflect these three segments. The multimedia phone, for example, implements the platform's application-DSP core for geometry processing and an ARM core for software-based rasterization. Using lots of properly organized local memory makes the performance goals achievable at this level.
The enhanced platform employs hardware rasterization. A z-buffer, along with other hardware-based early-exit strategies, ensures that only visible pixels pass through the pipeline. The geometry-DSP cores connect directly to the rasterization engine to eliminate the energy that a bus or a shared-memory approach would consume. The raster engine, in turn, drives a mobile-display processor that, in addition to creating signals for a small LCD, implements simple 2-D operations.
At the high end, the convergence platform employs a derivative of the ATI 2300 graphics core in combination with an enhanced version of the mobile-display processor. "The convergence platform gives a bit higher graphics performance than the Silicon Graphics Octane workstation that was the current hot product when I worked at Silicon Graphics," Ligon says. "And this is a single system in package in a handset."
In practice, the graphics APIs largely conceal the significant hardware differences among the three platforms, so game developers can almost regard the three as one platform, he says. As it moves up the product line, an OEM would see additional features but no fundamental changes in software interfaces. This approach can offer the ultimate in energy efficiency over a range of tasks. But, as Qualcomm's three distinct chips indicate, flexibility means the need for additional chip designs. For a cash-rich company with intimate customer relationships that make it confident of hitting a huge market, that trade-off is a good one. But not everyone is in that situation.
There is a natural evolution from a purely CPU-centric architecture in which hard-wired engines surround a single programmable core to a more flexible architecture in which the specialized engines themselves become programmable. Some current thinking that went into Texas Instruments' DaVinci platform illustrates this first step. "There are places for hard-wired blocks and places for programmable blocks," observes TI fellow Ray Simar. "It's important not to start with a preconceived notion of architecture but to start with an understanding of the application's requirements and how they will evolve over time. Then, think about architectures."
The cost of adding some level of programmability to functional blocks has been slowly edging down. "Power has been getting to be a harder problem for everyone," Simar says. "But with aggressive power-management techniques and careful memory architecture, the energy consumption of a programmable block may be getting closer to that of a hard-wired block."
Simar also points out that a spectrum of programmability occurs within blocks, depending on the granularity of the computation and the number of modes in which the block must function. In some cases?motion-estimation search and compare, for example?an operation is so close to a fixed data-flow model that register-programmed state machines can cover all the application's needs. In other cases, so many variations on an operation may exist or a function may be so data-dependent that only a device with stored programs and a program counter can be flexible enough to keep up. TI has employed the whole range of options beside its DSP cores.
One function that many designers think of as fixed but that is increasingly demanding programmability is DMA. Data movement within a multimedia subsystem can be highly complex, data-dependent, and variable. Simply routing data blocks into and out of DRAM would in many cases result in catastrophe. TI attacks this problem with a combination of embedded SRAMs, which are large enough to hold a full working set for the function the block performs; flow-through architectures when possible; and programmable-DMA controllers to match DRAM-traffic patterns with the needs of both functional blocks and DDR DRAMs.
The same style of analysis?beginning with use cases, identifying tasks, and partitioning them into hard, configurable, or software-driven blocks?can lead architects to different conclusions. Nvidia's GoForce product line, for example, has hard-wired engines, but the bulk of the work falls upon Tensilica-derived programmable-DSP cores with enhanced instruction sets. The DSP cores themselves repeat the same pattern in microcosm; Nvidia implements the instruction enhancements in fixed hardware.
"You have to be very clever about what goes into hardware," says Geoff Ballew, Nvidia's director of product marketing. On the one hand, fixed hardware delivers the best energy efficiency?better than the augmented DSP blocks and far better than a conventional DSP core. On the other hand, you'd better choose wisely. Sometimes, with a little more work, you can reuse a block that appears to be fixed-function across a number of codecs, for instance.
Part of the problem is that no good tools exist for energy profiling at the system level. It's a challenge to identify which blocks will be the major energy consumers. Once you identify those blocks, Nvidia further decomposes them and employs dynamic, fine-grained clock gating to switch on and off groups of circuits within the block, minimizing energy consumption.
Deciding how much flexibility you need is difficult, according to Steve Barlow, senior director of engineering at Broadcom. At resolutions below CIF?or D1 resolution in a 65-nm process?for example, you can acceptably deal with most functions in programmable hardware. Above that level?as the functions of set-top boxes begin to converge with those of mobile
-video devices, for example?hard-wired engines become necessary. Broadcom's approach is in many ways similar to Nvidia's: programmable processors enhanced with special hardware. In Broadcom's case, the processor is a proprietary CPU core with a 16-element vector processor and a 2-D register file, which Broadcom acquisition Alpha Mosaic developed. This architecture was sufficiently robust to allow a single chip design to support MPEG-4 and, with only software changes, adapt to the demand for H.264. But for set-top-box levels of performance, Broadcom needs to augment architecture with more fixed-hardware blocks. "There are some functions, such as the CABAC [context-adaptive-binary-arithmetic-coding] function in the H.264 main profile that simply don't map well onto vector hardware. At some point, you are up against the limit of what semiconductor technology can do, and you have to go hard-wired," says Barrow.
Absorbing the functions
In an interesting bit of evolution, some multimedia architectures have picked up this concept of hard-wired functions within a DSP core and used them not to build specialized blocks, but to build a single powerful DSP core that can itself handle all the media tasks in a handheld system. This approach relies on careful study of the algorithms necessary for video processing, identification of the hot spots within those algorithms, and development of instruction-set extensions to break up the hot spots.
"Applying systems expertise to crafting the microarchitecture of the DSP core can give anywhere from double to 20 times the performance on these algorithms," says Analog Devices' Kablotsky. And that performance is enough to handle, for example, CIF-level H.264 decoding entirely in the Blackfin DSP core with about 20-mW power consumption. However, centering computationally intensive tasks on the DSP core invites two problems: It increases the number of instruction fetches, or the amount of wasted energy, necessary to complete the tasks, and it increases the core-clock rate, again increasing energy consumption. Kablotsky argues that creating instructions tailored to the algorithms helps both of these issues. By combining large numbers of operations?the entirety of an inner loop, for example?into a single new instruction, you can reduce the number of required fetches. By developing extensions to the microarchitecture to exploit data-level parallelism through the use of SIMD (single-instruction-multiple-data) instructions, for example, you can slash the clock rate.
DSP licenser Ceva has taken a similar approach. In this case, Ceva based the microarchitecture evolution on an acquisition that brought the company ownership of the FST (fast-subspace-tracking) algorithms for video processing. According to Ceva's vice president of sales, Issachar Ohana, the FST software dramatically speeds video operations. By incorporating instruction extensions for the FST code, Ceva developed the Ceva-X core, which Ohana claims can perform 30-frame/sec H.264 decoding at resolutions as high as D1 without external hardware assistance. "The combination of these algorithms and the tuning of the core to execute them give us energy consumption competitive with that of hardware," he says.
Having all the computation flowing through a single DSP core simplifies things. But the big advantage, according to Kablotsky, is that it presents a single, C-level-programming model to developers. It presents all the features that a system OEM might want to customize and all the hooks a codec developer or highly skilled video house might want to tweak as source code for a microarchitecture. This approach eases both modification and extension of the system architecture, at least within the range of the DSP's computing power and the battery-life tolerance of the user.
A third example comes from the opposite end of the processing-core world: ARM. Keith Clarke, vice president of technology marketing at the CPU giant, walked through the evolution of the single CPU as multimedia processor. The ARM7 by itself is adequate for some audio applications. But add the 16-bit saturating arithmetic instructions and increased speed of the ARM9, and you get not just audio, but also MPEG-4 QCIF (quarter-CIF) encoding at 15 frames/sec at approximately 80 MHz. Add speed and the SIMD instructions of the V6 instruction-set architecture on the ARM11, and you can achieve VGA-resolution H.264 encoding. Move one step further to the Cortex A8 and the Neon accelerator with its 64-bit SIMD architecture, and you can perform 30-frame/sec MPEG-4, VGA encoding in about half the cycles that the ARM11 would require. That task in real time requires about 300 MHz. To make these options more realizable to users, ARM is now prototyping a parallelizing compiler that can extract data parallelism and employ the SIMD hardware to exploit it.
A highly optimized central DSP or CPU core is not necessarily the end of the road. Analog Devices, ARM, and Ceva all suggest that multiple cores might be necessary for the multiple high-definition data streams in set-top boxes, for example. Other vendors, such as Cradle Technologies, are also looking at a multicore approach based on customized DSPs. The company offers an asymmetric cluster of fully programmable DSP cores that reside around a shared central memory. Having the right extensions on the right DSP core and providing that core's local memory with the right data becomes the essence of system architecture. Designers with an intimate understanding of both the use scenarios and the algorithms must make all these decisions, according to Cradle's vice president of applications
, Bruce Schulman.
According to Schulman, one critical step is to partition the application into large enough blocks. If too little processing takes place in each block, too much traffic will pass through the shared on-chip memory, and system throughput will suffer. If too much processing takes place in each block, the chip will be unable to exploit the inherent task of parallelism in the application. Another critical step is to study the flow of data through functional blocks and conduct prefetching under program control so that a computing block is never waiting for data and so that the chip efficiently uses external DRAM.
All of these features give the architecture significant flexibility across a range of systems with similar applications
. However, only the stout of heart will undertake programming the individual DSP cores. For the most part, customers relate to the Cradle architecture through its APIs.
A systematic approach
These architectures represent the best analyses of their designers, plus a lot of vested interest, institutional inertia, and tradition. Vendors tend to start with the architectural biases that made them rich in the first place. Yet, from the tremendous variety, patterns emerge that suggest that a systematic approach to video-rich-multimedia architectural design is possible. Any such approach would start with an intimate relationship with the codec and application developers, whether they are in-house or third party. Without this knowledge, you can't accomplish much. With it, system architects can begin with a task-level model of the system, complete with the data flows for expected use patterns.
From this model, the architects should first extract the computational or data-moving hot spots?those code sequences that will consume the most time, energy, or both. Using conventional approaches such as pipelining, SIMD engines, and state machines, designers can tame these hot spots with dedicated hardware. These accelerators can just stay on paper for the moment; designers need not implement them in any particular fashion, but they reduce the estimated execution times and energy consumptions of the code sequences they accelerate.
Next, architects can partition the overall tasks into large blocks in such a way that the overall system requires minimum bandwidth between blocks. Motion estimation or Huffman decoding, for example, might be blocks. This step is vital: In today's technology, computation doesn't cost much, but bandwidth feasts on energy. When you have identified all the blocks, you can organize them based on the data flows and decide which ones you can string into pipelines, which you can arrange in parallel, and which are stubbornly sequential.
You then group the blocks, along with the accelerators you created for them, and map them onto the fewest processing sites that can meet both the throughput and the energy requirements of the system. If one mighty ARM core with a number of extended instructions can do the job at an acceptable clock frequency, you're done. If it requires a dozen little DSPs, each with its own accelerator, so be it. The idea is to map the blocks onto the smallest amount of computational resource necessary. If you have the time, you can now perform an optimization step, looking at all the accelerators that happen to land on each processing site to see whether they can share hardware without resource contention. If so, you might merge two or more accelerators into one slightly more flexible one. Finally, you outfit these processing sites, whether they are CPUs, DSPs, or blocks of dedicated hardware, with local memory and connect them to each other and to central DRAM in whatever manner is necessary to meet the required data-flow bandwidth.
Although no one is likely using this explicit approach, it organizes and describes the tangle of individual decisions that make up a multimedia architect's job, and it explains the blossoming garden of architectures that is rapidly appearing in the handheld-rich-video market.