| Home | Research | Courses | Staff | Alumni |
|
Modern electronic systems grow in complexity, and are combinations of different types of components: microprocessors, signal processors, RAM/ROM, digital logic and analogue circuits. At the same time, designers meet demands for shorter development time and lower development cost. This incites the use of higher abstraction levels, so-called system level design. System level design requires system level tools that simultaneously handle both hardware and software, for modeling, partitioning, verification and synthesis of the complete system. Hardware/Software Codesign covers all of these topics, and as such draws on the experience from many different technological communities. |
![]() Can you see the integrated circuit? |
Data Transfer and Storage Exploration
Many integrated circuit systems, particularly in the multi-media and telecom domains, are inherently data dominant. For this class of applications, data transfer and storage largely determine cost and performance parameters. This is the case for chip size, since large memories are usually needed, performance, since accessing the memories may very well be the main bottleneck, and power consumption, since the memories and buses consume large quantities of energy. Even for systems with caches, the overall storage requirement has vital impact on the performance and power consumption, since it greatly influences the number of slow and power expensive cache misses. For the system development process, the designer must hence concentrate first on exploring the data transfer and storage to achieve a cost optimized end product.
Multi-processor system-on-chip
The next generation of embedded systems will be dominated by industrial and consumer devices, which are able to deliver communications and rich, scalable multimedia content anytime, anywhere. These wireless communication and multimedia applications (e.g. 3D games, advanced medical observation and supervision systems) will lead to the creation of extremely complex and dynamic code with huge resource requirements. Such systems can typically not be handled by a single micro-processor. Instead, a System on Chip with multiple processing elements, an MPSoC, is required. The processing elements will be heterogeneous in nature, and can be anything from a standard processor to more specialized or even field programmable hardware units. The key characteristics of the applications will be the intensive computation, the large data transfer and storage requirement, and the need for efficient resource management. In such systems it is essential to optimize the utilization of computational resources both statically at design time, and dynamically at run-time.
We are cooperated closely with IMEC in Leuven, Belgium, regarding research in this domain. More information about the topic and the results of our work can be found at the following places:
Below you will find abstracs of a number of doctoral thesises defended in this research field at the CAS group.
Hierarchical Memory Size Estimation for Loop Transformation and Data Memory Platform Exploration
In today's embedded systems, the memory hierarchy is rapidly becoming a major bottleneck in terms of power, performance and area, due to the very large amount of (memory related) data need to be transferred and stored (temporarily). This is especially the case for portable multi-media applications systems. These applications are characterized by deep loop nests and multi-dimensional arrays at the high level. Due to the dramatically increasing size and complexity of system-on-a-chip (SoC) designs and stringent time-to-market requirement, the methodology and tools for chip design must be raised to the system level. Early analysis tools are particularly critical in enabling SoC designers to take full advantage of the many architectural options available. For memory optimization, the early high level techniques aim either to design an optimal memory platform for a given application or to optimize the application code in order to take advantage of the memory platform features, or even both. Loop transformation is such an important high level optimization technique. It modifies the execution order of loops and statements without changing the application functionality. Existing loop transformation algorithms are all performed based either on reduction of data access lifetime and on improvement in data locality and regularity to steer selection of loop transformations. These are, however, very abstract cost functions which do not represent the exact memory size requirement of the arrays and how the data will be mapped onto the memory platform later on. Existing algorithms all result in one final loop transformation solution. As different loop transformations may result in optimal utilization for different memory platform instances, ad-hoc decisions at this stage without estimating their impact on the actual hierarchy utilization can lead to a final sub-optimal solution. An evaluation of later design stages' effort is hence required. On the other hand, there usually exist a huge number of loop transformation possibilities, the estimation is required to be performed repeatedly and its computation time of the estimation technique also becomes critical to make it useful during the loop transformation search space exploration.
This dissertation proposes a memory footprint estimation methodology. An intra-array memory footprint estimation is performed first followed by an inter-array estimation. In order to achieve a fast estimate to make it useful repeatedly during the early high level search space exploration, several techniques have been introduced. A fast intra-array memory footprint estimation is performed at the iteration domain based on the maximal lifetime of data accesses, which is defined by the maximal dependency vector. Two approaches, an ILP formulation and vertexes approach, have been introduced for achieving a fast maximal dependency vector calculation. The fast inter-array estimation has been achieved based on several Hanoi tower based approaches.
A hierarchical memory size estimation methodology has also been proposed in this dissertation. It estimates the influence of any given sequence of loop transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high level design stage, a platform independent estimation is introduced with a Pareto curve output for each loop transformation instance. It can steer the designer or an automatic steering tool to select all the interesting loop transformation instances that might later lead to low power data mapping for any of the many possible memory hierarchy instances. This is useful when the memory platform is not defined yet, or for a given memory hierarchy instance. It also allows to find the most appropriate low power memory hierarchy instance by performing an early power estimation of different memory hierarchy instances. Initially the source code is used as input for estimation, resulting in an initial approach. However, performing the estimation repeatedly from the source code is too slow for the large loop transformation search space exploration. An incremental approach, based on local updating of the previous result, is thus introduced to handle sequences of different loop transformations. Several advanced techniques have also been used on these two approaches in order to perform a fast estimation, such as bounding box geometrical model based data reuse analysis, platform independent memory hierarchy layer assignment estimation, fast intra- and inter-array memory footprint estimation.
The feasibility and usefulness of the methodologies are substantiated using several representative real-life application demonstrators. It shows for instance that the fast memory footprint estimation can be two order of magnitude faster than compared techniques while still achieving fairly accurate estimation result. For hierarchical memory size estimation methodology, the initial approach is two order of magnitude faster than the compared technique and the incremental approach is another two order of magnitude faster than the initial approach, which can just take a few milliseconds. The fast computation time of the incremental approach make it feasible to be used repeatedly during the loop transformation exploration over a very large number of possibilities. Furthermore, prototype CAD tools has been developed that includes mast parts of the methodologies.
[]Hierarchical Memory Size Estimation for Loop Transformation and Data Memory Platform Exploration. Doctoral thesis at NTNU, 2007:63, Qubo Hu
Storage Requirement Estimation and Optimization for Data Intensive Applications
Many integrated circuit systems, especially in the multi-media and telecom domains, are inherently data dominant. For this class of applications, data transfer and storage largely determine cost and performance parameters. This is the case for chip size, since large memories are usually needed. It is however also the situation for performance. Accessing the memories is in more and more cases the main bottleneck and the cache behavior is for a significant part determined by “capacity misses” which are related to size issues. Finally, it is the case for power consumption, since the memories and buses consume large quantities of energy and the “effective data size” influences their capacitive loading in case of RAMs and their miss-related activity in case of caches. In both cases, the power consumption per access is also heavily affected. During the system development process, the designer must hence concentrate first on exploring the data transfer and storage to produce a cost-optimized end product. At the system level, no detailed information is available regarding the size of the memories required for storing data in the alternative realizations of an application. To guide the designer and help in selecting the best solution, estimation techniques for the storage requirements are needed, very early in the system design trajectory. The simplest estimates use the maximal size of the intermediate array data as declared in the application code. This is however not representative for the effective size required for their storage during the actual execution since arrays and parts of arrays may not be alive simultaneously. An array element is alive from the moment it is written, or produced, and until it is read for the last time. This last read is said to consume the element. To achieve accurate estimates, the so-called in-place mapping opportunity generated by these non-overlapping lifetimes must be taken into account. For scalars, a relatively simple lifetime analysis suffices, but for arrays, this is extremely complex due to the huge number of signals and the often very complex interdependencies between them.
For our target classes of data dominant applications the high-level description is typically characterized by large multi-dimensional nested loops and arrays. Within the loop nests statements access the arrays using read and write operations. At the beginning of the design process, no information about the execution order of these loops is available, except what is given from the data dependencies between the statements in the code. As the process progresses, the designer makes decisions that gradually fix the ordering, until the full execution ordering is known. This execution ordering determines the order in which array elements are accessed and hence the lifetimes of the array elements. Since these lifetimes in turn influence the in-place mapping opportunity, the storage requirements of the arrays within the loop nests is largely determined by the execution ordering. To guide the designer it is therefore essential to have storage requirement estimates that can take the available partially fixed execution ordering into account during the exploration of the implementation solution space. Previous work has either not taken execution ordering into account at all, resulting in large overestimates, or required a fully specified ordering. In the last case, a time consuming full exploration of all possible alternative orderings of the unfixed loop dimensions is needed. This is infeasible during the early system design steps where fast feedback is needed to be able to explore the huge solution space.
The storage requirement estimation methodology proposed in this doctoral thesis solves these important design problems. The methodology is divided into four steps. In the first step, a data-flow graph is generated that reflects the data dependencies in the application code. The array accesses and the dependencies between them are described using a polyhedral model. The second step places the polyhedral descriptions of the array accesses and their dependencies in a so-called common iteration space. A worst-case and best-case placement may be performed, both taking available execution ordering into account during the placement. The third step estimates the upper and lower bounds on the storage requirement of individual data dependencies in the code, taking into account the available execution ordering. As the execution ordering is gradually fixed, the upper and lower bounds on the data dependencies converge. This is a very useful and unique property of the methodology. Finally, simultaneously alive data dependencies are detected. Their combined maximal size at any time during execution equals the total storage requirement of the application. An important part of the estimation technique utilizes loop ordering guidance to estimate upper and lower bounds on dependency sizes. These guiding principles and the proof of their validity are together with the general estimation methodology important contributions of this thesis. The guiding principles can be used for high-level synthesis independently of the storage requirement estimation methodology.
The feasibility and usefulness of the methodology are substantiated using several representative application demonstrators. It is for instance shown how the designer is guided into reducing the memory size of the major arrays in the MPEG-4 Motion Estimation Kernel from 262400 to 257 memory locations. Similar results are achieved for a Cavity Detection algorithm. Applying the methodology on an Updating Singular Value Decomposition algorithm, it is also demonstrated how estimation feedback during global loop reorganization can approximately halve the application's storage requirement. Furthermore, a prototype CAD tool has been developed that includes major parts of the storage requirement estimation and optimization methodology. Using manually generated design examples the tool proved the feasibility of the techniques and in particular showed that run times on computers will be short, in the order of seconds even for substantial applications.
[]Storage Requirement Estimation and Optimization for Data Intensive Applications. Fys.el.-rapport:2001:24, Per Gunnar Kjeldsberg
