Ultra-Low Energy Domain-Specific Instruction-Set Processors - Javed Absar, Francky Catthoor, Murali Jayapala, Angeliki Kritikakou, Andy Lambrechts, Praveen Raghavan

Ultra-Low Energy Domain-Specific Instruction-Set Processors (eBook)

Javed Absar, Francky Catthoor, Murali Jayapala, Angeliki Kritikakou, Andy Lambrechts, Praveen Raghavan (Autoren)

eBook Download: PDF

2010 | 2010
XXII, 406 Seiten
Springer Netherlands (Verlag)
978-90-481-9528-2 (ISBN)

In the complex, global design of battery-operated embedded systems, the focus of Ultra-Low Energy Domain-Specific Instruction-Set Processors is on the energy-aware architecture exploration of domain-specific instruction-set processors and the co-optimization of the datapath architecture, foreground memory, and instruction memory organisation with a link to the required mapping techniques or compiler steps at the early stages of the design. By performing an extensive energy breakdown experiment for a complete embedded platform, both energy and performance bottlenecks have been identified, together with the important relations between the different components. Based on this knowledge, architecture extensions are proposed for all the bottlenecks.

Modern consumers carry many electronic devices, like a mobile phone, digital camera, GPS, PDA and an MP3 player. The functionality of each of these devices has gone through an important evolution over recent years, with a steep increase in both the number of features as in the quality of the services that they provide. However, providing the required compute power to support (an uncompromised combination of) all this functionality is highly non-trivial. Designing processors that meet the demanding requirements of future mobile devices requires the optimization of the embedded system in general and of the embedded processors in particular, as they should strike the correct balance between flexibility, energy efficiency and performance. In general, a designer will try to minimize the energy consumption (as far as needed) for a given performance, with a sufficient flexibility. However, achieving this goal is already complex when looking at the processor in isolation, but, in reality, the processor is a single component in a more complex system. In order to design such complex system successfully, critical decisions during the design of each individual component should take into account effect on the other parts, with a clear goal to move to a global Pareto optimum in the complete multi-dimensional exploration space.In the complex, global design of battery-operated embedded systems, the focus of Ultra-Low Energy Domain-Specific Instruction-Set Processors is on the energy-aware architecture exploration of domain-specific instruction-set processors and the co-optimization of the datapath architecture, foreground memory, and instruction memory organisation with a link to the required mapping techniques or compiler steps at the early stages of the design. By performing an extensive energy breakdown experiment for a complete embedded platform, both energy and performance bottlenecks have been identified, together with the important relations between the different components. Based on this knowledge, architecture extensions are proposed for all the bottlenecks.

Preface 5
Contents 8
Glossary and Acronyms 17
Chapter 1:Introduction 20
1.1 Context 20
1.1.1 Processor design: a game of many trade-offs 22
1.1.2 High level trade-off target 25
1.2 Focus of this book 26
1.3 Overview of the main differentiating elements 29
1.4 Structure of this book 32
Chapter 2:Global State-of-the-Art Overview 35
2.1 Architectural components and mapping 36
2.1.1 Processor core 36
2.1.1.1 The FUs, slots and PEs of the datapath 37
2.1.1.2 Foreground memory (or register files) 38
2.1.1.3 Processor pipelining 39
2.1.1.4 Issue logic 40
2.1.1.5 Overview of state-of-the-art processor classes 40
2.1.2 Data memory hierarchy 43
2.1.3 Instruction/configuration memory organization 43
2.1.4 Inter-core communication architecture 44
2.2 Platform architecture exploration 44
2.2.1 Exploration strategy 45
2.2.2 Criteria/cost metric 46
2.2.2.1 Performance 46
2.2.2.2 Energy consumption 46
2.2.2.3 Area 48
2.2.2.4 Design effort 48
2.2.2.5 Flexibility 49
2.2.3 Evaluation method 49
2.3 Conclusion and key messages of this chapter 50
Chapter 3: Energy Consumption Breakdown and Requirements for an Embedded Platform 51
3.1 Platform view: a processor is part of a system 52
3.2 A video platform case study 53
3.2.1 Video encoder/decoder description and context 54
3.2.1.1 Driver application 54
3.2.1.2 Embedded platform description 55
3.2.1.3 Inter-tile communication architecture 56
3.2.1.4 Mapping the application to the architecture 57
3.2.2 Experimental results for platform components 58
3.2.2.1 Experimental procedure 58
3.2.2.2 Embedded processor datapath logic 59
3.2.2.3 Datapath pipeline registers 60
3.2.2.4 Data and instruction memory hierarchy 61
3.2.2.5 Inter-tile communication architecture 62
3.2.3 Power breakdown analysis 63
3.2.4 Conclusions for the platform case study 67
3.3 Embedded processor case study 67
3.3.1 Scope of the case study 68
3.3.2 Processor styles 69
3.3.2.1 Software pipelining (e.g. modulo scheduling) 70
3.3.2.2 Clustering Clustering (clustered VLIW) 70
3.3.2.3 Coarse-grained reconfigurable architecture 71
3.3.2.4 SIMD or sub-word parallelism 73
3.3.2.5 Custom instructions and/or FUs 73
3.3.2.6 Optimized data memory hierarchy 74
3.3.2.7 Hybrid combinations 74
3.3.3 Focus of the experiments 74
3.3.4 Experimental results for the processor case study 75
3.3.4.1 RISC 76
3.3.4.2 Centralized VLIW 76
3.3.4.3 Clustered VLIW 78
3.3.4.4 Coarse-grained architectures 80
3.3.5 Conclusions for the processor case study 81
3.4 High level architecture requirements 81
3.5 Architecture exploration and trends 83
3.5.1 Interconnect scaling in future technologies 83
3.5.2 Representative architecture exploration examples: What are the bottlenecks? 84
3.6 Architecture optimization of different platform components 86
3.6.1 Algorithm design 86
3.6.2 Data memory hierarchy 87
3.6.3 Foreground memory organization 88
3.6.4 Instruction/Configuration Memory Organization(ICMO) 91
3.6.5 Datapath parallelism 93
3.6.6 Datapath–address path 95
3.7 Putting it together: FEENECS template 96
3.8 Comparison to related work 98
3.9 Conclusions and key messages of this chapter 98
Chapter 4:Overall Framework for Exploration 100
4.1 Introduction and motivation 100
4.2 Compiler and simulator flow 103
4.2.1 Memory architecture subsystem 104
4.2.1.1 Data memory hierarchy 105
4.2.1.2 Instruction/Configuration Memory Organization/Hierarchy (ICMO) 106
4.2.2 Processor core subsystem 107
4.2.2.1 Processor datapath 107
4.2.2.2 Register File/Foreground Memory Organization 108
4.2.3 Platform dependent loop transformations 110
4.3 Energy estimation flow (power model) 111
4.4 Comparison to related work 113
4.5 Architecture exploration for various algorithms 116
4.5.1 Exploration space of key parameters 116
4.5.2 Trends in exploration space 118
4.5.2.1 IPC trends 126
4.5.2.2 Loop buffers and their impact on Instruction Memory Hierarchy/Organization 127
4.5.2.3 Exploration time 129
4.6 Conclusion and key messages of this chapter 130
Chapter 5:Clustered L0 (Loop) Buffer Organization and Combinationwith Data Clusters 131
5.1 Introduction and motivation 132
5.2 Distributed L0 buffer organization 132
5.2.1 Filling distributed L0 buffers 134
5.2.2 Regulating access 135
5.2.3 Indexing into L0 buffer partitions 136
5.2.4 Fetching from L0 buffers or L1 cache 137
5.3 An illustration 137
5.4 Architectural evaluation 139
5.4.1 Energy reduction due to clustering 141
5.4.2 Proposed organization versus centralizedorganizations 144
5.4.3 Performance issues 146
5.5 Comparison to related work 147
5.6 Combining L0 instruction and data clusters 149
5.6.1 Data clustering 150
5.6.2 Data clustering followed by L0 clustering 151
5.6.3 Simulation results 152
5.6.4 VLIW Variants 155
5.7 Conclusions and key messages of this chapter 156
Chapter 6:Multi-threading in Uni-threaded Processor 158
6.1 Introduction 158
6.2 Need for light weight multi-threading 161
6.3 Proposed multi-threading architecture 164
6.3.1 Extending a uni-processor for multi-threading 164
6.3.1.1 Software counter based loop controller 165
6.3.1.2 Hardware counter based loop controller 166
6.3.1.3 Running multiple loops in parallel 167
6.4 Compilation support potential 169
6.5 Comparison to related work 171
6.6 Experimental results 175
6.6.1 Experimental platform setup 175
6.6.2 Benchmarks and base architectures used 176
6.6.3 Energy and performance analysis 177
6.7 Conclusion and key messages of this chapter 180
Chapter 7: Handling Irregular Indexed Arrays and Dynamically Accessed Data on Scratchpad Memory Organisations 181
7.1 Introduction 182
7.2 Motivating example for irregular indexing 183
7.3 Related work on irregular indexed array handling 184
7.4 Regular and irregular arrays 185
7.5 Cost model for data transfer 186
7.6 SPM mapping algorithm 187
7.6.1 Illustrating example 187
7.6.2 Search-space exploration algorithm 188
7.7 Experiments and results 191
7.8 Handling dynamic data structures on scratchpadmemory organisations 193
7.9 Related work on dynamic data structure access 194
7.10 Dynamic referencing: locality optimization 195
7.10.1 Independent reference model 197
7.10.2 Comparison of DM-cache with SPM 199
7.10.3 Optimal mapping on SPM--results 201
7.11 Dynamic organization: locality optimization 205
7.11.1 MST using binary heap 206
7.11.2 Ultra dynamic data organization 207
7.12 Conclusion and key messages of this chapter 211
Chapter 8:An Asymmetrical Register File: The VWR 213
8.1 Introduction 213
8.2 High level motivation 217
8.3 Proposed micro-architecture of VWR 218
8.3.1 Data (background) memory organizationand interface 219
8.3.2 Foreground memory organization 220
8.3.3 Connectivity between VWR and datapath 222
8.3.4 Layout aspects of VWR in a standard-cellbased design 223
8.3.5 Custom design circuit/micro-architecture and layout 225
8.4 VWR operation 228
8.5 Comparison to related work 231
8.6 Experimental results on DSP benchmarks 233
8.6.1 Experimental setup 233
8.6.2 Benchmarks and energy savings 234
8.7 Conclusion and key messages of this chapter 236
Chapter 9:Exploiting Word-Width Information During Mapping 237
9.1 Word-width variation in applications 237
9.1.1 Fixed point refinement 239
9.1.2 Word-width variation in applications 243
9.2 Word-width aware energy models 244
9.2.1 Varying word-width or dynamic range 245
9.2.2 Use-cases for word-width aware energy models 246
9.2.3 Example of word-width aware energy estimation 247
9.3 Exploiting word-width variation in mapping 248
9.3.1 Assignment 249
9.3.1.1 Concept 249
9.3.1.2 Expected gains 250
9.3.2 Scheduling 251
9.3.2.1 Concept 251
9.3.2.2 Expected gains 252
9.3.3 ISA selection 255
9.3.3.1 Concept 255
9.3.3.2 Expected gains 255
9.3.4 Data parallelization 256
9.3.4.1 Concept 257
9.3.4.2 Expected gains 258
9.4 Software SIMD 259
9.4.1 Hardware SIMD vs Software SIMD 259
9.4.2 Enabling SIMD without hardware separation 261
9.4.2.1 Corrective operations to preserve data boundaries 262
9.4.2.2 Software SIMD on a Hardware SIMD capable datapath 273
9.4.3 Case study 1: Homogeneous Software SIMD exploration for a Hardware SIMD capable RISC 273
9.4.4 Case study 2: Software SIMD exploration, including corrective operations, for a VLIW processor 278
9.5 Comparison to related work 282
9.6 Conclusions and key messages of this chapter 286
Chapter 10:Strength Reduction of Multipliers 288
10.1 Multiplier strength reduction: Motivation 289
10.2 Constant multiplications: A relevant sub-set 290
10.2.1 Types of multiplications 291
10.2.2 Motivating example 294
10.3 Systematic description of the global exploration/conversion space 297
10.3.1 Primitive conversion methods 298
10.3.1.1 Bitwise (or parallel) method 298
10.3.1.2 Recursive (or sequential) method 299
10.3.2 Partial conversion methods 300
10.3.2.1 Multiplicative factoring 301
10.3.2.2 Additive factoring (word splitting) 302
10.3.3 Coding 303
10.3.4 Modifying the instruction-set 304
10.3.5 Optimization techniques 306
10.3.6 Implementation cost vs. operator accuracy trade-off 307
10.3.6.1 Trading off accuracy with performance 308
10.3.6.2 Preventing width expansion of multiplication results 309
10.3.7 Cost-aware search over conversion space 313
10.4 Experimental results 314
10.4.1 Experimental procedure 315
10.4.2 IDCT kernel (part of MPEG2 decoder) 315
10.4.3 FFT kernel, including accuracy trade-offs 317
10.4.4 DWT kernel, part of architecture exploration 320
10.4.5 Online biotechnology monitoring application 323
10.4.6 Potential improvements of the strength reduction 324
10.4.6.1 Loop Buffer with Local Controller 324
10.4.6.2 Link between SSA, CSD and performance 324
10.4.6.3 Multiple precision MUL operations 325
10.5 Comparison to related work 325
10.6 Conclusions and key messages of chapter 326
Chapter 11:Bioimaging ASIP benchmark study 328
11.1 Bioimaging application and quantisation 329
11.2 Effective constant multiplication realisation with shiftand adds 335
11.3 Architecture exploration for scalar ASIP-VLIW options 344
11.3.1 Constant multiplication FU mapping: Specific SAand SAS options 351
11.3.2 FUs for the Generic SAs 352
11.3.3 Cost-effective mapping of detection algorithm 356
11.4 Data-path architecture exploration for data-parallelASIP options 359
11.5 Background and foreground memory organisationfor SoftSIMD ASIP 366
11.5.1 Basic proposal for 2D array access scheme 366
11.5.2 Overall schedule for SoftSIMD option 369
11.6 Energy results and discussion 371
11.6.1 Data path energy estimation for critical Gauss loop of scalar ASIP 372
11.6.2 Data path energy estimation for critical Gaussloops of SoftSIMD ASIP 374
11.6.3 Data path energy estimation for overall Detectionalgorithm 376
11.6.4 Energy modeling for SRAM and VWR contribution 378
11.6.5 Memory energy contributions 380
11.6.6 Global performance and energy results for options 381
11.7 Conclusions and key messages of chapter 385
Chapter 12:Conclusions 386
12.1 Related work overview 386
12.2 Ultra low energy architecture exploration 387
12.3 Main energy-efficient platform components 388
Bibliography 391

Erscheint lt. Verlag	5.8.2010
Reihe/Serie	Embedded Systems
Zusatzinfo	XXII, 406 p.
Verlagsort	Dordrecht
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
Themenwelt	Technik ► Elektrotechnik / Energietechnik
Schlagworte	algorithms • Biomedical Processing • Compiler • digital signal processor • Embedded • Embedded Systems • Low Energy • Low Power • Low Power Design • Multi-Media Processing • Processor • Processor architecture • Programmable • Scratch • wireless communications
ISBN-10	90-481-9528-4 / 9048195284
ISBN-13	978-90-481-9528-2 / 9789048195282

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 11,3 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

192,59 €