SlideShare ist ein Scribd-Unternehmen logo
1 von 50
The Art of Writing Efficient Software
Principles and Techniques
Version 1.0.1
ralf.holly@approxion.com
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
“Having lost sight of our goals,
we redouble our efforts.”
-- Mark Twain
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Efficiency defined
Effectivity:
â–ș Means doing the right things
â–ș Fulfill functional requirements
â–ș Example: sorting an array; algorithm is effective if array is sorted afterwards
Efficiency:
â–ș Means doing the right things as good as possible
â–ș "good" means using as few resources as possible
â–ș Example: using Quicksort (instead of Bubblesort)
â–ș Space efficiency ("footprint"):
â–ș Use as little memory as possible
â–ș Run-time efficiency ("performance")
â–ș Use as little execution time as possible
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Why is efficiency important?
Embedded systems
â–ș Fixed requirements in terms of memory consumption/execution time/cost
â–ș Mass production: car control units, mobile phones, smart cards
â–ș Efficient software yield efficiency in terms of energy consumption (battery life-time)
User experience
â–ș Slow software sucks
â–ș Especially: games, user-interfaces
⇹ Efficiency is an important sales-factor!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
An example
â–ș Task: implement a fast 'memfill' routine
void MemFill(uint8_t* p, uint16_t len, uint8_t fill) {
...
}
â–ș Optimize iteratively
â–ș Development platform: Renesas H8/300H (16/32 bit)
â–ș Toolchain: HEW C/C++ 5.03, "optimized for speed"
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 1 -- Simple 'for' loop
void MemFill1(uint8_t* p, uint16_t len, uint8_t fill) {
uint16_t i;
for (i = 0; i < len; i++) {
*p++ = fill;
}
}
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 1 -- Simple 'for' loop
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
Results show
consumed CPU cycles
on H8/300H
Results show
consumed CPU cycles
on H8/300H
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 2 -- Simple 'while' loop
void MemFill2(uint8_t* p, uint16_t len, uint8_t fill)
{
while (len-- > 0) {
*p++ = fill;
}
}
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 2 -- Simple 'while' loop
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
104 170 248 410 570 890 1690 3290 8090 16090 160088
Lesson:
"Don't assume
anything!"
Lesson:
"Don't assume
anything!"
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 3 -- Word access
void MemFill3(uint8_t* p, uint16_t len, uint8_t fill) {
uint16_t fill2 = fill << 8 | fill;
uint16_t len2;
if (len == 0) return;
if ((uint8_t)p & 1) { *p++ = fill; --len; }
len2 = len >> 1;
while (len2-- > 0) {
*(uint16_t*)p = fill2;
p += 2;
}
if (len & 1)
*p = fill;
}
Downside:
Need to take care
of special cases
Downside:
Need to take care
of special cases
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 3 -- Word access
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
104 170 248 410 570 890 1690 3290 8090 16090 160088
132 166 200 282 362 522 922 1722 4122 8122 80120
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Loop unrolling
for (i = 0; i < n; ++i) {
/* do something */
}
for (i = 0; i < n / 8; ++i) {
/* do something */
/* do something */
/* do something */
/* do something */
/* do something */
/* do something */
/* do something */
/* do something */
}
for (i = 0; i < n % 8; ++i) {
/* do something */
}
Goal:
Reduce loop
overhead
Goal:
Reduce loop
overhead
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Loop unrolling (Duff's Device)
int i = (n + 7) / 8;
switch(n % 8) {
case 0: do { /* do something */;
case 7: /* do something */;
case 6: /* do something */;
case 5: /* do something */;
case 4: /* do something */;
case 3: /* do something */;
case 2: /* do something */;
case 1: /* do something */;
} while (--i > 0);
}
â–ș Tom Duff (1983, Lucasfilm)
â–ș http://www.lysator.liu.se/c/duffs
-device.html
â–ș is valid ANSI C
â–ș "reusable unrolling"
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Loop unrolling (Duff's Device)
#define DUFF_DEVICE(duffTimes, duffAction) 
do { 
U16 n = (duffTimes) ; 
U16 i = (n + 7) >> 3; 
switch (n & 7) { 
case 0: do { duffAction; 
case 7: duffAction; 
case 6: duffAction; 
case 5: duffAction; 
case 4: duffAction; 
case 3: duffAction; 
case 2: duffAction; 
case 1: duffAction; 
} while(--i > 0); 
} 
} while(0)
Let's put Duff's
Device in a macro
Let's put Duff's
Device in a macro
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 4 -- Word access + Duff's Device
void MemFill4(uint8_t* p, uint16_t len, uint8_t fill) {
uint16_t len2;
if (len == 0) return;
if ((uint8_t)p & 1) { *p++ = fill; --len; }
len2 = len >> 1;
if (len2 != 0) {
uint16_t fill2 = fill << 8 | fill;
DUFF_DEVICE(len2, *(uint16_t*)p = fill2; p += 2;);
}
if (len & 1)
*p = fill;
}
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 4 -- Word access + Duff's Device
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
104 170 248 410 570 890 1690 3290 8090 16090 160088
132 166 200 282 362 522 922 1722 4122 8122 80120
150 224 248 282 312 378 552 888 1902 3588 33962
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 5 -- Word access + Duff's Device + small length
optimization
void MemFill5(uint8_t* p, uint16_t len, uint8_t fill) {
switch (len) {
case 10: *p++ = fill;
case 9: *p++ = fill;
case 8: *p++ = fill;
case 7: *p++ = fill;
case 6: *p++ = fill;
case 5: *p++ = fill;
case 4: *p++ = fill;
case 3: *p++ = fill;
case 2: *p++ = fill;
case 1: *p = fill;
case 0: return;
default:
;
};
... rest as version 4 ...
Special treatment
(similar to unrolling)
for len <= 10
Special treatment
(similar to unrolling)
for len <= 10
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 5 -- Word access + Duff's Device + small length
optimization
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
104 170 248 410 570 890 1690 3290 8090 16090 160088
132 166 200 282 362 522 922 1722 4122 8122 80120
150 224 248 282 312 378 552 888 1902 3588 33962
182 208 236 312 342 412 602 962 2052 3862 36480
This was a bad idea:
now our code is so
complicated that the
optimizer gives up!
This was a bad idea:
now our code is so
complicated that the
optimizer gives up!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 6 -- Assembler
â–ș Based on assembly output of version 5
â–ș Optimized use of CPU registers
â–ș Removed redundant store/load of some registers (push/pop)
â–ș Removed redundant library call
â–ș Removed redundant instruction in Duff's Device
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Version 6 -- Assembler
len 1 5 10 20 30 50 100 200 500 1000 10000
86 136 208 346 486 766 1466 2866 7066 14066 140068
104 170 248 410 570 890 1690 3290 8090 16090 160088
132 166 200 282 362 522 922 1722 4122 8122 80120
150 224 248 282 312 378 552 888 1902 3588 33962
182 208 236 312 342 412 602 962 2052 3862 36480
94 108 138 186 216 282 456 792 1806 3492 33864
0.9 1.3 1.5 1.9 2.3 2.7 3.2 3.6 3.9 4.0 4.1
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Summary
Run-time improvements
â–ș 1.5 (len = 10)
â–ș 3.2 (len = 100)
â–ș 4.0 (len = 1000)
Biggest contribution: word-access and loop unrolling
â–ș Just re-implementing slow code in assembly language doesn't cut it!
Complexity increases
â–ș SLOC
before: 2 lines of straigth-forward C code
afterwards: ~100 lines of assembly code
â–ș Cyclomatic complexity (McCabe factor)
before: 2
after: 20
Principles of Efficient Software
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principles vs. Techniques
Principles
â–ș Universal applicability
â–ș Easy to grasp
â–ș Unspecific, hard to apply
â–ș Principles often sound like truisms
Techniques
â–ș More or less easy to grasp
â–ș Specific, more or less easy to apply
â–ș Limited applicability (dependent on context)
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 1: Don't optimize (yet)
â–ș D. Knuth: "Premature optimization is the root of all evil"
â–ș What he meant: focus on correctness, maintainability
â–ș Code tuning is time-consuming, risky, and expensive
â–ș Maintainability and testability is reduced
â–ș But: efficiency must be a requirements issue
â–ș Difficult to add later
â–ș Choice of hardware/programming language/architecture
â–ș Measure/track efficiency already in early stages of development
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 2: Understand the system
Thorough understanding of the system is a prerequisite for
efficiency
â–ș How does the system work?
â–ș How does it interact with other systems?
â–ș What are the major use-cases?
â–ș Are there any real-time constraints?
â–ș What is the system doing most of the time?
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 3: Measure, measure, measure...
â–ș Don't assume inefficiency, prove it
â–ș Be wary of old-wives tales
â–ș Switch-case is faster than if-else
â–ș Bitfields are more efficient than explicit ANDing/ORing
â–ș Today, compiled code is better/worse than hand-written assembly code
â–ș Inspect assembly output and count cycles
â–ș Risky: prefetching, pipelining, caching are context-dependent
â–ș Use a profiler
â–ș Part of many toolchains (z. B. gprof, Lauterbach Debugger)
â–ș Finds hotspots/critical-paths
â–ș Every step of optimization must be measured
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: SmartTime
% Hit Func Func Func Func FChild FChild FChild FChild Func Func
Runtime Count Sum Min Max Mean Sum Min Max Mean ID Name
---------------------------------------------------------------------------------------------------
21.35 722 76.567 0.090 0.117 0.099 113.022 0.152 0.161 0.152 1202 ReadHeaderShort
10.02 26 35.926 0.197 4.177 1.380 148.249 0.807 17.282 5.701 1218 GetChildByFID
9.85 1060 35.308 0.027 0.108 0.027 35.792 0.027 0.125 0.027 588 pobjGetObjectHeader

5.60 6 20.096 3.039 3.505 3.343 20.096 3.039 3.505 3.343 149 vHALWriteBlock
4.53 136 16.242 0.108 0.170 0.117 20.500 0.143 0.206 0.143 529 MmFileRef2Ptr
4.35 771 15.615 0.018 0.027 0.018 15.615 0.018 0.027 0.018 669 GetDataShort
3.47 1 12.450 12.450 12.450 12.450 358.599 358.599 358.599 358.599 10 ROOT
3.21 129 11.518 0.081 0.099 0.081 17.094 0.125 0.134 0.125 1200 ReadHeaderByte
2.15 50 7.709 0.143 0.179 0.152 28.845 0.574 0.583 0.574 1197 GetFileSize
2.01 32 7.216 0.090 0.332 0.224 286.424 0.090 128.216 8.946 925 InvokeBasic
1.92 34 6.875 0.179 0.215 0.197 48.350 1.416 1.443 1.416 1196 peeGetFileBodyOffset
1.83 73 6.570 0.081 0.099 0.090 18.187 0.242 0.278 0.242 1199 ReadHeaderByteUse

1.19 29 4.258 0.134 0.152 0.143 6.266 0.197 0.260 0.215 917 Return
1.06 25 3.810 0.125 0.188 0.152 283.484 0.305 128.395 11.339 932 InvokeExec
0.98 1 3.532 3.532 3.532 3.532 3.532 3.532 3.532 3.532 139 UART_SendByteAndR

0.90 1 3.227 3.227 3.227 3.227 4.912 4.912 4.912 4.912 1213 SearchADF
0.75 40 2.698 0.018 0.583 0.063 2.698 0.018 0.583 0.063 690 MemFill2RAM
0.75 52 2.698 0.045 0.054 0.045 3.254 0.054 0.063 0.054 858 GetCPEntryShort
0.74 21 2.644 0.117 0.134 0.125 15.561 0.170 12.092 0.735 508 MmTaCommitAllTransa

0.73 231 2.617 0.009 0.018 0.009 2.617 0.009 0.018 0.009 666 GetDataByte
0.68 17 2.429 0.134 0.152 0.134 5.764 0.179 0.565 0.332 896 GetstaticExec
0.61 2 2.187 1.094 1.094 1.094 2.187 1.094 1.094 1.094 648 MmSegCalcSatEdcHelper
0.60 1 2.151 2.151 2.151 2.151 2.366 2.366 2.366 2.366 806 TM_SendDataSWget

The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 4: Exploit concurrency
How not to cook spaghetti with tomato sauce:
Step Time Total Time
Peel and chop onions 1 1
Peel and chop garlic 1 2
Heat olive oil in pan 5 7
Steam onions/garlic 3 10
Add canned tomatos 1 11
Season with salt/pepper and tasting 1 12
Simmer tomato sauce 15 27
Grind parmesan 2 29
Bring salted water to boil 10 39
Boil spaghetti 10 49
Prepare 2 51
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 4: Exploit concurrency
Prepare(2)
Boil spaghetti(10)
Bring salted water to boil(10)
Simmer tomato sauce(15)
Season with salt/pepper/ and tasting(1)
Added canned tomatoes(1)
Steam onions/garlic(3)
Heat olive oil in pan(5)
Peel/chop onions(1)
Peel/chop garlic(1)
Grind parmesan(2)
This tree shows
the dependencies
This tree shows
the dependencies
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 4: Exploit concurrency
Chef's load:
26%
Prepare
Boil spaghetti
Boil water
Simmer sauce
Heat oil
Chop
onions/garlic
Steam onions/garlic
Grind parmesan
Canned tomatoes
Seasoning/tasting
27
The critical path starts
with heating oil.
Doable in 27 mins
The critical path starts
with heating oil.
Doable in 27 mins
blue: Chef waits
orange: Chef works
blue: Chef waits
orange: Chef works
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 4: Exploit concurrency
Divide processes in independent steps
Execute independent steps in parallel
â–ș Assign tasks to worker threads
â–ș Example: C# BackgroundWorker class
â–ș Increases "liveliness" of the system
â–ș
User input
â–ș
Network and file I/O
â–ș
Calculations
â–ș Performance gain with multi-core systems
Only optimize along the performance critical-path!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 5: Look for alternative algorithms/designs
â–ș Optimization is a top-down process
â–ș Code-level Optimization often leads to "fast slow code"
â–ș Best optimizations stem from key insights
â–ș Example: quicksort vs. bubblesort
â–ș Example: computation of greatest common divisor
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 5: Look for alternative algorithms/designs
int gcdSimple(int a, int b) {
int i;
if (a < b) { // Ensure a >= b
i = b;
b = a;
a = i;
}
for (i = b; i > 0; --i) {
if ( a % i == 0
&& b % i == 0 ) {
return i;
}
}
return 0;
}
Brute-force approach
â–ș Straight-forward and obvious
â–ș Up to 'b' loop iterations
â–ș Up to 2 x 'b' integer operations
Euclid's key insight (~300 BC)
â–ș gcd(a, b) == gcd(b, a % b)
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 5: Look for alternative algorithms/designs
int gcdEuclid(int a, int b) {
int i;
if (a < b) { // Ensure a >= b
i = b;
b = a;
a = i;
}
for (;;) {
i = a % b;
if (i == 0)
return b;
a = b;
b = i;
}
}
Number of integer operations:
a 64 314.159 23.456.472
b 54 271.828 2.324.328
gcdSimple 58 271.829 2.324.323
gcdEuclid 4 9 12
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 6: Differentiate between normal and worst case
Different requirements
â–ș Normal case: efficient
â–ș Worst case: "only" correct; efficiency unimportant
Be careful with generic code and abstractions Abstraktionen
â–ș Increase maintainability
â–ș Frequently decrease efficiency
Special case treatment
â–ș Often ugly
â–ș Often very efficient
Examples
â–ș Speculative execution and branch prediction
â–ș 2G SIM SELECT command
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: 2G SELECT
SELECT command
â–ș Selects a file/folder on a SIM card (smart card)
â–ș Returns SIM card status information (access rights, PIN status, free memory)
GET RESPONSE command
â–ș Transport layer command of T=0 protocol
â–ș Generic command to fetch response data from a previous command to the SIM card
"Measure, measure, measure" principle
â–ș Time to select a file: ~10ms
â–ș Time to build status information: ~10 - 90ms
â–ș SELECT command is issued a lot by mobile handsets
â–ș In 90% of all cases, handset doesn't want status info and hence sends no GET RESPONSE!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: 2G SELECT
"Differentiate between normal and worst case" principle
Modified SELECT command
â–ș Selects file/folder as usual
â–ș Doesn't build status information
â–ș But remembers that SELECT was issued as last command
GET RESPONSE
â–ș If last command was SELECT, build status information "just-in-time"
â–ș Return status information to handset
Result:
â–ș Worst case: GET RESPONSE 10 - 90 ms slower (but GET RESPONSE is not used a lot)
â–ș Normal case: SELECT is 2 - 10 x faster
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 7: Fight high multiplicity
Execution time
â–ș Code that is executed frequently
â–ș Examples for techniques
â–ș Loop-unrolling
â–ș Inlining
Footprint
â–ș Redundancy in code and data
â–ș Examples for techniques
â–ș Factor out common code (base classes, subroutines)
â–ș Data compression
How to detect
â–ș Profiler, run-time measurements
â–ș Analyze map file
â–ș ZIP data, measure compression rate
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 8: Caching
Keep data that is frequently accessed in memory
Data that is difficult to access
â–ș Read cache
â–ș Example: RAM, browser cache
â–ș Write cache
â–ș Example: Non-volatile memory
Data that is difficult to compute
â–ș sin(), log(), data scaling and conversion
Verify efficacy by measuring cache hit-rate!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 9: Precompute results
Perform computations long before results are needed
â–ș At system start-up
â–ș At compile-time
Use look-up tables
â–ș Example: CRC16 checksum algorithm
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 9: Precompute results
uint16_t SimpleCRC16(uint8_t value, uint16_t crcin) {
uint16_t k = (((crcin >> 8) ^ value) & 255) << 8;
uint16_t crc = 0;
uint16_t bits = 8;
while (bits--) {
if (( crc ^ k ) & 0x8000)
crc = (crc << 1) ^ 0x1021;
else
crc <<= 1;
k <<= 1;
}
return ((crcin << 8) ^ crc);
}
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 9: Precompute results
const uint16_t CRC16_TABLE[] = {
0x0000, 0x1021, 0x2042, 0x3063, 0x4084, 0x50A5, 0x60C6, 0x70E7,
0x8108, 0x9129, 0xA14A, 0xB16B, 0xC18C, 0xD1AD, 0xE1CE, 0xF1EF,
0x1231, 0x0210, 0x3273, 0x2252, 0x52B5, 0x4294, 0x72F7, 0x62D6,
: : : : : : : :
0x7C26, 0x6C07, 0x5C64, 0x4C45, 0x3CA2, 0x2C83, 0x1CE0, 0x0CC1,
0xEF1F, 0xFF3E, 0xCF5D, 0xDF7C, 0xAF9B, 0xBFBA, 0x8FD9, 0x9FF8,
0x6E17, 0x7E36, 0x4E55, 0x5E74, 0x2E93, 0x3EB2, 0x0ED1, 0x1EF0
};
U16 LookupCRC16(U08 value, U16 crcin)
{
return (U16)((crcin << 8) ^
CRC16_TABLE[(U08)((crcin >> 8) ^ (value))]);
}
32 x
Improvement > factor 6
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 10: Exploit the processor's architecture
Word-wise data processing
â–ș Refer to MemFill example
Natural byte-ordering ("endianness")
â–ș Frequently, protocols use byte-ordering that is different to the byte-ordering of the target
architecture
â–ș No problem, as long as data is only stored
â–ș If data is also used internally (for computations) introduce endianness conversion layer:
â–ș When data enters the system: convert extern endian → intern endian
â–ș Perform computations with internal endianness
â–ș When data leaves the system: convert intern endian → extern endian
Use portable integer types
â–ș C99 stdint.h
â–ș Exact width types to store data (e. g. uint8_t)
â–ș Minimum width types for computations (e. g. uint_fast8_t)
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 11: Recode in assembly language
After all other possibilities have been tried:
â–ș No compiler/optimizer is perfect
â–ș Jumps to arbitrary locations possible
â–ș Stack manipulations
â–ș Self-modifying code
â–ș Some instructions have no high-level language equivalent:
ADC ; add with carry
ROL, ROR ; rotate left, right
JC, JNE, JPL ; branch based on flags
DIV ; div and mod at the same time
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Principle 11: Recode in assembly language
Important: Don't think like a compiler!
Instruction-level optimizations are good
â–ș Analyze generated assembly code
â–ș Remove redundant instructions
â–ș Replace inefficient instructions with more efficient instructions
But so-called "local optimization" is much better
â–ș View instructions as tools/building-blocks
â–ș Combine these building-blocks in an efficient (creative!) way
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: Local optimization
Michael Abrash's programming challenge (ca. 1991)
â–ș Write a function that finds the smallest/biggest value in an array
â–ș Use less than 24 bytes of memory
â–ș x86 assembly language (at the time: 16-bit)
unsigned int FindHigh(int len, unsigned int* buffer);
unsigned int FindLow(int len, unsigned int* buffer);
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: Local optimization
_FindLow: pop ax ; get return address
pop dx ; get len
pop bx ; get data pointer
push bx
push dx
push ax
save: mov ax, [bx] ; store min
top: cmp ax, [bx] ; compare current val to min
ja save ; if smaller, save new min
inc bx ; advance to next val, 1st byte
inc bx ; advance to next val, 2nd byte
dec dx ; decrement loop counter
jnz top ; next iteration
ret
Nice, for sure...Nice, for sure...
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Intermezzo: Local optimization
_FindHigh: db 0b9h ; first byte of mov cx, 31C9
_FindLow: xor cx, cx ; 31 C9
pop ax ; get return address
pop dx ; get len
pop bx ; get data pointer
push bx
push dx
push ax
save: mov ax, [bx]
top: cmp ax, [bx]
jcxz around ; depending on compare mode
cmc ; invert compare result
around: ja save
inc bx
inc bx
dec dx
jnz top
ret
But he wanted
both functions
in 24 bytes!
But he wanted
both functions
in 24 bytes!
Study this carefully!
There is a lot to be
learned!
Study this carefully!
There is a lot to be
learned!
The Art of Writing Efficient Software Copyright © 2013 Ralf Holly
Summary
There will always be a need for efficient software
Efficiency must be a requirements issue
Correctness and maintainability usually have higher priority
Prove that an optimization is necessary
Prove that optimization works
Prerequisites: knowledge and a systematic approach
For outstanding efficiency: creativity and passion required
Michael Abrash: „The best optimizer is between your ears“
http://www.approxion.com

Weitere Àhnliche Inhalte

Ähnlich wie The Art of Writing Efficient Software Principles

Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
IRJET- Voice Recognition(AI) : Voice Assistant Robot
IRJET-  	  Voice Recognition(AI) : Voice Assistant RobotIRJET-  	  Voice Recognition(AI) : Voice Assistant Robot
IRJET- Voice Recognition(AI) : Voice Assistant RobotIRJET Journal
 
Présentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à MontréalPrésentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à MontréalGuillaume Genest
 
Practical Operation Automation with StackStorm
Practical Operation Automation with StackStormPractical Operation Automation with StackStorm
Practical Operation Automation with StackStormShu Sugimoto
 
FAKE (F# Make) & Automation
FAKE (F# Make) & AutomationFAKE (F# Make) & Automation
FAKE (F# Make) & AutomationSergey Tihon
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
How go makes us faster (May 2015)
How go makes us faster (May 2015)How go makes us faster (May 2015)
How go makes us faster (May 2015)Wilfried Schobeiri
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
Project Automation
Project AutomationProject Automation
Project Automationelliando dias
 
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...Michele Orsi
 
Puppet camp London 2014: Module Rewriting The Smart Way
Puppet camp London 2014: Module Rewriting The Smart WayPuppet camp London 2014: Module Rewriting The Smart Way
Puppet camp London 2014: Module Rewriting The Smart WayPuppet
 
Puppet camp london-modulerewritingsmartway
Puppet camp london-modulerewritingsmartwayPuppet camp london-modulerewritingsmartway
Puppet camp london-modulerewritingsmartwayMartin Alfke
 
The joy of computer graphics programming
The joy of computer graphics programmingThe joy of computer graphics programming
The joy of computer graphics programmingBruno Levy
 
Présentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à QuébecPrésentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à QuébecGuillaume Genest
 
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deploymentFilippo Zanella
 
Puppet Camp Paris 2014: Module Rewriting The Smart Way
Puppet Camp Paris 2014: Module Rewriting The Smart WayPuppet Camp Paris 2014: Module Rewriting The Smart Way
Puppet Camp Paris 2014: Module Rewriting The Smart WayPuppet
 
Webinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetWebinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetOlinData
 

Ähnlich wie The Art of Writing Efficient Software Principles (20)

Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
How to Use OpenMP on Native Activity
How to Use OpenMP on Native ActivityHow to Use OpenMP on Native Activity
How to Use OpenMP on Native Activity
 
IRJET- Voice Recognition(AI) : Voice Assistant Robot
IRJET-  	  Voice Recognition(AI) : Voice Assistant RobotIRJET-  	  Voice Recognition(AI) : Voice Assistant Robot
IRJET- Voice Recognition(AI) : Voice Assistant Robot
 
Xdebug
XdebugXdebug
Xdebug
 
Présentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à MontréalPrésentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à Montréal
 
Practical Operation Automation with StackStorm
Practical Operation Automation with StackStormPractical Operation Automation with StackStorm
Practical Operation Automation with StackStorm
 
FAKE (F# Make) & Automation
FAKE (F# Make) & AutomationFAKE (F# Make) & Automation
FAKE (F# Make) & Automation
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
How go makes us faster (May 2015)
How go makes us faster (May 2015)How go makes us faster (May 2015)
How go makes us faster (May 2015)
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
Project Automation
Project AutomationProject Automation
Project Automation
 
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
 
Puppet camp London 2014: Module Rewriting The Smart Way
Puppet camp London 2014: Module Rewriting The Smart WayPuppet camp London 2014: Module Rewriting The Smart Way
Puppet camp London 2014: Module Rewriting The Smart Way
 
Puppet camp london-modulerewritingsmartway
Puppet camp london-modulerewritingsmartwayPuppet camp london-modulerewritingsmartway
Puppet camp london-modulerewritingsmartway
 
The joy of computer graphics programming
The joy of computer graphics programmingThe joy of computer graphics programming
The joy of computer graphics programming
 
Drools Workshop 2015 - LATAM
Drools Workshop 2015 - LATAMDrools Workshop 2015 - LATAM
Drools Workshop 2015 - LATAM
 
Présentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à QuébecPrésentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à Québec
 
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deployment
 
Puppet Camp Paris 2014: Module Rewriting The Smart Way
Puppet Camp Paris 2014: Module Rewriting The Smart WayPuppet Camp Paris 2014: Module Rewriting The Smart Way
Puppet Camp Paris 2014: Module Rewriting The Smart Way
 
Webinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppetWebinar - Manage user, groups, packages in windows using puppet
Webinar - Manage user, groups, packages in windows using puppet
 

KĂŒrzlich hochgeladen

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

KĂŒrzlich hochgeladen (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

The Art of Writing Efficient Software Principles

  • 1. The Art of Writing Efficient Software Principles and Techniques Version 1.0.1 ralf.holly@approxion.com
  • 2. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly “Having lost sight of our goals, we redouble our efforts.” -- Mark Twain
  • 3. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Efficiency defined Effectivity: â–ș Means doing the right things â–ș Fulfill functional requirements â–ș Example: sorting an array; algorithm is effective if array is sorted afterwards Efficiency: â–ș Means doing the right things as good as possible â–ș "good" means using as few resources as possible â–ș Example: using Quicksort (instead of Bubblesort) â–ș Space efficiency ("footprint"): â–ș Use as little memory as possible â–ș Run-time efficiency ("performance") â–ș Use as little execution time as possible
  • 4. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Why is efficiency important? Embedded systems â–ș Fixed requirements in terms of memory consumption/execution time/cost â–ș Mass production: car control units, mobile phones, smart cards â–ș Efficient software yield efficiency in terms of energy consumption (battery life-time) User experience â–ș Slow software sucks â–ș Especially: games, user-interfaces ⇹ Efficiency is an important sales-factor!
  • 5. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly An example â–ș Task: implement a fast 'memfill' routine void MemFill(uint8_t* p, uint16_t len, uint8_t fill) { ... } â–ș Optimize iteratively â–ș Development platform: Renesas H8/300H (16/32 bit) â–ș Toolchain: HEW C/C++ 5.03, "optimized for speed"
  • 6. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 1 -- Simple 'for' loop void MemFill1(uint8_t* p, uint16_t len, uint8_t fill) { uint16_t i; for (i = 0; i < len; i++) { *p++ = fill; } }
  • 7. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 1 -- Simple 'for' loop len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 Results show consumed CPU cycles on H8/300H Results show consumed CPU cycles on H8/300H
  • 8. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 2 -- Simple 'while' loop void MemFill2(uint8_t* p, uint16_t len, uint8_t fill) { while (len-- > 0) { *p++ = fill; } }
  • 9. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 2 -- Simple 'while' loop len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 104 170 248 410 570 890 1690 3290 8090 16090 160088 Lesson: "Don't assume anything!" Lesson: "Don't assume anything!"
  • 10. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 3 -- Word access void MemFill3(uint8_t* p, uint16_t len, uint8_t fill) { uint16_t fill2 = fill << 8 | fill; uint16_t len2; if (len == 0) return; if ((uint8_t)p & 1) { *p++ = fill; --len; } len2 = len >> 1; while (len2-- > 0) { *(uint16_t*)p = fill2; p += 2; } if (len & 1) *p = fill; } Downside: Need to take care of special cases Downside: Need to take care of special cases
  • 11. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 3 -- Word access len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 104 170 248 410 570 890 1690 3290 8090 16090 160088 132 166 200 282 362 522 922 1722 4122 8122 80120
  • 12. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Loop unrolling for (i = 0; i < n; ++i) { /* do something */ } for (i = 0; i < n / 8; ++i) { /* do something */ /* do something */ /* do something */ /* do something */ /* do something */ /* do something */ /* do something */ /* do something */ } for (i = 0; i < n % 8; ++i) { /* do something */ } Goal: Reduce loop overhead Goal: Reduce loop overhead
  • 13. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Loop unrolling (Duff's Device) int i = (n + 7) / 8; switch(n % 8) { case 0: do { /* do something */; case 7: /* do something */; case 6: /* do something */; case 5: /* do something */; case 4: /* do something */; case 3: /* do something */; case 2: /* do something */; case 1: /* do something */; } while (--i > 0); } â–ș Tom Duff (1983, Lucasfilm) â–ș http://www.lysator.liu.se/c/duffs -device.html â–ș is valid ANSI C â–ș "reusable unrolling"
  • 14. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Loop unrolling (Duff's Device) #define DUFF_DEVICE(duffTimes, duffAction) do { U16 n = (duffTimes) ; U16 i = (n + 7) >> 3; switch (n & 7) { case 0: do { duffAction; case 7: duffAction; case 6: duffAction; case 5: duffAction; case 4: duffAction; case 3: duffAction; case 2: duffAction; case 1: duffAction; } while(--i > 0); } } while(0) Let's put Duff's Device in a macro Let's put Duff's Device in a macro
  • 15. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 4 -- Word access + Duff's Device void MemFill4(uint8_t* p, uint16_t len, uint8_t fill) { uint16_t len2; if (len == 0) return; if ((uint8_t)p & 1) { *p++ = fill; --len; } len2 = len >> 1; if (len2 != 0) { uint16_t fill2 = fill << 8 | fill; DUFF_DEVICE(len2, *(uint16_t*)p = fill2; p += 2;); } if (len & 1) *p = fill; }
  • 16. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 4 -- Word access + Duff's Device len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 104 170 248 410 570 890 1690 3290 8090 16090 160088 132 166 200 282 362 522 922 1722 4122 8122 80120 150 224 248 282 312 378 552 888 1902 3588 33962
  • 17. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 5 -- Word access + Duff's Device + small length optimization void MemFill5(uint8_t* p, uint16_t len, uint8_t fill) { switch (len) { case 10: *p++ = fill; case 9: *p++ = fill; case 8: *p++ = fill; case 7: *p++ = fill; case 6: *p++ = fill; case 5: *p++ = fill; case 4: *p++ = fill; case 3: *p++ = fill; case 2: *p++ = fill; case 1: *p = fill; case 0: return; default: ; }; ... rest as version 4 ... Special treatment (similar to unrolling) for len <= 10 Special treatment (similar to unrolling) for len <= 10
  • 18. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 5 -- Word access + Duff's Device + small length optimization len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 104 170 248 410 570 890 1690 3290 8090 16090 160088 132 166 200 282 362 522 922 1722 4122 8122 80120 150 224 248 282 312 378 552 888 1902 3588 33962 182 208 236 312 342 412 602 962 2052 3862 36480 This was a bad idea: now our code is so complicated that the optimizer gives up! This was a bad idea: now our code is so complicated that the optimizer gives up!
  • 19. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 6 -- Assembler â–ș Based on assembly output of version 5 â–ș Optimized use of CPU registers â–ș Removed redundant store/load of some registers (push/pop) â–ș Removed redundant library call â–ș Removed redundant instruction in Duff's Device
  • 20. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Version 6 -- Assembler len 1 5 10 20 30 50 100 200 500 1000 10000 86 136 208 346 486 766 1466 2866 7066 14066 140068 104 170 248 410 570 890 1690 3290 8090 16090 160088 132 166 200 282 362 522 922 1722 4122 8122 80120 150 224 248 282 312 378 552 888 1902 3588 33962 182 208 236 312 342 412 602 962 2052 3862 36480 94 108 138 186 216 282 456 792 1806 3492 33864 0.9 1.3 1.5 1.9 2.3 2.7 3.2 3.6 3.9 4.0 4.1
  • 21. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Summary Run-time improvements â–ș 1.5 (len = 10) â–ș 3.2 (len = 100) â–ș 4.0 (len = 1000) Biggest contribution: word-access and loop unrolling â–ș Just re-implementing slow code in assembly language doesn't cut it! Complexity increases â–ș SLOC before: 2 lines of straigth-forward C code afterwards: ~100 lines of assembly code â–ș Cyclomatic complexity (McCabe factor) before: 2 after: 20
  • 23. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principles vs. Techniques Principles â–ș Universal applicability â–ș Easy to grasp â–ș Unspecific, hard to apply â–ș Principles often sound like truisms Techniques â–ș More or less easy to grasp â–ș Specific, more or less easy to apply â–ș Limited applicability (dependent on context)
  • 24. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 1: Don't optimize (yet) â–ș D. Knuth: "Premature optimization is the root of all evil" â–ș What he meant: focus on correctness, maintainability â–ș Code tuning is time-consuming, risky, and expensive â–ș Maintainability and testability is reduced â–ș But: efficiency must be a requirements issue â–ș Difficult to add later â–ș Choice of hardware/programming language/architecture â–ș Measure/track efficiency already in early stages of development
  • 25. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 2: Understand the system Thorough understanding of the system is a prerequisite for efficiency â–ș How does the system work? â–ș How does it interact with other systems? â–ș What are the major use-cases? â–ș Are there any real-time constraints? â–ș What is the system doing most of the time?
  • 26. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 3: Measure, measure, measure... â–ș Don't assume inefficiency, prove it â–ș Be wary of old-wives tales â–ș Switch-case is faster than if-else â–ș Bitfields are more efficient than explicit ANDing/ORing â–ș Today, compiled code is better/worse than hand-written assembly code â–ș Inspect assembly output and count cycles â–ș Risky: prefetching, pipelining, caching are context-dependent â–ș Use a profiler â–ș Part of many toolchains (z. B. gprof, Lauterbach Debugger) â–ș Finds hotspots/critical-paths â–ș Every step of optimization must be measured
  • 27. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: SmartTime % Hit Func Func Func Func FChild FChild FChild FChild Func Func Runtime Count Sum Min Max Mean Sum Min Max Mean ID Name --------------------------------------------------------------------------------------------------- 21.35 722 76.567 0.090 0.117 0.099 113.022 0.152 0.161 0.152 1202 ReadHeaderShort 10.02 26 35.926 0.197 4.177 1.380 148.249 0.807 17.282 5.701 1218 GetChildByFID 9.85 1060 35.308 0.027 0.108 0.027 35.792 0.027 0.125 0.027 588 pobjGetObjectHeader
 5.60 6 20.096 3.039 3.505 3.343 20.096 3.039 3.505 3.343 149 vHALWriteBlock 4.53 136 16.242 0.108 0.170 0.117 20.500 0.143 0.206 0.143 529 MmFileRef2Ptr 4.35 771 15.615 0.018 0.027 0.018 15.615 0.018 0.027 0.018 669 GetDataShort 3.47 1 12.450 12.450 12.450 12.450 358.599 358.599 358.599 358.599 10 ROOT 3.21 129 11.518 0.081 0.099 0.081 17.094 0.125 0.134 0.125 1200 ReadHeaderByte 2.15 50 7.709 0.143 0.179 0.152 28.845 0.574 0.583 0.574 1197 GetFileSize 2.01 32 7.216 0.090 0.332 0.224 286.424 0.090 128.216 8.946 925 InvokeBasic 1.92 34 6.875 0.179 0.215 0.197 48.350 1.416 1.443 1.416 1196 peeGetFileBodyOffset 1.83 73 6.570 0.081 0.099 0.090 18.187 0.242 0.278 0.242 1199 ReadHeaderByteUse
 1.19 29 4.258 0.134 0.152 0.143 6.266 0.197 0.260 0.215 917 Return 1.06 25 3.810 0.125 0.188 0.152 283.484 0.305 128.395 11.339 932 InvokeExec 0.98 1 3.532 3.532 3.532 3.532 3.532 3.532 3.532 3.532 139 UART_SendByteAndR
 0.90 1 3.227 3.227 3.227 3.227 4.912 4.912 4.912 4.912 1213 SearchADF 0.75 40 2.698 0.018 0.583 0.063 2.698 0.018 0.583 0.063 690 MemFill2RAM 0.75 52 2.698 0.045 0.054 0.045 3.254 0.054 0.063 0.054 858 GetCPEntryShort 0.74 21 2.644 0.117 0.134 0.125 15.561 0.170 12.092 0.735 508 MmTaCommitAllTransa
 0.73 231 2.617 0.009 0.018 0.009 2.617 0.009 0.018 0.009 666 GetDataByte 0.68 17 2.429 0.134 0.152 0.134 5.764 0.179 0.565 0.332 896 GetstaticExec 0.61 2 2.187 1.094 1.094 1.094 2.187 1.094 1.094 1.094 648 MmSegCalcSatEdcHelper 0.60 1 2.151 2.151 2.151 2.151 2.366 2.366 2.366 2.366 806 TM_SendDataSWget

  • 28. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 4: Exploit concurrency How not to cook spaghetti with tomato sauce: Step Time Total Time Peel and chop onions 1 1 Peel and chop garlic 1 2 Heat olive oil in pan 5 7 Steam onions/garlic 3 10 Add canned tomatos 1 11 Season with salt/pepper and tasting 1 12 Simmer tomato sauce 15 27 Grind parmesan 2 29 Bring salted water to boil 10 39 Boil spaghetti 10 49 Prepare 2 51
  • 29. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 4: Exploit concurrency Prepare(2) Boil spaghetti(10) Bring salted water to boil(10) Simmer tomato sauce(15) Season with salt/pepper/ and tasting(1) Added canned tomatoes(1) Steam onions/garlic(3) Heat olive oil in pan(5) Peel/chop onions(1) Peel/chop garlic(1) Grind parmesan(2) This tree shows the dependencies This tree shows the dependencies
  • 30. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 4: Exploit concurrency Chef's load: 26% Prepare Boil spaghetti Boil water Simmer sauce Heat oil Chop onions/garlic Steam onions/garlic Grind parmesan Canned tomatoes Seasoning/tasting 27 The critical path starts with heating oil. Doable in 27 mins The critical path starts with heating oil. Doable in 27 mins blue: Chef waits orange: Chef works blue: Chef waits orange: Chef works
  • 31. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 4: Exploit concurrency Divide processes in independent steps Execute independent steps in parallel â–ș Assign tasks to worker threads â–ș Example: C# BackgroundWorker class â–ș Increases "liveliness" of the system â–ș User input â–ș Network and file I/O â–ș Calculations â–ș Performance gain with multi-core systems Only optimize along the performance critical-path!
  • 32. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 5: Look for alternative algorithms/designs â–ș Optimization is a top-down process â–ș Code-level Optimization often leads to "fast slow code" â–ș Best optimizations stem from key insights â–ș Example: quicksort vs. bubblesort â–ș Example: computation of greatest common divisor
  • 33. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 5: Look for alternative algorithms/designs int gcdSimple(int a, int b) { int i; if (a < b) { // Ensure a >= b i = b; b = a; a = i; } for (i = b; i > 0; --i) { if ( a % i == 0 && b % i == 0 ) { return i; } } return 0; } Brute-force approach â–ș Straight-forward and obvious â–ș Up to 'b' loop iterations â–ș Up to 2 x 'b' integer operations Euclid's key insight (~300 BC) â–ș gcd(a, b) == gcd(b, a % b)
  • 34. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 5: Look for alternative algorithms/designs int gcdEuclid(int a, int b) { int i; if (a < b) { // Ensure a >= b i = b; b = a; a = i; } for (;;) { i = a % b; if (i == 0) return b; a = b; b = i; } } Number of integer operations: a 64 314.159 23.456.472 b 54 271.828 2.324.328 gcdSimple 58 271.829 2.324.323 gcdEuclid 4 9 12
  • 35. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 6: Differentiate between normal and worst case Different requirements â–ș Normal case: efficient â–ș Worst case: "only" correct; efficiency unimportant Be careful with generic code and abstractions Abstraktionen â–ș Increase maintainability â–ș Frequently decrease efficiency Special case treatment â–ș Often ugly â–ș Often very efficient Examples â–ș Speculative execution and branch prediction â–ș 2G SIM SELECT command
  • 36. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: 2G SELECT SELECT command â–ș Selects a file/folder on a SIM card (smart card) â–ș Returns SIM card status information (access rights, PIN status, free memory) GET RESPONSE command â–ș Transport layer command of T=0 protocol â–ș Generic command to fetch response data from a previous command to the SIM card "Measure, measure, measure" principle â–ș Time to select a file: ~10ms â–ș Time to build status information: ~10 - 90ms â–ș SELECT command is issued a lot by mobile handsets â–ș In 90% of all cases, handset doesn't want status info and hence sends no GET RESPONSE!
  • 37. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: 2G SELECT "Differentiate between normal and worst case" principle Modified SELECT command â–ș Selects file/folder as usual â–ș Doesn't build status information â–ș But remembers that SELECT was issued as last command GET RESPONSE â–ș If last command was SELECT, build status information "just-in-time" â–ș Return status information to handset Result: â–ș Worst case: GET RESPONSE 10 - 90 ms slower (but GET RESPONSE is not used a lot) â–ș Normal case: SELECT is 2 - 10 x faster
  • 38. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 7: Fight high multiplicity Execution time â–ș Code that is executed frequently â–ș Examples for techniques â–ș Loop-unrolling â–ș Inlining Footprint â–ș Redundancy in code and data â–ș Examples for techniques â–ș Factor out common code (base classes, subroutines) â–ș Data compression How to detect â–ș Profiler, run-time measurements â–ș Analyze map file â–ș ZIP data, measure compression rate
  • 39. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 8: Caching Keep data that is frequently accessed in memory Data that is difficult to access â–ș Read cache â–ș Example: RAM, browser cache â–ș Write cache â–ș Example: Non-volatile memory Data that is difficult to compute â–ș sin(), log(), data scaling and conversion Verify efficacy by measuring cache hit-rate!
  • 40. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 9: Precompute results Perform computations long before results are needed â–ș At system start-up â–ș At compile-time Use look-up tables â–ș Example: CRC16 checksum algorithm
  • 41. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 9: Precompute results uint16_t SimpleCRC16(uint8_t value, uint16_t crcin) { uint16_t k = (((crcin >> 8) ^ value) & 255) << 8; uint16_t crc = 0; uint16_t bits = 8; while (bits--) { if (( crc ^ k ) & 0x8000) crc = (crc << 1) ^ 0x1021; else crc <<= 1; k <<= 1; } return ((crcin << 8) ^ crc); }
  • 42. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 9: Precompute results const uint16_t CRC16_TABLE[] = { 0x0000, 0x1021, 0x2042, 0x3063, 0x4084, 0x50A5, 0x60C6, 0x70E7, 0x8108, 0x9129, 0xA14A, 0xB16B, 0xC18C, 0xD1AD, 0xE1CE, 0xF1EF, 0x1231, 0x0210, 0x3273, 0x2252, 0x52B5, 0x4294, 0x72F7, 0x62D6, : : : : : : : : 0x7C26, 0x6C07, 0x5C64, 0x4C45, 0x3CA2, 0x2C83, 0x1CE0, 0x0CC1, 0xEF1F, 0xFF3E, 0xCF5D, 0xDF7C, 0xAF9B, 0xBFBA, 0x8FD9, 0x9FF8, 0x6E17, 0x7E36, 0x4E55, 0x5E74, 0x2E93, 0x3EB2, 0x0ED1, 0x1EF0 }; U16 LookupCRC16(U08 value, U16 crcin) { return (U16)((crcin << 8) ^ CRC16_TABLE[(U08)((crcin >> 8) ^ (value))]); } 32 x Improvement > factor 6
  • 43. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 10: Exploit the processor's architecture Word-wise data processing â–ș Refer to MemFill example Natural byte-ordering ("endianness") â–ș Frequently, protocols use byte-ordering that is different to the byte-ordering of the target architecture â–ș No problem, as long as data is only stored â–ș If data is also used internally (for computations) introduce endianness conversion layer: â–ș When data enters the system: convert extern endian → intern endian â–ș Perform computations with internal endianness â–ș When data leaves the system: convert intern endian → extern endian Use portable integer types â–ș C99 stdint.h â–ș Exact width types to store data (e. g. uint8_t) â–ș Minimum width types for computations (e. g. uint_fast8_t)
  • 44. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 11: Recode in assembly language After all other possibilities have been tried: â–ș No compiler/optimizer is perfect â–ș Jumps to arbitrary locations possible â–ș Stack manipulations â–ș Self-modifying code â–ș Some instructions have no high-level language equivalent: ADC ; add with carry ROL, ROR ; rotate left, right JC, JNE, JPL ; branch based on flags DIV ; div and mod at the same time
  • 45. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Principle 11: Recode in assembly language Important: Don't think like a compiler! Instruction-level optimizations are good â–ș Analyze generated assembly code â–ș Remove redundant instructions â–ș Replace inefficient instructions with more efficient instructions But so-called "local optimization" is much better â–ș View instructions as tools/building-blocks â–ș Combine these building-blocks in an efficient (creative!) way
  • 46. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: Local optimization Michael Abrash's programming challenge (ca. 1991) â–ș Write a function that finds the smallest/biggest value in an array â–ș Use less than 24 bytes of memory â–ș x86 assembly language (at the time: 16-bit) unsigned int FindHigh(int len, unsigned int* buffer); unsigned int FindLow(int len, unsigned int* buffer);
  • 47. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: Local optimization _FindLow: pop ax ; get return address pop dx ; get len pop bx ; get data pointer push bx push dx push ax save: mov ax, [bx] ; store min top: cmp ax, [bx] ; compare current val to min ja save ; if smaller, save new min inc bx ; advance to next val, 1st byte inc bx ; advance to next val, 2nd byte dec dx ; decrement loop counter jnz top ; next iteration ret Nice, for sure...Nice, for sure...
  • 48. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Intermezzo: Local optimization _FindHigh: db 0b9h ; first byte of mov cx, 31C9 _FindLow: xor cx, cx ; 31 C9 pop ax ; get return address pop dx ; get len pop bx ; get data pointer push bx push dx push ax save: mov ax, [bx] top: cmp ax, [bx] jcxz around ; depending on compare mode cmc ; invert compare result around: ja save inc bx inc bx dec dx jnz top ret But he wanted both functions in 24 bytes! But he wanted both functions in 24 bytes! Study this carefully! There is a lot to be learned! Study this carefully! There is a lot to be learned!
  • 49. The Art of Writing Efficient Software Copyright © 2013 Ralf Holly Summary There will always be a need for efficient software Efficiency must be a requirements issue Correctness and maintainability usually have higher priority Prove that an optimization is necessary Prove that optimization works Prerequisites: knowledge and a systematic approach For outstanding efficiency: creativity and passion required Michael Abrash: „The best optimizer is between your ears“