The document describes an algorithm called Extreme DXT Compression for compressing textures into DXT1 and DXT5 formats. It uses SSE2 and SSSE3 instructions for high performance and produces quality comparable to the Real-Time DXT Compression algorithm but with roughly 300% better performance. The algorithm tightly packs data, processes two 4x4 blocks at once, and minimizes comparisons, jumps and loops to optimize for processors like the Core 2 Duo.
1. Extreme DXT Compression
Peter Uličiansky
Cauldron, Ltd.
Overview
• Simple highly optimized algorithm
• Uses SSE2 and SSSE3 for maximum performance
• Quality comparable to “Real-Time DXT Compression” algorithm
• Performance roughly 300%
• What’s identical to “Real-Time DXT Compression”
o Only non-transparent compression scheme for DXT1
o Only six intermediate alpha values compression scheme for DXT5
o Uses bounding box method for representative color and alpha values
• Computes color and alpha indices by division (fixed point multiplication)
o Uses lookup tables for color/alpha dividers
( R − R min) + (G − G min) + ( B − B min)
ColorIndex = 4 ∗
( R max − R min) + (G max − G min) + ( B max − B min)
( A − A min)
AlphaIndex = 8 ∗
( A max − A min)
• Converts natural index ordering to DXT index ordering by lookup tables
o Tightly packs natural indices first
o Then converts four color indices at once/two alpha indices at once
• Just two functions (CompressImageDXT1, CompressImageDXT5)
o Saves function call overhead
• No comparisons, jumps, loops (except height/width loops)
• Processes two 4x4 blocks at once
o Better utilization of registers
o Hides instruction latency in some places
o No need to “extract block” first
• Constant/temporary data just 24 * 16 = 384 bytes
• Lookup tables just 3072 + 1024 + 256 + 1280 = 5632 bytes
• Although some parts of DXT1/DXT5 compression algorithms are identical
different instruction ordering is crucial for maximum performance
• Code is optimized for Core 2 Duo so Pentium 4 performance is not optimal
(Don’t see much point in optimizing for Pentium 4 these days)
2. Color Compression Comparison
Original image Extreme DXT Comp. Real-Time DXT Comp.
Alpha Compression Comparison
Original image Extreme DXT Comp. Real-Time DXT Comp.
3. Performance
• 256x256 texture graphs show maximum possible performance of the algorithms
(all used data can fit and is already prepared in the cache memory)
• 4096x4096 texture graphs show more real-life performance
(source data cannot fit or is not already in the cache memory)
• The 256x256 Lena image was used for the 256x256 texture performance tests
• The same image was 16x16 tiled to create 4096x4096 texture for the 4096x4096
texture performance tests
• The blue channel was replicated to the alpha channel for the DXT5 tests
• The DXT1 compression creates correct results regardless of the alpha information
in the source texture and never outputs transparent pixels
4. The Algorithm
Read 4x4 pixel block (movdqa)
Pixel03 Pixel02 Pixel01 Pixel00
Pixel13 Pixel12 Pixel11 Pixel10
Pixel23 Pixel22 Pixel21 Pixel20
Pixel33 Pixel32 Pixel31 Pixel30
Compute bounding box and store minimum (movdqa, pmaxub, pminub, pshufd)
Max Max Max Max
Min Min Min Min
Compute and store range (movdqa, punpcklbw, psubw, movq)
Range Range Range Range Range Range Range Range
Inset bounding box and interleave max’/min’ values (psrlw, psubw, paddw, punpcklwd)
Min’ Max’ Min’ Max’ Min’ Max’ Min’ Max’
Shift and mask max’/min’ values as needed in the DXT block (pmulw, pand, movdqa)
Min’ Max’ Min’ Max’ Min’ Max’ Min‘ Max’
Pack and store max’/min’ values to the DXT block (mov, shr, or)
Min’ Max’ Min’ Max’
Load 4x4 pixel block again, subtract minimum, prepare for the division
(SSSE3: movdqa, psubb, pmaddubsw, phaddw)
(SSE2: movdqa, psubb, pand, pmaddwd, psrlw, psllw, paddw, packssdw)
DXT1
8(R+G+B)13 8(R+G+B)12 8(R+G+B)11 8(R+G+B)10 8(R+G+B)03 8(R+G+B)02 8(R+G+B)01 8(R+G+B)00
8(R+G+B)33 8(R+G+B)32 8(R+G+B)31 8(R+G+B)30 8(R+G+B)23 8(R+G+B)22 8(R+G+B)21 8(R+G+B)20
DXT5
8A03 8(R+G+B)03 8A02 8(R+G+B)02 8A01 8(R+G+B)01 8A00 8(R+G+B)00
8A13 8(R+G+B)13 8A12 8(R+G+B)12 8A11 8(R+G+B)11 8A10 8(R+G+B)10
8A23 8(R+G+B)23 8A22 8(R+G+B)22 8A21 8(R+G+B)21 8A20 8(R+G+B)20
8A33 8(R+G+B)33 8A32 8(R+G+B)32 8A31 8(R+G+B)31 8A30 8(R+G+B)30
5. Prepare dividers according to the range (mov, add, or, movd, pshufd)
DXT1
ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider
DXT5
AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider
Perform the division (fixed point multiplication) to get indices (pmulhw)
DXT1
ColorIndex13 ColorIndex12 ColorIndex11 ColorIndex10 ColorIndex03 ColorIndex02 ColorIndex01 ColorIndex00
ColorIndex33 ColorIndex32 ColorIndex31 ColorIndex30 ColorIndex23 ColorIndex22 ColorIndex21 ColorIndex20
DXT5
AlphaIndex03 ColorIndex03 AlphaIndex02 ColorIndex02 AlphaIndex01 ColorIndex01 AlphaIndex00 ColorIndex00
AlphaIndex13 ColorIndex13 AlphaIndex12 ColorIndex12 AlphaIndex11 ColorIndex11 AlphaIndex10 ColorIndex10
AlphaIndex23 ColorIndex23 AlphaIndex22 ColorIndex22 AlphaIndex21 ColorIndex21 AlphaIndex20 ColorIndex20
AlphaIndex33 ColorIndex33 AlphaIndex32 ColorIndex32 AlphaIndex31 ColorIndex31 AlphaIndex30 ColorIndex30
Pack indices together and store them to the temporary buffer
(SSSE3: packuswb, pshufb, pmaddubsw, pmaddwd, movdqa)
(SSE2: pshuflw, pshufhw, pmaddwd, packssdw, movdqa)
DXT1
ColorIndex33…30 ColorIndex23…20 ColorIndex13…10 ColorIndex03…00
DXT5
AlphaIndex13…10 ColorIndex13…10 AlphaIndex03…00 ColorIndex03…00
AlphaIndex33…30 ColorIndex33…30 AlphaIndex23…20 ColorIndex23…20
Convert packed indices to final DXT indices and store them to the DXT block (mov, or)
Set3 Set2 Set1 Set0 Min’ Max’ Set2 Set1 Set0 Min’ Max’
6. /*************************************************************************************************************
Extreme DXT Compression
Copyright (C) 2008 Cauldron, Ltd.
Written by Peter Uličiansky
Microsoft Public License (Ms-PL)
This license governs use of the accompanying software.
If you use the software, you accept this license.
If you do not accept the license, do not use the software.
1. Definitions
The terms "reproduce," "reproduction," "derivative works," and "distribution" have the same meaning here as
under U.S. copyright law. A "contribution" is the original software, or any additions or changes to the
software. A "contributor" is any person that distributes its contribution under this license. "Licensed
patents" are a contributor's patent claims that read directly on its contribution.
2. Grant of Rights
(A) Copyright Grant- Subject to the terms of this license, including the license conditions and limitations in
section 3, each contributor grants you a non-exclusive, worldwide, royalty-free copyright license to reproduce
its contribution, prepare derivative works of its contribution, and distribute its contribution or any
derivative works that you create.
(B) Patent Grant- Subject to the terms of this license, including the license conditions and limitations in
section 3, each contributor grants you a non-exclusive, worldwide, royalty-free license under its licensed
patents to make, have made, use, sell, offer for sale, import, and/or otherwise dispose of its contribution in
the software or derivative works of the contribution in the software.
3. Conditions and Limitations
(A) No Trademark License- This license does not grant you rights to use any contributors' name, logo, or
trademarks.
(B) If you bring a patent claim against any contributor over patents that you claim are infringed by the
software, your patent license from such contributor to the software ends automatically.
(C) If you distribute any portion of the software, you must retain all copyright, patent, trademark, and
attribution notices that are present in the software.
(D) If you distribute any portion of the software in source code form, you may do so only under this license
by including a complete copy of this license with your distribution. If you distribute any portion of the
software in compiled or object code form, you may only do so under a license that complies with this license.
(E) The software is licensed "as-is." You bear the risk of using it. The contributors give no express
warranties, guarantees, or conditions. You may have additional consumer rights under your local laws which
this license cannot change. To the extent permitted under your local laws, the contributors exclude the
implied warranties of merchantability, fitness for a particular purpose and non-infringement.
*************************************************************************************************************/
DWORD COLOR_DIVIDER_TABLE[768];
DWORD ALPHA_DIVIDER_TABLE[256];
BYTE COLOR_INDICES_TABLE[256];
WORD ALPHA_INDICES_TABLE[640];
__declspec(align(16)) const BYTE SSE2_BYTE_0 [1 * 16] =
{0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
__declspec(align(16)) const BYTE SSE2_WORD_1 [1 * 16] =
{0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00};
__declspec(align(16)) const BYTE SSE2_WORD_8 [1 * 16] =
{0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00};
__declspec(align(16)) const BYTE SSE2_BOUNDS_MASK [1 * 16] =
{0x00,0x1F,0x00,0x1F,0xE0,0x07,0xE0,0x07,0x00,0xF8,0x00,0xF8,0x00,0xFF,0xFF,0x00};
__declspec(align(16)) const BYTE SSE2_BOUNDS_SCALE [1 * 16] =
{0x20,0x00,0x20,0x00,0x08,0x00,0x08,0x00,0x00,0x01,0x00,0x01,0x00,0x01,0x01,0x00};
__declspec(align(16)) const BYTE SSE2_INDICES_MASK_0 [1 * 16] =
{0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00};
__declspec(align(16)) const BYTE SSE2_INDICES_MASK_1 [1 * 16] =
{0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00};
__declspec(align(16)) const BYTE SSE2_INDICES_MASK_2 [1 * 16] =
{0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00};
__declspec(align(16)) const BYTE SSE2_INDICES_SCALE_0[1 * 16] =
{0x01,0x00,0x04,0x00,0x10,0x00,0x40,0x00,0x01,0x00,0x04,0x00,0x10,0x00,0x40,0x00};
__declspec(align(16)) const BYTE SSE2_INDICES_SCALE_1[1 * 16] =
{0x01,0x00,0x04,0x00,0x01,0x00,0x08,0x00,0x10,0x00,0x40,0x00,0x00,0x01,0x00,0x08};
__declspec(align(16)) const BYTE SSE2_INDICES_SCALE_2[1 * 16] =
{0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40};
__declspec(align(16)) const BYTE SSE2_INDICES_SCALE_3[1 * 16] =
{0x01,0x04,0x01,0x04,0x01,0x08,0x01,0x08,0x01,0x04,0x01,0x04,0x01,0x08,0x01,0x08};
__declspec(align(16)) const BYTE SSE2_INDICES_SCALE_4[1 * 16] =
{0x01,0x00,0x10,0x00,0x01,0x00,0x00,0x01,0x01,0x00,0x10,0x00,0x01,0x00,0x00,0x01};
__declspec(align(16)) const BYTE SSE2_INDICES_SHUFFLE[1 * 16] =
{0x00,0x02,0x04,0x06,0x01,0x03,0x05,0x07,0x08,0x0A,0x0C,0x0E,0x09,0x0B,0x0D,0x0F};
__declspec(align(16)) BYTE sse2_minimum[2 * 16];
__declspec(align(16)) BYTE sse2_range [2 * 16];
__declspec(align(16)) BYTE sse2_bounds [2 * 16];
__declspec(align(16)) BYTE sse2_indices[4 * 16];