GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004...

Post on 02-Feb-2016

213 views 0 download

Transcript of GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004...

GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

email: gbernabe@ditec.um.es

Optimización de la Transformada Wavelet para Arquitecturas

Monoprocesador

Optimización de la Transformada Wavelet para Arquitecturas

Monoprocesador

Gregorio Bernabé García

Depto. Ingeniería y Tecnología de Computadores

Universidad de Murcia

30071 Murcia (SPAIN)

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

The 3D-FWT EncoderThe 3D-FWT Encoder

SOURCE VIDEODATA

THRESHOLDINGQUANTIZER

ENTROPYENCODER 3-D FWT

COMPRESSEDVIDEO DATA

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

INCREASE THE COMPRESSION RATE

MAINTAINING THE VIDEO QUALITY

SEVERAL IMPROVEMENTS IN THE QUANTIZATION AND THE ENTROPY

ENCODER 3D-Conscious Run-Length

Hexadecimal coding

Arithmetic coding

Proposal (I)Proposal (I)

OBJECTIVES

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

IntroductionIntroduction

Memory Conscious 3D FWT exploiting the

memory hierarchy

Rectangular overlapped partitioning Advantages

– Spatial locality of memory references– Reuse of floating point operations

Disavantages– L1 and L2 cache misses too high– Floating point operations executed too large

Blocking Techniques

Proposal (II)Proposal (II)

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Proposal (III)Proposal (III)

Optimize the Rectangular Overlapped Approach Reduce the number of FP instructions.

Pressure over the memory subsystem.

Enhancements Take advantage of the SSE efficiently (Intel C/C++

Compiler)

Data prefetching and Loop Unrolling

Columns Vectorization

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

SSE vectorization by handSSE vectorization by hand

1D-FWT algorithm (n pixels) with Daub-4

as wavelet mother function

for (i=0, j=0; i < n; i+=2, j++) {low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3];

high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3];

}

low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3];

WaveletCoefficient

Four pixels

Three additions

Four multiplications

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

SSE vectorization by handSSE vectorization by hand

1D-FWT algorithm (n pixels) with Daub-4

as wavelet mother function

for (i=0, j=0; i < n; i+=2, j++) {low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3];

high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3];

}

low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3];

4 coefficients16 fp mult12 fp add

low [1] = c0 * p[2] + c1 * p[3] + c2 * p[4] + c3 * p[5];

low [2] = c0 * p[4] + c1 * p[5] + c2 * p[6] + c3 * p[7];

low [3] = c0 * p[6] + c1 * p[7] + c2 * p[8] + c3 * p[9];

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

SSE vectorization by handSSE vectorization by hand

+ + + +

+ + + +

+ + + +

0 31 63 95 127

C0*p[0] C0*p[2] C0*p[6]C0*p[4]xmm0

C1*p[1] C1*p[3] C1*p[7]C1*p[5]xmm1

C2*p[2] C2*p[4] C2*p[6]C2*p[4]xmm2

C3*p[3] C3*p[5] C3*p[9]C3*p[7]xmm3

C0*p[0] + C1*p[1]

C0*p[2] + C1*p[3]

C0*p[6] + C1*p[7]

C0*p[4] + C1*p[5]xmm0 addps xmm0, xmm1

C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3]

C0*p[2] + C1*p[3] + C2*p[4] +

C3*p[5]

C0*p[6] + C1*p[7] + C2*p[8]

+ C3*p[9]

C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7]

xmm0 addps xmm0, xmm3

C0*p[0] + C1*p[1] + C2*p[2]

C0*p[2] + C1*p[3] + C2*p[4]

C0*p[6] + C1*p[7] + C2*p[8]

C0*p[4] + C1*p[5] + C2*p[6]

xmm0 addps xmm0, xmm2

low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3]low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5]low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7]low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Data prefetchingData prefetching

ControlUnit

Memory

Instructions + Data

. . .

Register File

Instructions

Processor

low[1] = C0*p[2] + C1*p[3] + C2*p[4] + C3*p[5]low[2] = C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7]low[3] = C0*p[6] + C1*p[7] + C2*p[8] + C3*p[9]

low[0] = C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3]

low[4] = C0*p[8] + C1*p[9] + C2*p[10] + C3*p[11]low[5] = C0*p[10] + C1*p[11] + C2*p[12] + C3*p[13]low[6] = C0*p[12] + C1*p[13] + C2*p[14] + C3*p[15]low[7] = C0*p[14] + C1*p[15] + C2*p[16] + C3*p[17]

4 wavelet coefficientsare being calculated

Pixels neededfor the next

coefficients are being prefetched

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Columns VectorizationColumns Vectorization

Columns 1 2 3 4 5 6 7 8 9 10 110

Row 0

Row 1

Row 2

Row 3

X-wavelet

X-wavelet

X-wavelet

X-wavelet

Row 4

Row 5

Second Row by Columns

Effective way apply the transform Y Locality of references Transform was already applied X dimension

First Row by Columns X-wavelet

X-wavelet

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

HyperthreadingHyperthreading

ProcessorExecutionResources

Architectural State

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

HyperthreadingHyperthreading

ProcessorExecutionResources

Architectural State

Architectural State

Architectural State

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

HyperthreadingHyperthreadingFetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

ProcessorExecutionResources

Architectural State

Architectural State

Architectural State

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Data Domain DecompositionData Domain Decomposition

Thread-1

Thread-2

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Data Domain DecompositionData Domain Decomposition

Thread-1

Thread-2

Sequenceof

Video 1

Sequenceof

Video 2

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

MemoryPrefetch

3D-FWT

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

MemoryPrefetch

3D-FWT

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

3D-FWT

Quantization

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

3D-FWT

Quantization

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

Functional DecompositionFunctional Decomposition

Thread-1

Thread-2

3D-FWTQ. Low

Q. High

Fetch

Queue Queue

Decode

Queue Queue

T. Cache/Mc ROM

Queue Queue

Rename/Allocate

Queue Queue

Retirement

Queue Queue

Out of OrderSchedule/Execute

APIC

APIC

Arch State

Arch State

Physical Registers

GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones Avanzadas Febrero, 2004

email: gbernabe@ditec.um.es

Transformada Wavelet 3D en Arquitecturas MonoprocesadorTransformada Wavelet 3D en

Arquitecturas Monoprocesador

Gregorio Bernabé García

Depto. Ingeniería y Tecnología de Computadores

Universidad de Murcia

30071 Murcia (SPAIN)