36 %
64 %
Information about Chaver

Published on January 12, 2008

Author: Calogera


Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions:  Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado Index:  Index Motivation Experimental environment Lifting Transform Memory hierarchy exploitation SIMD optimization Conclusions Future work Slide3:  Motivation Slide4:  Motivation Applications based on the Wavelet Transform: JPEG-2000 MPEG-4 Usage of the lifting scheme Study based on a modern general purpose microprocessor Pentium 4 Objectives: Efficient exploitation of Memory Hierarchy Use of the SIMD ISA extensions Slide5:  Experimental Environment Slide6:  Experimental Environment RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800) Memory 512 KB, 128 Byte/Line L2 8 KB, 64 Byte/Line, Write-Through DL1 NA IL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler Compiler Slide7:  Lifting Transform Slide8:  D 1st 1st 1st 1st 1st 1st Lifting Transform Original element 1st step 2nd step A D D D A A A 1st 1st Slide9:  N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element Approximation Slide10:  Lifting Transform Horizontal Filtering Vertical Filtering Slide11:  Memory Hierarchy Exploitation Slide12:  Poor data locality of one component (canonical layouts) E.g. : column-major layout  processing image rows (Horizontal Filtering) Aggregation (loop tiling) Memory Hierarchy Exploitation Poor data locality of the whole transform Other layouts Slide13:  Memory Hierarchy Exploitation Horizontal Filtering Vertical Filtering Slide14:  Aggregation Horizontal Filtering IMAGE Memory Hierarchy Exploitation Slide15:  Memory Hierarchy Exploitation INPLACE Common implementation of the transform Memory: Only requires the original matrix For most applications needs post-processing MALLAT Memory: requires 2 matrices Stores the image in the expected order INPLACE-MALLAT Memory: requires 2 matrices Stores the image in the expected order Different studied schemes Slide16:  Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 logical view physical view INPLACE Slide17:  Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 MATRIX 2 logical view physical view MALLAT Slide18:  Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view physical view INPLACE- MALLAT Slide19:  Memory Hierarchy Exploitation Execution time breakdown for several sizes comparing both compilers. I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively. Each bar shows the execution time of each level and the post-processing step. Slide20:  The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above These 2 approaches have a noticeable slowdown for the 1st level: Larger working set More complex access pattern The Inplace-Mallat version achieves the best execution time ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONS Slide21:  SIMD Optimization Slide22:  Objective: Extract the parallelism available on the Lifting Transform Different strategies: Semi-automatic vectorization Hand-coded vectorization Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout) SIMD Optimization Slide23:  SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employed Slide24:  Original element 1st step 2nd step A D SIMD Optimization 1st 1st Slide25:  SIMD Optimization Column-major layout Vectorial Horizontal filtering + a x + Horizontal filtering + a x + a a a Slide26:  SIMD Optimization Column-major layout Vectorial Vertical filtering + a x + Vertical filtering + a x + a a a Slide27:  for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } } Horizontal Vectorial Filtering (semi-automatic) SIMD Optimization Slide28:  SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering Slide29:  Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2 t3 t4 + a x + a a a Slide30:  SIMD Optimization Execution time breakdown of the horizontal filtering (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized. Slide31:  SIMD Optimization Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses. The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level. For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONS Slide32:  SIMD Optimization Execution time breakdown of the whole transform (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized. Slide33:  SIMD Optimization Speedup between 1,5 and 2 depending on the strategy. For ICC the shortest execution time is reached by the mallat version. When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONS Slide34:  SIMD Optimization Speedup achieved by the different vectorial codes over the inplace-mallat and inplace. We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC. Slide35:  SIMD Optimization The speedup grows with the image size since. On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy. Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes CONCLUSIONS Slide36:  Conclusions Slide37:  Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme. SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results. Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2. The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability. Most of our insights are compiler independent. Conclusions Slide38:  Future work Slide39:  4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information:

#pragma presentations

Add a comment

Related presentations

Related pages - Poster für die 125 Jahr-Feier der TU-Berlin. Poster für die 125 Jahr-Feier der TU-Berlin. Poster. ClaudiaWinterfeldt@ ...
Read more

Hugo Chávez – Wikipedia

Hugo Rafael Chávez Frías [ˈuɣo rafaˈel ˈtʃaβes ˈfɾias] (* 28. Juli 1954 in Sabaneta; † 5. März 2013 in Caracas) war ein venezolanischer ...
Read more - A New Approach to Torah and Mishnah A new approach to torah and mishnah. Discuss The Woven Torah on Facebook. A New Approach to torah and mishnah. This site offers a new new ...
Read more

chaver28 - YouTube

What's up you legends?? it's your boy chaver28 with LOADS of content for you to check out so make sure you subscribe :D i love you haha
Read more

WebChaver » Home

Protect Yourself. Protect Your Family. WebChaver sits on your computer and quietly keeps an eye out for any questionable internet use. Your designated ...
Read more

Chaver | Facebook

See more of Chaver by logging into Facebook. Message this Page, learn about upcoming events and more. If you don't have a Facebook account, ...
Read more

Chava Alberstein – Wikipedia

Chava Alberstein (hebräisch חוה אלברשטיין; * 8. Dezember 1947 in Stettin) ist eine international bekannte israelische Sängerin und ...
Read more

חבר - Wiktionary

This page was last modified on 6 October 2016, at 00:20. Text is available under the Creative Commons Attribution-ShareAlike License; additional ...
Read more