[calligra/calligra/2.6] /: Backported the Vc code

Sat Jan 12 13:42:03 UTC 2013

Git commit c4be9833420b25366fc02d5e7b57ac26e674f053 by Dmitry Kazakov.
Committed on 12/01/2013 at 14:20.
Pushed by dkazakov into branch 'calligra/2.6'.

Backported the Vc code

reviewed by Boudewijn

Squashed commit of the following:

commit 08df248c17d90a3f753be509c5ec7536d6915306
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Wed Dec 12 16:21:26 2012 +0400

    Added a note for packagers how to build multi-arch build of Calligra

commit 06b96db7da4428d112d0e981a67e18b6e871ae53
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Thu Dec 6 13:52:10 2012 +0400

    Finished multi-arch build for the Circle Mask Generator

    I also merged the factories code with the composition multi-arch
    implementation, so the code is quite nice and compact now. At least no so
    frightening templatish as it was in the beginning =)

commit be8a4307501cfe995a7961d5d1e04398f6e08d9a
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Thu Dec 6 12:18:53 2012 +0400

    Added first multi-arch implementation of KisCircleMaskGenerator code

    It is not yet finished:
    1) It doesn't compile on !HAVE_VC
    2) It is not merged with composition factories code

commit 2553baf2af3db994ac7642fdd35dc0ebe65748c2
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Sat Jan 12 15:37:53 2013 +0400

    Made the per-arch compilation code reusable

    Now I can start making the same thing for KisAutoBrush

    Conflicts:

    	libs/pigment/CMakeLists.txt
    	libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp
    	libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.h

commit 9271691074e37828eb7a4281347333958c24977c
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Tue Dec 4 11:06:01 2012 +0400

    Added a PACKAGERS_BUILD option for code generation for many architectures

    This option is disabled by default.

    By default we build the whole Calligra optimized for the host architecture.
    When the option is on, the hottest parts of calligra will compile optimized
    for several most popular architectures. The rest of the code will not use
    any brand-new instructions for not breaking binary compatibility among cpus.

    Short manual:
    1) If you build Calligra for yourself and are not going to copy Krita binary
       to another CPU, disable this option.
    2) If you build a Calligra package and are going to distribute it among users,
       then enable the option.

    Conflicts:

    	CMakeLists.txt
    	config-vc.h.cmake

commit d20213b143560e415cece4e8d31e001730626a0e
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Mon Dec 3 10:47:53 2012 +0400

    Removed hardcoded setting of optimization flags for non-multiarch parts

commit 9d8b10e0841680e26630b7050328735ca98061b4
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Mon Dec 3 10:15:08 2012 +0400

    Fix compilation when no Vc library is present

commit b157f2103a5598dd08abd694d22d8604b6522b64
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Sun Dec 2 22:16:08 2012 +0400

    Fixed an alpha-locked bug

    Sorry for the inconvenience.

    BUG:311012

commit 308170695c6c5277cd04b7bbe1d4228dad282995
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Sat Jan 12 15:17:49 2013 +0400

    Added the first version of per-architecture binaries for composition

    Pros:
    + we can have prebuild versions for all the architectures supported
      by Vc (Amd XMA4 and XOP are not supported by Vc yet)
    + the implementation is chosen dynamically on Krita start
    + the semi-general code for multi-arch builds now in
      KoVcMultiArchBuildSupport.h (might be ported upstream in the future)

    Cons:
    - it depends on Vc's 'staging' branch, so it can't be put in master
      right now
    - the code became much less readable due to all that template magic
    - I had to copy-paste Vc's 'vc_compile_for_all_implementations' cmake
      macro, because we do not need 'Scalar' implementation
    - the size of the pigment library grew almost 1.5 times: 11->17 MiB
      (probably, we still need plugin system for this)


    Conflicts:

    	libs/pigment/CMakeLists.txt
    	libs/pigment/compositeops/KoOptimizedCompositeOpAlphaDarken32.h
    	libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp
    	libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h
    	libs/pigment/compositeops/KoStreamedMath.h

commit 12933dd4ea64ea59b9a038c2f70db87e6ef60810
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Sun Dec 2 09:37:49 2012 +0400

    Optimized vector composite ops by 1.5-2 times more

    Conversion Uint<->Float is quite expensive in comparison to
    Int<->Float (2-2.5 times). This happens because of special code
    that handles sign bit of the number. So discarding this bit with
    conversion Uint->Int makes a huge speedup.

    Now the vector version of the composition is 1.8-8.7 times faster
    that the old version (weighted: 3.2 times).

    Many thanks to Matthias Kretz for pointing this out!

    CCMAIL:kimageshop at kde.org
    CCMAIL:kretz at kde.org

commit 57ee76dc327b7d771ffaa62ccc4670d29abb015c
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Sat Dec 1 22:19:49 2012 +0400

    Fixed a 1.4 times speed regression when legacy/optimized ops are put together

    The optimized and legacy composite ops should be put into separate
    object files. Otherwise, some code layout/locality problem arises.
    I do not know the exact explanation of this phenomenon, but splitting
    the implementations fixes it.

commit 3a3491ce37a284110963e584609f1e3a030005a4
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Tue Nov 27 18:34:24 2012 +0400

    Create Vc version of the composite op only when the online cpu supports it

commit 767405a815c1208b79ad40dffe3db6ecefeed7a8
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Mon Nov 26 17:28:26 2012 +0400

    Fixed a zero-alpha bug in the vector implementation of the OVER composite op

commit 0485f3bdc33d6ac846003e4c6313567b93273347
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Thu Nov 22 20:29:03 2012 +0400

    Fixed warnings and a bug in the vector compositioning

commit bb36f513fbf442e9e00af0f10a874a8814076bbc
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Thu Nov 22 20:05:17 2012 +0400

    Fixed compilation when no Vc library is present in the system

commit 1f52906297b725b5c33b4e25d1df72db8849c516
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Fri Oct 26 15:25:12 2012 +0400

    Fixed a bug in the optimized OVER composite

    In some cases src_alpha does not correspond to the real source alpha
    because it has opacity and mask mixed to it. In these cases we cannot
    use memcpy.

commit 8645a698f71cadec3c4403e2d09c4a5ff953f1cd
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Thu Oct 25 19:21:40 2012 +0400

    Added fast-path optimizations to the vector Alpha Darken composite op

    The cases of 0 or 255 alpha value are quite common in Krita

commit 1580f62db31a6ad73aadea5a51c9227f53cd41db
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Wed Oct 24 22:53:56 2012 +0400

    Optimized Vector Composite Over to special cases of alpha

    Alpha: 255 and 0 are too common in Krita, so these checks do really
    good work.

    Now some of the Stroke Benchmark execute 10 or 20% faster. For others
    there is no change.

commit 8a2258b95a22fce7483e6539e713199d0faa62e9
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Tue Oct 16 16:51:09 2012 +0400

    The Vc implementation of the composite ops in ready for testing

    All the known bugs are fixed.

commit 6ec028de613d7b5d79e93622dc5b7b8b74470023
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Tue Oct 16 10:37:26 2012 +0400

    Added Vc implementation of the "over" composite

    There is still one bug in both the composites: the calculation
    of a single pixel compositions should be done in float instead of
    integers, otherwise it causes artifacts on the canvas during painting.

commit 23879a388b2ad8089e946e5f614c2d03d515c73d
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Mon Oct 15 11:07:58 2012 +0400

    Added an optimized version of Alpha Darken composite op

    It gives 1.58...1.74 times better result of the composition
    on Sandy Bridge. Other architectures are to be tested.

    Conflicts:

    	krita/CMakeLists.txt
    	krita/benchmarks/CMakeLists.txt

commit d8d16dcb5e406e2beaf0e285ef2485037fe38166
Author: Dmitry Kazakov <dimula73 at gmail.com>
Date:   Mon Oct 15 10:57:16 2012 +0400

    Fixed a cmake config bug which made Vc do not use streamed extensions

    Without these flags Vc falls back to sse2 and doesn't use extensions
    present in the current cpu.

M  +84   -0    CMakeLists.txt
M  +46   -0    README.PACKAGERS
R  +1    -3    config-vc.h.cmake [from: krita/config-vc.h.cmake - 080% similarity]
M  +0    -19   krita/CMakeLists.txt
M  +8    -5    krita/benchmarks/CMakeLists.txt
A  +613  -0    krita/benchmarks/kis_composition_benchmark.cpp     [License: GPL (v2+)]
A  +53   -0    krita/benchmarks/kis_composition_benchmark.h     [License: GPL (v2+)]
M  +1    -1    krita/benchmarks/kis_mask_generator_benchmark.cpp
M  +4    -0    krita/image/CMakeLists.txt
M  +14   -0    krita/image/kis_base_mask_generator.cpp
M  +4    -3    krita/image/kis_base_mask_generator.h
A  +92   -0    krita/image/kis_brush_mask_applicator_base.h     [License: GPL (v2+)]
A  +122  -0    krita/image/kis_brush_mask_applicator_factories.cpp     [License: GPL (v2+)]
A  +52   -0    krita/image/kis_brush_mask_applicator_factories.h     [License: GPL (v2+)]
A  +207  -0    krita/image/kis_brush_mask_applicators.h     [License: GPL (v2+)]
M  +9    -68   krita/image/kis_circle_mask_generator.cpp
M  +4    -4    krita/image/kis_circle_mask_generator.h
A  +30   -0    krita/image/kis_circle_mask_generator_p.h     [License: GPL (v2+)]
M  +0    -2    krita/plugins/paintops/hairy/hairy_brush.cpp
M  +1    -0    krita/plugins/paintops/libbrush/CMakeLists.txt
M  +11   -180  krita/plugins/paintops/libbrush/kis_auto_brush.cpp
M  +15   -1    libs/pigment/CMakeLists.txt
M  +4    -1    libs/pigment/KoCompositeOp.cpp
M  +0    -3    libs/pigment/colorspaces/KoRgbU16ColorSpace.cpp
M  +1    -5    libs/pigment/colorspaces/KoRgbU8ColorSpace.cpp
M  +48   -2    libs/pigment/compositeops/KoCompositeOps.h
A  +214  -0    libs/pigment/compositeops/KoOptimizedCompositeOpAlphaDarken32.h     [License: LGPL (v2+)]
A  +47   -0    libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp     [License: GPL (v2+)]
A  +46   -0    libs/pigment/compositeops/KoOptimizedCompositeOpFactory.h     [License: GPL (v2+)]
A  +107  -0    libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.cpp     [License: GPL (v2+)]
A  +54   -0    libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.h     [License: GPL (v2+)]
A  +47   -0    libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch_Scalar.cpp     [License: GPL (v2+)]
A  +240  -0    libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h     [License: LGPL (v2+)]
A  +304  -0    libs/pigment/compositeops/KoStreamedMath.h     [License: GPL (v2+)]
A  +91   -0    libs/pigment/compositeops/KoVcMultiArchBuildSupport.h     [License: GPL (v2+)]

http://commits.kde.org/calligra/c4be9833420b25366fc02d5e7b57ac26e674f053

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 24bebb7..acc42db 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -41,6 +41,7 @@ option(NEPOMUK "support NEPOMUK Tagging" ON)
 option(TINY "compile a tiny Calligra" OFF)
 option(CREATIVEONLY "compile only Karbon and Krita" OFF)
 option(QT3SUPPORT "Build the parts of Calligra that still depend on Qt3" ON)
+option(PACKAGERS_BUILD "Build support of multiple CPU architectures in one binary. Should be used by packagers only." OFF)
 
 IF (TINY)
     set(SHOULD_BUILD_WORDS TRUE)
@@ -307,6 +308,89 @@ if(LCMS2_FOUND)
 endif(LCMS2_FOUND)
 
 ##
+## Test for Vc
+##
+
+set(OLD_CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} )
+set(CMAKE_MODULE_PATH ${CMAKE_SOURCE_DIR}/cmake/modules )
+macro_optional_find_package(Vc)
+macro_log_feature(Vc_FOUND "Vc" "Portable, zero-overhead SIMD library for C++" "http://code.compeng.uni-frankfurt.de/projects/vc" FALSE "" "Required by the Krita for vectorization")
+macro_bool_to_01(Vc_FOUND HAVE_VC)
+macro_bool_to_01(PACKAGERS_BUILD DO_PACKAGERS_BUILD)
+configure_file(config-vc.h.cmake ${CMAKE_CURRENT_BINARY_DIR}/config-vc.h )
+
+if(HAVE_VC)
+    message(STATUS "Vc found!")
+
+    SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Vc_CMAKE_MODULES_DIR}")
+    include (VcMacros)
+
+    # This is a copy-paste from VcMacros.cmake
+    # we need a version *without* Scalar implementation
+    macro(ko_compile_for_all_implementations_no_scalar_impl _objs _src)
+      set(${_objs})
+
+      # remove all -march, -msse, etc. flags from the flags we want to pass
+      string(REPLACE "${Vc_ARCHITECTURE_FLAGS}" "" _flags "${Vc_DEFINITIONS}")
+      string(REPLACE "-DVC_IMPL=[^ ]*" "" _flags "${_flags}")
+
+      # capture the -march= switch as -mtune; if there is none skip it
+      if(Vc_ARCHITECTURE_FLAGS MATCHES "-march=")
+        string(REGEX REPLACE "^.*-march=([^ ]*).*$" "-mtune=\\1" _tmp "${Vc_ARCHITECTURE_FLAGS}")
+        set(_flags "${_flags} ${_tmp}")
+      endif()
+
+      # make a semicolon separated list of all flags
+      string(TOUPPER "${CMAKE_BUILD_TYPE}" _tmp)
+      set(_tmp "CMAKE_CXX_FLAGS_${_tmp}")
+      string(REPLACE " " ";" _flags "${CMAKE_CXX_FLAGS} ${${_tmp}} ${_flags} ${ARGN}")
+      get_directory_property(_inc INCLUDE_DIRECTORIES)
+      foreach(_i ${_inc})
+        list(APPEND _flags "-I${_i}")
+      endforeach()
+
+      set(_vc_compile_src "${_src}")
+
+      ##! commented out intentionally, the only difference with original
+      #   _vc_compile_one_implementation(${_objs} Scalar NO_FLAG)
+      ##!
+      if(NOT Vc_SSE_INTRINSICS_BROKEN)
+        _vc_compile_one_implementation(${_objs} SSE2   "-msse2"   "-xSSE2"   "/arch:SSE2")
+        _vc_compile_one_implementation(${_objs} SSE3   "-msse3"   "-xSSE3"   "/arch:SSE2")
+        _vc_compile_one_implementation(${_objs} SSSE3  "-mssse3"  "-xSSSE3"  "/arch:SSE2")
+        _vc_compile_one_implementation(${_objs} SSE4_1 "-msse4.1" "-xSSE4.1" "/arch:SSE2")
+        _vc_compile_one_implementation(${_objs} SSE4_2 "-msse4.2" "-xSSE4.2" "/arch:SSE2")
+        _vc_compile_one_implementation(${_objs} SSE4a  "-msse4a"  "-xSSSE3"  "/arch:SSE2")
+      endif()
+      if(NOT Vc_AVX_INTRINSICS_BROKEN)
+        _vc_compile_one_implementation(${_objs} AVX      "-mavx"    "-xAVX"    "/arch:AVX")
+      endif()
+    endmacro()
+
+    macro(ko_compile_for_all_implementations_no_scalar _objs _src _opts)
+      if(PACKAGERS_BUILD)
+        ko_compile_for_all_implementations_no_scalar_impl(${_objs} ${_src} ${_opts})
+      else(PACKAGERS_BUILD)
+        set(${_objs} ${_src})
+      endif(PACKAGERS_BUILD)
+    endmacro()
+
+    macro(ko_compile_for_all_implementations _objs _src _opts)
+      if(PACKAGERS_BUILD)
+        vc_compile_for_all_implementations(${_objs} ${_src} ${_opts})
+      else(PACKAGERS_BUILD)
+        set(${_objs} ${_src})
+      endif(PACKAGERS_BUILD)
+    endmacro()
+
+    if (NOT PACKAGERS_BUILD)
+      # Optimize the whole Calligra for current architecture
+      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${Vc_DEFINITIONS}")
+    endif (NOT PACKAGERS_BUILD)
+endif(HAVE_VC)
+set(CMAKE_MODULE_PATH ${OLD_CMAKE_MODULE_PATH} )
+
+##
 ## Test for Nepomuk
 ##
 if(NEPOMUK)
diff --git a/README.PACKAGERS b/README.PACKAGERS
index 84b97ad..c06c5da 100644
--- a/README.PACKAGERS
+++ b/README.PACKAGERS
@@ -26,6 +26,7 @@ Table Of Contents
 5. Calligra libraries
 5.1. Calligra default database driver: SQLite
 6. WARNING against qt 4.8.0 and 4.8.1
+7. IMPORTANT On using CPU vector capabilities in Calligra Libs and Krita
 
 1. Kexi
 =======
@@ -268,3 +269,48 @@ Without SQLite in at least this version Calligra will not compile.
 6. WARNING against qt 4.8.0 and 4.8.1
 ====================
 Using Qt 4.8.0 and 4.8.1 causes crashes. As a result Words and Stage will not be built. Please upgrade to 4.8.2. You can also patch Qt and when building Calligra set IHAVEPATCHEDQT. Patch against Qt can be found in qt48setx.patch in this directory
+
+
+7. IMPORTANT On using CPU vector capabilities in Calligra Libs and Krita
+====================
+
+IN BRIEF: 1) Intall Vc library [1] and don't forget to activate
+             PACKAGERS_BUILD=ON option when building a package.
+          2) Vc libary should be present on the building system only,
+             it need not be installed on all the client systems.
+
+Krita and Pigment can make use of the vector capabilities of the
+user's CPU. To make it possible Vc library [1] should be present in
+the host system. This is a static library and fully included into the
+final Pigment/Krita binary, so it is not necessary to have it
+installed in the client system.
+
+The code generation is generally controlled by two factors: the
+presence of the Vc library and a special cmake option
+'PACKAGERS_BUILD'. Consider three cases:
+
+1) Vc library is not present. PACKAGERS_BUILD=<don't care>.
+
+The calligra is build with default compiller options. The resulting
+binary is non-optimized and portable among different CPU
+architectures.
+
+2) Vc library is present. PACKAGERS_BUILD=OFF (default).
+
+All the calligra binaries are optimized for the host CPU. This is the
+most efficient type of build of Krita. But be careful, because such
+binaries are not portable among different CPU architectures! Using
+this build for packages distributed to many users will most probably
+result in SIGILL crashes on the client system. Use this option for
+private builds only.
+
+3) Vc library is present. PACKAGERS_BUILD=ON.
+
+This option disables CPU optimizations for the most of Calligra, but
+gnerates several versions of the code for its hottest parts. The
+specific implementation of the code is chosen on the fly when Calligra
+starts. This version is a bit slower than 2) but much faster than 1)
+and is *portable* among all the CPU architectures. Use this type of
+build for building distributable packages.
+
+[1] - http://code.compeng.uni-frankfurt.de/projects/vc
diff --git a/krita/config-vc.h.cmake b/config-vc.h.cmake
similarity index 80%
rename from krita/config-vc.h.cmake
rename to config-vc.h.cmake
index 227acfa..e2b6ada 100644
--- a/krita/config-vc.h.cmake
+++ b/config-vc.h.cmake
@@ -2,6 +2,4 @@
 
 /* Define if you have Vc, the vectorization library */
 #cmakedefine HAVE_VC 1
-
-
-
+#cmakedefine DO_PACKAGERS_BUILD 1
diff --git a/krita/CMakeLists.txt b/krita/CMakeLists.txt
index 28615fb..825e28d 100644
--- a/krita/CMakeLists.txt
+++ b/krita/CMakeLists.txt
@@ -12,25 +12,6 @@ if(MSVC)
   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /bigobj")
 endif(MSVC)
 
-set(OLD_CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} )
-set(CMAKE_MODULE_PATH ${CMAKE_SOURCE_DIR}/cmake/modules )
-
-macro_optional_find_package(Vc)
-macro_log_feature(Vc_FOUND "Vc" "Portable, zero-overhead SIMD library for C++" "http://code.compeng.uni-frankfurt.de/projects/vc" FALSE "" "Required by the Krita for vectorization")
-macro_bool_to_01(Vc_FOUND HAVE_VC)
-set(HAVE_VC FALSE)
-configure_file(config-vc.h.cmake ${CMAKE_CURRENT_BINARY_DIR}/config-vc.h )
-
-if(HAVE_VC)
-    message(STATUS "Vc found!")
-    SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Vc_CMAKE_MODULES_DIR}")
-    include (OptimizeForArchitecture)
-    OptimizeForArchitecture()
-endif(HAVE_VC)
-
-set(CMAKE_MODULE_PATH ${OLD_CMAKE_MODULE_PATH} )
-
-
 include(CheckFunctionExists)
 
 macro_optional_find_package(GLEW)
diff --git a/krita/benchmarks/CMakeLists.txt b/krita/benchmarks/CMakeLists.txt
index 2ecfdd2..117ca80 100644
--- a/krita/benchmarks/CMakeLists.txt
+++ b/krita/benchmarks/CMakeLists.txt
@@ -1,8 +1,12 @@
 set( EXECUTABLE_OUTPUT_PATH ${CMAKE_CURRENT_BINARY_DIR} )
 include_directories(  ${KOMAIN_INCLUDES}  ${CMAKE_SOURCE_DIR}/krita/image/tiles3 ${CMAKE_SOURCE_DIR}/krita/sdk/tests)
 
+
+set(LINK_VC_LIB)
 if(HAVE_VC)
   include_directories(${Vc_INCLUDE_DIR})
+#  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${Vc_DEFINITIONS}")
+  set(LINK_VC_LIB ${Vc_LIBRARIES})
 endif(HAVE_VC)
 
 add_definitions(-DFILES_DATA_DIR="${CMAKE_CURRENT_SOURCE_DIR}/data/")
@@ -26,6 +30,7 @@ set(kis_gradient_benchmark_SRCS kis_gradient_benchmark.cpp)
 set(kis_mask_generator_benchmark_SRCS kis_mask_generator_benchmark.cpp)
 set(kis_low_memory_benchmark_SRCS kis_low_memory_benchmark.cpp)
 set(kis_filter_selections_benchmark_SRCS kis_filter_selections_benchmark.cpp)
+set(kis_composition_benchmark_SRCS kis_composition_benchmark.cpp)
 
 calligra_add_benchmark(KisDatamanagerBenchmark TESTNAME krita-benchmarks-KisDataManager ${kis_datamanager_benchmark_SRCS})
 calligra_add_benchmark(KisHLineIteratorBenchmark TESTNAME krita-benchmarks-KisHLineIterator ${kis_hiterator_benchmark_SRCS})
@@ -42,6 +47,7 @@ calligra_add_benchmark(KisGradientBenchmark TESTNAME krita-benchmarks-KisGradien
 calligra_add_benchmark(KisMaskGeneratorBenchmark TESTNAME krita-benchmarks-KisMaskGenerator ${kis_mask_generator_benchmark_SRCS})
 calligra_add_benchmark(KisLowMemoryBenchmark TESTNAME krita-benchmarks-KisLowMemory ${kis_low_memory_benchmark_SRCS})
 calligra_add_benchmark(KisFilterSelectionsBenchmark TESTNAME krita-image-KisFilterSelectionsBenchmark ${kis_filter_selections_benchmark_SRCS})
+calligra_add_benchmark(KisCompositionBenchmark TESTNAME krita-benchmarks-KisComposition ${kis_composition_benchmark_SRCS})
 
 target_link_libraries(KisDatamanagerBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
 target_link_libraries(KisHLineIteratorBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
@@ -57,8 +63,5 @@ target_link_libraries(KisFloodfillBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_Q
 target_link_libraries(KisGradientBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
 target_link_libraries(KisLowMemoryBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
 target_link_libraries(KisFilterSelectionsBenchmark  ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
-
-target_link_libraries(KisMaskGeneratorBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY})
-if(HAVE_VC)
-    target_link_libraries(KisMaskGeneratorBenchmark ${Vc_LIBRARIES})
-endif(HAVE_VC)
\ No newline at end of file
+target_link_libraries(KisCompositionBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY} ${LINK_VC_LIB})
+target_link_libraries(KisMaskGeneratorBenchmark ${KDE4_KDEUI_LIBS} kritaimage ${QT_QTTEST_LIBRARY} ${LINK_VC_LIB})
diff --git a/krita/benchmarks/kis_composition_benchmark.cpp b/krita/benchmarks/kis_composition_benchmark.cpp
new file mode 100644
index 0000000..7b465af
--- /dev/null
+++ b/krita/benchmarks/kis_composition_benchmark.cpp
@@ -0,0 +1,613 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "kis_composition_benchmark.h"
+
+#include <qtest_kde.h>
+
+#include <KoColorSpace.h>
+#include <KoCompositeOp.h>
+#include <KoColorSpaceRegistry.h>
+
+#include <KoColorSpaceTraits.h>
+#include <KoCompositeOpAlphaDarken.h>
+#include <KoCompositeOpOver.h>
+#include "KoOptimizedCompositeOpFactory.h"
+
+
+
+// for calculation of the needed alignment
+#include "config-vc.h"
+#ifdef HAVE_VC
+#include <Vc/Vc>
+#include <Vc/IO>
+
+#include <KoOptimizedCompositeOpOver32.h>
+#include <KoOptimizedCompositeOpAlphaDarken32.h>
+#endif
+
+// for memalign()
+#include <malloc.h>
+
+const int alpha_pos = 3;
+
+enum AlphaRange {
+    ALPHA_ZERO,
+    ALPHA_UNIT,
+    ALPHA_RANDOM
+};
+
+inline quint8 generateAlphaValue(AlphaRange range) {
+    quint8 value = 0;
+
+    switch (range) {
+    case ALPHA_ZERO:
+        break;
+    case ALPHA_UNIT:
+        value = 255;
+        break;
+    case ALPHA_RANDOM:
+        value = qrand() % 255;
+        break;
+    }
+
+    return value;
+}
+
+void generateDataLine(uint seed, int numPixels, quint8 *srcPixels, quint8 *dstPixels, quint8 *mask, AlphaRange srcAlphaRange, AlphaRange dstAlphaRange)
+{
+    Q_ASSERT(numPixels >= 4);
+
+    for (int i = 0; i < 4; i++) {
+        srcPixels[4*i]   = i * 10 + 30;
+        srcPixels[4*i+1] = i * 10 + 30;
+        srcPixels[4*i+2] = i * 10 + 30;
+        srcPixels[4*i+3] = i * 10 + 35;
+
+        dstPixels[4*i]   = i * 10 + 160;
+        dstPixels[4*i+1] = i * 10 + 160;
+        dstPixels[4*i+2] = i * 10 + 160;
+        dstPixels[4*i+3] = i * 10 + 165;
+
+        mask[i] = i * 10 + 225;
+    }
+
+    qsrand(seed);
+    numPixels -= 4;
+    srcPixels += 4 * 4;
+    dstPixels += 4 * 4;
+    mask += 4;
+
+    for (int i = 0; i < numPixels; i++) {
+        for (int j = 0; j < 3; j++) {
+            *(srcPixels++) = qrand() % 255;
+            *(dstPixels++) = qrand() % 255;
+        }
+
+        *(srcPixels++) = generateAlphaValue(srcAlphaRange);
+        *(dstPixels++) = generateAlphaValue(dstAlphaRange);
+
+        *(mask++) = qrand() % 255;
+    }
+}
+
+void printData(int numPixels, quint8 *srcPixels, quint8 *dstPixels, quint8 *mask)
+{
+    for (int i = 0; i < numPixels; i++) {
+        qDebug() << "Src: "
+                 << srcPixels[i*4] << "\t"
+                 << srcPixels[i*4+1] << "\t"
+                 << srcPixels[i*4+2] << "\t"
+                 << srcPixels[i*4+3] << "\t"
+                 << "Msk:" << mask[i];
+
+        qDebug() << "Dst: "
+                 << dstPixels[i*4] << "\t"
+                 << dstPixels[i*4+1] << "\t"
+                 << dstPixels[i*4+2] << "\t"
+                 << dstPixels[i*4+3];
+    }
+}
+
+const int rowStride = 64;
+const int totalRows = 64;
+const QRect processRect(0,0,64,64);
+const int numPixels = rowStride * totalRows;
+const int numTiles = 1024;
+
+
+struct Tile {
+    quint8 *src;
+    quint8 *dst;
+    quint8 *mask;
+};
+#include <stdint.h>
+QVector<Tile> generateTiles(int size,
+                            const int srcAlignmentShift,
+                            const int dstAlignmentShift,
+                            AlphaRange srcAlphaRange,
+                            AlphaRange dstAlphaRange)
+{
+    QVector<Tile> tiles(size);
+
+#ifdef HAVE_VC
+    const int vecSize = Vc::float_v::Size;
+#else
+    const int vecSize = 1;
+#endif
+
+    for (int i = 0; i < size; i++) {
+        tiles[i].src = (quint8*)memalign(vecSize * 4, numPixels * 4 + srcAlignmentShift) + srcAlignmentShift;
+        tiles[i].dst = (quint8*)memalign(vecSize * 4, numPixels * 4 + dstAlignmentShift) + dstAlignmentShift;
+        tiles[i].mask = (quint8*)memalign(vecSize, numPixels);
+
+        generateDataLine(1, numPixels, tiles[i].src, tiles[i].dst, tiles[i].mask, srcAlphaRange, dstAlphaRange);
+    }
+
+    return tiles;
+}
+
+void freeTiles(QVector<Tile> tiles,
+               const int srcAlignmentShift,
+               const int dstAlignmentShift)
+{
+    foreach (const Tile &tile, tiles) {
+        free(tile.src - srcAlignmentShift);
+        free(tile.dst - dstAlignmentShift);
+        free(tile.mask);
+    }
+}
+
+inline bool fuzzyCompare(quint8 a, quint8 b, quint8 prec) {
+    return qAbs(a - b) <= prec;
+}
+
+inline bool comparePixels(quint8 *p1, quint8*p2, quint8 prec) {
+    return (p1[3] == p2[3] && p1[3] == 0) ||
+        (fuzzyCompare(p1[0], p2[0], prec) &&
+         fuzzyCompare(p1[1], p2[1], prec) &&
+         fuzzyCompare(p1[2], p2[2], prec) &&
+         fuzzyCompare(p1[3], p2[3], prec));
+
+}
+
+bool compareTwoOps(bool haveMask, const KoCompositeOp *op1, const KoCompositeOp *op2)
+{
+    QVector<Tile> tiles = generateTiles(2, 16, 16, ALPHA_RANDOM, ALPHA_RANDOM);
+
+    KoCompositeOp::ParameterInfo params;
+    params.dstRowStride  = 4 * rowStride;
+    params.srcRowStride  = 4 * rowStride;
+    params.maskRowStride = rowStride;
+    params.rows          = processRect.height();
+    params.cols          = processRect.width();
+    params.opacity       = 0.5*1.0f;
+    params.flow          = 0.3*1.0f;
+    params.channelFlags  = QBitArray();
+
+    params.dstRowStart   = tiles[0].dst;
+    params.srcRowStart   = tiles[0].src;
+    params.maskRowStart  = haveMask ? tiles[0].mask : 0;
+    op1->composite(params);
+
+    params.dstRowStart   = tiles[1].dst;
+    params.srcRowStart   = tiles[1].src;
+    params.maskRowStart  = haveMask ? tiles[1].mask : 0;
+    op2->composite(params);
+
+    quint8 *dst1 = tiles[0].dst;
+    quint8 *dst2 = tiles[1].dst;
+    for (int i = 0; i < numPixels; i++) {
+        if (!comparePixels(dst1, dst2, 7)) {
+
+            qDebug() << "Wrong result:" << i;
+            qDebug() << "Act: " << dst1[0] << dst1[1] << dst1[2] << dst1[3];
+            qDebug() << "Exp: " << dst2[0] << dst2[1] << dst2[2] << dst2[3];
+
+            quint8 *src1 = tiles[0].src + 4 * i;
+            quint8 *src2 = tiles[1].src + 4 * i;
+
+            qDebug() << "SrcA:" << src1[0] << src1[1] << src1[2] << src1[3];
+            qDebug() << "SrcE:" << src2[0] << src2[1] << src2[2] << src2[3];
+
+            qDebug() << "MskA:" << tiles[0].mask[i];
+            qDebug() << "MskE:" << tiles[1].mask[i];
+
+            return false;
+        }
+        dst1 += 4;
+        dst2 += 4;
+    }
+
+    freeTiles(tiles, 16, 16);
+
+    return true;
+}
+
+QString getTestName(bool haveMask,
+                    const int srcAlignmentShift,
+                    const int dstAlignmentShift,
+                    AlphaRange srcAlphaRange,
+                    AlphaRange dstAlphaRange)
+{
+
+    QString testName;
+    testName +=
+        !srcAlignmentShift && !dstAlignmentShift ? "Aligned   " :
+        !srcAlignmentShift &&  dstAlignmentShift ? "SrcUnalig " :
+         srcAlignmentShift && !dstAlignmentShift ? "DstUnalig " :
+         srcAlignmentShift &&  dstAlignmentShift ? "Unaligned " : "###";
+
+    testName += haveMask ? "Mask   " : "NoMask ";
+
+    testName +=
+        srcAlphaRange == ALPHA_RANDOM ? "SrcRand " :
+        srcAlphaRange == ALPHA_ZERO   ? "SrcZero " :
+        srcAlphaRange == ALPHA_UNIT   ? "SrcUnit " : "###";
+
+    testName +=
+        dstAlphaRange == ALPHA_RANDOM ? "DstRand" :
+        dstAlphaRange == ALPHA_ZERO   ? "DstZero" :
+        dstAlphaRange == ALPHA_UNIT   ? "DstUnit" : "###";
+
+    return testName;
+}
+
+void benchmarkCompositeOp(const KoCompositeOp *op,
+                          bool haveMask,
+                          qreal opacity,
+                          qreal flow,
+                          const int srcAlignmentShift,
+                          const int dstAlignmentShift,
+                          AlphaRange srcAlphaRange,
+                          AlphaRange dstAlphaRange)
+{
+    QString testName = getTestName(haveMask, srcAlignmentShift, dstAlignmentShift, srcAlphaRange, dstAlphaRange);
+
+    QVector<Tile> tiles =
+        generateTiles(numTiles, srcAlignmentShift, dstAlignmentShift, srcAlphaRange, dstAlphaRange);
+
+//    qDebug() << "Initial values:";
+//    printData(8, tiles[0].src, tiles[0].dst, tiles[0].mask);
+
+    const int tileOffset = 4 * (processRect.y() * rowStride + processRect.x());
+
+    KoCompositeOp::ParameterInfo params;
+    params.dstRowStride  = 4 * rowStride;
+    params.srcRowStride  = 4 * rowStride;
+    params.maskRowStride = rowStride;
+    params.rows          = processRect.height();
+    params.cols          = processRect.width();
+    params.opacity       = opacity;
+    params.flow          = flow;
+    params.channelFlags  = QBitArray();
+
+    QTime timer;
+    timer.start();
+
+    foreach (const Tile &tile, tiles) {
+        params.dstRowStart   = tile.dst + tileOffset;
+        params.srcRowStart   = tile.src + tileOffset;
+        params.maskRowStart  = haveMask ? tile.mask : 0;
+        op->composite(params);
+    }
+
+    qDebug() << testName << "RESULT:" << timer.elapsed() << "msec";
+
+//    qDebug() << "Final values:";
+//    printData(8, tiles[0].src, tiles[0].dst, tiles[0].mask);
+
+    freeTiles(tiles, srcAlignmentShift, dstAlignmentShift);
+}
+
+void benchmarkCompositeOp(const KoCompositeOp *op, const QString &postfix)
+{
+    qDebug() << "Testing Composite Op:" << op->id() << "(" << postfix << ")";
+
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 0, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 8, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 0, 8, ALPHA_RANDOM, ALPHA_RANDOM);
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 4, 8, ALPHA_RANDOM, ALPHA_RANDOM);
+
+/// --- Vary the content of the source and destination
+
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_ZERO, ALPHA_RANDOM);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_UNIT, ALPHA_RANDOM);
+
+/// ---
+
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_RANDOM, ALPHA_ZERO);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_ZERO, ALPHA_ZERO);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_UNIT, ALPHA_ZERO);
+
+/// ---
+
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_RANDOM, ALPHA_UNIT);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_ZERO, ALPHA_UNIT);
+    benchmarkCompositeOp(op, false, 1.0, 1.0, 0, 0, ALPHA_UNIT, ALPHA_UNIT);
+}
+
+#ifdef HAVE_VC
+
+template<class Compositor>
+void checkRounding()
+{
+    QVector<Tile> tiles =
+        generateTiles(2, 0, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+
+    const int vecSize = Vc::float_v::Size;
+
+    const int numBlocks = numPixels / vecSize;
+
+    quint8 *src1 = tiles[0].src;
+    quint8 *dst1 = tiles[0].dst;
+    quint8 *msk1 = tiles[0].mask;
+
+    quint8 *src2 = tiles[1].src;
+    quint8 *dst2 = tiles[1].dst;
+    quint8 *msk2 = tiles[1].mask;
+
+    for (int i = 0; i < numBlocks; i++) {
+        Compositor::template compositeVector<true,true, VC_IMPL>(src1, dst1, msk1, 0.5, 0.3);
+        for (int j = 0; j < vecSize; j++) {
+
+            Compositor::template compositeOnePixelScalar<true, VC_IMPL>(src2, dst2, msk2, 0.5, 0.3, QBitArray());
+
+            if(!comparePixels(dst1, dst2, 0)) {
+                qDebug() << "Wrong rounding in pixel:" << 8 * i + j;
+                qDebug() << "Vector version: " << dst1[0] << dst1[1] << dst1[2] << dst1[3];
+                qDebug() << "Scalar version: " << dst2[0] << dst2[1] << dst2[2] << dst2[3];
+
+                qDebug() << "src:" << src1[0] << src1[1] << src1[2] << src1[3];
+                qDebug() << "msk:" << msk1[0];
+
+                QFAIL("Wrong rounding");
+            }
+
+            src1 += 4;
+            dst1 += 4;
+            src2 += 4;
+            dst2 += 4;
+            msk1++;
+            msk2++;
+        }
+    }
+
+    freeTiles(tiles, 0, 0);
+}
+
+#endif
+
+
+void KisCompositionBenchmark::checkRoundingAlphaDarken()
+{
+#ifdef HAVE_VC
+    checkRounding<AlphaDarkenCompositor32<quint8, quint32> >();
+#endif
+}
+
+void KisCompositionBenchmark::checkRoundingOver()
+{
+#ifdef HAVE_VC
+    checkRounding<OverCompositor32<quint8, quint32, false, true> >();
+#endif
+}
+
+void KisCompositionBenchmark::compareAlphaDarkenOps()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *opAct = KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    KoCompositeOp *opExp = new KoCompositeOpAlphaDarken<KoBgrU8Traits>(cs);
+
+    QVERIFY(compareTwoOps(true, opAct, opExp));
+
+    delete opExp;
+    delete opAct;
+}
+
+void KisCompositionBenchmark::compareAlphaDarkenOpsNoMask()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *opAct = KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    KoCompositeOp *opExp = new KoCompositeOpAlphaDarken<KoBgrU8Traits>(cs);
+
+    QVERIFY(compareTwoOps(false, opAct, opExp));
+
+    delete opExp;
+    delete opAct;
+}
+
+void KisCompositionBenchmark::compareOverOps()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *opAct = KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    KoCompositeOp *opExp = new KoCompositeOpOver<KoBgrU8Traits>(cs);
+
+    QVERIFY(compareTwoOps(true, opAct, opExp));
+
+    delete opExp;
+    delete opAct;
+}
+
+void KisCompositionBenchmark::compareOverOpsNoMask()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *opAct = KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    KoCompositeOp *opExp = new KoCompositeOpOver<KoBgrU8Traits>(cs);
+
+    QVERIFY(compareTwoOps(false, opAct, opExp));
+
+    delete opExp;
+    delete opAct;
+}
+
+void KisCompositionBenchmark::testRgb8CompositeAlphaDarkenLegacy()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *op = new KoCompositeOpAlphaDarken<KoBgrU8Traits>(cs);
+    benchmarkCompositeOp(op, "Legacy");
+    delete op;
+}
+
+void KisCompositionBenchmark::testRgb8CompositeAlphaDarkenOptimized()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *op = KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    benchmarkCompositeOp(op, "Optimized");
+    delete op;
+}
+
+void KisCompositionBenchmark::testRgb8CompositeOverLegacy()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *op = new KoCompositeOpOver<KoBgrU8Traits>(cs);
+    benchmarkCompositeOp(op, "Legacy");
+    delete op;
+}
+
+void KisCompositionBenchmark::testRgb8CompositeOverOptimized()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    KoCompositeOp *op = KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    benchmarkCompositeOp(op, "Optimized");
+    delete op;
+}
+
+void KisCompositionBenchmark::testRgb8CompositeAlphaDarkenReal_Aligned()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    const KoCompositeOp *op = cs->compositeOp(COMPOSITE_ALPHA_DARKEN);
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 0, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+}
+
+void KisCompositionBenchmark::testRgb8CompositeOverReal_Aligned()
+{
+    const KoColorSpace *cs = KoColorSpaceRegistry::instance()->rgb8();
+    const KoCompositeOp *op = cs->compositeOp(COMPOSITE_OVER);
+    benchmarkCompositeOp(op, true, 0.5, 0.3, 0, 0, ALPHA_RANDOM, ALPHA_RANDOM);
+}
+
+void KisCompositionBenchmark::benchmarkMemcpy()
+{
+    QVector<Tile> tiles =
+        generateTiles(numTiles, 0, 0, ALPHA_UNIT, ALPHA_UNIT);
+
+    QBENCHMARK_ONCE {
+        foreach (const Tile &tile, tiles) {
+            memcpy(tile.dst, tile.src, 4 * numPixels);
+        }
+    }
+
+    freeTiles(tiles, 0, 0);
+}
+
+void KisCompositionBenchmark::benchmarkUintFloat()
+{
+#ifdef HAVE_VC
+    const int vecSize = Vc::float_v::Size;
+
+    const int dataSize = 4096;
+    quint8 *iData = (quint8*) memalign(vecSize, dataSize);
+    float *fData = (float*) memalign(vecSize * 4, dataSize * 4);
+
+    QBENCHMARK {
+        for (int i = 0; i < dataSize; i += Vc::float_v::Size) {
+            // convert uint -> float directly, this causes
+            // static_cast helper be called
+            Vc::float_v b(Vc::uint_v(iData + i));
+            b.store(fData + i);
+        }
+    }
+
+    free(iData);
+    free(fData);
+#endif
+}
+
+void KisCompositionBenchmark::benchmarkUintIntFloat()
+{
+#ifdef HAVE_VC
+    const int vecSize = Vc::float_v::Size;
+
+    const int dataSize = 4096;
+    quint8 *iData = (quint8*) memalign(vecSize, dataSize);
+    float *fData = (float*) memalign(vecSize * 4, dataSize * 4);
+
+    QBENCHMARK {
+        for (int i = 0; i < dataSize; i += Vc::float_v::Size) {
+            // convert uint->int->float, that avoids special sign
+            // treating, and gives 2.6 times speedup
+            Vc::float_v b(Vc::int_v(Vc::uint_v(iData + i)));
+            b.store(fData + i);
+        }
+    }
+
+    free(iData);
+    free(fData);
+#endif
+}
+
+void KisCompositionBenchmark::benchmarkFloatUint()
+{
+#ifdef HAVE_VC
+    const int vecSize = Vc::float_v::Size;
+
+    const int dataSize = 4096;
+    quint32 *iData = (quint32*) memalign(vecSize * 4, dataSize * 4);
+    float *fData = (float*) memalign(vecSize * 4, dataSize * 4);
+
+    QBENCHMARK {
+        for (int i = 0; i < dataSize; i += Vc::float_v::Size) {
+            // conversion float -> uint
+            Vc::uint_v b(Vc::float_v(fData + i));
+
+            b.store(iData + i);
+        }
+    }
+
+    free(iData);
+    free(fData);
+#endif
+}
+
+void KisCompositionBenchmark::benchmarkFloatIntUint()
+{
+#ifdef HAVE_VC
+    const int vecSize = Vc::float_v::Size;
+
+    const int dataSize = 4096;
+    quint32 *iData = (quint32*) memalign(vecSize * 4, dataSize * 4);
+    float *fData = (float*) memalign(vecSize * 4, dataSize * 4);
+
+    QBENCHMARK {
+        for (int i = 0; i < dataSize; i += Vc::float_v::Size) {
+            // conversion float -> int -> uint
+            Vc::uint_v b(Vc::int_v(Vc::float_v(fData + i)));
+
+            b.store(iData + i);
+        }
+    }
+
+    free(iData);
+    free(fData);
+#endif
+}
+
+QTEST_KDEMAIN(KisCompositionBenchmark, GUI)
+
diff --git a/krita/benchmarks/kis_composition_benchmark.h b/krita/benchmarks/kis_composition_benchmark.h
new file mode 100644
index 0000000..9dabf65
--- /dev/null
+++ b/krita/benchmarks/kis_composition_benchmark.h
@@ -0,0 +1,53 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KIS_COMPOSITION_BENCHMARK_H
+#define __KIS_COMPOSITION_BENCHMARK_H
+
+#include <QtTest/QtTest>
+
+class KisCompositionBenchmark : public QObject
+{
+    Q_OBJECT
+private slots:
+    void checkRoundingAlphaDarken();
+    void checkRoundingOver();
+
+    void compareAlphaDarkenOps();
+    void compareAlphaDarkenOpsNoMask();
+    void compareOverOps();
+    void compareOverOpsNoMask();
+
+    void testRgb8CompositeAlphaDarkenLegacy();
+    void testRgb8CompositeAlphaDarkenOptimized();
+
+    void testRgb8CompositeOverLegacy();
+    void testRgb8CompositeOverOptimized();
+
+    void testRgb8CompositeAlphaDarkenReal_Aligned();
+    void testRgb8CompositeOverReal_Aligned();
+
+    void benchmarkMemcpy();
+
+    void benchmarkUintFloat();
+    void benchmarkUintIntFloat();
+    void benchmarkFloatUint();
+    void benchmarkFloatIntUint();
+};
+
+#endif /* __KIS_COMPOSITION_BENCHMARK_H */
diff --git a/krita/benchmarks/kis_mask_generator_benchmark.cpp b/krita/benchmarks/kis_mask_generator_benchmark.cpp
index d861ba7..9ea4db7 100644
--- a/krita/benchmarks/kis_mask_generator_benchmark.cpp
+++ b/krita/benchmarks/kis_mask_generator_benchmark.cpp
@@ -54,7 +54,7 @@ void KisMaskGeneratorBenchmark::benchmarkSIMD()
     QBENCHMARK{
         for(int y = 0; y < 1000; ++y)
         {
-            gen.processRowFast(buffer, width, y, 0.0f, 1.0f, 500.0f, 500.0f, 0.5f, 0.5f);
+//            gen.processRowFast(buffer, width, y, 0.0f, 1.0f, 500.0f, 500.0f, 0.5f, 0.5f);
         }
     }
     Vc::free(buffer);
diff --git a/krita/image/CMakeLists.txt b/krita/image/CMakeLists.txt
index af9e1b1..93bb450 100644
--- a/krita/image/CMakeLists.txt
+++ b/krita/image/CMakeLists.txt
@@ -52,6 +52,9 @@ include_directories( ${KDE4_INCLUDE_DIR}/threadweaver/
 
 if(HAVE_VC)
   include_directories(${Vc_INCLUDE_DIR})
+  ko_compile_for_all_implementations(__per_arch_circle_mask_generator_objs kis_brush_mask_applicator_factories.cpp "-fPIC")
+else(HAVE_VC)
+  set(__per_arch_circle_mask_generator_objs kis_brush_mask_applicator_factories.cpp)
 endif(HAVE_VC)
 
 set(kritaimage_LIB_SRCS
@@ -155,6 +158,7 @@ set(kritaimage_LIB_SRCS
    kis_circle_mask_generator.cpp
    kis_gauss_circle_mask_generator.cpp
    kis_gauss_rect_mask_generator.cpp
+   ${__per_arch_circle_mask_generator_objs}
    kis_gtl_lock.cpp
    kis_curve_circle_mask_generator.cpp
    kis_curve_rect_mask_generator.cpp
diff --git a/krita/image/kis_base_mask_generator.cpp b/krita/image/kis_base_mask_generator.cpp
index 7c4b4d3..63e9a0a 100644
--- a/krita/image/kis_base_mask_generator.cpp
+++ b/krita/image/kis_base_mask_generator.cpp
@@ -31,6 +31,8 @@
 #include "kis_cubic_curve.h"
 #include "kis_curve_circle_mask_generator.h"
 #include "kis_curve_rect_mask_generator.h"
+#include "kis_brush_mask_applicator_factories.h"
+
 
 KisMaskGenerator::KisMaskGenerator(qreal diameter, qreal ratio, qreal fh, qreal fv, int spikes, Type type, const KoID& id) : d(new Private), m_id(id)
 {
@@ -42,11 +44,13 @@ KisMaskGenerator::KisMaskGenerator(qreal diameter, qreal ratio, qreal fh, qreal
     d->spikes = spikes;
     d->cachedSpikesAngle = M_PI / d->spikes;
     d->type = type;
+    d->defaultMaskProcessor = 0;
     init();
 }
 
 KisMaskGenerator::~KisMaskGenerator()
 {
+    delete d->defaultMaskProcessor;
     delete d;
 }
 
@@ -67,6 +71,16 @@ bool KisMaskGenerator::shouldVectorize() const
     return false;
 }
 
+KisBrushMaskApplicatorBase* KisMaskGenerator::applicator()
+{
+    if (!d->defaultMaskProcessor) {
+        d->defaultMaskProcessor =
+            createOptimizedClass<MaskApplicatorFactory<KisMaskGenerator, KisBrushMaskScalarApplicator> >(this);
+    }
+
+    return d->defaultMaskProcessor;
+}
+
 void KisMaskGenerator::toXML(QDomDocument& doc, QDomElement& e) const
 {
     Q_UNUSED(doc);
diff --git a/krita/image/kis_base_mask_generator.h b/krita/image/kis_base_mask_generator.h
index 15efaa6..62dde87 100644
--- a/krita/image/kis_base_mask_generator.h
+++ b/krita/image/kis_base_mask_generator.h
@@ -24,6 +24,7 @@
 #include <klocale.h>
 
 #include "krita_export.h"
+#include "kis_brush_mask_applicator_base.h"
 
 class QDomElement;
 class QDomDocument;
@@ -68,13 +69,12 @@ public:
      */
     virtual quint8 valueAt(qreal x, qreal y) const = 0;
 
-    virtual void processRowFast(float* /*buffer*/, int /*width*/, float /*y*/, float /*cosa*/, float /*sina*/,
-                                float /*centerX*/, float /*centerY*/, float /*invScaleX*/, float /*invScaleY*/) {}
-
     virtual bool shouldSupersample() const;
 
     virtual bool shouldVectorize() const;
 
+    virtual KisBrushMaskApplicatorBase* applicator();
+
     virtual void toXML(QDomDocument& , QDomElement&) const;
 
     /**
@@ -115,6 +115,7 @@ protected:
         bool empty;
         Type type;
         QString curveString;
+        KisBrushMaskApplicatorBase *defaultMaskProcessor;
     };
 
     Private* const d;
diff --git a/krita/image/kis_brush_mask_applicator_base.h b/krita/image/kis_brush_mask_applicator_base.h
new file mode 100644
index 0000000..ea29583
--- /dev/null
+++ b/krita/image/kis_brush_mask_applicator_base.h
@@ -0,0 +1,92 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KIS_BRUSH_MASK_APPLICATOR_BASE_H
+#define __KIS_BRUSH_MASK_APPLICATOR_BASE_H
+
+#include "kis_types.h"
+#include "kis_fixed_paint_device.h"
+#include "math.h"
+
+
+struct MaskProcessingData {
+    MaskProcessingData(KisFixedPaintDeviceSP _device,
+                       const KoColorSpace* _colorSpace,
+                       qreal _randomness,
+                       qreal _density,
+                       double _centerX,
+                       double _centerY,
+                       double _invScaleX,
+                       double _invScaleY,
+                       double _angle)
+        {
+            device = _device;
+            colorSpace = _colorSpace;
+            randomness = _randomness;
+            density = _density;
+            centerX = _centerX;
+            centerY = _centerY;
+            invScaleX = _invScaleX;
+            invScaleY = _invScaleY;
+            cosa = cos(_angle);
+            sina = sin(_angle);
+            pixelSize = colorSpace->pixelSize();
+        }
+
+
+
+    KisFixedPaintDeviceSP device;
+    const KoColorSpace* colorSpace;
+    qreal randomness;
+    qreal density;
+    double centerX;
+    double centerY;
+    double invScaleX;
+    double invScaleY;
+
+    double cosa;
+    double sina;
+
+    qint32 pixelSize;
+};
+
+struct KisBrushMaskApplicatorBase
+{
+    virtual ~KisBrushMaskApplicatorBase() {}
+    virtual void process(const QRect &rect) = 0;
+
+    inline void initializeData(const MaskProcessingData *data) {
+        m_d = data;
+    }
+
+protected:
+    const MaskProcessingData *m_d;
+};
+
+struct OperatorWrapper {
+    OperatorWrapper(KisBrushMaskApplicatorBase *applicator)
+        : m_applicator(applicator) {}
+
+    inline void operator() (const QRect& rect) {
+        m_applicator->process(rect);
+    }
+
+    KisBrushMaskApplicatorBase *m_applicator;
+};
+
+#endif /* __KIS_BRUSH_MASK_APPLICATOR_BASE_H */
diff --git a/krita/image/kis_brush_mask_applicator_factories.cpp b/krita/image/kis_brush_mask_applicator_factories.cpp
new file mode 100644
index 0000000..249345d
--- /dev/null
+++ b/krita/image/kis_brush_mask_applicator_factories.cpp
@@ -0,0 +1,122 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "kis_brush_mask_applicator_factories.h"
+
+#include "kis_circle_mask_generator.h"
+#include "kis_circle_mask_generator_p.h"
+#include "kis_brush_mask_applicators.h"
+
+
+#define a(_s) #_s
+#define b(_s) a(_s)
+
+template<>
+template<>
+MaskApplicatorFactory<KisMaskGenerator, KisBrushMaskScalarApplicator>::ReturnType
+MaskApplicatorFactory<KisMaskGenerator, KisBrushMaskScalarApplicator>::create<VC_IMPL>(ParamType maskGenerator)
+{
+    // qDebug() << "Creating scalar applicator" << b(VC_IMPL);
+    return new KisBrushMaskScalarApplicator<KisMaskGenerator,VC_IMPL>(maskGenerator);
+}
+
+template<>
+template<>
+MaskApplicatorFactory<KisCircleMaskGenerator, KisBrushMaskVectorApplicator>::ReturnType
+MaskApplicatorFactory<KisCircleMaskGenerator, KisBrushMaskVectorApplicator>::create<VC_IMPL>(ParamType maskGenerator)
+{
+    // qDebug() << "Creating vector applicator" << b(VC_IMPL);
+    return new KisBrushMaskVectorApplicator<KisCircleMaskGenerator,VC_IMPL>(maskGenerator);
+}
+
+#if defined HAVE_VC
+
+struct KisCircleMaskGenerator::FastRowProcessor
+{
+    FastRowProcessor(KisCircleMaskGenerator *maskGenerator)
+        : d(maskGenerator->d) {}
+
+    template<Vc::Implementation _impl>
+    void process(float* buffer, int width, float y, float cosa, float sina,
+                 float centerX, float centerY, float invScaleX, float invScaleY);
+
+    KisCircleMaskGenerator::Private *d;
+};
+
+template<> void KisCircleMaskGenerator::
+FastRowProcessor::process<VC_IMPL>(float* buffer, int width, float y, float cosa, float sina,
+                                   float centerX, float centerY, float invScaleX, float invScaleY)
+{
+    float y_ = (y - centerY) * invScaleY;
+    float sinay_ = sina * y_;
+    float cosay_ = cosa * y_;
+
+    float *initValues = Vc::malloc<float, Vc::AlignOnVector>(Vc::float_v::Size);
+    for(int i = 0; i < Vc::float_v::Size; i++) {
+        initValues[i] = (float)i;
+    }
+
+    float* bufferPointer = buffer;
+
+    Vc::float_v currentIndices(initValues);
+
+    Vc::float_v increment((float)Vc::float_v::Size);
+    Vc::float_v vCenterX(centerX);
+    Vc::float_v vInvScaleX(invScaleX);
+
+    Vc::float_v vCosa(cosa);
+    Vc::float_v vSina(sina);
+    Vc::float_v vCosaY_(cosay_);
+    Vc::float_v vSinaY_(sinay_);
+
+    Vc::float_v vXCoeff(d->xcoef);
+    Vc::float_v vYCoeff(d->ycoef);
+
+    Vc::float_v vTransformedFadeX(d->transformedFadeX);
+    Vc::float_v vTransformedFadeY(d->transformedFadeY);
+
+    Vc::float_v vOne(1.0f);
+
+    for (int i=0; i < width; i+= Vc::float_v::Size){
+
+        Vc::float_v x_ = (currentIndices - vCenterX) * vInvScaleX;
+
+        Vc::float_v xr = x_ * vCosa - vSinaY_;
+        Vc::float_v yr = x_ * vSina + vCosaY_;
+
+        Vc::float_v n = ((xr * vXCoeff) * (xr * vXCoeff)) + ((yr * vYCoeff) * (yr * vYCoeff));
+
+        Vc::float_v vNormFade =((xr * vTransformedFadeX) * (xr * vTransformedFadeX)) + ((yr * vTransformedFadeY) * (yr * vTransformedFadeY));
+
+        //255 * n * (normeFade - 1) / (normeFade - n)
+        Vc::float_v vFade = n * (vNormFade - vOne) / (vNormFade - n);
+        // Mask out the inner circe of the mask
+        Vc::float_m mask = vNormFade < vOne;
+        vFade.setZero(mask);
+        vFade = Vc::min(vFade, vOne);
+
+        vFade.store(bufferPointer);
+        currentIndices = currentIndices + increment;
+
+        bufferPointer += Vc::float_v::Size;
+    }
+
+    Vc::free<float>(initValues);
+}
+
+#endif /* defined HAVE_VC */
diff --git a/krita/image/kis_brush_mask_applicator_factories.h b/krita/image/kis_brush_mask_applicator_factories.h
new file mode 100644
index 0000000..745c5af
--- /dev/null
+++ b/krita/image/kis_brush_mask_applicator_factories.h
@@ -0,0 +1,52 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KIS_BRUSH_MASK_APPLICATOR_FACTORIES_H
+#define __KIS_BRUSH_MASK_APPLICATOR_FACTORIES_H
+
+#include "KoVcMultiArchBuildSupport.h"
+
+#include "kis_brush_mask_applicator_base.h"
+
+
+template<class MaskGenerator, Vc::Implementation _impl>
+struct KisBrushMaskScalarApplicator;
+
+#ifdef HAVE_VC
+
+template<class MaskGenerator, Vc::Implementation _impl>
+struct KisBrushMaskVectorApplicator;
+
+#else /* HAVE_VC */
+
+#define KisBrushMaskVectorApplicator KisBrushMaskScalarApplicator
+
+#endif /* HAVE_VC */
+
+template<class MaskGenerator,
+         template<class U, Vc::Implementation V> class Applicator>
+struct MaskApplicatorFactory
+{
+    typedef MaskGenerator* ParamType;
+    typedef KisBrushMaskApplicatorBase* ReturnType;
+
+    template<Vc::Implementation _impl>
+    static ReturnType create(ParamType maskGenerator);
+};
+
+#endif /* __KIS_BRUSH_MASK_APPLICATOR_FACTORIES_H */
diff --git a/krita/image/kis_brush_mask_applicators.h b/krita/image/kis_brush_mask_applicators.h
new file mode 100644
index 0000000..112085b
--- /dev/null
+++ b/krita/image/kis_brush_mask_applicators.h
@@ -0,0 +1,207 @@
+/*
+ *  Copyright (c) 2012 Sven Langkamp  <sven.langkamp at gmail.com>
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KIS_BRUSH_MASK_APPLICATORS_H
+#define __KIS_BRUSH_MASK_APPLICATORS_H
+
+#include "kis_brush_mask_applicator_base.h"
+
+// 3x3 supersampling
+#define SUPERSAMPLING 3
+
+#if defined(_WIN32) || defined(_WIN64)
+#include <stdlib.h>
+#define srand48 srand
+inline double drand48() {
+    return double(rand()) / RAND_MAX;
+}
+#endif
+
+
+template<class MaskGenerator, Vc::Implementation _impl>
+struct KisBrushMaskScalarApplicator : public KisBrushMaskApplicatorBase
+{
+    KisBrushMaskScalarApplicator(MaskGenerator *maskGenerator)
+        : m_maskGenerator(maskGenerator)
+    {
+    }
+
+    void process(const QRect &rect) {
+        processScalar(rect);
+    }
+
+protected:
+    void processScalar(const QRect &rect);
+
+protected:
+    MaskGenerator *m_maskGenerator;
+};
+
+#if defined HAVE_VC
+
+template<class MaskGenerator, Vc::Implementation _impl>
+struct KisBrushMaskVectorApplicator : public KisBrushMaskScalarApplicator<MaskGenerator, _impl>
+{
+    KisBrushMaskVectorApplicator(MaskGenerator *maskGenerator)
+        : KisBrushMaskScalarApplicator<MaskGenerator, _impl>(maskGenerator)
+    {
+    }
+
+    void process(const QRect &rect) {
+        startProcessing(rect, TypeHelper<MaskGenerator, _impl>());
+    }
+
+protected:
+    void processVector(const QRect &rect);
+
+private:
+    template<class U, Vc::Implementation V> struct TypeHelper {};
+
+private:
+    template<class U>
+    inline void startProcessing(const QRect &rect, TypeHelper<U, Vc::ScalarImpl>) {
+        KisBrushMaskScalarApplicator<MaskGenerator, _impl>::processScalar(rect);
+    }
+
+    template<class U, Vc::Implementation V>
+    inline void startProcessing(const QRect &rect, TypeHelper<U, V>) {
+        MaskGenerator *m_maskGenerator = KisBrushMaskScalarApplicator<MaskGenerator, _impl>::m_maskGenerator;
+
+        if (m_maskGenerator->shouldVectorize()) {
+            processVector(rect);
+        } else {
+            KisBrushMaskScalarApplicator<MaskGenerator, _impl>::processScalar(rect);
+        }
+    }
+};
+
+template<class MaskGenerator, Vc::Implementation _impl>
+void KisBrushMaskVectorApplicator<MaskGenerator, _impl>::processVector(const QRect &rect)
+{
+    const MaskProcessingData *m_d = KisBrushMaskApplicatorBase::m_d;
+    MaskGenerator *m_maskGenerator = KisBrushMaskScalarApplicator<MaskGenerator, _impl>::m_maskGenerator;
+
+    qreal random = 1.0;
+    quint8* dabPointer = m_d->device->data() + rect.y() * rect.width() * m_d->pixelSize;
+    quint8 alphaValue = OPACITY_TRANSPARENT_U8;
+    // this offset is needed when brush size is smaller then fixed device size
+    int offset = (m_d->device->bounds().width() - rect.width()) * m_d->pixelSize;
+
+    int width = rect.width();
+
+    // We need to calculate with a multiple of the width of the simd register
+    int alignOffset = 0;
+    if (width % Vc::float_v::Size != 0) {
+        alignOffset = Vc::float_v::Size - (width % Vc::float_v::Size);
+    }
+    int simdWidth = width + alignOffset;
+
+    float *buffer = Vc::malloc<float, Vc::AlignOnVector>(simdWidth);
+
+    typename MaskGenerator::FastRowProcessor processor(m_maskGenerator);
+
+    for (int y = rect.y(); y < rect.y() + rect.height(); y++) {
+
+        processor.template process<_impl>(buffer, simdWidth, y, m_d->cosa, m_d->sina, m_d->centerX, m_d->centerY, m_d->invScaleX, m_d->invScaleY);
+
+        if (m_d->randomness != 0.0 || m_d->density != 1.0) {
+            for (int x = 0; x < width; x++) {
+
+                if (m_d->randomness!= 0.0){
+                    random = (1.0 - m_d->randomness) + m_d->randomness * float(rand()) / RAND_MAX;
+                }
+
+                alphaValue = quint8( (OPACITY_OPAQUE_U8 - buffer[x]*255) * random);
+
+                // avoid computation of random numbers if density is full
+                if (m_d->density != 1.0){
+                    // compute density only for visible pixels of the mask
+                    if (alphaValue != OPACITY_TRANSPARENT_U8){
+                        if ( !(m_d->density >= drand48()) ){
+                            alphaValue = OPACITY_TRANSPARENT_U8;
+                        }
+                    }
+                }
+
+                m_d->colorSpace->applyAlphaU8Mask(dabPointer, &alphaValue, 1);
+                dabPointer += m_d->pixelSize;
+            }
+        } else {
+            m_d->colorSpace->applyInverseNormedFloatMask(dabPointer, buffer, width);
+            dabPointer += width * m_d->pixelSize;
+        }//endfor x
+        dabPointer += offset;
+    }//endfor y
+    Vc::free(buffer);
+}
+
+#endif /* defined HAVE_VC */
+
+template<class MaskGenerator, Vc::Implementation _impl>
+void KisBrushMaskScalarApplicator<MaskGenerator, _impl>::processScalar(const QRect &rect)
+{
+    const MaskProcessingData *m_d = KisBrushMaskApplicatorBase::m_d;
+    MaskGenerator *m_maskGenerator = KisBrushMaskScalarApplicator<MaskGenerator, _impl>::m_maskGenerator;
+
+    qreal random = 1.0;
+    quint8* dabPointer = m_d->device->data() + rect.y() * rect.width() * m_d->pixelSize;
+    quint8 alphaValue = OPACITY_TRANSPARENT_U8;
+    // this offset is needed when brush size is smaller then fixed device size
+    int offset = (m_d->device->bounds().width() - rect.width()) * m_d->pixelSize;
+    int supersample = (m_maskGenerator->shouldSupersample() ? SUPERSAMPLING : 1);
+    double invss = 1.0 / supersample;
+    int samplearea = supersample * supersample;
+    for (int y = rect.y(); y < rect.y() + rect.height(); y++) {
+        for (int x = rect.x(); x < rect.x() + rect.width(); x++) {
+            int value = 0;
+            for (int sy = 0; sy < supersample; sy++) {
+                for (int sx = 0; sx < supersample; sx++) {
+                    double x_ = (x + sx * invss - m_d->centerX) * m_d->invScaleX;
+                    double y_ = (y + sy * invss - m_d->centerY) * m_d->invScaleY;
+                    double maskX = m_d->cosa * x_ - m_d->sina * y_;
+                    double maskY = m_d->sina * x_ + m_d->cosa * y_;
+                    value += m_maskGenerator->valueAt(maskX, maskY);
+                }
+            }
+            if (supersample != 1) value /= samplearea;
+
+            if (m_d->randomness!= 0.0){
+                random = (1.0 - m_d->randomness) + m_d->randomness * float(rand()) / RAND_MAX;
+            }
+
+            alphaValue = quint8( (OPACITY_OPAQUE_U8 - value) * random);
+
+            // avoid computation of random numbers if density is full
+            if (m_d->density != 1.0){
+                // compute density only for visible pixels of the mask
+                if (alphaValue != OPACITY_TRANSPARENT_U8){
+                    if ( !(m_d->density >= drand48()) ){
+                        alphaValue = OPACITY_TRANSPARENT_U8;
+                    }
+                }
+            }
+
+            m_d->colorSpace->applyAlphaU8Mask(dabPointer, &alphaValue, 1);
+            dabPointer += m_d->pixelSize;
+        }//endfor x
+        dabPointer += offset;
+    }//endfor y
+}
+
+#endif /* __KIS_BRUSH_MASK_APPLICATORS_H */
diff --git a/krita/image/kis_circle_mask_generator.cpp b/krita/image/kis_circle_mask_generator.cpp
index 6e84219..5fbefb4 100644
--- a/krita/image/kis_circle_mask_generator.cpp
+++ b/krita/image/kis_circle_mask_generator.cpp
@@ -30,13 +30,10 @@
 
 #include "kis_fast_math.h"
 #include "kis_circle_mask_generator.h"
+#include "kis_circle_mask_generator_p.h"
 #include "kis_base_mask_generator.h"
+#include "kis_brush_mask_applicator_factories.h"
 
-struct KisCircleMaskGenerator::Private {
-    double xcoef, ycoef;
-    double xfadecoef, yfadecoef;
-    double transformedFadeX, transformedFadeY;
-};
 
 KisCircleMaskGenerator::KisCircleMaskGenerator(qreal diameter, qreal ratio, qreal fh, qreal fv, int spikes)
         : KisMaskGenerator(diameter, ratio, fh, fv, spikes, CIRCLE, DefaultId), d(new Private)
@@ -47,10 +44,13 @@ KisCircleMaskGenerator::KisCircleMaskGenerator(qreal diameter, qreal ratio, qrea
     d->yfadecoef = (KisMaskGenerator::d->fv == 0) ? 1 : (1.0 / (KisMaskGenerator::d->fv * KisMaskGenerator::d->ratio * width()));
     d->transformedFadeX = d->xfadecoef * softness();
     d->transformedFadeY = d->yfadecoef * softness();
+
+    d->applicator = createOptimizedClass<MaskApplicatorFactory<KisCircleMaskGenerator, KisBrushMaskVectorApplicator> >(this);
 }
 
 KisCircleMaskGenerator::~KisCircleMaskGenerator()
 {
+    delete d->applicator;
     delete d;
 }
 
@@ -64,6 +64,10 @@ bool KisCircleMaskGenerator::shouldVectorize() const
     return !shouldSupersample() && spikes() == 2;
 }
 
+KisBrushMaskApplicatorBase* KisCircleMaskGenerator::applicator()
+{
+    return d->applicator;
+}
 
 quint8 KisCircleMaskGenerator::valueAt(qreal x, qreal y) const
 {
@@ -117,69 +121,6 @@ quint8 KisCircleMaskGenerator::valueAt(qreal x, qreal y) const
     }
 }
 
-void KisCircleMaskGenerator::processRowFast(float* buffer, int width, float y, float cosa, float sina,
-                                            float centerX, float centerY, float invScaleX, float invScaleY)
-{
-#ifdef HAVE_VC
-    float y_ = (y - centerY) * invScaleY;
-    float sinay_ = sina * y_;
-    float cosay_ = cosa * y_;
-
-    float *initValues = Vc::malloc<float, Vc::AlignOnVector>(Vc::float_v::Size);
-    for(int i = 0; i < Vc::float_v::Size; i++) {
-        initValues[i] = (float)i;
-    }
-
-    float* bufferPointer = buffer;
-
-    Vc::float_v currentIndices(initValues);
-
-    Vc::float_v increment((float)Vc::float_v::Size);
-    Vc::float_v vCenterX(centerX);
-    Vc::float_v vInvScaleX(invScaleX);
-
-    Vc::float_v vCosa(cosa);
-    Vc::float_v vSina(sina);
-    Vc::float_v vCosaY_(cosay_);
-    Vc::float_v vSinaY_(sinay_);
-
-    Vc::float_v vXCoeff(d->xcoef);
-    Vc::float_v vYCoeff(d->ycoef);
-
-    Vc::float_v vTransformedFadeX(d->transformedFadeX);
-    Vc::float_v vTransformedFadeY(d->transformedFadeY);
-
-    Vc::float_v vOne(1.0f);
-
-    for (int i=0; i < width; i+= Vc::float_v::Size){
-
-        Vc::float_v x_ = (currentIndices - vCenterX) * vInvScaleX;
-
-        Vc::float_v xr = x_ * vCosa - vSinaY_;
-        Vc::float_v yr = x_ * vSina + vCosaY_;
-
-        Vc::float_v n = ((xr * vXCoeff) * (xr * vXCoeff)) + ((yr * vYCoeff) * (yr * vYCoeff));
-
-        Vc::float_v vNormFade =((xr * vTransformedFadeX) * (xr * vTransformedFadeX)) + ((yr * vTransformedFadeY) * (yr * vTransformedFadeY));
-
-        //255 * n * (normeFade - 1) / (normeFade - n)
-        Vc::float_v vFade = n * (vNormFade - vOne) / (vNormFade - n);
-        // Mask out the inner circe of the mask
-        Vc::float_m mask = vNormFade < vOne;
-        vFade.setZero(mask);
-        vFade = Vc::min(vFade, vOne);
-
-        vFade.store(bufferPointer);
-        currentIndices = currentIndices + increment;
-
-        bufferPointer += Vc::float_v::Size;
-    }
-
-    Vc::free<float>(initValues);
-#endif
-}
-
-
 void KisCircleMaskGenerator::toXML(QDomDocument& d, QDomElement& e) const
 {
     KisMaskGenerator::toXML(d, e);
diff --git a/krita/image/kis_circle_mask_generator.h b/krita/image/kis_circle_mask_generator.h
index 53b6c61..7cc6416 100644
--- a/krita/image/kis_circle_mask_generator.h
+++ b/krita/image/kis_circle_mask_generator.h
@@ -31,20 +31,20 @@ class QDomDocument;
  */
 class KRITAIMAGE_EXPORT KisCircleMaskGenerator : public KisMaskGenerator
 {
-
+public:
+    struct FastRowProcessor;
 public:
     KisCircleMaskGenerator(qreal radius, qreal ratio, qreal fh, qreal fv, int spikes);
     virtual ~KisCircleMaskGenerator();
     
     virtual quint8 valueAt(qreal x, qreal y) const;
 
-    virtual void processRowFast(float* buffer, int width, float y, float cosa, float sina,
-                             float centerX, float centerY, float invScaleX, float invScaleY);
-
     virtual bool shouldSupersample() const;
 
     virtual bool shouldVectorize() const;
 
+    KisBrushMaskApplicatorBase* applicator();
+
     virtual void toXML(QDomDocument& , QDomElement&) const;
     
     virtual void setSoftness(qreal softness);
diff --git a/krita/image/kis_circle_mask_generator_p.h b/krita/image/kis_circle_mask_generator_p.h
new file mode 100644
index 0000000..9cb9449
--- /dev/null
+++ b/krita/image/kis_circle_mask_generator_p.h
@@ -0,0 +1,30 @@
+/*
+ *  Copyright (c) 2008-2009 Cyrille Berger <cberger at cberger.net>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _KIS_CIRCLE_MASK_GENERATOR_P_H_
+#define _KIS_CIRCLE_MASK_GENERATOR_P_H_
+
+struct KisCircleMaskGenerator::Private {
+    double xcoef, ycoef;
+    double xfadecoef, yfadecoef;
+    double transformedFadeX, transformedFadeY;
+
+    KisBrushMaskApplicatorBase *applicator;
+};
+
+#endif /* _KIS_CIRCLE_MASK_GENERATOR_P_H_ */
diff --git a/krita/plugins/paintops/hairy/hairy_brush.cpp b/krita/plugins/paintops/hairy/hairy_brush.cpp
index 64ab8d7..d2d3614 100644
--- a/krita/plugins/paintops/hairy/hairy_brush.cpp
+++ b/krita/plugins/paintops/hairy/hairy_brush.cpp
@@ -24,8 +24,6 @@ inline double drand48() {
 }
 #endif
 
-#include <KoCompositeOps.h>
-
 #include "hairy_brush.h"
 #include "trajectory.h"
 
diff --git a/krita/plugins/paintops/libbrush/CMakeLists.txt b/krita/plugins/paintops/libbrush/CMakeLists.txt
index 1f86333..ea97648 100644
--- a/krita/plugins/paintops/libbrush/CMakeLists.txt
+++ b/krita/plugins/paintops/libbrush/CMakeLists.txt
@@ -31,6 +31,7 @@ target_link_libraries(kritalibbrush kritaui)
 if(HAVE_VC)
   include_directories(${Vc_INCLUDE_DIR})
   target_link_libraries(kritalibbrush  ${Vc_LIBRARIES})
+#  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${Vc_DEFINITIONS}")
 endif(HAVE_VC)
 
 target_link_libraries(kritalibbrush LINK_INTERFACE_LIBRARIES kritaui)
diff --git a/krita/plugins/paintops/libbrush/kis_auto_brush.cpp b/krita/plugins/paintops/libbrush/kis_auto_brush.cpp
index f7bf3f8..5b4de4c 100644
--- a/krita/plugins/paintops/libbrush/kis_auto_brush.cpp
+++ b/krita/plugins/paintops/libbrush/kis_auto_brush.cpp
@@ -18,14 +18,6 @@
  *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
  */
 
-#if defined(_WIN32) || defined(_WIN64)
-#include <stdlib.h>
-#define srand48 srand
-inline double drand48() {
-    return double(rand()) / RAND_MAX;
-}
-#endif
-
 #include "kis_auto_brush.h"
 
 #include <kis_debug.h>
@@ -46,177 +38,8 @@ inline double drand48() {
 #include "kis_mask_generator.h"
 #include "kis_boundary.h"
 
-#include "config-vc.h"
-#ifdef HAVE_VC
-#include <Vc/Vc>
-#include <Vc/IO>
-#include <Vc/common/support.h>
-#endif
-
-// 3x3 supersampling
-#define SUPERSAMPLING 3
-
-struct MaskProcessor
-{
-    MaskProcessor(KisFixedPaintDeviceSP device, const KoColorSpace* cs, qreal randomness, qreal density,
-           double centerX, double centerY, double invScaleX, double invScaleY, double angle,
-           KisMaskGenerator* shape)
-    : m_device(device)
-    , m_cs(cs)
-    , m_randomness(randomness)
-    , m_density(density)
-    , m_pixelSize(cs->pixelSize())
-    , m_centerX(centerX)
-    , m_centerY(centerY)
-    , m_invScaleX(invScaleX)
-    , m_invScaleY(invScaleY)
-    , m_shape(shape)
-    {
-
-        m_cosa = cos(angle);
-        m_sina = sin(angle);
-
-#ifdef HAVE_VC
-        m_canVectorize = Vc::currentImplementationSupported();
-#else
-        m_canVectorize = false;
-#endif
-
-    }
-
-    void operator()(QRect& rect)
-    {
-        process(rect);
-    }
-
-    void process(QRect& rect){
-#ifdef HAVE_VC
-        if (m_canVectorize && m_shape->shouldVectorize()) {
-            processParallel(rect);
-        } else {
-            processScalar(rect);
-        }
-
-#else
-        processScalar(rect);
-#endif
-    }
-
-    void processScalar(QRect& rect){
-        qreal random = 1.0;
-        quint8* dabPointer = m_device->data() + rect.y() * rect.width() * m_pixelSize;
-        quint8 alphaValue = OPACITY_TRANSPARENT_U8;
-        // this offset is needed when brush size is smaller then fixed device size
-        int offset = (m_device->bounds().width() - rect.width()) * m_pixelSize;
-        int supersample = (m_shape->shouldSupersample() ? SUPERSAMPLING : 1);
-        double invss = 1.0 / supersample;
-        int samplearea = supersample * supersample;
-        for (int y = rect.y(); y < rect.y() + rect.height(); y++) {
-            for (int x = rect.x(); x < rect.x() + rect.width(); x++) {
-                int value = 0;
-                for (int sy = 0; sy < supersample; sy++) {
-                    for (int sx = 0; sx < supersample; sx++) {
-                        double x_ = (x + sx * invss - m_centerX) * m_invScaleX;
-                        double y_ = (y + sy * invss - m_centerY) * m_invScaleY;
-                        double maskX = m_cosa * x_ - m_sina * y_;
-                        double maskY = m_sina * x_ + m_cosa * y_;
-                        value += m_shape->valueAt(maskX, maskY);
-                    }
-                }
-                if (supersample != 1) value /= samplearea;
-
-                if (m_randomness!= 0.0){
-                    random = (1.0 - m_randomness) + m_randomness * float(rand()) / RAND_MAX;
-                }
-
-                alphaValue = quint8( (OPACITY_OPAQUE_U8 - value) * random);
 
-                // avoid computation of random numbers if density is full
-                if (m_density != 1.0){
-                    // compute density only for visible pixels of the mask
-                    if (alphaValue != OPACITY_TRANSPARENT_U8){
-                        if ( !(m_density >= drand48()) ){
-                            alphaValue = OPACITY_TRANSPARENT_U8;
-                        }
-                    }
-                }
-
-                m_cs->applyAlphaU8Mask(dabPointer, &alphaValue, 1);
-                dabPointer += m_pixelSize;
-            }//endfor x
-            dabPointer += offset;
-        }//endfor y
-    }
-
-#ifdef HAVE_VC
-    void processParallel(QRect& rect){
-        qreal random = 1.0;
-        quint8* dabPointer = m_device->data() + rect.y() * rect.width() * m_pixelSize;
-        quint8 alphaValue = OPACITY_TRANSPARENT_U8;
-        // this offset is needed when brush size is smaller then fixed device size
-        int offset = (m_device->bounds().width() - rect.width()) * m_pixelSize;
-
-        int width = rect.width();
-
-        // We need to calculate with a multiple of the width of the simd register
-        int alignOffset = 0;
-        if (width % Vc::float_v::Size != 0) {
-            alignOffset = Vc::float_v::Size - (width % Vc::float_v::Size);
-        }
-        int simdWidth = width + alignOffset;
-
-        float *buffer = Vc::malloc<float, Vc::AlignOnVector>(simdWidth);
-
-        for (int y = rect.y(); y < rect.y() + rect.height(); y++) {
-
-            m_shape->processRowFast(buffer, simdWidth, y, m_cosa, m_sina, m_centerX, m_centerY, m_invScaleX, m_invScaleY);
 
-            if (m_randomness != 0.0 || m_density != 1.0) {
-                for (int x = 0; x < width; x++) {
-
-                    if (m_randomness!= 0.0){
-                        random = (1.0 - m_randomness) + m_randomness * float(rand()) / RAND_MAX;
-                    }
-
-                    alphaValue = quint8( (OPACITY_OPAQUE_U8 - buffer[x]*255) * random);
-
-                    // avoid computation of random numbers if density is full
-                    if (m_density != 1.0){
-                        // compute density only for visible pixels of the mask
-                        if (alphaValue != OPACITY_TRANSPARENT_U8){
-                            if ( !(m_density >= drand48()) ){
-                                alphaValue = OPACITY_TRANSPARENT_U8;
-                            }
-                        }
-                    }
-
-                    m_cs->applyAlphaU8Mask(dabPointer, &alphaValue, 1);
-                    dabPointer += m_pixelSize;
-                }
-            } else {
-                m_cs->applyInverseNormedFloatMask(dabPointer, buffer, width);
-                dabPointer += width*m_pixelSize;
-            }//endfor x
-            dabPointer += offset;
-        }//endfor y
-        Vc::free(buffer);
-    }
-#endif
-
-    KisFixedPaintDeviceSP m_device;
-    const KoColorSpace* m_cs;
-    qreal m_randomness;
-    qreal m_density;
-    quint32 m_pixelSize;
-    double m_centerX;
-    double m_centerY;
-    double m_invScaleX;
-    double m_invScaleY;
-    double m_cosa;
-    double m_sina;
-    KisMaskGenerator* m_shape;
-    bool m_canVectorize;
-};
 
 struct KisAutoBrush::Private {
     KisMaskGenerator* shape;
@@ -405,7 +228,14 @@ void KisAutoBrush::generateMaskAndApplyMaskOrCreateDab(KisFixedPaintDeviceSP dst
         }
     }
 
-    MaskProcessor s(dst, cs, d->randomness, d->density, centerX, centerY, invScaleX, invScaleY, angle, d->shape);
+    MaskProcessingData data(dst, cs, d->randomness, d->density,
+                            centerX, centerY,
+                            invScaleX, invScaleY,
+                            angle);
+
+    KisBrushMaskApplicatorBase *applicator = d->shape->applicator();
+    applicator->initializeData(&data);
+
     int jobs = d->idealThreadCountCached;
     if(dstHeight > 100 && jobs >= 4) {
         int splitter = dstHeight/jobs;
@@ -414,10 +244,11 @@ void KisAutoBrush::generateMaskAndApplyMaskOrCreateDab(KisFixedPaintDeviceSP dst
             rects << QRect(0, i*splitter, dstWidth, splitter);
         }
         rects << QRect(0, (jobs - 1)*splitter, dstWidth, dstHeight - (jobs - 1)*splitter);
-        QtConcurrent::blockingMap(rects, s);
+        OperatorWrapper wrapper(applicator);
+        QtConcurrent::blockingMap(rects, wrapper);
     } else {
         QRect rect(0, 0, dstWidth, dstHeight);
-        s.process(rect);
+        applicator->process(rect);
     }
 }
 
diff --git a/libs/pigment/CMakeLists.txt b/libs/pigment/CMakeLists.txt
index 2dcd441..0ce5004 100644
--- a/libs/pigment/CMakeLists.txt
+++ b/libs/pigment/CMakeLists.txt
@@ -10,6 +10,17 @@ if(OPENEXR_FOUND)
     add_definitions(${OPENEXR_DEFINITIONS})
 endif(OPENEXR_FOUND)
 
+set(LINK_VC_LIB)
+
+if(HAVE_VC)
+    include_directories(${Vc_INCLUDE_DIR})
+    set(LINK_VC_LIB ${Vc_LIBRARIES})
+    ko_compile_for_all_implementations_no_scalar(__per_arch_factory_objs compositeops/KoOptimizedCompositeOpFactoryPerArch.cpp "-fPIC")
+
+    message("Following objects are generated from the per-arch lib")
+    message(${__per_arch_factory_objs})
+endif(HAVE_VC)
+
 add_subdirectory(tests)
 add_subdirectory(benchmarks)
 
@@ -44,6 +55,9 @@ set(pigmentcms_SRCS
     colorspaces/KoRgbU16ColorSpace.cpp
     colorspaces/KoRgbU8ColorSpace.cpp
     colorspaces/KoSimpleColorSpaceEngine.cpp
+    compositeops/KoOptimizedCompositeOpFactory.cpp
+    compositeops/KoOptimizedCompositeOpFactoryPerArch_Scalar.cpp
+    ${__per_arch_factory_objs}
     colorprofiles/KoDummyColorProfile.cpp
     resources/KoAbstractGradient.cpp
     resources/KoColorSet.cpp
@@ -91,7 +105,7 @@ set(PIGMENT_INSTALL_FILES
         KoHistogramProducer.h
 )
 
-set (EXTRA_LIBRARIES ${LINK_OPENEXR_LIB})
+set (EXTRA_LIBRARIES ${LINK_OPENEXR_LIB} ${LINK_VC_LIB})
 
 if(MSVC)
   # avoid "cannot open file 'LIBC.lib'" error
diff --git a/libs/pigment/KoCompositeOp.cpp b/libs/pigment/KoCompositeOp.cpp
index b084131..ae80020 100644
--- a/libs/pigment/KoCompositeOp.cpp
+++ b/libs/pigment/KoCompositeOp.cpp
@@ -24,6 +24,7 @@
 
 #include "KoCompositeOp.h"
 #include "KoColorSpace.h"
+#include "KoColorSpaceMaths.h"
 
 QString KoCompositeOp::categoryColor()
 {
@@ -94,11 +95,13 @@ void KoCompositeOp::composite(quint8 *dstRowStart, qint32 dstRowStride,
 
 void KoCompositeOp::composite(const KoCompositeOp::ParameterInfo& params) const
 {
+    using namespace Arithmetic;
+
     composite(params.dstRowStart           , params.dstRowStride ,
               params.srcRowStart           , params.srcRowStride ,
               params.maskRowStart          , params.maskRowStride,
               params.rows                  , params.cols         ,
-              quint8(params.opacity*255.0f), params.channelFlags );
+              scale<quint8>(params.opacity), params.channelFlags );
 }
 
 
diff --git a/libs/pigment/colorspaces/KoRgbU16ColorSpace.cpp b/libs/pigment/colorspaces/KoRgbU16ColorSpace.cpp
index 49795c7..13b89e1 100644
--- a/libs/pigment/colorspaces/KoRgbU16ColorSpace.cpp
+++ b/libs/pigment/colorspaces/KoRgbU16ColorSpace.cpp
@@ -30,9 +30,6 @@
 #include "KoChannelInfo.h"
 #include "KoID.h"
 #include "KoIntegerMaths.h"
-#include "KoCompositeOpOver.h"
-#include "KoCompositeOpErase.h"
-#include "KoCompositeOpAlphaDarken.h"
 
 
 KoRgbU16ColorSpace::KoRgbU16ColorSpace() :
diff --git a/libs/pigment/colorspaces/KoRgbU8ColorSpace.cpp b/libs/pigment/colorspaces/KoRgbU8ColorSpace.cpp
index 68ef03d..88bab04 100644
--- a/libs/pigment/colorspaces/KoRgbU8ColorSpace.cpp
+++ b/libs/pigment/colorspaces/KoRgbU8ColorSpace.cpp
@@ -29,12 +29,8 @@
 #include "KoChannelInfo.h"
 #include "KoID.h"
 #include "KoIntegerMaths.h"
-#include "KoCompositeOpOver.h"
-#include "KoCompositeOpErase.h"
-#include "KoCompositeOpAlphaDarken.h"
 #include "compositeops/KoCompositeOps.h"
-#include "compositeops/KoCompositeOpAdd.h"
-#include "compositeops/KoCompositeOpSubtract.h"
+
 
 KoRgbU8ColorSpace::KoRgbU8ColorSpace() :
 
diff --git a/libs/pigment/compositeops/KoCompositeOps.h b/libs/pigment/compositeops/KoCompositeOps.h
index 2e1225b..5c2dadb 100644
--- a/libs/pigment/compositeops/KoCompositeOps.h
+++ b/libs/pigment/compositeops/KoCompositeOps.h
@@ -49,6 +49,8 @@
 
 #include "compositeops/KoCompositeOpBehind.h"
 
+#include "KoOptimizedCompositeOpFactory.h"
+
 namespace _Private {
 
 template<class Traits, bool flag>
@@ -58,6 +60,50 @@ struct AddGeneralOps
 };
 
 template<class Traits>
+struct OptimizedOpsSelector
+{
+    static KoCompositeOp* createAlphaDarkenOp(const KoColorSpace *cs) {
+        return new KoCompositeOpAlphaDarken<Traits>(cs);
+    }
+    static KoCompositeOp* createOverOp(const KoColorSpace *cs) {
+        return new KoCompositeOpOver<Traits>(cs);
+    }
+};
+
+template<>
+struct OptimizedOpsSelector<KoRgbU8Traits>
+{
+    static KoCompositeOp* createAlphaDarkenOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    }
+    static KoCompositeOp* createOverOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    }
+};
+
+template<>
+struct OptimizedOpsSelector<KoBgrU8Traits>
+{
+    static KoCompositeOp* createAlphaDarkenOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    }
+    static KoCompositeOp* createOverOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    }
+};
+
+template<>
+struct OptimizedOpsSelector<KoLabU8Traits>
+{
+    static KoCompositeOp* createAlphaDarkenOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(cs);
+    }
+    static KoCompositeOp* createOverOp(const KoColorSpace *cs) {
+        return KoOptimizedCompositeOpFactory::createOverOp32(cs);
+    }
+};
+
+template<class Traits>
 struct AddGeneralOps<Traits, true>
 {
      typedef typename Traits::channels_type Arg;
@@ -70,8 +116,8 @@ struct AddGeneralOps<Traits, true>
      }
 
      static void add(KoColorSpace* cs) {
-         cs->addCompositeOp(new KoCompositeOpOver<Traits>(cs));
-         cs->addCompositeOp(new KoCompositeOpAlphaDarken<Traits>(cs));
+         cs->addCompositeOp(OptimizedOpsSelector<Traits>::createOverOp(cs));
+         cs->addCompositeOp(OptimizedOpsSelector<Traits>::createAlphaDarkenOp(cs));
          cs->addCompositeOp(new KoCompositeOpCopy2<Traits>(cs));
          cs->addCompositeOp(new KoCompositeOpErase<Traits>(cs));
          cs->addCompositeOp(new KoCompositeOpBehind<Traits>(cs));
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpAlphaDarken32.h b/libs/pigment/compositeops/KoOptimizedCompositeOpAlphaDarken32.h
new file mode 100644
index 0000000..172bf30
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpAlphaDarken32.h
@@ -0,0 +1,214 @@
+/*
+ * Copyright (c) 2006 Cyrille Berger  <cberger at cberger.net>
+ * Copyright (c) 2011 Silvio Heinrich <plassy at web.de>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Library General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Library General Public License for more details.
+ *
+ * You should have received a copy of the GNU Library General Public License
+ * along with this library; see the file COPYING.LIB.  If not, write to
+ * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifndef KOOPTIMIZEDCOMPOSITEOPALPHADARKEN32_H_
+#define KOOPTIMIZEDCOMPOSITEOPALPHADARKEN32_H_
+
+#include "KoCompositeOpBase.h"
+
+#include "KoStreamedMath.h"
+
+template<typename channels_type, typename pixel_type>
+struct AlphaDarkenCompositor32 {
+    /**
+     * This is a vector equivalent of compositeOnePixelScalar(). It is considered
+     * to process Vc::float_v::Size pixels in a single pass.
+     *
+     * o the \p haveMask parameter points whether the real (non-null) mask
+     *   pointer is passed to the function.
+     * o the \p src pointer may be aligned to vector boundary or may be
+     *   not. In case not, it must be pointed with a special parameter
+     *   \p src_aligned.
+     * o the \p dst pointer must always(!) be aligned to the boundary
+     *   of a streaming vector. Unaligned writes are really expensive.
+     * o This function is *never* used if HAVE_VC is not present
+     */
+
+
+    template<bool haveMask, bool src_aligned, Vc::Implementation _impl>
+    static ALWAYS_INLINE void compositeVector(const quint8 *src, quint8 *dst, const quint8 *mask, float opacity, float flow)
+    {
+        Vc::float_v src_alpha;
+        Vc::float_v dst_alpha;
+
+        Vc::float_v opacity_vec(255.0 * opacity * flow);
+        Vc::float_v flow_norm_vec(flow);
+
+
+        Vc::float_v uint8MaxRec2((float)1.0 / (255.0 * 255.0));
+        Vc::float_v uint8MaxRec1((float)1.0 / 255.0);
+        Vc::float_v uint8Max((float)255.0);
+        Vc::float_v zeroValue(Vc::Zero);
+
+
+        Vc::float_v msk_norm_alpha;
+        src_alpha = KoStreamedMath<_impl>::template fetch_alpha_32<src_aligned>(src);
+
+        if (haveMask) {
+            Vc::float_v mask_vec = KoStreamedMath<_impl>::fetch_mask_8(mask);
+            msk_norm_alpha = src_alpha * mask_vec * uint8MaxRec2;
+        } else {
+            msk_norm_alpha = src_alpha * uint8MaxRec1;
+        }
+
+        dst_alpha = KoStreamedMath<_impl>::template fetch_alpha_32<true>(dst);
+        src_alpha = msk_norm_alpha * opacity_vec;
+
+        Vc::float_m empty_dst_pixels_mask = dst_alpha == zeroValue;
+
+        Vc::float_v src_c1;
+        Vc::float_v src_c2;
+        Vc::float_v src_c3;
+
+        Vc::float_v dst_c1;
+        Vc::float_v dst_c2;
+        Vc::float_v dst_c3;
+
+        KoStreamedMath<_impl>::template fetch_colors_32<src_aligned>(src, src_c1, src_c2, src_c3);
+
+        bool srcAlphaIsZero = (src_alpha == zeroValue).isFull();
+        if (srcAlphaIsZero) return;
+
+        bool dstAlphaIsZero = empty_dst_pixels_mask.isFull();
+
+        Vc::float_v dst_blend = src_alpha * uint8MaxRec1;
+
+        bool srcAlphaIsUnit = (src_alpha == uint8Max).isFull();
+
+        if (dstAlphaIsZero) {
+            dst_c1 = src_c1;
+            dst_c2 = src_c2;
+            dst_c3 = src_c3;
+        } else if (srcAlphaIsUnit) {
+            bool dstAlphaIsUnit = (dst_alpha == uint8Max).isFull();
+            if (dstAlphaIsUnit) {
+                memcpy(dst, src, 4 * Vc::float_v::Size);
+                return;
+            } else {
+                dst_c1 = src_c1;
+                dst_c2 = src_c2;
+                dst_c3 = src_c3;
+            }
+        } else if (empty_dst_pixels_mask.isEmpty()) {
+            KoStreamedMath<_impl>::template fetch_colors_32<true>(dst, dst_c1, dst_c2, dst_c3);
+            dst_c1 = dst_blend * (src_c1 - dst_c1) + dst_c1;
+            dst_c2 = dst_blend * (src_c2 - dst_c2) + dst_c2;
+            dst_c3 = dst_blend * (src_c3 - dst_c3) + dst_c3;
+        } else {
+            KoStreamedMath<_impl>::template fetch_colors_32<true>(dst, dst_c1, dst_c2, dst_c3);
+            dst_c1(empty_dst_pixels_mask) = src_c1;
+            dst_c2(empty_dst_pixels_mask) = src_c2;
+            dst_c3(empty_dst_pixels_mask) = src_c3;
+
+            Vc::float_m not_empty_dst_pixels_mask = !empty_dst_pixels_mask;
+
+            dst_c1(not_empty_dst_pixels_mask) = dst_blend * (src_c1 - dst_c1) + dst_c1;
+            dst_c2(not_empty_dst_pixels_mask) = dst_blend * (src_c2 - dst_c2) + dst_c2;
+            dst_c3(not_empty_dst_pixels_mask) = dst_blend * (src_c3 - dst_c3) + dst_c3;
+        }
+
+        Vc::float_v alpha1 = src_alpha + dst_alpha -
+            dst_blend * dst_alpha;
+
+        Vc::float_m alpha2_mask = opacity_vec > dst_alpha;
+        Vc::float_v opt1 = (opacity_vec - dst_alpha) * msk_norm_alpha + dst_alpha;
+        Vc::float_v alpha2;
+        alpha2(!alpha2_mask) = dst_alpha;
+        alpha2(alpha2_mask) = opt1;
+        dst_alpha = (alpha2 - alpha1) * flow_norm_vec + alpha1;
+
+        KoStreamedMath<_impl>::write_channels_32(dst, dst_alpha, dst_c1, dst_c2, dst_c3);
+    }
+
+    /**
+     * Composes one pixel of the source into the destination
+     */
+    template <bool haveMask, Vc::Implementation _impl>
+    static ALWAYS_INLINE void compositeOnePixelScalar(const channels_type *src, channels_type *dst, const quint8 *mask, float opacity, float flow, const QBitArray &channelFlags)
+    {
+        Q_UNUSED(channelFlags);
+
+        using namespace Arithmetic;
+        const qint32 alpha_pos = 3;
+
+        const float uint8Rec1 = 1.0 / 255.0;
+        const float uint8Rec2 = 1.0 / (255.0 * 255.0);
+        const float uint8Max = 255.0;
+
+        quint8 dstAlphaInt = dst[alpha_pos];
+        float dstAlphaNorm = dstAlphaInt ? dstAlphaInt * uint8Rec1 : 0.0;
+        float srcAlphaNorm;
+        float mskAlphaNorm;
+
+        /**
+         * FIXME: precalculate this value on a higher level for
+         * not doing it on every cycle
+         */
+        opacity *= flow;
+
+        if (haveMask) {
+            mskAlphaNorm = float(*mask) * uint8Rec2 * src[alpha_pos];
+            srcAlphaNorm = mskAlphaNorm * opacity;
+        } else {
+            mskAlphaNorm = src[alpha_pos] * uint8Rec1;
+            srcAlphaNorm = mskAlphaNorm * opacity;
+        }
+
+        if (dstAlphaInt != 0) {
+            dst[0] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[0], src[0], srcAlphaNorm);
+            dst[1] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[1], src[1], srcAlphaNorm);
+            dst[2] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[2], src[2], srcAlphaNorm);
+        } else {
+            const pixel_type *s = reinterpret_cast<const pixel_type*>(src);
+            pixel_type *d = reinterpret_cast<pixel_type*>(dst);
+            *d = *s;
+        }
+
+        float alpha1 = unionShapeOpacity(srcAlphaNorm, dstAlphaNorm);                               // alpha with 0% flow
+        float alpha2 = (opacity > dstAlphaNorm) ? lerp(dstAlphaNorm, opacity, mskAlphaNorm) : dstAlphaNorm; // alpha with 100% flow
+        dst[alpha_pos] = quint8(lerp(alpha1, alpha2, flow) * uint8Max);
+    }
+};
+
+/**
+ * An optimized version of a composite op for the use in 4 byte
+ * colorspaces with alpha channel placed at the last byte of
+ * the pixel: C1_C2_C3_A.
+ */
+template<Vc::Implementation _impl>
+class KoOptimizedCompositeOpAlphaDarken32 : public KoCompositeOp
+{
+public:
+    KoOptimizedCompositeOpAlphaDarken32(const KoColorSpace* cs)
+        : KoCompositeOp(cs, COMPOSITE_ALPHA_DARKEN, i18n("Alpha darken"), KoCompositeOp::categoryMix()) {}
+
+    using KoCompositeOp::composite;
+
+    virtual void composite(const KoCompositeOp::ParameterInfo& params) const
+    {
+        if(params.maskRowStart) {
+            KoStreamedMath<_impl>::template genericComposite32<true, true, AlphaDarkenCompositor32<quint8, quint32> >(params);
+        } else {
+            KoStreamedMath<_impl>::template genericComposite32<false, true, AlphaDarkenCompositor32<quint8, quint32> >(params);
+        }
+    }
+};
+
+#endif // KOOPTIMIZEDCOMPOSITEOPALPHADARKEN32_H_
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp b/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp
new file mode 100644
index 0000000..7c16b28
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.cpp
@@ -0,0 +1,47 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "KoOptimizedCompositeOpFactory.h"
+
+#include "config-vc.h"
+
+
+#ifdef HAVE_VC
+
+#include <Vc/global.h>
+#include <Vc/common/support.h>
+#endif
+
+#include "KoOptimizedCompositeOpFactoryPerArch.h"
+
+static struct ArchReporter {
+    ArchReporter() {
+        createOptimizedClass<KoReportCurrentArch>(0);
+    }
+} StaticReporter;
+
+
+KoCompositeOp* KoOptimizedCompositeOpFactory::createAlphaDarkenOp32(const KoColorSpace *cs)
+{
+    return createOptimizedClass<KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpAlphaDarken32> >(cs);
+}
+
+KoCompositeOp* KoOptimizedCompositeOpFactory::createOverOp32(const KoColorSpace *cs)
+{
+    return createOptimizedClass<KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpOver32> >(cs);
+}
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.h b/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.h
new file mode 100644
index 0000000..ce568cb
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpFactory.h
@@ -0,0 +1,46 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef KOOPTIMIZEDCOMPOSITEOPFACTORY_H
+#define KOOPTIMIZEDCOMPOSITEOPFACTORY_H
+
+#include "pigment_export.h"
+
+class KoCompositeOp;
+class KoColorSpace;
+
+/**
+ * The creation of the optimized composite ops is moved into a separate
+ * objects module for two reasons:
+ *
+ * 1) They are not templated, that is they do not need inlining into
+ *    the user's code.
+ * 2) This removes compilation dependencies.
+ * 3) (most important!) When the object module is shared with a colorspace
+ *    class, which is quite huge itself, GCC layouts the code somehow badly
+ *    that causes 60% performance degradation.
+ */
+
+class PIGMENTCMS_EXPORT KoOptimizedCompositeOpFactory
+{
+public:
+    static KoCompositeOp* createAlphaDarkenOp32(const KoColorSpace *cs);
+    static KoCompositeOp* createOverOp32(const KoColorSpace *cs);
+};
+
+#endif /* KOOPTIMIZEDCOMPOSITEOPFACTORY_H */
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.cpp b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.cpp
new file mode 100644
index 0000000..2fd8179
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.cpp
@@ -0,0 +1,107 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "KoOptimizedCompositeOpFactoryPerArch.h"
+
+#include <QDebug>
+
+#include "KoOptimizedCompositeOpAlphaDarken32.h"
+#include "KoOptimizedCompositeOpOver32.h"
+
+template<>
+template<>
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpAlphaDarken32>::ReturnType
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpAlphaDarken32>::create<VC_IMPL>(ParamType param)
+{
+    return new KoOptimizedCompositeOpAlphaDarken32<VC_IMPL>(param);
+}
+
+template<>
+template<>
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpOver32>::ReturnType
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpOver32>::create<VC_IMPL>(ParamType param)
+{
+    return new KoOptimizedCompositeOpOver32<VC_IMPL>(param);
+}
+
+
+#define __stringify(_s) #_s
+#define stringify(_s) __stringify(_s)
+
+#ifdef __SSE2__
+#  define HAVE_SSE2 1
+#else
+#  define HAVE_SSE2 0
+#endif
+
+#ifdef __SSE3__
+#  define HAVE_SSE3 1
+#else
+#  define HAVE_SSE3 0
+#endif
+
+#ifdef __SSSE3__
+#  define HAVE_SSSE3 1
+#else
+#  define HAVE_SSSE3 0
+#endif
+
+#ifdef __SSE4_1__
+#  define HAVE_SSE4_1 1
+#else
+#  define HAVE_SSE4_1 0
+#endif
+
+#ifdef __SSE4_2__
+#  define HAVE_SSE4_2 1
+#else
+#  define HAVE_SSE4_2 0
+#endif
+
+#ifdef __SSE4a__
+#  define HAVE_SSE4a 1
+#else
+#  define HAVE_SSE4a 0
+#endif
+
+#ifdef __AVX__
+#  define HAVE_AVX 1
+#else
+#  define HAVE_AVX 0
+#endif
+
+inline void printFeatureSupported(const QString &feature,
+                                  bool present)
+{
+    qDebug() << "\t" << feature << "\t---\t" << (present ? "yes" : "no");
+}
+
+template<>
+KoReportCurrentArch::ReturnType
+KoReportCurrentArch::create<VC_IMPL>(ParamType)
+{
+    qDebug() << "Compiled for arch:" << stringify(VC_IMPL);
+    qDebug() << "Features supported:";
+    printFeatureSupported("SSE2", HAVE_SSE2);
+    printFeatureSupported("SSE3", HAVE_SSE3);
+    printFeatureSupported("SSSE3", HAVE_SSSE3);
+    printFeatureSupported("SSE4.1", HAVE_SSE4_1);
+    printFeatureSupported("SSE4.2", HAVE_SSE4_2);
+    printFeatureSupported("SSE4a", HAVE_SSE4a);
+    printFeatureSupported("AVX ", HAVE_AVX);
+}
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.h b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.h
new file mode 100644
index 0000000..17cd523
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch.h
@@ -0,0 +1,54 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef KOOPTIMIZEDCOMPOSITEOPFACTORYPERARCH_H
+#define KOOPTIMIZEDCOMPOSITEOPFACTORYPERARCH_H
+
+#include "KoVcMultiArchBuildSupport.h"
+
+
+class KoCompositeOp;
+class KoColorSpace;
+
+
+template<Vc::Implementation _impl>
+class KoOptimizedCompositeOpAlphaDarken32;
+
+template<Vc::Implementation _impl>
+class KoOptimizedCompositeOpOver32;
+
+template<template<Vc::Implementation I> class CompositeOp>
+struct KoOptimizedCompositeOpFactoryPerArch
+{
+    typedef const KoColorSpace* ParamType;
+    typedef KoCompositeOp* ReturnType;
+
+    template<Vc::Implementation _impl>
+    static ReturnType create(ParamType param);
+};
+
+struct KoReportCurrentArch
+{
+    typedef void* ParamType;
+    typedef void ReturnType;
+
+    template<Vc::Implementation _impl>
+    static ReturnType create(ParamType);
+};
+
+#endif /* KOOPTIMIZEDCOMPOSITEOPFACTORYPERARCH_H */
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch_Scalar.cpp b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch_Scalar.cpp
new file mode 100644
index 0000000..db17ae0
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpFactoryPerArch_Scalar.cpp
@@ -0,0 +1,47 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include "KoOptimizedCompositeOpFactoryPerArch.h"
+
+#include "KoColorSpaceTraits.h"
+#include "KoCompositeOpAlphaDarken.h"
+#include "KoCompositeOpOver.h"
+
+
+template<>
+template<>
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpAlphaDarken32>::ReturnType
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpAlphaDarken32>::create<Vc::ScalarImpl>(ParamType param)
+{
+    return new KoCompositeOpAlphaDarken<KoBgrU8Traits>(param);
+}
+
+template<>
+template<>
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpOver32>::ReturnType
+KoOptimizedCompositeOpFactoryPerArch<KoOptimizedCompositeOpOver32>::create<Vc::ScalarImpl>(ParamType param)
+{
+    return new KoCompositeOpOver<KoBgrU8Traits>(param);
+}
+
+template<>
+KoReportCurrentArch::ReturnType
+KoReportCurrentArch::create<Vc::ScalarImpl>(ParamType)
+{
+    qDebug() << "Legacy integer arithmetics implementation";
+}
diff --git a/libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h b/libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h
new file mode 100644
index 0000000..400b572
--- /dev/null
+++ b/libs/pigment/compositeops/KoOptimizedCompositeOpOver32.h
@@ -0,0 +1,240 @@
+/*
+ * Copyright (c) 2006 Cyrille Berger  <cberger at cberger.net>
+ * Copyright (c) 2011 Silvio Heinrich <plassy at web.de>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Library General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Library General Public License for more details.
+ *
+ * You should have received a copy of the GNU Library General Public License
+ * along with this library; see the file COPYING.LIB.  If not, write to
+ * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#ifndef KOOPTIMIZEDCOMPOSITEOPOVER32_H_
+#define KOOPTIMIZEDCOMPOSITEOPOVER32_H_
+
+#include "KoCompositeOpBase.h"
+
+#include "KoStreamedMath.h"
+
+
+template<typename channels_type, typename pixel_type, bool alphaLocked, bool allChannelsFlag>
+struct OverCompositor32 {
+    // \see docs in AlphaDarkenCompositor32
+    template<bool haveMask, bool src_aligned, Vc::Implementation _impl>
+    static ALWAYS_INLINE void compositeVector(const quint8 *src, quint8 *dst, const quint8 *mask, float opacity, float flow)
+    {
+        Q_UNUSED(flow);
+
+        Vc::float_v src_alpha;
+        Vc::float_v dst_alpha;
+
+        src_alpha = KoStreamedMath<_impl>::template fetch_alpha_32<src_aligned>(src);
+
+        bool haveOpacity = opacity != 1.0;
+        Vc::float_v opacity_norm_vec(opacity);
+
+        Vc::float_v uint8Max((float)255.0);
+        Vc::float_v uint8MaxRec1((float)1.0 / 255.0);
+        Vc::float_v zeroValue(Vc::Zero);
+        Vc::float_v oneValue(Vc::One);
+
+        src_alpha *= opacity_norm_vec;
+
+        if (haveMask) {
+            Vc::float_v mask_vec = KoStreamedMath<_impl>::fetch_mask_8(mask);
+            src_alpha *= mask_vec * uint8MaxRec1;
+        }
+
+        // The source cannot change the colors in the destination,
+        // since its fully transparent
+        if ((src_alpha == zeroValue).isFull()) {
+            return;
+        }
+
+        dst_alpha = KoStreamedMath<_impl>::template fetch_alpha_32<true>(dst);
+
+        Vc::float_v src_c1;
+        Vc::float_v src_c2;
+        Vc::float_v src_c3;
+
+        Vc::float_v dst_c1;
+        Vc::float_v dst_c2;
+        Vc::float_v dst_c3;
+
+
+        KoStreamedMath<_impl>::template fetch_colors_32<src_aligned>(src, src_c1, src_c2, src_c3);
+        Vc::float_v src_blend;
+        Vc::float_v new_alpha;
+
+        if ((dst_alpha == uint8Max).isFull()) {
+            new_alpha = dst_alpha;
+            src_blend = src_alpha * uint8MaxRec1;
+        } else if ((dst_alpha == zeroValue).isFull()) {
+            new_alpha = src_alpha;
+            src_blend = oneValue;
+        } else {
+            /**
+             * The value of new_alpha can have *some* zero values,
+             * which will result in NaN values while division. But
+             * when converted to integers these NaN values will
+             * be converted to zeroes, which is exactly what we need
+             */
+            new_alpha = dst_alpha + (uint8Max - dst_alpha) * src_alpha * uint8MaxRec1;
+            src_blend = src_alpha / new_alpha;
+        }
+
+        if (!(src_blend == oneValue).isFull()) {
+            KoStreamedMath<_impl>::template fetch_colors_32<true>(dst, dst_c1, dst_c2, dst_c3);
+
+            dst_c1 = src_blend * (src_c1 - dst_c1) + dst_c1;
+            dst_c2 = src_blend * (src_c2 - dst_c2) + dst_c2;
+            dst_c3 = src_blend * (src_c3 - dst_c3) + dst_c3;
+
+        } else {
+            if (!haveMask && !haveOpacity) {
+                memcpy(dst, src, 4 * Vc::float_v::Size);
+                return;
+            } else {
+                // opacity has changed the alpha of the source,
+                // so we can't just memcpy the bytes
+                dst_c1 = src_c1;
+                dst_c2 = src_c2;
+                dst_c3 = src_c3;
+            }
+        }
+
+        KoStreamedMath<_impl>::write_channels_32(dst, new_alpha, dst_c1, dst_c2, dst_c3);
+    }
+
+    template <bool haveMask, Vc::Implementation _impl>
+    static ALWAYS_INLINE void compositeOnePixelScalar(const channels_type *src, channels_type *dst, const quint8 *mask, float opacity, float flow, const QBitArray &channelFlags)
+    {
+        Q_UNUSED(flow);
+
+        using namespace Arithmetic;
+        const qint32 alpha_pos = 3;
+
+        const float uint8Rec1 = 1.0 / 255.0;
+        const float uint8Max = 255.0;
+
+        float srcAlpha = src[alpha_pos];
+        srcAlpha *= opacity;
+
+        if (haveMask) {
+            srcAlpha *= float(*mask) * uint8Rec1;
+        }
+
+        if (srcAlpha != 0.0) {
+
+            float dstAlpha = dst[alpha_pos];
+            float srcBlendNorm;
+
+            if (dstAlpha == uint8Max) {
+                srcBlendNorm = srcAlpha * uint8Rec1;
+            } else if (dstAlpha == 0.0) {
+                dstAlpha = srcAlpha;
+                srcBlendNorm = 1.0;
+
+                if (!allChannelsFlag) {
+                    pixel_type *d = reinterpret_cast<pixel_type*>(dst);
+                    *d = 0; // dstAlpha is already null
+                }
+            } else {
+                dstAlpha += (uint8Max - dstAlpha) * srcAlpha * uint8Rec1;
+                srcBlendNorm = srcAlpha / dstAlpha;
+            }
+
+            if(allChannelsFlag) {
+                if (srcBlendNorm == 1.0) {
+                    if (!alphaLocked) {
+                        const pixel_type *s = reinterpret_cast<const pixel_type*>(src);
+                        pixel_type *d = reinterpret_cast<pixel_type*>(dst);
+                        *d = *s;
+                    } else {
+                        dst[0] = src[0];
+                        dst[1] = src[1];
+                        dst[2] = src[2];
+                    }
+                } else if (srcBlendNorm != 0.0){
+                    dst[0] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[0], src[0], srcBlendNorm);
+                    dst[1] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[1], src[1], srcBlendNorm);
+                    dst[2] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[2], src[2], srcBlendNorm);
+                }
+            } else {
+                if (srcBlendNorm == 1.0) {
+                    if(channelFlags.at(0)) dst[0] = src[0];
+                    if(channelFlags.at(1)) dst[1] = src[1];
+                    if(channelFlags.at(2)) dst[2] = src[2];
+                } else if (srcBlendNorm != 0.0) {
+                    if(channelFlags.at(0)) dst[0] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[0], src[0], srcBlendNorm);
+                    if(channelFlags.at(1)) dst[1] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[1], src[1], srcBlendNorm);
+                    if(channelFlags.at(2)) dst[2] = KoStreamedMath<_impl>::lerp_mixed_u8_float(dst[2], src[2], srcBlendNorm);
+                }
+            }
+
+            if (!alphaLocked) {
+                dst[alpha_pos] = quint8(dstAlpha);
+            }
+        }
+    }
+};
+
+/**
+ * An optimized version of a composite op for the use in 4 byte
+ * colorspaces with alpha channel placed at the last byte of
+ * the pixel: C1_C2_C3_A.
+ */
+template<Vc::Implementation _impl>
+class KoOptimizedCompositeOpOver32 : public KoCompositeOp
+{
+public:
+    KoOptimizedCompositeOpOver32(const KoColorSpace* cs)
+        : KoCompositeOp(cs, COMPOSITE_OVER, i18n("Normal"), KoCompositeOp::categoryMix()) {}
+
+    using KoCompositeOp::composite;
+
+    virtual void composite(const KoCompositeOp::ParameterInfo& params) const
+    {
+        if(params.maskRowStart) {
+            composite<true>(params);
+        } else {
+            composite<false>(params);
+        }
+    }
+
+    template <bool haveMask>
+    inline void composite(const KoCompositeOp::ParameterInfo& params) const {
+        if (params.channelFlags.isEmpty() ||
+            params.channelFlags == QBitArray(4, true)) {
+
+            KoStreamedMath<_impl>::template genericComposite32<haveMask, false, OverCompositor32<quint8, quint32, false, true> >(params);
+        } else {
+            const bool allChannelsFlag =
+                params.channelFlags.at(0) &&
+                params.channelFlags.at(1) &&
+                params.channelFlags.at(2);
+
+            const bool alphaLocked =
+                !params.channelFlags.at(3);
+
+            if (allChannelsFlag && alphaLocked) {
+                KoStreamedMath<_impl>::template genericComposite32_novector<haveMask, false, OverCompositor32<quint8, quint32, true, true> >(params);
+            } else if (!allChannelsFlag && !alphaLocked) {
+                KoStreamedMath<_impl>::template genericComposite32_novector<haveMask, false, OverCompositor32<quint8, quint32, false, false> >(params);
+            } else /*if (!allChannelsFlag && alphaLocked) */{
+                KoStreamedMath<_impl>::template genericComposite32_novector<haveMask, false, OverCompositor32<quint8, quint32, true, false> >(params);
+            }
+        }
+    }
+};
+
+#endif // KOOPTIMIZEDCOMPOSITEOPOVER32_H_
diff --git a/libs/pigment/compositeops/KoStreamedMath.h b/libs/pigment/compositeops/KoStreamedMath.h
new file mode 100644
index 0000000..dca5ba8
--- /dev/null
+++ b/libs/pigment/compositeops/KoStreamedMath.h
@@ -0,0 +1,304 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __VECTOR_MATH_H
+#define __VECTOR_MATH_H
+
+
+#include "config-vc.h"
+#ifndef HAVE_VC
+#error "BUG: There is no reason in including this file when Vc is not present"
+#endif
+
+#include <Vc/Vc>
+#include <Vc/IO>
+
+#include <stdint.h>
+
+#ifndef ALWAYS_INLINE
+#if defined __GNUC__
+#define ALWAYS_INLINE inline __attribute__((__always_inline__))
+#elif defined _MSC_VER
+#define ALWAYS_INLINE __forceinline
+#else
+#define ALWAYS_INLINE inline
+#endif
+#endif
+
+template<Vc::Implementation _impl>
+struct KoStreamedMath {
+
+/**
+ * Composes src into dst without using vector instructions
+ */
+template<bool useMask, bool useFlow, class Compositor>
+    static void genericComposite32_novector(const KoCompositeOp::ParameterInfo& params)
+{
+    using namespace Arithmetic;
+
+    const qint32 linearInc = 4;
+    qint32 srcLinearInc = params.srcRowStride ? 4 : 0;
+
+    quint8*       dstRowStart  = params.dstRowStart;
+    const quint8* maskRowStart = params.maskRowStart;
+    const quint8* srcRowStart  = params.srcRowStart;
+
+    for(quint32 r=params.rows; r>0; --r) {
+        const quint8 *mask = maskRowStart;
+        const quint8 *src  = srcRowStart;
+        quint8       *dst  = dstRowStart;
+
+        int blockRest = params.cols;
+
+        for(int i = 0; i < blockRest; i++) {
+            Compositor::template compositeOnePixelScalar<useMask, _impl>(src, dst, mask, params.opacity, params.flow, params.channelFlags);
+            src += srcLinearInc;
+            dst += linearInc;
+
+            if (useMask) {
+                mask++;
+            }
+        }
+
+        srcRowStart  += params.srcRowStride;
+        dstRowStart  += params.dstRowStride;
+
+        if (useMask) {
+            maskRowStart += params.maskRowStride;
+        }
+    }
+}
+
+static inline quint8 lerp_mixed_u8_float(quint8 a, quint8 b, float alpha) {
+    return quint8(qint16(b - a) * alpha + a);
+}
+
+/**
+ * Get a vector containing first Vc::float_v::Size values of mask.
+ * Each source mask element is considered to be a 8-bit integer
+ */
+static inline Vc::float_v fetch_mask_8(const quint8 *data) {
+    Vc::uint_v data_i(data);
+    return Vc::float_v(Vc::int_v(data_i));
+}
+
+/**
+ * Get an alpha values from Vc::float_v::Size pixels 32-bit each
+ * (4 channels, 8 bit per channel).  The alpha value is considered
+ * to be stored in the most significat byte of the pixel
+ *
+ * \p aligned controls whether the \p data is fetched using aligned
+ *            instruction or not.
+ *            1) Fetching aligned data with unaligned instruction
+ *               degrades performance.
+ *            2) Fetching unaligned data with aligned instruction
+ *               causes #GP (General Protection Exception)
+ */
+template <bool aligned>
+static inline Vc::float_v fetch_alpha_32(const quint8 *data) {
+    Vc::uint_v data_i;
+    if (aligned) {
+        data_i.load((const quint32*)data, Vc::Aligned);
+    } else {
+        data_i.load((const quint32*)data, Vc::Unaligned);
+    }
+
+    return Vc::float_v(Vc::int_v(data_i >> 24));
+}
+
+/**
+ * Get color values from Vc::float_v::Size pixels 32-bit each
+ * (4 channels, 8 bit per channel).  The color data is considered
+ * to be stored in the 3 least significant bytes of the pixel.
+ *
+ * \p aligned controls whether the \p data is fetched using aligned
+ *            instruction or not.
+ *            1) Fetching aligned data with unaligned instruction
+ *               degrades performance.
+ *            2) Fetching unaligned data with aligned instruction
+ *               causes #GP (General Protection Exception)
+ */
+template <bool aligned>
+static inline void fetch_colors_32(const quint8 *data,
+                            Vc::float_v &c1,
+                            Vc::float_v &c2,
+                            Vc::float_v &c3) {
+    Vc::uint_v data_i;
+    if (aligned) {
+        data_i.load((const quint32*)data, Vc::Aligned);
+    } else {
+        data_i.load((const quint32*)data, Vc::Unaligned);
+    }
+
+    const quint32 lowByteMask = 0xFF;
+    Vc::uint_v mask(lowByteMask);
+
+    c1 = Vc::float_v(Vc::int_v((data_i >> 16) & mask));
+    c2 = Vc::float_v(Vc::int_v((data_i >> 8)  & mask));
+    c3 = Vc::float_v(Vc::int_v( data_i        & mask));
+}
+
+/**
+ * Pack color and alpha values to Vc::float_v::Size pixels 32-bit each
+ * (4 channels, 8 bit per channel).  The color data is considered
+ * to be stored in the 3 least significant bytes of the pixel, alpha -
+ * in the most significant byte
+ *
+ * NOTE: \p data must be aligned pointer!
+ */
+static inline void write_channels_32(quint8 *data,
+                              Vc::float_v alpha,
+                              Vc::float_v c1,
+                              Vc::float_v c2,
+                              Vc::float_v c3) {
+
+    const quint32 lowByteMask = 0xFF;
+    Vc::uint_v mask(lowByteMask);
+
+    Vc::uint_v v1 = Vc::uint_v(Vc::int_v(alpha)) << 24;
+    Vc::uint_v v2 = (Vc::uint_v(Vc::int_v(c1)) & mask) << 16;
+    Vc::uint_v v3 = (Vc::uint_v(Vc::int_v(c2)) & mask) <<  8;
+    v1 = v1 | v2;
+    Vc::uint_v v4 = Vc::uint_v(Vc::int_v(c3)) & mask;
+    v3 = v3 | v4;
+
+    *((Vc::uint_v*)data) = v1 | v3;
+}
+
+/**
+ * Composes src pixels into dst pixles. Is optimized for 32-bit-per-pixel
+ * colorspaces. Uses \p Compositor strategy parameter for doing actual
+ * math of the composition
+ */
+template<bool useMask, bool useFlow, class Compositor>
+    static void genericComposite32(const KoCompositeOp::ParameterInfo& params)
+{
+    using namespace Arithmetic;
+
+    const int vectorSize = Vc::float_v::Size;
+    const qint32 vectorInc = 4 * vectorSize;
+    const qint32 linearInc = 4;
+    qint32 srcVectorInc = vectorInc;
+    qint32 srcLinearInc = 4;
+
+    quint8*       dstRowStart  = params.dstRowStart;
+    const quint8* maskRowStart = params.maskRowStart;
+    const quint8* srcRowStart  = params.srcRowStart;
+
+    if (!params.srcRowStride) {
+        quint32 *buf = Vc::malloc<quint32, Vc::AlignOnVector>(vectorSize);
+        *((Vc::uint_v*)buf) = Vc::uint_v(*((const quint32*)params.srcRowStart));
+        srcRowStart = reinterpret_cast<quint8*>(buf);
+        srcLinearInc = 0;
+        srcVectorInc = 0;
+    }
+
+    for(quint32 r=params.rows; r>0; --r) {
+        // Hint: Mask is allowed to be unaligned
+        const quint8 *mask = maskRowStart;
+
+        const quint8 *src  = srcRowStart;
+        quint8       *dst  = dstRowStart;
+
+        const int pixelsAlignmentMask = vectorInc - 1;
+        uintptr_t srcPtrValue = reinterpret_cast<uintptr_t>(src);
+        uintptr_t dstPtrValue = reinterpret_cast<uintptr_t>(dst);
+        uintptr_t srcAlignment = srcPtrValue & pixelsAlignmentMask;
+        uintptr_t dstAlignment = dstPtrValue & pixelsAlignmentMask;
+
+        // Uncomment if facing problems with alignment:
+        // Q_ASSERT_X(!(dstAlignment & 3), "Compositioning",
+        //            "Pixel data must be aligned on pixels borders!");
+
+        int blockAlign = params.cols;
+        int blockAlignedVector = 0;
+        int blockUnalignedVector = 0;
+        int blockRest = 0;
+
+        int *vectorBlock =
+            srcAlignment == dstAlignment || !srcVectorInc ?
+            &blockAlignedVector : &blockUnalignedVector;
+
+        if (!dstAlignment) {
+            blockAlign = 0;
+            *vectorBlock = params.cols / vectorSize;
+            blockRest = params.cols % vectorSize;
+        } else if (params.cols > 2 * vectorSize) {
+            blockAlign = (vectorInc - dstAlignment) / 4;
+            const int restCols = params.cols - blockAlign;
+            *vectorBlock = restCols / vectorSize;
+            blockRest = restCols % vectorSize;
+        }
+
+        for(int i = 0; i < blockAlign; i++) {
+            Compositor::template compositeOnePixelScalar<useMask, _impl>(src, dst, mask, params.opacity, params.flow, params.channelFlags);
+            src += srcLinearInc;
+            dst += linearInc;
+
+            if(useMask) {
+                mask++;
+            }
+        }
+
+        for (int i = 0; i < blockAlignedVector; i++) {
+            Compositor::template compositeVector<useMask, true, _impl>(src, dst, mask, params.opacity, params.flow);
+            src += srcVectorInc;
+            dst += vectorInc;
+
+            if (useMask) {
+                mask += vectorSize;
+            }
+        }
+
+        for (int i = 0; i < blockUnalignedVector; i++) {
+            Compositor::template compositeVector<useMask, false, _impl>(src, dst, mask, params.opacity, params.flow);
+            src += srcVectorInc;
+            dst += vectorInc;
+
+            if (useMask) {
+                mask += vectorSize;
+            }
+        }
+
+
+        for(int i = 0; i < blockRest; i++) {
+            Compositor::template compositeOnePixelScalar<useMask, _impl>(src, dst, mask, params.opacity, params.flow, params.channelFlags);
+            src += srcLinearInc;
+            dst += linearInc;
+
+            if (useMask) {
+                mask++;
+            }
+        }
+
+        srcRowStart  += params.srcRowStride;
+        dstRowStart  += params.dstRowStride;
+
+        if (useMask) {
+            maskRowStart += params.maskRowStride;
+        }
+    }
+
+    if (!params.srcRowStride) {
+        Vc::free<float>(reinterpret_cast<float*>(const_cast<quint8*>(srcRowStart)));
+    }
+}
+
+};
+
+#endif /* __VECTOR_MATH_H */
diff --git a/libs/pigment/compositeops/KoVcMultiArchBuildSupport.h b/libs/pigment/compositeops/KoVcMultiArchBuildSupport.h
new file mode 100644
index 0000000..cfc45b5
--- /dev/null
+++ b/libs/pigment/compositeops/KoVcMultiArchBuildSupport.h
@@ -0,0 +1,91 @@
+/*
+ *  Copyright (c) 2012 Dmitry Kazakov <dimula73 at gmail.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __KOVCMULTIARCHBUILDSUPPORT_H
+#define __KOVCMULTIARCHBUILDSUPPORT_H
+
+#include "config-vc.h"
+
+#ifdef HAVE_VC
+
+#include <Vc/Vc>
+#include <Vc/common/support.h>
+
+#else /* HAVE_VC */
+
+namespace Vc {
+    typedef enum {ScalarImpl} Implementation;
+}
+
+#define VC_IMPL ::Vc::ScalarImpl
+
+#ifdef DO_PACKAGERS_BUILD
+#warning "Packagers build is not available without the presence of Vc library. Disabling."
+#undef DO_PACKAGERS_BUILD
+#endif
+
+#endif /* HAVE_VC */
+
+
+#ifdef DO_PACKAGERS_BUILD
+
+template<class FactoryType>
+typename FactoryType::ReturnType
+createOptimizedClass(typename FactoryType::ParamType param)
+{
+    /*if (Vc::isImplementationSupported(Vc::Fma4Impl)) {
+        return FactoryType::template create<Vc::Fma4Impl>(param);
+    } else if (Vc::isImplementationSupported(Vc::XopImpl)) {
+        return FactoryType::template create<Vc::XopImpl>(param);
+        } else*/
+    if (Vc::isImplementationSupported(Vc::AVXImpl)) {
+        return FactoryType::template create<Vc::AVXImpl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSE42Impl)) {
+        return FactoryType::template create<Vc::SSE42Impl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSE41Impl)) {
+        return FactoryType::template create<Vc::SSE41Impl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSE4aImpl)) {
+        return FactoryType::template create<Vc::SSE4aImpl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSSE3Impl)) {
+        return FactoryType::template create<Vc::SSSE3Impl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSE3Impl)) {
+        return FactoryType::template create<Vc::SSE3Impl>(param);
+    } else if (Vc::isImplementationSupported(Vc::SSE2Impl)) {
+        return FactoryType::template create<Vc::SSE2Impl>(param);
+    } else {
+        return FactoryType::template create<Vc::ScalarImpl>(param);
+    }
+}
+
+#else /* DO_PACKAGERS_BUILD */
+
+/**
+ * When doing not a packager's build we have one architecture only,
+ * so the factory methods are simplified
+ */
+
+template<class FactoryType>
+typename FactoryType::ReturnType
+createOptimizedClass(typename FactoryType::ParamType param)
+{
+    return FactoryType::template create<VC_IMPL>(param);
+}
+
+#endif /* DO_PACKAGERS_BUILD */
+
+#endif /* __KOVCMULTIARCHBUILDSUPPORT_H */