返回介绍

Optimize Image Processing with XSIMD

发布于 2024-06-23 21:36:12 字数 28856 浏览 0 评论 0 收藏 0

About SIMD

SIMD (Single instruction, multiple data) is a type of parallel data processing where a single processor instruction can process multiple values. For this purpose the CPU incorporates special computational blocks that perform the most popular arithmetic operations in parallel. For example, using AVX blocks it can multiply 8 floating-point numbers by other 8 floating-point numbers at roughly the same “speed” [1] as it would multiply a couple of numbers using a standard instruction. Such speed increase is possible because these blocks have 8 separate physical pipelines that do the multiplication in parallel.

https://www.wenjiangs.com/wp-content/uploads/2024/docimg4/547-vvl4ai4irmd.png

_mm256_mul_ps is potentially 8 times faster than normal fmul instruction

In the original SIMD instruction set there were two instructions available: one for aligned memory access and one for unaligned:

  • __m256 _mm256_load_ps (float const * mem_addr) — load eight 32-bit floating point values stored at mem_addr into a 256-bit register. The address at mem_addr must be 256-bit aligned, otherwise application will crash with SIGSEGV.

  • __m256 _mm256_loadu_ps (float const * mem_addr) — same as above, but mem_addr is allowed to be unaligned.

In older CPUs (before Nehalem for Intel and before TODO:unknown for AMD) aligned version of the instruction was more efficient when dealing with aligned data than the unaligned one. Hence we had to write two different versions of the code, one for aligned data and one for unaligned. In modern Intel CPUs both instructions give exactly the same performance when passed with the aligned data, so it is considered safe to use the unaligned version all the time.

小心

  • TODO: Is it safe to always use unaligned access on ARM?

  • TODO: Are we required to use element-aligned on ARM?

Please remember, that for better performance it is still recommended to make sure that all buffers are aligned to SIMD-word boundary (that is, 32 bytes for AVX and 16 bytes for SSE).

Fixed-size chunks

Even though we can partially ignore alignment issues with modern CPUs, we still cannot ignore the fact that the size of the chunks must be fixed.

Let’s consider the following example, where we need to process 10 pixels:

https://www.wenjiangs.com/wp-content/uploads/2024/docimg4/548-rslji5nakn1.png

_mm256_mul_ps will potentially read past the end of the allocated buffers

Compilation targets

The term “cpu optimization” is usually rather confusing. It can be used in three different meanings:

  1. Compiler’s “target architecture”. An architecture, for which the compiler generates code. This architecture is selected by -march and -mtune compiler switches. -march allows the compiler to issue architecture specific instructions. It also enables the corresponding intrinsics. -mtune activates automatic optimizations (and vectorizations) of the code for the specified architecture.

    For example, if we specify -march=nehalem, then _mm_mul_ps intrinsic will become available (since it is a part SSE), but _mm256_mul_ps will not (it is from AVX set). If we specify -march=sandybridge, then both intrinsic will become available.

  2. “Host architecture”. An architecture of the CPU where we compile Krita on. In most of the cases this architecture doesn’t matter. If the compiler supports generation of instructions for a specific instruction set at all (e.g. AVX), then it will generate them on any host CPU, even the older one.

    Theoretically, you can instruct the compiler to build Krita for the “host architecture”, by passing -march=native -mtune=native, but it is not recommended, since it makes the binaries not portable.

    提示

    Back in 2012 Krita also had an option to compile for the “host architecture”. That option was removed later in favor of multiarch builds of the critical code.

  3. “User’s CPU architecture”. An architecture of the CPU where the user will run Krita on.

Obviously, we cannot tell in advance what CPU the user will run Krita on. We can detect CPU capabilities only when Krita actually starts on user’s device. Therefore we need to have multiple versions of the hot-path algorithms, prebuilt for each possible CPU architecture and select the optimal version on Krita startup.

Here is Krita we prebuild code for 7 most popular target instruction sets:

  1. SSE2 — basic 128-bit floating-point and integer arithmetic

  2. SSSE3 — SSE2 + several shuffle instructions

  3. SSE4.1 — SSSE3 + integer conversions and rounding instructions

  4. AVX — SSE4.1 + 256-bit floating-point (only!) arithmetic and shuffles

  5. AVX2+FMA — AVX + 256-bit integer arithmetic and fused-multiply-add (used a lot in compositioning code for implementation of lerp function)

Multiarch builds

There are two standard approaches for multiarch builds:

1. DLL-based approach

The easiest approach assumes that you build the same .dll or .so library multiple times, one for each supported architecture. Then, on application launch, you load the library that fits best to the current CPU. GNU ld also has some special features that allow automatically resolve symbols on a per-architecture basis. GNU libc uses this approach.

Here in Krita we do not use this approach:

  • it causes too much code to be duplicated between the cloned libraries

  • relying on the linker features is not a portable approach

2. Template-based approach

In Krita we use a template based approach. It is very explicit and provides full control over how implementations are generated and selected.

Let’s consider an example of KoOptimizedPixelDataScalerU8ToU16. It is a simple class that provides optimized routines for converting pixels between uint8 and uint16 pixel formats. We use this class to increase precision of colorsmudge brush and avoid the well-known “color drift on low opacity” bug.

Firstly, we need to declare an abstract interface class that will be available to the user:

// file: KoOptimizedPixelDataScalerU8ToU16Base.h

class KRITAPIGMENT_EXPORT KoOptimizedPixelDataScalerU8ToU16Base
{
public:
    // ...
    virtual void convertU8ToU16(const quint8 *src, int srcRowStride,
                                quint8 *dst, int dstRowStride,
                                int numRows, int numColumns) const = 0;

    virtual void convertU16ToU8(const quint8 *src, int srcRowStride,
                                quint8 *dst, int dstRowStride,
                                int numRows, int numColumns) const = 0;
    // ...
};

The class does nothing serious other than declares two pure virtual methods that will be used by the colorsmudge brush later.

Then we need to add a header file with the class that actually implements this interface using SSE/AVX instructions:

// file: KoOptimizedPixelDataScalerU8ToU16.h

template<typename _impl>
class KoOptimizedPixelDataScalerU8ToU16 : public KoOptimizedPixelDataScalerU8ToU16Base
{
public:
    // ...
    void convertU8ToU16(const quint8 *src, int srcRowStride,
                        quint8 *dst, int dstRowStride,
                        int numRows, int numColumns) const override
    {
        // ... very clever implementation of U8->U16 scaling using SSE/AVX is skipped ...
    }

    void convertU16ToU8(const quint8 *src, int srcRowStride,
                        quint8 *dst, int dstRowStride,
                        int numRows, int numColumns) const override
    {
        // ... very clever implementation of U16->U8 scaling using SSE/AVX is skipped ...
    }
    // ...
};

Pay attention to the only template parameter of the class. The class is parameterized with “architecture”, which is a simple class provided by XSIMD. We don’t use this template parameter inside the class. We only need it to create multiple copies of the class without violating ODR-rule.

In the next step we need to create a FactoryImpl class. It is actually the class that will be copied multiple times.

// file: KoOptimizedPixelDataScalerU8ToU16FactoryImpl.h

class KRITAPIGMENT_EXPORT KoOptimizedPixelDataScalerU8ToU16FactoryImpl
{
public:

    /// declare a templated factory method that is parameterized
    /// by the CPU architecture

    template<typename _impl>
    static KoOptimizedPixelDataScalerU8ToU16Base* create(int);
};
// file: KoOptimizedPixelDataScalerU8ToU16FactoryImpl.cpp

/// define a full template specialization for the factory
/// method for `xsimd::current_arch` architecture

template<>
KoOptimizedPixelDataScalerU8ToU16Base *
KoOptimizedPixelDataScalerU8ToU16FactoryImpl::create<xsimd::current_arch>(
    int channelsPerPixel)
{
    return new KoOptimizedPixelDataScalerU8ToU16<xsimd::current_arch>(
        channelsPerPixel);
}

FactoryImpl class has the only method. This method creates the scaler object and returns it via the abstract interface. Pay attention that create() method has no generic template implementation. Its only implementation is fully specialized with “magic” type xsimd::current_arch. xsimd::current_arch is a special placeholder type that points to the “desired target” architecture type, when the .cpp file is compiled for multiple architectures.

Now we need to actually compile KoOptimizedPixelDataScalerU8ToU16FactoryImpl.cpp for all targets. To do that we should use a special CMake macro:

// file: CMakeLists.txt

if(HAVE_XSIMD)

    # create 6 copies of the file and compile each one
    # with the corresponding compiler flags

    ko_compile_for_all_implementations(__per_arch_rgb_scaler_factory_objs
                                       KoOptimizedPixelDataScalerU8ToU16FactoryImpl.cpp)
else()

    # in case XSIMD is not available, just compile the .cpp file once
    # with the default compiler options (x86_64)

    set(__per_arch_rgb_scaler_factory_objs KoOptimizedPixelDataScalerU8ToU16FactoryImpl.cpp)
endif()

# ...

set(kritapigment_SRCS
    # ...
    ${__per_arch_rgb_scaler_factory_objs}
    # ...
)

Now we have six explicit intantiations of KoOptimizedPixelDataScalerU8ToU16FactoryImpl class. One for each target architecture. The only thing left is to implement runtime selection of the proper instantiation. To do that, let’s implement a real Factory class:

// file: KoOptimizedPixelDataScalerU8ToU16Factory.h

class KRITAPIGMENT_EXPORT KoOptimizedPixelDataScalerU8ToU16Factory
{
public:
    static KoOptimizedPixelDataScalerU8ToU16Base* createRgbaScaler();
};
// file: KoOptimizedPixelDataScalerU8ToU16Factory.cpp

KoOptimizedPixelDataScalerU8ToU16Base *KoOptimizedPixelDataScalerU8ToU16Factory::createRgbaScaler()
{
    return createOptimizedClass<
            KoOptimizedPixelDataScalerU8ToU16FactoryImpl>(4);
}

The Factory class calls a special function createOptimizedClass(), which detects the current CPU architecture and calls the proper instantiation of the FactoryImpl class to create the scaler object.

The usage of the optimized class is very simple:

/// detect the current CPU capabilities, select the best-fit `FactoryImpl`
/// factory and create the scaler object that is optimized for the current CPU

KoOptimizedPixelDataScalerU8ToU16Base *scaler =
    KoOptimizedPixelDataScalerU8ToU16Factory::createRgbaScaler();

/// use the scaler as usual...

scaler->convertU8ToU16(...);

提示

Why do we have two factory objects, FactoryImpl and Factory?

The main reason is that we really don’t want to export a templated class from a .so/.dll library. Exporting templates is not portable, so we encapsulate all the templates behind the wall of a Factory class.

Some notes about writing efficient processing functions:

  • when processing pixels we should perform as few virtual calls as possible

  • the best way to minimize the number of virtual calls is to use “row-stride” approach, like convertU8ToU16() does

    void convertU8ToU16(const quint8 *src, int srcRowStride,
                        quint8 *dst, int dstRowStride,
                        int numRows, int numColumns) const override
    
  • when implementing your own algorithms just reuse the signature of convertU8ToU16()

Explicit scalar implementation of the code

As we discussed above, we need to generate a version of an algorithm for every target CPU. Usually, we also need to provide one more version of the code, which is fully scalar, i.e. doesn’t use any SIMD instructions. This scalar version is used used in the following cases:

  • no XSIMD library is not found during the compilation of Krita

  • the CPU Krita runs on doesn’t support any SIMD we know

  • the user emplicitly disabled SIMD optimizations in ‘Performance’ tab in Krita’s ‘Preferences’

This extra version is generated by passing a special architecture in the template parameter xsimd::generic. Depending on how you code implements the SIMD optimization, you would either need to handle this architecture manually or not.

The code in `KoOptimizedPixelDataScalerU8ToU16 uses explicit checks for macros like XSIMD_WITH_AVX2 and XSIMD_WITH_NEON64, therefore, it does not need any special handling for the “scalar” xsimd::generic architecture. For xsimd::generic all these macros are undefined, hence the code automatically folds into plain old C-for-loop.

If your code is more abstract, i.e. delegates all the architecture dispatching to XSIMD library (the library itself is described below), then you need to add a custom implementation for xsimd::generic arcitecture. You can check the example of how it is done in KoAlphaMaskApplicator:

template<typename _channels_type_,
        int _channels_nb_,
        int _alpha_pos_,

        // implementation that will be passed by `xsimd::current_arch`
        typename _impl,

        // a dummy parameter to make `std::enable_if` work
        typename EnableDummyType = void

        >
struct KoAlphaMaskApplicator : public KoAlphaMaskApplicatorBase
{
    void applyInverseNormedFloatMask(quint8 *pixels,
                                    const float *alpha,
                                    qint32 nPixels) const override {
        // ... default implementation that uses plain C-for-loops only ...
    }
    // ... skipped unrelated...
};

#if defined(HAVE_XSIMD) && !defined(XSIMD_NO_SUPPORTED_ARCHITECTURE)

#include "KoStreamedMath.h"

template<typename _impl>
struct KoAlphaMaskApplicator<
        quint8, 4, 3, _impl,

        // enable this implementation only when it is **not** `xsimd::generic`
        typename std::enable_if<!std::is_same<_impl, xsimd::generic>::value>::type

        > : public KoAlphaMaskApplicatorBase
{
    void applyInverseNormedFloatMask(quint8 *pixels,
                                    const float *alpha,
                                    qint32 nPixels) const override
    {
        // ... actual implementation of SIMD algorithm using XSIMD ...
    }
    // ... skipped unrelated...
};

XSIMD Library

All we did before was just a preparation for the actual work. Now we need to write the actual SIMD code.

Here in Krita we use a special library XSIMD. It wraps all the compiler intrinsics into convenient C++ classes. The heart of XSIMD is xsimd::batch<type, arch> class. It behaves as if it were a simple arithmetic type, but processes multiple values at once.

Example:

/// Define convenience types to manage vector batches.
/// `_impl` is a template parameter that is passed via `xsimd::current_arch`
/// by the per-arch build script. The size of the vector is defined
/// by the actual architecture passed to it.

using uint_v = xsimd::batch<unsigned int, _impl>;
using float_v = xsimd::batch<float, _impl>;

// load pixels into a vector register

uint_v data_i = uint_v::load_unaligned(reinterpret_cast<const quint32 *>(pixels));

// extract alpha channel from the pixels and convert it to float

const float_v pixelAlpha =
    xsimd::to_float(xsimd::bitwise_cast<int_v>(data_i >> 24U));

In Krita we have a set of predefined convenience types for vector batches in KoStreamedMath:

batch type

element type

num elements (AVX2)

num elements (AVX)

num elements (SSE2)

int_v

qint32

8

8*

4

uint_v

quint32

8

8*

4

float_v

float

8

8

4

  • — even though the first version of AVX doesn’t support integer packs, XSIMD implements that by combining two(!) SSE2 registers into one batch. This way we can easily convert int_v into float_v back and forth.

Arithmetic operations

Arithmetic operations with SIMD batches look exactly the same as if you did them with normal int or float values. Let’s consider example from KoAlphaMaskApplicator::fillGrayBrushWithColor, which fills the alpha mask of the RGBA8 brush with provided color (all the inline comments assume the current architecture is AVX2):

/// a convenience batch for 8 copies of the provided brush color
/// (please note that the constructor accepts a plain quint32 value,
/// this value is loaded into all 8 slots of the batch)

const uint_v brushColor_i(*reinterpret_cast<const quint32*>(brushColor) & 0x00FFFFFFu);

/// a convenience batch of 8 values `0xFF`

const uint_v redChannelMask(0xFF);

for (int i = 0; i < block1; i++) {

    const int vectorPixelStride = numChannels * static_cast<int>(float_v::size);

    /// Load RGBA8 pixels into the brush. If `brush` pointer is aligned to 256 bits,
    /// the speed it a little bit better, but it is not strictly necessary, since we
    /// use `load_unaligned` call.

    const uint_v maskPixels = uint_v::load_unaligned(reinterpret_cast<const quint32*>(brush));

    /// calculate the alpha channel value of each pixel

    const uint_v pixelAlpha = maskPixels >> 24;

    /// calculate the red channel value of each pixel (the brush is guaranteed to be
    /// grayscale here, that is, all color channels have the same value)

    const uint_v pixelRed = maskPixels & redChannelMask;

    /// calculate the final alpha value of the brush

    const uint_v pixelAlpha_i = multiply(redChannelMask - pixelRed, pixelAlpha);

    /// combine alpha value and the provided painting color

    const uint_v data_i = brushColor_i | (pixelAlpha_i << 24);

    /// store the result into the brush memory buffer

    data_i.store_unaligned(reinterpret_cast<typename uint_v::value_type *>(dst));

    dst += vectorPixelStride;

    /// we have processed `float_v::size` pixels at once, so advance the pointer
    /// (for AVX2 `float_v::size` is `8`)

    brush += float_v::size;
 }

Conditionals

Conditionals for vectorized values look very different from normal values. You can compare two batches, but instead of getting a single boolean you get a batch of booleans. This resulting boolean batch is called “a mask” and you can analyze it afterwards.

Let’s consider an example from KoOptimizedCompositeOpAlphaDarken.h. Alpha-darken blending mode has a lot of conditionals inside, so it is a very nice example. Here is a short excerpt from it:

/// check what dst pixels have zero alpha value, the result is
/// written into a mask of type `float_m`

const float_m empty_dst_pixels_mask = dst_alpha == zeroValue;

/// check if **all** dst pixels have null alpha

if (!xsimd::all(empty_dst_pixels_mask)) {

    /// it seems like there are some pixels with non-zero alpha...
    /// check if all pixels have non-zero alpha

    if (xsimd::none(empty_dst_pixels_mask)) {

        /// if all destination pixels have non-zero alpha, just
        /// blend them as usual

        dst_c1 = (src_c1 - dst_c1) * src_alpha + dst_c1;
        dst_c2 = (src_c2 - dst_c2) * src_alpha + dst_c2;
        dst_c3 = (src_c3 - dst_c3) * src_alpha + dst_c3;
    }
    else {

        /// if at least one pixel has zero alpha, we cannot use its
        /// `dst_c1` value, because it is undefined; we need to
        /// conditionally overwrite such pixels with `src_c1`

        dst_c1 = xsimd::select(empty_dst_pixels_mask, src_c1, (src_c1 - dst_c1) * src_alpha + dst_c1);
        dst_c2 = xsimd::select(empty_dst_pixels_mask, src_c2, (src_c2 - dst_c2) * src_alpha + dst_c2);
        dst_c3 = xsimd::select(empty_dst_pixels_mask, src_c3, (src_c3 - dst_c3) * src_alpha + dst_c3);
    }
}
else {

    /// if **all** dst pixels have null alpha, just overwrite them

    dst_c1 = src_c1;
    dst_c2 = src_c2;
    dst_c3 = src_c3;
}

Gather-scatter

TODO: this chapter is not written yet. Please check implementation of FastRowProcessor<KisCurveMaskGenerator> and PixelWrapper<quint16, impl>::read for examples.

Mixed-style

Sometimes you may want to mix XSIMD code and raw compiler intrinsics. In some cases, it may give much better performance, especially if a specific CPU instruction exists for your operation. In such cases you can just access the underlying __m128 or __m256 type of the batch via .data member.

Let’s consider an example from KoOptimizedPixelDataScalerU8ToU16:

// a pack of 16 quint16 values stored in a 256-bit AVX2 register
using uint16_avx_v = xsimd::batch<uint16_t, xsimd::avx2>;

// a pack of 16 quint8 values stored in a 128-bit SSE register
using uint8_v = xsimd::batch<uint8_t, xsimd::sse4_1>;


for (int i = 0; i < avx2Block; i++) {

    /// load a pack of 16 8-bit integer values using SSE4 instruction

    const uint8_v x = uint8_v::load_unaligned(srcPtr);

    /// convert them into 16 16-bit integers (and store in a
    /// wider register) using AVX2 instruction

    uint16_avx_v y(_mm256_cvtepu8_epi16(x));

    /// scale the value and add entropy to the lower bits to make
    /// rounding smoother using AVX2 instruction

    const uint16_avx_v y_shifted = y << 8;
    y |= y_shifted;

    /// store the value using AVX2 instruction

    y.store_unaligned(
         reinterpret_cast<typename uint16_avx_v::value_type *>(dstPtr));

    srcPtr += channelsPerAvx2Block;
    dstPtr += channelsPerAvx2Block;
}

This approach uses a custom instruction provided by the CPU to optimize the conversion process. It gives a huge speed benefit for the color smudge brush, where this scaler is used. The main problem of this approach is that you need to implement the custom code for every platform we support, including NEON and NEON64.

Useful projects for Krita

  1. [easy, small] Optimize lightness mode for Krita brushes

    It needs changes in the following places:

    • KoColorSpace::modulateLightnessByGrayBrush()

    • KoColorSpace::fillGrayBrushWithColorAndLightnessWithStrength()

    • KoColorSpace::fillGrayBrushWithColorAndLightnessOverlay()

    The project is really nice and self-contained. You can follow the example of KoColorSpaceAbstract::m_alphaMaskApplicator that does exactly the same thing.

  2. [easy, big] Optimize masking brush compositioning

    You basically need to rewrite a single class KisMaskingBrushCompositeOp. The problem is that the class is parameterized with a dozen of composition functions. Theoretically, those functions are arithmetic, so they can be just passed with xsimd’s batches, but you would probably need to define custom KoColorSpaceMathsTraits for them.

  3. [difficult, small] Optimize gradients

    The project basically needs to optimize KoCachedGradient and all the places where it is used. The project might be a bit complicated, because it needs to use gather/scatter functionality, which is a bit tricky.

  4. [easy, small, depends on the previous three] Optimize brush textures

    Basically, you needs to rewrite KisTextureOption::apply to use the code of the previous three projects to do batch-processing.

  5. [very difficult, big] Scale predefined brushes with vectorized instructions

    In this project you needs to rewrite KisQImagePyramid class to use custom scaling algorithm instead of relying on QImage. We know that QImage internally uses SSE/AVX instructions for scaling, but we are required to use RGBA8 mode for that. And our brushes are usually GrayA8 or even Alpha8, so we have huge overhead on allocations, copies and conversions (confirmed by VTune).

Footnotes

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文