README.md

    Copyright © 2022-2024, KNS Group LLC (YADRO)
    Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0

    This code indirectly uses OpenBLAS: can link with it for testing accuracy and
    performance of matrix multiplication. OpenBLAS is distributed freely with the
    BSD license. Here is the link to the OpenBLAS project:

    • https://www.openblas.net/

    This code follows the logics from the following articles. These texts explain
    the well-known approaches: for optimizing of matrix multiplication, and for
    calculating of neural convolution by reducing it to the BLAS/gemm function
    with the im2col/im2row transforms. Links:

    • https://habr.com/ru/articles/359272/
    • https://habr.com/ru/articles/448436/

    Overview

    This repo contains examples to illustrate computing neural-convolution by reducing
    it to function which we call “mini-gemm”. “Gemm” refers to well-known BLAS/gemm
    function for GEnerall Matrix Multiplication (so GEMM). And “mini-” is just the word
    that indicates our specific variant of GEMM-like interface. Besides, we would target
    small matrices, like 100 x 100 (or maybe 1000 x 10) elements (so mini-matrices).

    These examples illustrate optimizing the mini-gemm function for central processor
    unit (CPU), specifically of RISC-V architecture. And even more specifically: yet it
    is only tested on the Lichee Pi 4A mini-computer, development platform actually,
    with the TH1520 processor by the T-Head company, which belongs to the Alibaba
    group from China. For information: this CPU is built on 4 of XuanTie C910 cores.

    These examples are also a benchmarks (performance tests). Such test would call
    the tested function many times with different parameters and track the time, in
    milliseconds per call. This would allow to estimate effectiveness of the given
    optimizations, and maybe to try your own variant. Particularly, we can convert
    the measured milliseconds into giga-flop/s and compare against the theoretical
    peak of performance for the target CPU, like for Ryzen 5500U if x86, or XuanTie
    C910 if we test on RISC-V.

    These tests can also compare performance of our optimization of the mini-gemm
    function versus similar call to BLAS/gemm functions, specifically versus OpenBLAS
    implementation. It can also compare performance of convolution-via-mini-gemm
    versus “standard” way to compute neural convolution via BLAS/gemm: by reducing
    it with the well-known im2cal/im2row transforms. Or, if you choose building these
    examples without linking to OpenBLAS, our tests would compare versus naive code,
    like direct calculation of convolution or simple triple-cycle for GEMM.

    Hereby is the link to more details about these examples, including instructions
    for building and running these examples on both x86 and RISC-V processors:

    Examples: minigemm & convolution

    This directory contains four examples/benchmarks:

    • example_minigemm
    • example_convolution
    • example_optimized_minigemm
    • example_optimized_convolution

    Two of these examples would performance-test the “mini-gemm” function. And the
    other two examples would test the neural-convolution-via-mini-gemm, i.e.: reducing
    the operation of neural convolution to a series of calls to the mini-gemm function.

    The first two not-optimized examples are probably not very interesting: they present
    the concept, how would the approach work at all on RISC-V computers. The other two
    examples would test performance with some hand-made optimizations for mini-gemm.
    This allows out-performing convolution calculated with well-known im2col/im2row
    methods, which reduce convolution to the GEMM function from the OpenBLAS.

    How can our simple variant of mini-gemm outperform well-optimized OpenBLAS?
    Mini-gemm itself does not, but convolution-via-minigemm does sometimes, if lucky
    use-case (parameters of convolution). Unlike im2col/im2row, reducing convolution
    to mini-gemm would never require duplicating or copying of data, so would better
    utilize processor’s L1/L2 caches (note, that Lichee Pi 4A does not have L3 cache).

    The first two not-optimized examples try our concept with various data types:

    • FP32 x FP32 –> FP32
    • Int32 x Int32 –> Int32
    • 8-bit x 8-bit –> Int32
      • I8 x I8 –> I32
      • U8 x U8 –> I32
      • I8 x U8 –> I32
      • U8 x i8 –> I32
    • FP16 x FP16 –> FP16
    • FP16 x FP16 –> FP32

    The optimized ones mostly concentrate on pure FP32 case. It performs quite well
    on the target RISC-V processor (we tested on Lichee Pi 4A), as processor can do
    two vectored vfma(s += ax) operations per CPU tick. For FP16, performance peak
    would be twice higher than for FP32, but FP16 is less usable due to accuracy loss
    with such reduced precision. For information, here is theoretical estimate for peak
    of CPU performance, for s += ax operation on one core of Lichee Pi 4A processor
    running at 1.848 GHz with 128-bits SIMD:

    • 59.136 giga-flop/s for FP16 += FP16 x FP16
    • 29.568 giga-flop/s for FP32 += FP32 x FP32
    • 14.784 giga-flop/s for Int32 += Int8 x Int8

    If interested, find more details about Lichee Pi 4A in the Web, e.g.:

    • https://wiki.sipeed.com/hardware/en/lichee/th1520/lp4a.html

    For checking correctness and performance of mini-gemm, these examples would use
    OpenBLAS/gemm function as etalon. Either they would use hand-made not-optimized
    GEneral Matrix Multiplication if you choose to build these examples without OpenBLAS.

    For checking correctness and performance of convolution-via-minigemm, these examples
    would compare it versus the well-known method for reducing convolution to BLAS/gemm
    with the im2col/im2row transforms. Or if you choose to build these examples without
    OpenBLAS, they would compare versus naive direct method for convolution.

    For data types other than FP32, we still can use BLAS/gemm: by copying data into
    temporary buffer with converting the data to FP32 type, and then converting results
    back to target type. Of course, this would substantially damage observed performance
    of “via BLAS/gemm” calculations, so performance comparison is not very honest for
    data types other than FP32.

    Prepare

    Here I describe how to build these examples locally on your x86/Linux computer,
    including the case if your copy of Linux is installed under Windows on WSL 2, like
    for example WSL/Ubuntu. This instruction also describes cross-compiling of these
    examples for your Lichee Pi 4A device using your x86/Linux as the host.

    You may build these examples with or without linking them to OpenBLAS, on your
    discretion. If you prefer comparing performance of mini-gemm versus BLAS/gemm
    (which I would recommend), you need to preliminarily build OpenBLAS itself. Below
    you can find instructions on how to build OpenBLAS. Here, let’s assume it is already
    built, and its binaries are found under some local folder at your host computer.

    Cross-compiling for RISC-V is tested only for Lichee Pi 4A yet, the development
    board with the TH1520 processor with 4 of the XuanTie C910 cores. To compile
    for this target system, you would need to download the XuanTie development
    kit. Specifically, you need the version 2.8.1 of the XuanTie toolkit. You may find
    this toolkit at the binaries form at the following Web site:

    • https://xrvm.cn/community/download

    Here is the direct link for downloading the toolkit version 2.8.1:

    To cross-compile, please download and unpack this toolkit on your host, under
    Linux, or maybe under WSL/Linux if you work under Windows. Please unpack the
    tookit into your home folder, and please setup the XUANTIE_TOOLCHAIN_ROOT
    environment variable, like this:

    export XUANTIE_TOOLCHAIN_ROOT=/home/${USER}/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1/
    

    For executing our tests/examples on your Lichee Pi 4A board, you also need a copy
    of the toolkit at the /home/sipeed folder on the board, assuming you login to the
    board as the default user sipeed.

    Or you may copy only the sysroot sub-folder of the toolkit, as the compiled tests
    would need only this subfolder. If you choose this option, I’d recommend to archive
    the sysroot folder before copying it to the board: as it has many symbolic links
    which would cause duplicating many lib files if you copy simply with “scp -r”.

    Below is the example of copying the sysroot folder to the board. Let LICHEE_PI
    variable equal the IP address or the network unique name of the Lichee Pi, by which
    your host can ping it. Here I assume you login to the Lichee Pi board as the default
    user named sipeed. Note, that the default password pre-defined for the sipeed
    user at the board is licheepi:

    On host:

    cd ~/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
    tar cf sysroot.tar sysroot/*  # archive with sym-links
    gzip -1 sysroot.tar           # compress into *.tar.gz
    scp sysroot.tar.gz sipeed@${LICHEE_PI}:/home/sipeed
    rm sysroot.tar.gz         # no need to keep it on host
    

    Login to board:

    ssh sipeed@${LICHEE_PI}
    

    On board:

    cd /home/sipeed  # actually, must be already here at login
    mkdir Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
    cd    Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
    tar zxf ../sysroot.tar.gz  # unpack, restore sym-links
    rm      ../sysroot.tar.gz  # no need to keep *.tar.gz
    

    Actually, you do not need 32-bit sub-folders sysroot/lib32... on dev board,
    so you may delete these 32-bit libs from Lichee Pi copy of sysroot if you want.

    Build

    We use CMake to build these examples. Cmake options specific to this project:

    Option Type Comments
    RISCV_ENABLE_RVV bool default is OFF
    RISCV_ENABLE_FP16 bool default is OFF
    SMART_CHOICE_OPTL bool default is OFF
    OPTIMIZE_INTEGER bool default is OFF
    WITH_OPEN_BLAS bool default is OFF
    OPEN_BLAS_DIR path default is empty

    Below is example of building for x86, without linking to OpenBLAS. Here we assume
    that ${MINIGEMM} is the folder at your host where you’ve placed
    the local clone of the minigemm repository:

    cd ${MINIGEMM}
    mkdir -p build/x86
    cd       build/x86
    cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release
    make
    

    This must make the binaries of our examples and put them under the bin folder,
    e.g.:

    cd ${MINIGEMM}
    ls -al bin
    -rwxr-xr-x 209920 example_convolution
    -rwxr-xr-x 160016 example_minigemm
    -rwxr-xr-x 210064 example_optimized_convolution
    -rwxr-xr-x 169128 example_optimized_minigemm
    

    Please see the section Test below for instructions of how-to-run them.

    Linking these examples to OpenBLAS would compare performance of our
    mini-gemm function against similar BLAS/gemm function, which must be
    well-optimized at the OpenBLAS package.

    Example of building for x86, with linking to OpenBLAS. Assume we have
    preliminarily build OpenBLAS, and its binaries are found at our local host
    folder like, e.g.:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    mkdir -p build/x86-blas
    cd       build/x86-blas
    cmake -S ../.. -B .  -DCMAKE_BUILD_TYPE=Release -DWITH_OPEN_BLAS=ON \
        -DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/x86-ryzen3-nothreads/
    make
    

    To build for Lichee Pi 4A, we use to cross-compile on x86 host computer.
    The section Prepare above talks how to download the XuanTie compiler
    and to install it on your host system.

    Example of cross-compiling for RISC-V, without linking to OpenBLAS.
    Do this on your host computer, under Linux or under WSL/Linux:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    mkdir -p build/riscv
    cd       build/riscv
    cmake -S ../.. -B .  -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_TOOLCHAIN_FILE=../../cmake/xuantie.toolchain.cmake \
        -DRISCV_ENABLE_RVV=ON -DRISCV_ENABLE_FP16=ON
    make
    

    For testing performance against well-optimized OpenBLAS, link with its
    binaries, which you need to preliminarily build on your host computer:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    mkdir -p build/riscv-blas
    cd       build/riscv-blas
    cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_TOOLCHAIN_FILE=../../cmake/xuantie.toolchain.cmake \
        -DRISCV_ENABLE_RVV=ON -DRISCV_ENABLE_FP16=ON -DWITH_OPEN_BLAS=ON \
        -DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/riscv-c910v-nothreads
    make
    

    Test

    To run examples on x86, just start one of them like, e.g.:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    cd build/x86-blas
    time bin/example_optimized_convolution ${OPTION} ; echo status=$?
    

    Such testing may last 10 to 30 minutes. It will print log about its progress.
    At the end, it will print summary table with results of performance testing.

    To run the cross-compiled examples on your Lichee Pi 4A board, you need to
    upload the binaries to the board. Following example assumes, that Lichee Pi 4A
    appears at you local network, and ${LICHEE_PI} equals to its IP address or its
    name. Here I assume you login to Lichee Pi as the default user named sipeed.
    The default password for the sipeed user at the board is licheepi. Assume
    you copy the binaries to the /home/sipeed/temp folder. Please create such
    folder manually in advance:

    On host:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    cd build/riscv-blas
    scp bin/example_* sipeed@${LICHEE_PI}:temp
    

    On Lichee Pi board:

    cd ~/temp
    time ./example_optimized_convolution ${OPTION} ; echo status=$?
    

    Command-line OPTION may be empty, either one of these:

    Command-line option meaning
    -s stabilize performance results
    -f fast testing mode (less stable)
    -v verbose: for more detailed log

    To explain these options, several words about our tests/examples:

    We execute the tested function 4 times:

    • 0th run: calibrate and “warm-up engine”
    • 1st, 2nd, 3rd: attempt testing three times

    Calibration would tell the test how many times to iterate the tested function.

    Note that the timer we use is accurate only if we measure large enough interval,
    like 100 milliseconds or larger. Calibration phase would iterate tested function
    many times until such iterations would take at least 100 milliseconds in overall.
    This number of iterations is then used for the next main phase of the testing.

    By the way, such calibration sort of “warms-up the engine”: adjusts CPU cache,
    and puts the CPU core to maximal frequency, so we can see best performance.

    Then: the 1st, 2nd, and 3rd runs would repeat this experiment three times.
    Then, the test selects the median time of these three, and treats it as the result
    of performance testing. Presumably, such taking of the median must stabilize
    the performance result: make it more stable than simple one-run testing.

    For even better stabilization, please use the -s command-line option. With
    this mode, the test additionally checks the difference between the best and
    the worst of the three results. If the difference exceeds 10%, the test repeats
    the whole experiment again and again until the difference fits under 10%.

    Conversely, with the -f option, the test omits the three-runs testing, so
    works much quicker. Please use this option to estimate performance roughly,
    or just to test accuracy.

    OpenBLAS

    Here I explain how to build OpenBLAS: compile for x86/Linux host system,
    either cross-compile for RISC-V. For RISC-V, we would compile specifically
    for XuanTie C910 processor core, targeting Lichee Pi 4A dev board.

    Note, that we compile for single-core execution, that is without threads.

    The idea is, that the bigger algorithm could apply its own threading. E.g.:
    for computing neural convolution, we split large tensor into many smaller
    parts to call many mini-gemm’s for them. So we could use CPU threads to
    call many of small instances of mini-gemm in parallel.

    To build OpenBLAS from sources, first you need to download its sources.
    Its sources are available freely at GitHub, under the BSD 3-Clause License.
    Here is the link to the repository:

    • https://github.com/OpenMathLib/OpenBLAS

    The authors recommend the develop branch of this repo, as the master
    branch is very old. You may clone this repo like following. The develop
    branch is the default. I would clone it via HTTPS, as I do not plan pushing
    my code to this repo. Assuming you clone into your ~/code folder:

    cd ~/code/OpenMathLib
    git clone https://github.com/OpenMathLib/OpenBLAS.git
    cd OpenBLAS
    # git checkout develop  -- no need to checkout, develop is default
    

    Then, I would base on the following instructions (links below). These texts
    are not complete, but they usefully complement each other. So we would
    combine these advices:

    Below I show how I built OpenBLAS binaries. Assuming we install the
    compiled OpenBLAS binaries into the following two folders. Use these
    folders as OPEN_BLAS_DIR option for CMake to build our mini-gemm:

    /home/${USER}/code/OpenMathLib/OpenBLAS/build/
        x86-ryzen3-nothreads
        riscv-c910v-nothreads
    

    For x86, how I used to compile for my laptop with Ryzen 5500U processor:

    cd ~/code/OpenMathLib/OpenBLAS
    
    make -j6 DYNAMIC_ARCH=0 CC=gcc HOSTCC=gcc BINARY=64 INTERFACE=64 \
        NO_AFFINITY=1 NO_WARMUP=1 USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 \
        NOFORTRAN=1
    
    mkdir -p build/x86-ryzen3-nothreads
    make install \
        PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/x86-ryzen3-nothreads
    
    make clean  # all we need is saved into install folder
    

    For RISC-V, how I used to cross-compile for Lichee Pi 4A. Do this on host:

    cd ~/code/OpenMathLib/OpenBLAS
    
    make -j6 DYNAMIC_ARCH=0 TARGET=C910V \
        CC=/home/${USER}/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.6.1/bin/riscv64-unknown-linux-gnu-gcc \
        HOSTCC=gcc BINARY=64 INTERFACE=64 NO_AFFINITY=1 NO_WARMUP=1 \
        USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 NOFORTRAN=1
    
    mkdir -p build/riscv-c910v-nothreads
    make install \
        PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/riscv-c910v-nothreads
    
    make clean  # all we need is saved into install folder
    

    You do not need to copy OpenBLAS binaries to Lichee Pi 4A board: these
    libraries are static so would link into binaries of our tests/examples

    ARM

    DISCLAIMER: Please note, that our mini-gemm code is not optimized for ARM (yet).

    Anyway, if you occasionally want to try these mini-gemm examples on ARM processor,
    you also need to build OpenBLAS for ARM, to compare our code against BLAS/sgemm.

    Here I explain cross-compiling on x86 host. For that, you need to install GCC tools
    for ARM. Hopefully, path to GCC for ARM is same on your host as I show below.

    Example of compiling OpenBLAS for ARMv8 processor. Do this on host:

    cd ~/code/OpenMathLib/OpenBLAS
    
    make -j6 DYNAMIC_ARCH=0 TARGET=ARMV8 \
        CC=/usr/bin/aarch64-linux-gnu-gcc \
        HOSTCC=gcc BINARY=64 INTERFACE=64 NO_AFFINITY=1 NO_WARMUP=1 \
        USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 NOFORTRAN=1
    
    mkdir -p build/armv8-neon-nothreads
    make install \
        PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/armv8-neon-nothreads
    
    make clean  # all we need is saved into install folder
    

    If you target SVE instead of NEON, OpenBLAS docs propose TARGET=ARMV8SVE
    parameter for make (although I have not tested such case).

    Example of building these mini-gemm examples. Note that here I use toolchain file
    from OpenCV, assuming your ARM computer works under Linux. Do this on host:

    cd ${RVDNN_EXAMPLES_MINIGEMM}
    
    mkdir -p build/arm-blas
    cd       build/arm-blas
    
    cmake ../.. -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_TOOLCHAIN_FILE=${OpenCV_DIR}/platforms/linux/aarch64-gnu.toolchain.cmake \
        -DWITH_OPEN_BLAS=ON \
        -DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/armv8-neon-nothreads
    
    make
    

    Upload just built bin/example_* binaries to your ARM computer and try them there.

    Конвейеры
    0 успешных
    0 с ошибкой