Copyright © 2022-2024, KNS Group LLC (YADRO)
Licensed under the Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0
This code indirectly uses OpenBLAS: can link with it for testing accuracy and
performance of matrix multiplication. OpenBLAS is distributed freely with the
BSD license. Here is the link to the OpenBLAS project:
- https://www.openblas.net/
This code follows the logics from the following articles. These texts explain
the well-known approaches: for optimizing of matrix multiplication, and for
calculating of neural convolution by reducing it to the BLAS/gemm function
with the im2col/im2row transforms. Links:
- https://habr.com/ru/articles/359272/
- https://habr.com/ru/articles/448436/
Overview
This repo contains examples to illustrate computing neural-convolution by reducing
it to function which we call “mini-gemm”. “Gemm” refers to well-known BLAS/gemm
function for GEnerall Matrix Multiplication (so GEMM). And “mini-” is just the word
that indicates our specific variant of GEMM-like interface. Besides, we would target
small matrices, like 100 x 100 (or maybe 1000 x 10) elements (so mini-matrices).
These examples illustrate optimizing the mini-gemm function for central processor
unit (CPU), specifically of RISC-V architecture. And even more specifically: yet it
is only tested on the Lichee Pi 4A mini-computer, development platform actually,
with the TH1520 processor by the T-Head company, which belongs to the Alibaba
group from China. For information: this CPU is built on 4 of XuanTie C910 cores.
These examples are also a benchmarks (performance tests). Such test would call
the tested function many times with different parameters and track the time, in
milliseconds per call. This would allow to estimate effectiveness of the given
optimizations, and maybe to try your own variant. Particularly, we can convert
the measured milliseconds into giga-flop/s and compare against the theoretical
peak of performance for the target CPU, like for Ryzen 5500U if x86, or XuanTie
C910 if we test on RISC-V.
These tests can also compare performance of our optimization of the mini-gemm
function versus similar call to BLAS/gemm functions, specifically versus OpenBLAS
implementation. It can also compare performance of convolution-via-mini-gemm
versus “standard” way to compute neural convolution via BLAS/gemm: by reducing
it with the well-known im2cal/im2row transforms. Or, if you choose building these
examples without linking to OpenBLAS, our tests would compare versus naive code,
like direct calculation of convolution or simple triple-cycle for GEMM.
Hereby is the link to more details about these examples, including instructions
for building and running these examples on both x86 and RISC-V processors:
Examples: minigemm & convolution
This directory contains four examples/benchmarks:
- example_minigemm
- example_convolution
- example_optimized_minigemm
- example_optimized_convolution
Two of these examples would performance-test the “mini-gemm” function. And the
other two examples would test the neural-convolution-via-mini-gemm, i.e.: reducing
the operation of neural convolution to a series of calls to the mini-gemm function.
The first two not-optimized examples are probably not very interesting: they present
the concept, how would the approach work at all on RISC-V computers. The other two
examples would test performance with some hand-made optimizations for mini-gemm.
This allows out-performing convolution calculated with well-known im2col/im2row
methods, which reduce convolution to the GEMM function from the OpenBLAS.
How can our simple variant of mini-gemm outperform well-optimized OpenBLAS?
Mini-gemm itself does not, but convolution-via-minigemm does sometimes, if lucky
use-case (parameters of convolution). Unlike im2col/im2row, reducing convolution
to mini-gemm would never require duplicating or copying of data, so would better
utilize processor’s L1/L2 caches (note, that Lichee Pi 4A does not have L3 cache).
The first two not-optimized examples try our concept with various data types:
- FP32 x FP32 –> FP32
- Int32 x Int32 –> Int32
- 8-bit x 8-bit –> Int32
- I8 x I8 –> I32
- U8 x U8 –> I32
- I8 x U8 –> I32
- U8 x i8 –> I32
- FP16 x FP16 –> FP16
- FP16 x FP16 –> FP32
The optimized ones mostly concentrate on pure FP32 case. It performs quite well
on the target RISC-V processor (we tested on Lichee Pi 4A), as processor can do
two vectored vfma(s += ax) operations per CPU tick. For FP16, performance peak
would be twice higher than for FP32, but FP16 is less usable due to accuracy loss
with such reduced precision. For information, here is theoretical estimate for peak
of CPU performance, for s += ax operation on one core of Lichee Pi 4A processor
running at 1.848 GHz with 128-bits SIMD:
- 59.136 giga-flop/s for FP16 += FP16 x FP16
- 29.568 giga-flop/s for FP32 += FP32 x FP32
- 14.784 giga-flop/s for Int32 += Int8 x Int8
If interested, find more details about Lichee Pi 4A in the Web, e.g.:
- https://wiki.sipeed.com/hardware/en/lichee/th1520/lp4a.html
For checking correctness and performance of mini-gemm, these examples would use
OpenBLAS/gemm function as etalon. Either they would use hand-made not-optimized
GEneral Matrix Multiplication if you choose to build these examples without OpenBLAS.
For checking correctness and performance of convolution-via-minigemm, these examples
would compare it versus the well-known method for reducing convolution to BLAS/gemm
with the im2col/im2row transforms. Or if you choose to build these examples without
OpenBLAS, they would compare versus naive direct method for convolution.
For data types other than FP32, we still can use BLAS/gemm: by copying data into
temporary buffer with converting the data to FP32 type, and then converting results
back to target type. Of course, this would substantially damage observed performance
of “via BLAS/gemm” calculations, so performance comparison is not very honest for
data types other than FP32.
Prepare
Here I describe how to build these examples locally on your x86/Linux computer,
including the case if your copy of Linux is installed under Windows on WSL 2, like
for example WSL/Ubuntu. This instruction also describes cross-compiling of these
examples for your Lichee Pi 4A device using your x86/Linux as the host.
You may build these examples with or without linking them to OpenBLAS, on your
discretion. If you prefer comparing performance of mini-gemm versus BLAS/gemm
(which I would recommend), you need to preliminarily build OpenBLAS itself. Below
you can find instructions on how to build OpenBLAS. Here, let’s assume it is already
built, and its binaries are found under some local folder at your host computer.
Cross-compiling for RISC-V is tested only for Lichee Pi 4A yet, the development
board with the TH1520 processor with 4 of the XuanTie C910 cores. To compile
for this target system, you would need to download the XuanTie development
kit. Specifically, you need the version 2.8.1 of the XuanTie toolkit. You may find
this toolkit at the binaries form at the following Web site:
- https://xrvm.cn/community/download
Here is the direct link for downloading the toolkit version 2.8.1:
To cross-compile, please download and unpack this toolkit on your host, under
Linux, or maybe under WSL/Linux if you work under Windows. Please unpack the
tookit into your home folder, and please setup the XUANTIE_TOOLCHAIN_ROOT
environment variable, like this:
export XUANTIE_TOOLCHAIN_ROOT=/home/${USER}/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1/
For executing our tests/examples on your Lichee Pi 4A board, you also need a copy
of the toolkit at the /home/sipeed
folder on the board, assuming you login to the
board as the default user sipeed
.
Or you may copy only the sysroot
sub-folder of the toolkit, as the compiled tests
would need only this subfolder. If you choose this option, I’d recommend to archive
the sysroot
folder before copying it to the board: as it has many symbolic links
which would cause duplicating many lib files if you copy simply with “scp -r
”.
Below is the example of copying the sysroot
folder to the board. Let LICHEE_PI
variable equal the IP address or the network unique name of the Lichee Pi, by which
your host can ping it. Here I assume you login to the Lichee Pi board as the default
user named sipeed
. Note, that the default password pre-defined for the sipeed
user at the board is licheepi
:
On host:
cd ~/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
tar cf sysroot.tar sysroot/* # archive with sym-links
gzip -1 sysroot.tar # compress into *.tar.gz
scp sysroot.tar.gz sipeed@${LICHEE_PI}:/home/sipeed
rm sysroot.tar.gz # no need to keep it on host
Login to board:
ssh sipeed@${LICHEE_PI}
On board:
cd /home/sipeed # actually, must be already here at login
mkdir Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
cd Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.8.1
tar zxf ../sysroot.tar.gz # unpack, restore sym-links
rm ../sysroot.tar.gz # no need to keep *.tar.gz
Actually, you do not need 32-bit sub-folders sysroot/lib32...
on dev board,
so you may delete these 32-bit libs from Lichee Pi copy of sysroot
if you want.
Build
We use CMake to build these examples. Cmake options specific to this project:
Option | Type | Comments |
---|---|---|
RISCV_ENABLE_RVV | bool | default is OFF |
RISCV_ENABLE_FP16 | bool | default is OFF |
SMART_CHOICE_OPTL | bool | default is OFF |
OPTIMIZE_INTEGER | bool | default is OFF |
WITH_OPEN_BLAS | bool | default is OFF |
OPEN_BLAS_DIR | path | default is empty |
Below is example of building for x86, without linking to OpenBLAS. Here we assume
that ${MINIGEMM}
is the folder at your host where you’ve placed
the local clone of the minigemm
repository:
cd ${MINIGEMM}
mkdir -p build/x86
cd build/x86
cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release
make
This must make the binaries of our examples and put them under the bin
folder,
e.g.:
cd ${MINIGEMM}
ls -al bin
-rwxr-xr-x 209920 example_convolution
-rwxr-xr-x 160016 example_minigemm
-rwxr-xr-x 210064 example_optimized_convolution
-rwxr-xr-x 169128 example_optimized_minigemm
Please see the section Test below for instructions of how-to-run them.
Linking these examples to OpenBLAS would compare performance of our
mini-gemm function against similar BLAS/gemm function, which must be
well-optimized at the OpenBLAS package.
Example of building for x86, with linking to OpenBLAS. Assume we have
preliminarily build OpenBLAS, and its binaries are found at our local host
folder like, e.g.:
cd ${RVDNN_EXAMPLES_MINIGEMM}
mkdir -p build/x86-blas
cd build/x86-blas
cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release -DWITH_OPEN_BLAS=ON \
-DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/x86-ryzen3-nothreads/
make
To build for Lichee Pi 4A, we use to cross-compile on x86 host computer.
The section Prepare above talks how to download the XuanTie compiler
and to install it on your host system.
Example of cross-compiling for RISC-V, without linking to OpenBLAS.
Do this on your host computer, under Linux or under WSL/Linux:
cd ${RVDNN_EXAMPLES_MINIGEMM}
mkdir -p build/riscv
cd build/riscv
cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=../../cmake/xuantie.toolchain.cmake \
-DRISCV_ENABLE_RVV=ON -DRISCV_ENABLE_FP16=ON
make
For testing performance against well-optimized OpenBLAS, link with its
binaries, which you need to preliminarily build on your host computer:
cd ${RVDNN_EXAMPLES_MINIGEMM}
mkdir -p build/riscv-blas
cd build/riscv-blas
cmake -S ../.. -B . -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=../../cmake/xuantie.toolchain.cmake \
-DRISCV_ENABLE_RVV=ON -DRISCV_ENABLE_FP16=ON -DWITH_OPEN_BLAS=ON \
-DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/riscv-c910v-nothreads
make
Test
To run examples on x86, just start one of them like, e.g.:
cd ${RVDNN_EXAMPLES_MINIGEMM}
cd build/x86-blas
time bin/example_optimized_convolution ${OPTION} ; echo status=$?
Such testing may last 10 to 30 minutes. It will print log about its progress.
At the end, it will print summary table with results of performance testing.
To run the cross-compiled examples on your Lichee Pi 4A board, you need to
upload the binaries to the board. Following example assumes, that Lichee Pi 4A
appears at you local network, and ${LICHEE_PI}
equals to its IP address or its
name. Here I assume you login to Lichee Pi as the default user named sipeed
.
The default password for the sipeed
user at the board is licheepi
. Assume
you copy the binaries to the /home/sipeed/temp
folder. Please create such
folder manually in advance:
On host:
cd ${RVDNN_EXAMPLES_MINIGEMM}
cd build/riscv-blas
scp bin/example_* sipeed@${LICHEE_PI}:temp
On Lichee Pi board:
cd ~/temp
time ./example_optimized_convolution ${OPTION} ; echo status=$?
Command-line OPTION may be empty, either one of these:
Command-line option meaning | |
---|---|
-s | stabilize performance results |
-f | fast testing mode (less stable) |
-v | verbose: for more detailed log |
To explain these options, several words about our tests/examples:
We execute the tested function 4 times:
- 0th run: calibrate and “warm-up engine”
- 1st, 2nd, 3rd: attempt testing three times
Calibration would tell the test how many times to iterate the tested function.
Note that the timer we use is accurate only if we measure large enough interval,
like 100 milliseconds or larger. Calibration phase would iterate tested function
many times until such iterations would take at least 100 milliseconds in overall.
This number of iterations is then used for the next main phase of the testing.
By the way, such calibration sort of “warms-up the engine”: adjusts CPU cache,
and puts the CPU core to maximal frequency, so we can see best performance.
Then: the 1st, 2nd, and 3rd runs would repeat this experiment three times.
Then, the test selects the median time of these three, and treats it as the result
of performance testing. Presumably, such taking of the median must stabilize
the performance result: make it more stable than simple one-run testing.
For even better stabilization, please use the -s
command-line option. With
this mode, the test additionally checks the difference between the best and
the worst of the three results. If the difference exceeds 10%, the test repeats
the whole experiment again and again until the difference fits under 10%.
Conversely, with the -f
option, the test omits the three-runs testing, so
works much quicker. Please use this option to estimate performance roughly,
or just to test accuracy.
OpenBLAS
Here I explain how to build OpenBLAS: compile for x86/Linux host system,
either cross-compile for RISC-V. For RISC-V, we would compile specifically
for XuanTie C910 processor core, targeting Lichee Pi 4A dev board.
Note, that we compile for single-core execution, that is without threads.
The idea is, that the bigger algorithm could apply its own threading. E.g.:
for computing neural convolution, we split large tensor into many smaller
parts to call many mini-gemm’s for them. So we could use CPU threads to
call many of small instances of mini-gemm in parallel.
To build OpenBLAS from sources, first you need to download its sources.
Its sources are available freely at GitHub, under the BSD 3-Clause License.
Here is the link to the repository:
- https://github.com/OpenMathLib/OpenBLAS
The authors recommend the develop
branch of this repo, as the master
branch is very old. You may clone this repo like following. The develop
branch is the default. I would clone it via HTTPS, as I do not plan pushing
my code to this repo. Assuming you clone into your ~/code
folder:
cd ~/code/OpenMathLib
git clone https://github.com/OpenMathLib/OpenBLAS.git
cd OpenBLAS
# git checkout develop -- no need to checkout, develop is default
Then, I would base on the following instructions (links below). These texts
are not complete, but they usefully complement each other. So we would
combine these advices:
- OpenMathLib/OpenBLAS/README.md # Installation from Source
- https://github.com/bgeneto/build-install-compile-openblas
Below I show how I built OpenBLAS binaries. Assuming we install the
compiled OpenBLAS binaries into the following two folders. Use these
folders as OPEN_BLAS_DIR option for CMake to build our mini-gemm:
/home/${USER}/code/OpenMathLib/OpenBLAS/build/
x86-ryzen3-nothreads
riscv-c910v-nothreads
For x86, how I used to compile for my laptop with Ryzen 5500U processor:
cd ~/code/OpenMathLib/OpenBLAS
make -j6 DYNAMIC_ARCH=0 CC=gcc HOSTCC=gcc BINARY=64 INTERFACE=64 \
NO_AFFINITY=1 NO_WARMUP=1 USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 \
NOFORTRAN=1
mkdir -p build/x86-ryzen3-nothreads
make install \
PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/x86-ryzen3-nothreads
make clean # all we need is saved into install folder
For RISC-V, how I used to cross-compile for Lichee Pi 4A. Do this on host:
cd ~/code/OpenMathLib/OpenBLAS
make -j6 DYNAMIC_ARCH=0 TARGET=C910V \
CC=/home/${USER}/Xuantie-900-gcc-linux-5.10.4-glibc-x86_64-V2.6.1/bin/riscv64-unknown-linux-gnu-gcc \
HOSTCC=gcc BINARY=64 INTERFACE=64 NO_AFFINITY=1 NO_WARMUP=1 \
USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 NOFORTRAN=1
mkdir -p build/riscv-c910v-nothreads
make install \
PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/riscv-c910v-nothreads
make clean # all we need is saved into install folder
You do not need to copy OpenBLAS binaries to Lichee Pi 4A board: these
libraries are static so would link into binaries of our tests/examples
ARM
DISCLAIMER: Please note, that our mini-gemm code is not optimized for ARM (yet).
Anyway, if you occasionally want to try these mini-gemm examples on ARM processor,
you also need to build OpenBLAS for ARM, to compare our code against BLAS/sgemm.
Here I explain cross-compiling on x86 host. For that, you need to install GCC tools
for ARM. Hopefully, path to GCC for ARM is same on your host as I show below.
Example of compiling OpenBLAS for ARMv8 processor. Do this on host:
cd ~/code/OpenMathLib/OpenBLAS
make -j6 DYNAMIC_ARCH=0 TARGET=ARMV8 \
CC=/usr/bin/aarch64-linux-gnu-gcc \
HOSTCC=gcc BINARY=64 INTERFACE=64 NO_AFFINITY=1 NO_WARMUP=1 \
USE_OPENMP=0 USE_THREAD=0 USE_LOCKING=1 NOFORTRAN=1
mkdir -p build/armv8-neon-nothreads
make install \
PREFIX=/home/${USER}/code/OpenMathLib/OpenBLAS/build/armv8-neon-nothreads
make clean # all we need is saved into install folder
If you target SVE instead of NEON, OpenBLAS docs propose TARGET=ARMV8SVE
parameter for make (although I have not tested such case).
Example of building these mini-gemm examples. Note that here I use toolchain file
from OpenCV, assuming your ARM computer works under Linux. Do this on host:
cd ${RVDNN_EXAMPLES_MINIGEMM}
mkdir -p build/arm-blas
cd build/arm-blas
cmake ../.. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=${OpenCV_DIR}/platforms/linux/aarch64-gnu.toolchain.cmake \
-DWITH_OPEN_BLAS=ON \
-DOPEN_BLAS_DIR=/home/${USER}/code/OpenMathLib/OpenBLAS/build/armv8-neon-nothreads
make
Upload just built bin/example_* binaries to your ARM computer and try them there.