Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEPooling3dLayer performance issue #1107

Open
alvoron opened this issue May 13, 2024 · 9 comments
Open

NEPooling3dLayer performance issue #1107

alvoron opened this issue May 13, 2024 · 9 comments

Comments

@alvoron
Copy link

alvoron commented May 13, 2024

Output of 'strings libarm_compute.so | grep arm_compute_version':
arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'

Platform:
Apple M2 Pro

Operating System:
macOS 13.4

Problem description:
NEPooling3dLayer provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.

Reproducer

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "tests/SimpleTensor.h"
#include "arm_compute/runtime/Tensor.h"
#include "utils/TypePrinter.h"

#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"

#include <string>
#include <chrono>

 using namespace std;
 using namespace arm_compute;
 using namespace arm_compute::test;
 
 int main()
 {
  DataLayout dataLayout = DataLayout::NDHWC;
  TensorShape inTensorShape = TensorShape(192, 28, 28, 40, 1);
  TensorShape outTensorShape = inTensorShape;

  Tensor inputt;
  Tensor outputt;
  inputt.allocator()->init(TensorInfo(inTensorShape, 1, DataType::F32, dataLayout));
  outputt.allocator()->init(TensorInfo(outTensorShape, 1, DataType::F32, dataLayout));

  Pooling3dLayerInfo pool3d_info;
  pool3d_info.pool_type       = PoolingType::MAX;
  pool3d_info.exclude_padding = true;
  pool3d_info.pool_size       = arm_compute::Size3D(3, 3, 3);
  pool3d_info.stride          = arm_compute::Size3D(1, 1, 1);
  pool3d_info.padding         = arm_compute::Padding3D(1, 1, 1, 1, 1, 1);
  pool3d_info.round_type      = DimensionRoundingType::FLOOR;

  NEPooling3dLayer pooling;
  pooling.configure(&inputt, &outputt, pool3d_info);
 
  inputt.allocator()->allocate();
  outputt.allocator()->allocate();

  AssetsLibrary library(".", std::random_device()());
  std::uniform_real_distribution<> distribution{ 0.0f, 10.0f };
  library.fill(Accessor(inputt), distribution, 0);

//warm-up
  pooling.run();

std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) pooling.run();
std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
std::cout << "time: " << total_duration / 100 << std::endl;
}

How reproducer was built

clang++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute -lAssetsLibrary.o -lRawTensor.o -lExceptions.o -std=c++17

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Could you please review potential performance issues in NEPooling3dLayer ?

@alvoron
Copy link
Author

alvoron commented May 23, 2024

I prepared a benchdnn reference reproducer and checked it on Ampere server.

Benchdnn

cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1

The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748 on Ampere.

ACL

scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
 LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling

The last command gives me 2086 on Ampere.

@morgolock
Copy link

Hi @alvoron

Could you please try rebuilding the library with openmp=1 cppthreads=0 ?

Hope this helps

@alvoron
Copy link
Author

alvoron commented Jun 3, 2024

I rebuilt ACL:

arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'

and got 2072 on Ampere, so the issue still remains.

P.S.
Also I wasn't able to build ACL with validation_tests=1 and openmp=1 because of undefined reference issue:

/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'

That's why I set validation_tests=0 and deleted inputt filling logic from the reproducer. I believe it shouldn't affect the performance.

@morgolock
Copy link

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

@morgolock
Copy link

Hi @alvoron

I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.

ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
  Wall clock/Wall clock time:    AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
  Wall clock/Wall clock time:    AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
  Wall clock/Wall clock time:    AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
  Wall clock/Wall clock time:    AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
  Wall clock/Wall clock time:    AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
  Wall clock/Wall clock time:    AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
  Wall clock/Wall clock time:    AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
  Wall clock/Wall clock time:    AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
  Wall clock/Wall clock time:    AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
  Wall clock/Wall clock time:    AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
  Wall clock/Wall clock time:    AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'

@alvoron
Copy link
Author

alvoron commented Jul 3, 2024

Hi @alvoron

The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.

Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?

May we refer to benchdnn results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.

I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.

@morgolock
Copy link

Hi @alvoron

I tried this on Ampere altra using benchdnn.

See below, the output of the build without ACL: total perf: min(ms):2.14062 avg(ms):5.18986

pabtel01@ampere-01 build_noacl]$ ./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,0.469971,2.14062,0,5.18986,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):2.14062 avg(ms):5.18986
total: 3.05s; fill: 0.03s (1%);

Can you please double check on your side?

@morgolock
Copy link

morgolock commented Nov 8, 2024

Hi @alvoron
I made some changes to the reproducer to print the time used for each iteration and then compute the average.
I ran the ACL reproducer on Ampere altra and I see performance AVG ranging from [3584,7000] μs

Please see below

LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./acl_pooling 
i 0 time: 3047
i 1 time: 3393
i 2 time: 2866
i 3 time: 3601
i 4 time: 3060
i 5 time: 2984
i 6 time: 4200
i 7 time: 3540
i 8 time: 3782
i 9 time: 3690
i 10 time: 3055
i 11 time: 3504
i 12 time: 3530
i 13 time: 3144
i 14 time: 3507
i 15 time: 3632
i 16 time: 3504
i 17 time: 3519
i 18 time: 3515
i 19 time: 3474
i 20 time: 3514
i 21 time: 3467
i 22 time: 3605
i 23 time: 3518
i 24 time: 3409
i 25 time: 3337
i 26 time: 2815
i 27 time: 3278
i 28 time: 3125
i 29 time: 3520
i 30 time: 3581
i 31 time: 2903
i 32 time: 2951
i 33 time: 3142
i 34 time: 3332
i 35 time: 3572
i 36 time: 3045
i 37 time: 3364
i 38 time: 3271
i 39 time: 3456
i 40 time: 3375
i 41 time: 3441
i 42 time: 3329
i 43 time: 3298
i 44 time: 3345
i 45 time: 3336
i 46 time: 3488
i 47 time: 3280
i 48 time: 3362
i 49 time: 3566
i 50 time: 3579
i 51 time: 3614
i 52 time: 3507
i 53 time: 3619
i 54 time: 3468
i 55 time: 3576
i 56 time: 3549
i 57 time: 3467
i 58 time: 3557
i 59 time: 3532
i 60 time: 3481
i 61 time: 3499
i 62 time: 3541
i 63 time: 3591
i 64 time: 3572
i 65 time: 3552
i 66 time: 3342
i 67 time: 3565
i 68 time: 3437
i 69 time: 3610
i 70 time: 3532
i 71 time: 3425
i 72 time: 3620
i 73 time: 3633
i 74 time: 3613
i 75 time: 3661
i 76 time: 3566
i 77 time: 3598
i 78 time: 3761
i 79 time: 3866
i 80 time: 3790
i 81 time: 3662
i 82 time: 3559
i 83 time: 3380
i 84 time: 3888
i 85 time: 3589
i 86 time: 3569
i 87 time: 3695
i 88 time: 3543
i 89 time: 3603
i 90 time: 3536
i 91 time: 3581
i 92 time: 3715
i 93 time: 3365
i 94 time: 3648
i 95 time: 3260
i 96 time: 3609
i 97 time: 7832
i 98 time: 6981
i 99 time: 7144
total: 3584

Hope this helps

@alvoron
Copy link
Author

alvoron commented Nov 19, 2024

@morgolock
oneDNN without ACL gives me 0.8 ms avg

./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,simple_nhwc:any,,--mode=P --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1,0,1.63257,0.69165,0,0.837442,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.69165 avg(ms):0.837442
total: 3.06s; fill: 0.03s (1%);

Below is a single test run with DNNL_VERBOSE enabled:

DNNL_VERBOSE=2 ./build/tests/benchdnn/benchdnn --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb  --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
onednn_verbose,v1,info,oneDNN v3.6.1 (commit e72f65d70e36f552f902d35614aef7aa54f3c796)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:AArch64 (with Advanced SIMD & floating-point)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.0651855
onednn_verbose,v1,primitive,create:cache_hit,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,0.00195312
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,0.512939
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:abcde::f0 dst:f32::blocked:acdeb::f0,,,1x192x40x28x28,3.46313
onednn_verbose,v1,primitive,exec,cpu,pooling,simple_nhwc:any,forward_inference,src:f32::blocked:acdeb::f0 dst:f32:a:blocked:acdeb::f0,attr-scratchpad:user,alg:pooling_max,mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1,1.80005
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.156982
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:acdeb::f0 dst:f32::blocked:abcde::f0,,,1x192x40x28x28,0.478027
0:PASSED __REPRO: --pool --allow-enum-tags-only=false --dir=FWD_I --tag=acdeb --attr-scratchpad=user mb1ic192id40ih28iw28od40oh28ow28kd3kh3kw3pd1ph1pw1
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.12s; fill: 0.03s (23%); compute_ref: 0.05s (37%); compare: 0.02s (16%);

I'm wondering why your oneDNN results is much slower than mine: 5.19 ms vs 0.8 ms.
I'm using oneDNN 3.6.1. I built benchdnn in this way:

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target benchdnn --parallel $(nproc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants