最近在用vtune分析程序性能瓶颈时,发现一些内存访问的地方竟然成了cpu热点。经过仔细分析,发现这些热点主要是对大数组非连续位置的访问的引起的。比较消耗cpu的原因应该是cache不命中。因为像这样局部性很差的内存访问逻辑,对cache是很不友好的。于是想到了prefetch……
现象
#include <xmmintrin.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/time.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <math.h> void usage() { printf("usage: BIN file step prefetch\n"); exit(1); } inline int calcu(int input) { #ifdef EMPTYCALC return input; #endif int val = (input % 99) * (input / 98); val = val ? val : 1; #ifdef HEAVYCALC double d = (double)input / (double)val; return (int)pow(d, 1999.9); #endif double n = sqrt(sqrt((double)(unsigned)input * 1.3)); double m = sqrt(sqrt((double)(unsigned)val * 0.9)); return (int)((double)input * (double)val * m / (n ? n : 1.1)); } int run_withprefetch(const int *array, int size, int step, int prefetch) { int result = 0; printf("run with prefetch(%d)...\n", prefetch); for (int i = 0; i < step; i++) { for (int j = i; j < size; j += step) { int k = j + step * prefetch; if (k < size) { _mm_prefetch(&array[k], _MM_HINT_T0); //const int *addr = &array[k]; //__asm__ __volatile__ ("mov (%0), %%eax"::"r"(addr):"eax"); //__asm__ __volatile__ ("mov %0, %%eax"::"r"(k):"eax"); } result += calcu(array[j]); } } return result; } int run(const int *array, int size, int step) { int result = 0; printf("run...\n"); for (int i = 0; i < step; i++) { for (int j = i; j < size; j += step) { //asm volatile("lfence"); result += calcu(array[j]); } } return result; } int main(int argc, const char *argv[]) { if (argc != 4) { usage(); } int step = atoi(argv[2]); int prefetch = atoi(argv[3]); int fd = open(argv[1], O_RDONLY); if (fd == -1) { usage(); } struct stat st; int ret = fstat(fd, &st); if (ret != 0) { usage(); } int array_size = st.st_size / sizeof(int); printf("array size: %d, step: %d. ", array_size, step); const int *array = (const int *)mmap(NULL, st.st_size, PROT_READ, MAP_POPULATE|MAP_SHARED, fd, 0); if (array == MAP_FAILED) { usage(); } struct timeval tv1, tv2; gettimeofday(&tv1, NULL); int result = 0; if (prefetch == 0) { result = run(array, array_size, step); } else if (prefetch > 0) { result = run_withprefetch(array, array_size, step, prefetch); } gettimeofday(&tv2, NULL); tv2.tv_sec -= tv1.tv_sec; tv2.tv_usec -= tv1.tv_usec; if (tv2.tv_usec < 0) { tv2.tv_usec += 1000000; tv2.tv_sec--; } printf("time cost: %d.%06d, ", tv2.tv_sec, tv2.tv_usec); printf("result: %d\n", result); return result; }
$ ll -h test.tar.gz -rw-rw-r-- 1 xiangy xiangy 1.8G Jun 27 09:37 test.tar.gz $ g++ -O2 prefetch.cpp -DHEAVYCALC -o prefetch.heavy $ g++ -O2 prefetch.cpp -DEMPTYCALC -o prefetch.empty $ g++ -O2 prefetch.cpp -o prefetch.normal $ ./prefetch.normal usage: BIN file step prefetch
(选择不同复杂度的计算逻辑,编译成以不同后缀命名的可执行文件。)
[case-1]$ ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 23.980005, result: 692002678
(空计算+跳读内存。预期内存访问基本上都是cache miss,而计算逻辑基本上又不花时间,所以最终花费的时间主要就是读内存时间。记住这个值,下面的case经常需要参考它。)
[case-2]$ ./prefetch.normal test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 22.846302, result: 1309150882 [case-3]$ ./prefetch.normal test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 66.041256, result: 1309150882 [case-4]$ ./prefetch.normal test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.247350, result: 1309150882
(以上是普通计算的运行情况。case-2顺序读内存预期基本上都是cache hit的,最终花费的时间主要是执行计算逻辑的时间;case-3跳读内存时间花费大量增加;case-4加了预取之后,时间花费基本上恢复到case-2的水平。)
[case-5]$ ./prefetch.heavy test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 47.386533, result: 1625037789 [case-6]$ ./prefetch.heavy test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 107.783801, result: 1625037789 [case-7]$ ./prefetch.heavy test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 51.492479, result: 1625037789
(以上是复杂计算的运行情况。跟前面的表现基本一致,跳读带来了大量的时间增长,而预取又基本恢复到顺序读时的水平。)
[case-8]$ ./prefetch.empty test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 24.253892, result: 692002678
(空计算+跳读内存,预取效果跟不加预取时差不多。)
[case-9]$ ./prefetch.normal test.tar.gz 1 4 array size: 468787200, step: 1. run with prefetch(4)... time cost: 22.896790, result: 1309150882
(普通计算+顺序读内存+预取,效果跟不加预取时也差不多。)
prefetch原理
load-1 load-1 calc-1 => calc-1 load-2 load-2 calc-2 load-3 calc-2 calc-3
但是实际上却又并非如此简单。
calc-11 calc-12 calc-13 calc-21 calc-22 calc-23 calc-31 calc-32 calc-33 calc-41 calc-42 calc-43 calc-51 calc-52 calc-53
(每个loop约花费2个单位时间。)
load-11 load-12 calc-11 calc-12 calc-13 load-21 load-22 calc-21 calc-22 calc-23 load-31 load-32 calc-31 calc-32 calc-33
(每个loop约花费4个单位时间。注,假设CPU执行到calc-N3的时候才看到有load-(N+1)1。)
load-11 load-12 calc-11 prefetch-21 calc-12 prefetch-22 calc-13 calc-21 prefetch-31 calc-22 prefetch-32 calc-23 calc-31 prefetch-41 calc-32 prefetch-42 calc-33 calc-41 prefetch-51 calc-42 prefetch-52 calc-43 calc-51 calc-52 calc-53
(每个loop约花费2个单位时间。注,在calc-N1之前prefetch-(N+X)1就已经发起了。)
例子引出的问题
int run(const int *array, int size, int step) { int result = 0; printf("run...\n"); for (int i = 0; i < step; i++) { for (int j = i; j < size; j += step) { asm volatile("lfence"); result += calcu(array[j]); } } return result; }
再次运行case-1:
[case-1.1]$ g++ -O2 prefetch.cpp -DEMPTYCALC [case-1.1]$ ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 63.999864, result: 692002678
(强制让load不能并行之后,case-1.1的耗时直接变成了case-3的水平。说明在原本的case-1中load是存在很大的并行度的。)
[case-6.1]$ g++ -O2 prefetch.cpp -DHEAVYCALC [case-6.1]$ ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 114.787379, result: 1625037789
(果然如此。)
[case-2.2]$ ./prefetch.normal test.tar.gz 1 0 | ./prefetch.normal test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 22.964594, result: 1309150882
(两个顺序读内存的普通计算一起运行,因为总是cache hit,所以跟单个运行的时间差不多。)
[case-1.2]$ ./a.out test.tar.gz 1024 0 | ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 63.973557, result: 692002678
(两个加了lfence的进程一起运行,由于进程内的内存访问已经串行化了,两个进程可以各自使用一个内存通道,所以运行时间跟单个进程运行时差不多。)
[case-1.3]$ ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 24.083044, result: 692002678 [case-1.4]$ ./prefetch.empty test.tar.gz 1024 0 | ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 37.948864, result: 692002678
(而用之前没加过lfence的程序再试一下,两个进程同时运行时,由于争抢内存带宽,运行时间就会受影响。)
prefetch提前量
$ for x in 1 2 4 8 16 32; do ./prefetch.normal test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run with prefetch(1)... time cost: 36.262511, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(2)... time cost: 29.902517, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.052798, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(8)... time cost: 26.040215, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(16)... time cost: 26.198825, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(32)... time cost: 25.910506, result: 1309150882
prefetch提前量从1增大到32。从结果看,当提前量小的时候,prefetch效果不明显。为什么呢?
用mov代替prefetch?
int run_withprefetch(const int *array, int size, int step, int prefetch) { int result = 0; printf("run with prefetch(%d)...\n", prefetch); for (int i = 0; i < step; i++) { for (int j = i; j < size; j += step) { int k = j + step * prefetch; if (k < size) { const int *addr = &array[k]; asm volatile("mov (%0), %%eax"::"r"(addr):"eax"); } result += calcu(array[j]); } } return result; }
重跑case-4:
[case-4.1]$ g++ -O2 prefetch.cpp [case-4.1]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 37.312423, result: 1309150882
确实比不加prefetch的情况case-3(66.041256)要好很多,但还是比不上原来的case-4(28.247350)。
if (k < size) { _mm_prefetch(&array[k], _MM_HINT_T0); __asm__ __volatile__ ("mov %0, %%eax"::"r"(k):"eax"); }
[case-4.2]$ g++ -O2 prefetch.cpp [case-4.2]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.051848, result: 1309150882
可见仅仅多占用一个寄存器,貌似并没有什么影响。(在tomasulo算法中,这里实际上并没有多占用寄存器,而是多占用了RS。)
[case-6.2]$ g++ -O2 prefetch.cpp -DHEAVYCALC [case-6.2]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 100.910547, result: 1625037789
果然,这个结果比起原本的case-6(107.783801)已经没有优势了。
其他问题
关于硬件prefetch
$ for x in 1 2 4 8 16 32 64 128 256 512 1024; do ./prefetch.normal test.tar.gz $x 0; done array size: 468787200, step: 1. run... time cost: 22.863716, result: 1309150882 array size: 468787200, step: 2. run... time cost: 23.438035, result: 1309150882 array size: 468787200, step: 4. run... time cost: 23.846166, result: 1309150882 array size: 468787200, step: 8. run... time cost: 24.171723, result: 1309150882 array size: 468787200, step: 16. run... time cost: 25.502980, result: 1309150882 array size: 468787200, step: 32. run... time cost: 37.461018, result: 1309150882 array size: 468787200, step: 64. run... time cost: 39.829086, result: 1309150882 array size: 468787200, step: 128. run... time cost: 44.291904, result: 1309150882 array size: 468787200, step: 256. run... time cost: 65.332225, result: 1309150882 array size: 468787200, step: 512. run... time cost: 64.764071, result: 1309150882 array size: 468787200, step: 1024. run... time cost: 65.952260, result: 1309150882
随着step的逐步增大,可以看出时间消耗分为三个档次:
关于TLB cache miss
int run(const int *array, int size, int step, int blocksize) { int result = 0; blocksize *= 4096; if (blocksize == 0) { blocksize = size; } printf("run... (block=%d pages)\n", blocksize/4096); int start = 0; int nsize = blocksize; while (nsize == blocksize) { if (start + blocksize > size) nsize = size - start; for (int i = 0; i < step; i+=32) { for (int j = i; j < nsize; j += step) { int thissize = j + 32 < nsize ? j + 32 : nsize; for (int k = j; k < thissize; k++) { result += calcu(array[start+k]); } } } start += nsize; } return result; }
改造run函数把整个文件按blocksize划分成若干个块,每个块单独完成跳读逻辑。
$ for x in 128 256 512 1024 0; do ./a.out test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run... (block=128 pages) time cost: 22.501363, result: 1309150882 array size: 468787200, step: 1024. run... (block=256 pages) time cost: 22.627935, result: 1309150882 array size: 468787200, step: 1024. run... (block=512 pages) time cost: 25.064514, result: 1309150882 array size: 468787200, step: 1024. run... (block=1024 pages) time cost: 24.976720, result: 1309150882 array size: 468787200, step: 1024. run... (block=114450 pages) time cost: 24.900870, result: 1309150882
看起来TLB cache miss所带来的影响不大。
关于L1 cache
int run(const int *array, int size, int step, int blocksize) { int result = 0; blocksize *= 4096; if (blocksize == 0) { blocksize = size; } printf("run... (block=%d pages)\n", blocksize/4096); int start = 0; int nsize = blocksize; while (nsize == blocksize) { if (start + blocksize > size) nsize = size - start; for (int i = 0; i < step; i++) { for (int j = i; j < nsize; j += step) { result += calcu(array[start+j]); } } start += nsize; } return result; }
在一个块内反复跳读,如果以块的大小刚好能被cache住,则程序运行时间会很短;否则又得经历漫长的读内存过程。
$ for x in 1 2 4 8 16 32 64 128 256 512 1024; do ./a.out test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run... (block=1 pages) time cost: 1.614654, result: 692002678 array size: 468787200, step: 1024. run... (block=2 pages) time cost: 1.554286, result: 692002678 array size: 468787200, step: 1024. run... (block=4 pages) time cost: 1.625566, result: 692002678 array size: 468787200, step: 1024. run... (block=8 pages) time cost: 2.621453, result: 692002678 array size: 468787200, step: 1024. run... (block=16 pages) time cost: 2.697908, result: 692002678 array size: 468787200, step: 1024. run... (block=32 pages) time cost: 2.724401, result: 692002678 array size: 468787200, step: 1024. run... (block=64 pages) time cost: 2.710056, result: 692002678 array size: 468787200, step: 1024. run... (block=128 pages) time cost: 3.864916, result: 692002678 array size: 468787200, step: 1024. run... (block=256 pages) time cost: 4.241000, result: 692002678 array size: 468787200, step: 1024. run... (block=512 pages) time cost: 20.216653, result: 692002678 array size: 468787200, step: 1024. run... (block=1024 pages) time cost: 24.361176, result: 692002678
随着block size的逐渐增大,程序运行时间显现出三个层次。分别代表着L1 cache hit(1〜4)、L2 cache hit(8〜256)、cache miss(512〜)三种状态。看起来L1 cache在本例中影响并不大。