In looking at the generated code, you find some subtle difference that lead to different loop sizes.
1st)
000000013F5A1090 vmovaps ymm0,ymmword ptr [rbx+rax]
vecAp++;
vecBp++;
000000013F5A1095 add rax,20h
verses 2nd)
000000013FA51070 vmovaps ymm0,ymmword ptr [rbp+rbx]
000000013FA51076 add rbx,20h
Ignore the assembler comment for the ++ of the two pointers, instead look at the byte address of the add instruction. The first case is +5 bytes from the start of loop ...90, the second case is +6 from the start of loop ...+70. Apparently using rbp requires a prefix byte.
Next look at the vaddps
000000013F5A109C vaddps ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2 vmovaps ymmword ptr [rax-20h],ymm1
verses
000000013FA5107D vaddps ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
000000013FA51086 vmovaps ymmword ptr [rbx+rax-20h],ymm0
Note the immediate value in the first case is 20h, this fits in imm8 (one byte) making the vaddps 6 bytes
The immediate value in the second case is 3FE0h, this requires imm32 (4 bytes) making the vaddps 9 bytes
The use of the (registerized) pointers permitted the use of shorter byte length instructions.
Jim Demspey
In looking at the generated code, you find some subtle difference that lead to different loop sizes.
1st)
000000013F5A1090 vmovaps ymm0,ymmword ptr [rbx+rax]
vecAp++;
vecBp++;
000000013F5A1095 add rax,20h
verses 2nd)
000000013FA51070 vmovaps ymm0,ymmword ptr [rbp+rbx]
000000013FA51076 add rbx,20h
Ignore the assembler comment for the ++ of the two pointers, instead look at the byte address of the add instruction. The first case is +5 bytes from the start of loop ...90, the second case is +6 from the start of loop ...+70. Apparently using rbp requires a prefix byte.
Next look at the vaddps
000000013F5A109C vaddps ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2 vmovaps ymmword ptr [rax-20h],ymm1
verses
000000013FA5107D vaddps ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
000000013FA51086 vmovaps ymmword ptr [rbx+rax-20h],ymm0
Note the immediate value in the first case is 20h, this fits in imm8 (one byte) making the vaddps 6 bytes
The immediate value in the second case is 3FE0h, this requires imm32 (4 bytes) making the vaddps 9 bytes
The use of the (registerized) pointers permitted the use of shorter byte length instructions.
Jim Demspey