-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
Description
Feature or enhancement
Proposal:
We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.
Currently this is _BINARY_OP_ADD_INT:
// _BINARY_OP_ADD_INT_r23.o: file format elf64-x86-64
//
// Disassembly of section .text:
//
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55 pushq %rbp
// 1: 48 83 ec 10 subq $0x10, %rsp
// 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp)
// a: 48 89 fb movq %rdi, %rbx
// d: 4c 89 fd movq %r15, %rbp
// 10: 4c 89 ff movq %r15, %rdi
// 13: 48 83 e7 fe andq $-0x2, %rdi
// 17: 48 89 de movq %rbx, %rsi
// 1a: 48 83 e6 fe andq $-0x2, %rsi
// 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24>
// 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4
// 24: 48 83 f8 01 cmpq $0x1, %rax
// 28: 75 15 jne 0x3f <_JIT_ENTRY+0x3f>
// 2a: 49 89 ef movq %rbp, %r15
// 2d: 48 89 df movq %rbx, %rdi
// 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi
// 35: 48 83 c4 10 addq $0x10, %rsp
// 39: 5d popq %rbp
// 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f>
// 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4
// 3f: 49 89 c7 movq %rax, %r15
// 42: 48 89 ef movq %rbp, %rdi
// 45: 48 89 de movq %rbx, %rsi
// 48: 48 83 c4 10 addq $0x10, %rsp
// 4c: 5d popq %rbpWith hot-cold splitting, it will be split into:
_BINARY_OP_ADD_INT_r23.HOT:
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55 pushq %rbp
// 1: 48 83 ec 10 subq $0x10, %rsp
// 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp)
// a: 48 89 fb movq %rdi, %rbx
// d: 4c 89 fd movq %r15, %rbp
// 10: 4c 89 ff movq %r15, %rdi
// 13: 48 83 e7 fe andq $-0x2, %rdi
// 17: 48 89 de movq %rbx, %rsi
// 1a: 48 83 e6 fe andq $-0x2, %rsi
// 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24>
// 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4
// 24: 48 83 f8 01 cmpq $0x1, %rax
// 28: 75 15 jne 0x3f <_JIT_ENTRY+0x3f>
// 3f: 49 89 c7 movq %rax, %r15
// 42: 48 89 ef movq %rbp, %rdi
// 45: 48 89 de movq %rbx, %rsi
// 48: 48 83 c4 10 addq $0x10, %rsp
// 4c: 5d popq %rbp
_BINARY_OP_ADD_INT_r23.COLD:
// 2a: 49 89 ef movq %rbp, %r15
// 2d: 48 89 df movq %rbx, %rdi
// 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi
// 35: 48 83 c4 10 addq $0x10, %rsp
// 39: 5d popq %rbp
// 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f>
// 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4Running the current jump inversion and zero length jump removal then gives us:
_BINARY_OP_ADD_INT_r23.HOT:
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55 pushq %rbp
// 1: 48 83 ec 10 subq $0x10, %rsp
// 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp)
// a: 48 89 fb movq %rdi, %rbx
// d: 4c 89 fd movq %r15, %rbp
// 10: 4c 89 ff movq %r15, %rdi
// 13: 48 83 e7 fe andq $-0x2, %rdi
// 17: 48 89 de movq %rbx, %rsi
// 1a: 48 83 e6 fe andq $-0x2, %rsi
// 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24>
// 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4
// 24: 48 83 f8 01 cmpq $0x1, %rax
// 28: 75 15 je _BINARY_OP_ADD_INT_r23.COLD
// 3f: 49 89 c7 movq %rax, %r15
// 42: 48 89 ef movq %rbp, %rdi
// 45: 48 89 de movq %rbx, %rsi
// 48: 48 83 c4 10 addq $0x10, %rsp
// 4c: 5d popq %rbp
_BINARY_OP_ADD_INT_r23.COLD:
// 2a: 49 89 ef movq %rbp, %r15
// 2d: 48 89 df movq %rbx, %rdi
// 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi
// 35: 48 83 c4 10 addq $0x10, %rsp
// 39: 5d popq %rbp
// 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f>
// 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.
This builds on #142228.
In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple _BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common _BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.
I will work on this.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response