WIP: Hack to optimise based on known sizes#276
WIP: Hack to optimise based on known sizes#276mjp41 wants to merge 8 commits intomicrosoft:snmalloc1from
Conversation
If the caller knows the sizes, and knows the allocation is thread local, then we can make some significant optimisations. This is a brief hack to show where would need changing.
|
On x86 the code would look something like: 00000000000055a0 <free_local_small>:
55a0: 48 89 fa mov %rdi,%rdx
55a3: 48 89 fe mov %rdi,%rsi
55a6: 48 81 e6 00 00 f0 ff and $0xfffffffffff00000,%rsi
55ad: 89 d0 mov %edx,%eax
55af: c1 e8 0e shr $0xe,%eax
55b2: 83 e0 3f and $0x3f,%eax
55b5: 48 c1 e0 04 shl $0x4,%rax
55b9: 66 83 44 06 60 ff addw $0xffff,0x60(%rsi,%rax,1)
55bf: 74 0e je 55cf <free_local_small+0x2f>
55c1: 48 8b 4c 06 58 mov 0x58(%rsi,%rax,1),%rcx
55c6: 48 89 54 06 58 mov %rdx,0x58(%rsi,%rax,1)
55cb: 48 89 0a mov %rcx,(%rdx)
55ce: c3 retq
;; SLOW PATH
55cf: 48 89 d1 mov %rdx,%rcx
55d2: 48 81 e1 00 c0 ff ff and $0xffffffffffffc000,%rcx
55d9: 48 8b 3d e8 c9 20 00 mov 0x20c9e8(%rip),%rdi # 211fc8 <.got+0x10>
55e0: 64 48 8b 3f mov %fs:(%rdi),%rdi
55e4: 48 81 e1 ff 3f f0 ff and $0xfffffffffff03fff,%rcx
55eb: 0f b6 4c 01 66 movzbl 0x66(%rcx,%rax,1),%ecx
55f0: e9 4b e6 ff ff jmpq 3c40 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE27small_dealloc_offseted_slowEPNS_9SuperslabES1_m>
55f5: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
55fc: 00 00 00
55ff: 90 nop
and 0000000000005650 <malloc_small_64>:
5650: 48 8b 05 71 c9 20 00 mov 0x20c971(%rip),%rax # 211fc8 <.got+0x10>
5657: 64 48 8b 38 mov %fs:(%rax),%rdi
565b: 48 8b 47 18 mov 0x18(%rdi),%rax
565f: 48 85 c0 test %rax,%rax
5662: 74 08 je 566c <malloc_small_64+0x1c>
5664: 48 8b 08 mov (%rax),%rcx
5667: 48 89 4f 18 mov %rcx,0x18(%rdi)
566b: c3 retq
;; SLOW PATH
566c: 48 8b 87 80 0c 00 00 mov 0xc80(%rdi),%rax
5673: 48 39 87 88 0c 00 00 cmp %rax,0xc88(%rdi)
567a: 75 0f jne 568b <malloc_small_64+0x3b>
567c: be 03 00 00 00 mov $0x3,%esi
5681: ba 40 00 00 00 mov $0x40,%edx
5686: e9 25 ea ff ff jmpq 40b0 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE26small_alloc_next_free_listILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
568b: be 03 00 00 00 mov $0x3,%esi
5690: ba 40 00 00 00 mov $0x40,%edx
5695: e9 b6 ea ff ff jmpq 4150 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE19small_alloc_mq_slowILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
569a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
This is not intended for merging, but I have pushed so we can adapt the idea later. |
|
|
||
| # Build a redirection layer for all sizes that are a multiple of | ||
| # 16bytes up to 1024. | ||
| add_executable(generate src/redirect/generate.cc) |
There was a problem hiding this comment.
Is it possible to differentiate between executables for the build host and executables for the target when we're cross-compiling?
There was a problem hiding this comment.
There was a problem hiding this comment.
Needing to build a C++ program on the host to generate things that we compile for the target is a bit painful for cross compiling (do we even guarantee that the sizes will be the same if, for example, we're compiling on a 32-bit system for a CHERI target?).
It's also going to be annoying to integrate into a libc build system.
There was a problem hiding this comment.
Currently, all platforms and configurations, we build would use the same file. However, we can use this program to build a collection of headers that encode the parameters of interest. E.g. of the form
generated_[MIN_ALLOC_BITS]_[INTERMEDIATE_BITS].cc
Currently, everything would use exactly the same thing:
generated_4_2.cc
But for CHERI, we might want to up the minimum allocation size. Getting either parameter wrong in the header would "work", however, we might
- have more entry points than necessary, or
- allocate a larger object than necessary.
Perhaps, just check in the file in once, we have finished experimenting. While, we are experimenting, I think having this generated is good as it means we are less likely to make mistakes.
There was a problem hiding this comment.
Note, we use the template parameter for s, so the code will be specialised for this size, even if it ends up being a medium alloc, or large.
#define DEFINE_MALLOC_SIZE(a, s) \
extern "C" void* a() \
{ \
return snmalloc::ThreadAlloc::get_noncachable()->template alloc<s>(); \
}| SNMALLOC_EXPORT | ||
| void SNMALLOC_NAME_MANGLE(free_local_small)(void* ptr) | ||
| { | ||
| if (Alloc::small_local_dealloc(ptr)) |
There was a problem hiding this comment.
I think I'd feel slightly better if this if(!fast) slow dance were inside Alloc rather than having it export the fast and slow paths separately? Unless there's some reason to want to expose this for inlining?
There was a problem hiding this comment.
It has to at least be in ThreadAlloc, as we don't want to take the TLS lookup on the fast path. So we need that in scope. I think we probably need a better refactor if we actually want to support this design. This is more a hack to get some codegen, and see if the use case makes sense.
| return (likely(slab->dealloc_fast(super, p))); | ||
| } | ||
|
|
||
| SNMALLOC_FAST_PATH void small_local_dealloc_slow(void* p) |
There was a problem hiding this comment.
Still SNMALLOC_FAST_PATH despite _slow?
Fix Align to be not a log size.
If the caller knows the sizes, and knows the allocation is thread local,
then we can make some significant optimisations.
This is a brief hack to show where would need changing.