How does tcmalloc work




















Yes and my experience is that valgrind is a tremendously slow way to do one-off debugging of a suspected memory leak. The instrumentation in tcmalloc is really different as it has almost no cost depending on the size of the system you might want to adjust the sample parameter for highly multithreaded programs and is running at all times so you can use it to troubleshoot in production.

When I've used valgrind the program was so slow it wasn't the kind of thing you could put under a live workload. Valgrind slows things considerably. In the end, what to use is a matter of workflow and convenience. Good to know how tcmalloc makes debugging memleaks easier. Just for for completeness, a third option would be a tracing tool that hooks into the kernel syscalls e.

They have nearly zero performance penalty as well. James is one of the two hosts of the Real Talk podcast, which is by far the best podcast I've ever listened to.

If the thread cache doesn't already contain a chunk for the size class that is being allocated, it has to fetch some chunks for that class from the central cache, of which there is one per size class. If the thread cache becomes too full more on what that means in a second on deallocation, chunks are migrated back to the central cache. Each central cache has its own lock to reduce contention during such migrations. As chunks are migrated in and out of a thread cache, it bounds its own size in two interesting ways.

If a thread cache has exceeeded its maximum size after a migration from the central cache or on deallocation, it first attempts to find some extra headroom in its own free-lists by checking to see if they have any excess that can be released to the central caches.

Chunks are considered excess if they have been added to a free list since the last allocation that the list satisfied [3].

Central caches have their own system for managing space across all the caches in the system. Each is capped at either 1MB of chunks or 1 entry, whichever is greater.

As central caches need more space, they can "steal" it from their neighbours, using a similar mechanism to the one employed by thread caches. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures. Large objects are allocated directly from the central heap using a page-level allocator a page is a 4K aligned region of memory.

A run of pages can be carved up into a sequence of small objects, each equally sized. For example a run of one page 4K can be carved up into 32 objects of size bytes each. Each small object size maps to one of approximately allocatable size-classes. For example, all allocations in the range to bytes are rounded up to The size-classes are spaced so that small sizes are separated by 8 bytes, larger sizes by 16 bytes, even larger sizes by 32 bytes, and so forth.

When allocating a small object: 1 We map its size to the corresponding size-class. When following this fast path, TCMalloc acquires no locks at all. If the free list is empty: 1 We fetch a bunch of objects from a central free list for this size-class the central free list is shared by all threads.

If the central free list is also empty: 1 We allocate a run of pages from the central page allocator. The central page heap is again an array of free lists.

An allocation for k pages is satisfied by looking in the k th free list. If that free list is empty, we look in the next free list, and so forth. Eventually, we look in the last free list if necessary. The heap managed by TCMalloc consists of a set of pages.

A run of contiguous pages is represented by a Span object. A span can either be allocated , or free. In the diagram span A covers two pages, and span B covers 3 pages. Spans are used in the middle-end to determine where to place returned objects, and in the back-end to manage the handling of page ranges. A span contains a pointer to the base of the TCMalloc pages that the span controls.

For small objects those pages are divided into at most 2 16 objects. This value is selected so that within the span we can refer to objects by a two-byte index. This means that we can use an unrolled linked list to hold the objects.

For example, if we have eight byte objects we can store the indexes of three ready-to-use objects, and use the forth slot to store the index of the next object in the chain.

This data structure reduces cache misses over a fully linked list. When we have no available objects for a size-class, we need to fetch a new span from the pageheap and populate it.

Note that these do not correspond to the page size used in the TLB of the underlying hardware. A TCMalloc page either holds multiple objects of a particular size, or is used as part of a group to hold an object of size greater than a single page.

If an entire page becomes free it will be returned to the back-end the pageheap and can later be repurposed to hold objects of a different size or returned to the OS. Small pages are better able to handle the memory requirements of the application with less overhead. Small pages are also more likely to become free.

For example, a 4KiB page can hold eight byte objects versus 64 objects on a 32KiB page; and there is much less chance of 32 objects being free at the same time than there is of eight becoming free. Large pages result in less need to fetch and return memory from the back-end.

A single 32KiB page can hold eight times the objects of a 4KiB page, and this can result in the costs of managing the larger pages being smaller. It also takes fewer large pages to map the entire virtual address space.

TCMalloc has a pagemap which maps a virtual address onto the structures that manage the objects in that address range. Larger pages mean that the pagemap needs fewer entries and is therefore smaller. Consequently, it makes sense for applications with small memory footprints, or that are sensitive to memory footprint size to use smaller TCMalloc page sizes.

Applications with large memory footprints are likely to benefit from larger TCMalloc page sizes. The legacy pageheap is an array of free lists for particular lengths of contiguous pages of available memory. An allocation for k pages is satisfied by looking in the k th free list.

If that free list is empty, we look in the next free list, and so forth. Eventually, we look in the last free list if necessary. If that fails, we fetch memory from the system mmap. When a range of pages are returned to the pageheap, the adjacent pages are checked to determine if they now form a contiguous region, if that is the case then the pages are concatenated and placed into the appropriate free list.

The objective of the hugepage aware allocator is to hold memory in hugepage size chunks. On x86 a hugepage is 2MiB in size. To do this the back-end has three different caches:.

Additional information about the design choices made in HPAA are discussed in a specific design doc for it. TCMalloc will reserve some memory for metadata at start up. The amount of metadata will grow as the heap grows. In particular the pagemap will grow with the virtual address range that TCMalloc uses, and the spans will grow as the number of active pages of memory grows.

The address space is reserved, but not backed by physical memory until it is used. The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

Objects are cached, depending on mode, either per-thread, or per-logical-CPU. Most allocations do not need to take locks, so there is low contention and good scaling for multi-threaded applications. Flexible use of memory, so freed memory can be reused for different object sizes, or returned to the OS.



0コメント

  • 1000 / 1000