Why is Malloc Different Under uClinux?
By David McCullough
davidm@snapgear.com
Firstly, under uClinux we don't have Virtual Memory (VM.)
This means
you cannot arbitrarily add memory to an already running process. As VM is
usually implemented using a processing unit called a MMU (memory
management unit) you will often hear the term NOMMU when traveling in
uClinux circles. Let's discuss at a very high level how not having an MMU
or VM affects malloc.
With VM, all processes run at the same address (albeit a virtual one)
and the VM system takes care of what physical memory is actually mapped to
these locations. So even though the virual memory that the process sees
is contiguous, the physical memory it occupies can be scattered about. Some
of it may even be on a hard disk in swap!
Without VM, each process must be located at a place in memory where it
can be run. In the simplest case, this area of memory must be contiguous,
and generally cannot be expanded as there may be other processes above and
below it.
With respect to the original topic, malloc is normally implemented at the low level
using the sbrk/brk system calls which increase/change the size of a process's address space.
The malloc library then manages the extra memory obtained by calling sbrk()
on behalf of the application. When it runs out of memory it can get more by calling sbrk() again, it can also decrease memory using brk(), although very few malloc implementations do this. sbrk() works by adding more memory to the end of a process (increasing its size). brk() can arbitrarily set the end of the process to be closer to the start of the process (reduce the process size) or futher away (increase the process size). The standard operation of sbrk()/brk() is to increase/decrease a processes size.
Since sbrk/brk cannot increase a processes size under uClinux, malloc
needs to undergo some changes at least at the low level in order to work.
There are many possibilities here. Some obvious ones are:
- Do not allow processes to allocate dynamic memory.
- Allocate a heap per process that sbrk/brk can operate on. The heap size
cannot be changed by the process and is allocated at process startup.
- Let processes allocate memory from the global pool (system wide) of free
memory.
Simply put, 1) is just not useful. Too much application software assumes
the ability to allocate memory and preventing it just increases the
effort required to utilise the vast amount of Open Source software (OSS).
Option 2) has its merits. It limits the amount of memory a process can use
(which can be considered a good thing). But it also means that the memory
for the heap is always allocated, even if it is only used temporarily.
The heap space has to be allocated at process start time so that the
semantics of sbrk/brk still hold and the normal malloc library can be used.
Finally we come to 3). There are pitfalls here. A run away process can use all of
the available memory for one. Allocating from the system pool is not
compatible with sbrk/brk as they require memory to be added to the end of
a processes address space. Thus a normal malloc implementation is no good
and a new implementation is needed. The advantage of this method is
that only the amount of memory actually required is used. Memory can be returned to the global
pool as soon as it's finished with and the implementation can take advantage
of the existing kernel allocator for managing this memory.
Currently method 3 is used by uClinux. The simplest malloc implementation
calls mmap to obtain memory from the kernel's free memory pool and munmap to return memory to the kernel's free memory pool. This
gives a very small malloc implementation (1 system call).
Problems encountered with such a simple malloc implentation include:
- The overheads of using mmap under uClinux are about 56 bytes per
allocation. This turns out to be extremely bad for applications that do a
large number of small allocations (like the Zebra routing daemon).
Under uClinux, mmaps from user programs are simply allocated
from the pool of free memory using kmalloc, the kernel allocator,
and then chained into a linked list attached to the process. The 56
bytes above comes from the kmalloc overhead plus the linked list
overheads.
Not only is the 56 bytes significant, but a large number of small
allocations will result in quite a long list leading to slower
deallocations and reallocations (new allocations are put directly at
the head of the list).
- The standard kernel allocator allocates only power-of-two sized quantities
up to 1Mb, which is inefficient and limiting. To understand the
ramifications of this, especially for large allocations, consider that
a 33K allocation is rounded to the next power of two, 64K!
Several small malloc implementations have been created to alleviate point 1 by
reducing the effect of the 56 byte overhead by allocating larger blocks
and then managing those blocks internally for better results.
The kernel allocator has been modified to increase the maximum allocation
size substantially. In some cases this is done by a kernel config option.
This allows for larger allocations and thus larger programs.
An alternative kernel allocator has been added to uClinux that no longer enforces power of two allocations, thus eliminating the wastage in
allocations quite substantially. This allocator is commonly known as
Kmalloc2 and it can substantially reduce the overheads for memory allocation in a NOMMU environment and increase the amount of free memory available for other tasks.
The standard kernel allocator only allocates memory based on a power of two.
For example, if you need 12000 bytes, it will allocate 16Kb, and you
cannot make use of the extra 4Kb that was allocated. This can be very
expensive when you start an application. For example an application of
130Kb in size will actually need 256Kb just to run.
Kmalloc2 addresses this by using a power-of-two allocator for allocations
up to 1 page in size (a page is 4096 bytes, or 4Kb). It then allocates
memory rounded up to the nearest page. So for the previous example,
an application of 130Kb will actually have 132Kb allocated to it. A saving
of over 124Kb on the standard kernel allocator.
Kmalloc2 also takes steps to avoid fragmenting memory. It allocates all
amounts of 2 pages (8kB) or less from the end of memory down, and all
larger amounts from the start of free memory up. This stops transient
allocations for network buffers and so on fragmenting memory and preventing
large apps from running.
Kmalloc2 is not perfect, it is quite susceptible to fragmentation of
memory, although it could be argued that the standard allocator is more
susceptible. Kmalloc2 works well in practice as the embedded environments
that run uClinux tend to have a relatively static group of long lived applications.
Now we have discussed the kernel level options for memory allocation a
little, lets look at the options available to user programs. Due to the
issues mentioned thus far, there are quite a few solutions available
for use with user programs. Each has its place.
For a start there are a choice of "libc"s -- a topic for another discussion.
The choice of "malloc"s usually depends on the libc you are using: uC-libc or uClibc.
Both offer a simple allocator, malloc-simple.
malloc-simple uses mmap and munmap to let the kernel actually handle the requests for memory.
The implementation is trivial and the code size is negligible, so
the cost of including it in an application is very small.
The problem with malloc-simple is, as mentioned earlier, that the kernel
overhead on an mmap based allocation is about 56 bytes. So if you have an
application that does a lot of small allocations, its memory usage will
be quite high. For example, say your program allocates 1000 10 byte amounts, that's a total of 10000 bytes of memory that is required. Because of the 56 byte overhead, it will actually allocate 66000 bytes, a 560% increase over what was actually required. Zebra, a routing daemon, is an example of an application that suffers quite badly from the small allocation problem as it allocates command data structures as it starts (it has a large number of commands/keywords!).
uC-libc used to offer a libsmalloc, a version of malloc that specialises in low overhead small allocations specifically for applications like Zebra. This version of malloc has since been merged with malloc-simple as the extra code required is no longer a significant overhead when using shared libraries for uClinux.
uClibc offers a few choices as well:
| malloc |
A hunk based allocator that appears to have NOMMU support,
although it uses mmap functionality not available on NOMMU
systems. At this point it appears that this allocator will not run on NOMMU systems. It may be relatively simple to fix this, and if so, it should have some potential. |
| malloc-930716 |
This allocator will only work on MMU systems. It relies on the
brk/sbrk calls. Although these will return a small amount of
memory on uClinux systems, it is useless as the sole source of
memory for an allocator. |
| malloc-simple |
The most simple allocator, works on both MMU and NOMMU systems. |
Generally the more complex mallocs deliver faster allocations and are more
efficient for small allocations. They also add considerable code size to
small applications such as one would find in a uClinux environment.
There are a few choices for malloc. Which one should you use? malloc-simple is generally a good default malloc to use. Then you can choose
a better malloc for the individual applications that need it. As you can
see above, your choices are limited if you want something that works out
of the box.
Inevitably, despite your choice of kernel/user allocations, you will run out
of memory. One of the common problems new users encounter is the "missing
memory" problem. The system is showing a large amount of free memory but
my application cannot allocate a buffer of size X. The problem here is
memory fragmentation, and all of the solutions available at
this time suffer from it. Because of the lack of VM in the uClinux
environment it is near impossible to fully utilise memory due to
fragmentation. This is best explained by example. Suppose your system has
500Kb of memory free and you want to allocate 100Kb. It is easy to think
that this would be possible. However, it is important to remember that
you must have a contiguous 100Kb of memory in order to satisfy the
allocation. Suppose the memory map looks like this. Each character
represents approximately 20Kb. X marks areas that are allocated or in use
by other programs or the kernel:
0 100 200 300 400 500 600 700 800 900 1000
-+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--
|XXXXX|XXXXX|---XX|--X--|-X---|XX---|-X---|-XX--|-X---|XXXXX|
As you can see, there is 500Kb free, but the largest contiguous block
is only 80Kb. There are many ways to arrive at such a situation. A
program that allocates some memory, then frees most of it leaving a small
allocation in the middle of a larger free block is often the cause.
Transient programs under uClinux can also affect where and how memory is
allocated.
The question is often asked, why can't this memory be defragmented. The
problem is that we don't have VM and we cannot move memory that is being
used by programs. Programs usually have references to addresses within the
allocated memory regions and without VM to make the memory always appear to be at the
correct address, the program will crash if we move its memory. There is
no solution to this problem under uClinux. The developer needs to be aware
of the problem and, where possible, try to utilise smaller blocks of
memory.
Memory allocation under uClinux, as we have seen, is quite similar to
memory allocation under normal Linux but has its own set of quirks and shortfalls.
Further progress will no doubt be made on the memory allocation
front now that uClinux is enjoying its first shared library implementations.
This reduces the requirement to keep the malloc implementation as small as
possible as it can live in a shared library. This will lead to larger, but
more efficient user level malloc implementations that help to reduce the
problems of fragmentation and malloc overheads.
Thanks to Phil Wilshire <philwil@earthlink.net> for prompting this response
to some questions.
References (from the uClinux-distribution sources available for download
at www.uclinux.org/pub/uClinux/dist)
Kernel allocators and mmap implementations:
- linux-2.4.x/mmnommu/slab.c
- linux-2.4.x/mmnommu/page_alloc.c
- linux-2.4.x/mmnommu/page_alloc2.c
- linux-2.4.x/mmnommu/mmap.c
- linux-2.0.x/mmnommu/kmalloc.c
- linux-2.0.x/mmnommu/page_alloc.c
- linux-2.0.x/mmnommu/kmalloc2.c
- linux-2.0.x/mmnommu/page_alloc2.c
- linux-2.0.x/mmnommu/mmap.c
uC-libc malloc implementations:
- lib/libc/malloc/alloc.c
- lib/libc/malloc-simple/alloc.c
uClibc malloc implementations:
- uClibc/libc/stdlib/malloc
- uClibc/libc/stdlib/malloc-930716
- uClibc/libc/stdlib/malloc-simple
Further information on uClinux
Further information on SecureEdgeTM Development Platforms
Further information on SnapGear® VPN Routers
SnapGear and SecureEdge are trademarks of SnapGear Inc. All other trademarks are the property of their respective owners.
Further Technical Bulletins
|