uClinux - Memory Management
An engineer from SnapGear has successfully reduced the footprint required by uClinux in their line of OEMable SecureEdge Internet security appliances. The improvements mean yet another competitive advantage to OEM partners using the units as the basis for their own products.
The improvements came about when an engineer was examining "Zebra", the GNU routing solution which provides RIPv1, RIPv2, OSPF and BGP and has been ported to the SecureEdge platform. Although the binary itself wasn't remarkably large, the run-time memory requirements for a RIP route daemon was huge - around 1Mb. The curiosity was that Zebra only allocated 35K of real memory - where was the rest going?
The first note was that uClinux/Coldfire uses mmap for all its memory allocations. This makes the "malloc" implementation very small and very simple, but could be more efficient. Each "mmap" call has about 56 bytes of overhead. Zebra was generating around 2300 small allocations of 8 to 32 bytes each and so the runtime usage skyrocketed. In embedded applications developers must zealously guard every byte, and that's not always the case in conventional applications that have migrated over from virtual memory systems offering literally gigabytes of memory.
David McCullough, an experienced engineer sought first to try a new malloc - one that would use mmap more efficiently. He wrote a new malloc with a very low per allocation overhead (4 bytes) and
only used mmap to allocate kernel friendly blocks. As the page size was 4K and the overhead of the kernel allocation was 24 bytes, allocations were in blocks of 4072 bytes.
This gave a pretty good gain for Zebra/Ripd but didn't help any of the
standard applications. The reasons for this were two-fold. Firstly most of
the other applications performed very few calls to malloc, and secondly, the malloc implementation
was adding a reasonable amount of extra code to each binary that was statically linked.
David considered a different tack by using the slack space in each process - most processes have a significant amount of unused space residing at the end of their memory space. This can be quite substantial as the kernel allocates memory on a power of 2 basis. For example, a process of 6k needs to
allocate 8k, a process of 129k needs to allocate 256k. The larger the process the more potential room there is to play with. By implementing a simple version of the standard process heap
using the brk() and sbrk() systems calls to access this memory it meant that each process could use this memory, and when exhausted, just to switch to using mmap again. This was an easy extension to enhance the malloc already written.
Again this produced further gains on the Zebra/Ripd front as quite a bit of allocating goes on, as previously noted. Unfortunately, for all the reasons that the first modification to malloc wasn't a complete success this one wasn't either - when all applications were compiled against the new allocator the gains were noticeable, but still not great.
This left one avenue left to address - to find an alternative to the power of 2 allocator. This was the most difficult stage as to cease using a power of 2 allocator meant having to completely redo the way the kernel allocates its memory. The allocator was changed to be a simple page based allocator that only allocated the number of pages needed to almost exactly satisfy the request. For example, a process of 6k needs to allocate 8k, a process of 129k needs to allocate 132k. This avenue showed some promise as on average only about 2k was wasted. By retaining the previous modification to brk()/sbrk() to take advantage of this an ultimately efficient solution was found.
On the first boot-up 200K more free memory was available - in embedded applications this can be a make or break. The SecureEdge unit went from 1Mb to 1.2Mb of free memory. Uploading to a test box produced the following results:
| KB | total: | used: | free: | shared: | buffers: | cached: |
| Before: | 2662400 | 2043904 | 618496 | 0 | 331776 | 77824 |
| After: | 2646016 | 1564768 | 1077248 | 0 | 307200 | 49152 |
In effect a 70% increase in free memory - an astounding jump. Furthermore, slack space in the allocator has decreased from around 30% to as low as 5%. One immediate benefit of these gains is the ability to run more PPTP and PPP sessions than ever before - a decidedly useful factor in a unit that needs to service large numbers of VPN clients. By looking at the process table you can see where the gains happened:
Before:
| PID | STAT | SIZE | SHARED | %CPU | COMMAND |
| 1 | S | 39K | 0K | 0.0 | init |
| 2 | S | 7K | 0K | 0.0 | kflushd |
| 20 | S | 39K | 0K | 0.0 | /bin/inetd |
| 21 | S | 23K | 0K | 0.0 | /bin/flatfsd |
| 22 | S | 211K | 0K | 0.0 | /bin/diald |
| 23 | S | 71K | 0K | 0.0 | /bin/pptpd |
| 24 | S | 39K | 0K | 0.0 | /bin/gettyd |
| 25 | S | 90K | 0K | 0.0 | /bin/boa |
| 299 | S | 264K | 0K | 0.0 | /bin/pppd |
| 300 | S | 40K | 0K | 0.0 | /bin/chat |
| 28 | S | 55K | 0K | 0.0 | /bin/dhcpd |
| 288 | S0 S | 263K | 0K | 0.0 | /bin/pppd |
| 457 | R | 39K | 0K | 0.2 | /bin/telnetd |
| 458 | p1 R | 74K | 0K | 0.8 | sh |
After:
| PID | STAT | SIZE | SHARED | %CPU | COMMAND |
| 1 | S | 31K | 0K | 0.0 | init |
| 2 | S | 7K | 0K | 0.4 | kflushd |
| 20 | S | 27K | 0K | 0.0 | /bin/inetd |
| 21 | S | 23K | 0K | 0.0 | /bin/flatfsd |
| 22 | S | 183K | 0K | 0.2 | /bin/diald |
| 23 | S | 47K | 0K | 0.0 | /bin/pptpd |
| 24 | S | 27K | 0K | 0.0 | /bin/gettyd |
| 33 | S | 155K | 0K | 0.4 | /bin/pppd |
| 26 | S | 64K | 0K | 0.0 | /bin/boa |
| 27 | S | 156K | 0K | 0.1 | /bin/pppd |
| 29 | S | 35K | 0K | 0.0 | /bin/dhcpd |
| 32 | S | 40K | 0K | 0.0 | /bin/chat |
| 36 | R | 39K | 0K | 1.3 | /bin/telnetd |
| 37 | R | 54K | 0K | 4.3 | sh |
Normally a memory allocation of this sort would suffer from fragmentation,
polluting the memory until you could not do any more large allocations.
Fortunately, in our embedded environment this hasn't been the case. Most
of the large allocations are for processes, that once started, run until
reboot. Once the units are up, any allocations are usually small and can
be satisfied with 1 or 2 pages. This means that fragmentation should not
be a major problem. In practice we have found that fragmentation is limited and the benefits of more efficient allocation of memory far outweigh any disadvantages.
Further information on SnapGear VPN Routers
Further information on SecureEdge Development Platforms
Further Technical Bulletins
|