uCdot
search uCdot:
 
Embedded Linux and uClinux Developer Forum
 
uCdot
- FAQ
- Dev Boards
- Submit Story
- Submit FAQ
- Submit Dev Board
- Topics
- Authors
- About

- Preferences
- Older Stuff
- Past Polls
- Discussions
- Journals
- Messages

Embedded Linux
Mailing Lists
uClinux-dev (search)
Coldfire (search)
MTD
Microblaze (search)
ELUG
BDM-devel
Blackfin

Embedded Linux
Sites
uClinux.org
uClinux-Dist
uClibc
uClinux Directory
LinuxDevices
ARMulator
uClinux-elf-tools
Colilo
Kernel Archives
H8-uClinux
TLDP
Microblaze uClinux
BDM Tools
SkyEye (emulator)
LOM
SETR live CD
Blackfin uClinux

Embedded Linux
Companies
SecureComputing
SDCS
CodePoet
Arcturus
Cadenux
ARMtwister
uClinux.net
Xiptech
senTec
embedded^cl
Cwlinux
emlix
TimeSys
eSpark Infotech
SSV Embedded Systems
Embedded Minds
PeerSec Networks
Vortech Consulting
swissEmbedded
Synertronixx
Mbedthis Software
.vantronix
Aday
GraceLabs
Pengutronix
metux ITS
Codito Technologies
Firmix Software
PetaLogix
NuDesign
Merritt Technologies
WindRiver
OpenGear
Rubico
Analog Devices
Artila Electronics
Vyatta
Embest Info&Tech
Katalix Systems
WorkWare Systems
Kdev
Intellimetrix
Virtual Cogs
SYSGO
coresystems
ExactCODE
KOAN

 

uClinux - Memory Management

An engineer from SnapGear has successfully reduced the footprint required by uClinux in their line of OEMable SecureEdge Internet security appliances. The improvements mean yet another competitive advantage to OEM partners using the units as the basis for their own products.

The improvements came about when an engineer was examining "Zebra", the GNU routing solution which provides RIPv1, RIPv2, OSPF and BGP and has been ported to the SecureEdge platform. Although the binary itself wasn't remarkably large, the run-time memory requirements for a RIP route daemon was huge - around 1Mb. The curiosity was that Zebra only allocated 35K of real memory - where was the rest going?

The first note was that uClinux/Coldfire uses mmap for all its memory allocations. This makes the "malloc" implementation very small and very simple, but could be more efficient. Each "mmap" call has about 56 bytes of overhead. Zebra was generating around 2300 small allocations of 8 to 32 bytes each and so the runtime usage skyrocketed. In embedded applications developers must zealously guard every byte, and that's not always the case in conventional applications that have migrated over from virtual memory systems offering literally gigabytes of memory.

David McCullough, an experienced engineer sought first to try a new malloc - one that would use mmap more efficiently. He wrote a new malloc with a very low per allocation overhead (4 bytes) and only used mmap to allocate kernel friendly blocks. As the page size was 4K and the overhead of the kernel allocation was 24 bytes, allocations were in blocks of 4072 bytes.

This gave a pretty good gain for Zebra/Ripd but didn't help any of the standard applications. The reasons for this were two-fold. Firstly most of the other applications performed very few calls to malloc, and secondly, the malloc implementation was adding a reasonable amount of extra code to each binary that was statically linked.

David considered a different tack by using the slack space in each process - most processes have a significant amount of unused space residing at the end of their memory space. This can be quite substantial as the kernel allocates memory on a power of 2 basis. For example, a process of 6k needs to allocate 8k, a process of 129k needs to allocate 256k. The larger the process the more potential room there is to play with. By implementing a simple version of the standard process heap using the brk() and sbrk() systems calls to access this memory it meant that each process could use this memory, and when exhausted, just to switch to using mmap again. This was an easy extension to enhance the malloc already written.

Again this produced further gains on the Zebra/Ripd front as quite a bit of allocating goes on, as previously noted. Unfortunately, for all the reasons that the first modification to malloc wasn't a complete success this one wasn't either - when all applications were compiled against the new allocator the gains were noticeable, but still not great.

This left one avenue left to address - to find an alternative to the power of 2 allocator. This was the most difficult stage as to cease using a power of 2 allocator meant having to completely redo the way the kernel allocates its memory. The allocator was changed to be a simple page based allocator that only allocated the number of pages needed to almost exactly satisfy the request. For example, a process of 6k needs to allocate 8k, a process of 129k needs to allocate 132k. This avenue showed some promise as on average only about 2k was wasted. By retaining the previous modification to brk()/sbrk() to take advantage of this an ultimately efficient solution was found.

On the first boot-up 200K more free memory was available - in embedded applications this can be a make or break. The SecureEdge unit went from 1Mb to 1.2Mb of free memory. Uploading to a test box produced the following results:

KB total: used: free: shared: buffers: cached:
Before: 2662400 2043904 618496 0 331776 77824
After: 2646016 1564768 1077248 0 307200 49152

In effect a 70% increase in free memory - an astounding jump. Furthermore, slack space in the allocator has decreased from around 30% to as low as 5%. One immediate benefit of these gains is the ability to run more PPTP and PPP sessions than ever before - a decidedly useful factor in a unit that needs to service large numbers of VPN clients. By looking at the process table you can see where the gains happened:

Before:

PIDSTATSIZESHARED%CPUCOMMAND
1S39K0K0.0init
2S7K0K0.0kflushd
20S39K0K0.0/bin/inetd
21S23K0K0.0/bin/flatfsd
22S211K0K0.0/bin/diald
23S71K0K0.0/bin/pptpd
24S39K0K0.0/bin/gettyd
25S90K0K0.0/bin/boa
299S264K0K0.0/bin/pppd
300S40K0K0.0/bin/chat
28S55K0K0.0/bin/dhcpd
288S0 S263K0K0.0/bin/pppd
457R39K0K0.2/bin/telnetd
458p1 R74K0K0.8sh

After:

PIDSTATSIZESHARED%CPUCOMMAND
1S31K0K0.0init
2S7K0K0.4kflushd
20S27K0K0.0/bin/inetd
21S23K0K0.0/bin/flatfsd
22S183K0K0.2/bin/diald
23S47K0K0.0/bin/pptpd
24S27K0K0.0/bin/gettyd
33S155K0K0.4/bin/pppd
26S64K0K0.0/bin/boa
27S156K0K0.1/bin/pppd
29S35K0K0.0/bin/dhcpd
32S40K0K0.0/bin/chat
36R39K0K1.3/bin/telnetd
37R54K0K4.3sh

Normally a memory allocation of this sort would suffer from fragmentation, polluting the memory until you could not do any more large allocations. Fortunately, in our embedded environment this hasn't been the case. Most of the large allocations are for processes, that once started, run until reboot. Once the units are up, any allocations are usually small and can be satisfied with 1 or 2 pages. This means that fragmentation should not be a major problem. In practice we have found that fragmentation is limited and the benefits of more efficient allocation of memory far outweigh any disadvantages.

Further information on SnapGear VPN Routers

Further information on SecureEdge Development Platforms

Further Technical Bulletins

The Embedded Linux and uClinux Developer Forum is hosted by: SnapGear The road to Hades is easy to travel. -- Bion

[ home | contribute story | older articles | past polls | faq | authors | preferences ]