I occasionally browse hackaday with the hope that some project will guilt inspire me into creating something interesting. A posting which got my attention was the 1 kB Challenge. This was a competition run at the end of 2016, with the key stipulation being:
Projects must use 1 kB or less of code, including any initialized data tables, bootloaders, and executable code.
The competition is now long over and I didn’t write so much as a single line of code for it. However, it got me thinking about the MSP430G2 Launchpad, a small developer board I used to experiment with. If there has ever been a device to make me conscious of program size it was this one.
The current revision of the Launchpad comes with a MSP430G2553 MCU. Within its family this is a fairly “high end” part, but it’s 16 kB of flash memory and 512 B RAM aren’t exactly generous. Especially with several vendors now offering low cost, low power ARM Cortex M0 parts.
I recalled that even a basic C program used a significant amount of the MSP430G2553’s limited resources. Mostly for nostalgia’s sake I thought I would try to understand why.
The program I decided to test with is the MSP430 equivalent of “hello world”. For those unfamiliar with the Launchpad it:
The complete source can be seen below. Additionally it and any supporting files can be found in the github repository linked at the bottom of this article.
#include <msp430.h>
#define RED BIT0
#define GREEN BIT6
int main(void)
{
WDTCTL = WDTPW | WDTHOLD;
P1DIR |= RED | GREEN;
P1OUT |= RED | GREEN;
while (1)
{
__bis_SR_register(LPM4_bits);
}
return 0;
}
When compiled with msp430-elf-gcc (without providing any fancy arguments) a binary with the following footprint is produced:
$ msp430-elf-size output/default.elf
text data bss dec hex filename
694 16 22 732 2dc output/default.elf
This basic program produced a whopping 732 B output, as indicated by “dec”. This value includes the program’s machine instructions (“text”), initialised data (“data”) and uninitialized data (“bss”). This is well past the halfway point of the 1 kB challenge.
The majority of the space is taken up by machine instructions (“text”). Certain that a near-empty main()
function could not be responsible for generating so much output, I started probing the binary to see what it contained.
Below is the disassembly of main()
. This was generated by using msp430-elf-objdump to disassemble the contents of the compiled binary. With the C source being included inline I won’t bother going into the specifics of the assembly instructions. For those interested, the assembly mnemonics and their descriptions can be found in the MSP430x2xx Family User’s Guide.
0000c142 <main>:
#define RED BIT0
#define GREEN BIT6
int main(void)
{
WDTCTL = WDTPW | WDTHOLD;
c142: b2 40 80 5a mov #23168, &0x0120 ;#0x5a80
c146: 20 01
0000c148 <.Loc.36.1>:
P1DIR |= RED | GREEN;
c148: 5c 42 22 00 mov.b &0x0022,r12 ;0x0022
c14c: 7c d0 41 00 bis.b #65, r12 ;#0x0041
c150: 3c f0 ff 00 and #255, r12 ;#0x00ff
c154: c2 4c 22 00 mov.b r12, &0x0022 ;
0000c158 <.Loc.37.1>:
P1OUT |= RED | GREEN;
c158: 5c 42 21 00 mov.b &0x0021,r12 ;0x0021
c15c: 7c d0 41 00 bis.b #65, r12 ;#0x0041
c160: 3c f0 ff 00 and #255, r12 ;#0x00ff
c164: c2 4c 21 00 mov.b r12, &0x0021 ;
0000c168 <.L2>:
while (1)
{
__bis_SR_register(LPM4_bits);
c168: 32 d0 f0 00 bis #240, r2 ;#0x00f0
0000c16c <.Loc.42.1>:
}
c16c: 30 40 68 c1 br #0xc168 ;
The important takeaway is the assembled subroutine for main()
is only 46 bytes long. This means the vast majority of the binary’s instructions are coming from elsewhere.
In the table below are the sizes, in bytes, of the other subroutines present in the disassembly of the elf file. Alongside this is a description of what I think the various subroutines are attempting to do. Please take the descriptions with a pinch of salt as I didn’t devote enough time to do it justice.
Name | Size (Bytes) | Description |
---|---|---|
__msp430_resetvec_hook | 2 | On reset jump to __crt0_start |
__crt0_start | 4 | Loads stack address into R1 |
__crt0_init_bss | 14 | Call memset on bss |
__crt0_movedata | 20 | Copy data section from ROM to RAM (unknown 4 bytes?) |
__crt0_call_init_then_main | 10 | Calls initialization code -call___do_global_ctors_aux then main |
_msp430_run_init_array | 14 | Loads the start and end address of an array of subroutines to call before calling into _msp430_run_array. The array is 0 length. |
_msp430_run_preinit_array | 14 | Loads the start and end address of an array of subroutines to call before calling into _msp430_run_array. The array is 0 length. |
_msp430_run_fini_array | 16 | Loads the start and end address of an array of subroutines to call before calling into _msp430_run_array. The array is 0 length. |
_msp430_run_array | 14 | Would call each subroutine from an array of their addresses. All callers to this have arrays of length 0. |
_msp430_run_done | 6 | Return instruction for _msp430_run_array. Has 3 calls to ret? |
deregister_tm_clones | 30 | See register_tm_clones |
register_tm_clones | 46 | Appears to relate to transactional memory, which apparently is to make threading easier. Seems unlikely this would be required on a MSP430, unless it can benefit interrupts? |
__do_global_dtors_aux | 78 | Attempts to iterate over an empty array of function pointers (__DTOR_LIST__). |
call___do_global_dtors_aux | 44 | Calls register_tm_clones after a lot of value checking. |
__mspabi_func_epilog* | 16 | Fall-through instructions to pop r4-r10 before returning. Defined in the EABI with the intention of reducing code size. |
__mspabi_srli* | 74 | Fall-through subroutines to logical shift an int right. Right shifts through carry and clears carry. |
__mspabi_srll* | 106 | Fall-through subroutines to logical shift a long right. Right shifts through carry and clears carry. |
memmove | 64 | Included for __crt0_movedata |
memset | 22 | Included for __ctr0_init_bss |
__do_global_ctors_aux | 26 | Tries to call various functions from the array __CTOR_LIST__. The array itself is empty. Handler code for C++ constructors? |
call___do_global_ctors_aux | 18 | Calls: call___do_global_dtors_aux, __do_global_ctors_aux, _msp430_run_preinit_array, _msp430_run_init_array. |
__msp430_fini | 10 | Calls _msp430_run_fini_array then __do_global_dtors_aux |
Somewhat unsurprisingly, these subroutines suggest the C language runtime library is responsible for using the rest of the memory. The runtime provides various supporting functions for the C language. For example, managing the stack is not something which you actively need to think about when writing C, but it happens in the background nonetheless.
My first approach to reducing the binary size was using GCC’s optimization option intended for that specific purpose. This is enabled by passing the -Os
switch to GCC.
The outcome of this was a meager saving of 20 bytes, with the entirety of this saving coming from GCC optimising main()
.
The MSP430 GCC toolchain has an additional switch to reduce binary size, -minrt
. I stumbled across this option in some documentation written by one of the developers. The snippet from the MSP430 GCC manpage states that -minrt
will:
Enable the use of a minimum runtime environment - no static initializers or constructors. This is intended for memory-constrained devices. The compiler includes special symbols in some objects that tell the linker and runtime which code fragments are required.
Enabling this option strips away several subroutines in their entirety. As documented in the table above there were several redundant subroutines which operated on zero length arrays. The binary produced with -minrt
enabled contains a total of 58 bytes of machine instructions (“text”).
Of these 58 bytes, the pre-main “minimal runtime” is only 12 bytes long:
0000c000 <__crt0_start>:
c000: 31 40 00 04 mov #1024, r1
0000c004 <__crt0_call_just_main>:
c004: 0c 43 clr r12
c006: b0 12 0a c0 call #49162
It ensures:
r1
is set to contain the stack pointerr12
is zero - this is the register main()
will read to check the number of arguments passed to it (argc
)main()
.The options -minrt
and -Os
can be specified simultaneously. This has the effect of producing a binary with a minimal runtime and a size optimised main()
subroutine.
This results in a binary with a mere 38 bytes of machine instructions. This is a significant reduction from the original binary and at least offers the chance of squeezing something interesting out of 1 kB.
This Github repository contains the source code, Makefile and disassembly used for this post. It additionally contains a handful of the relevant files taken from the MSP430 GCC source code which provide some of the runtime subroutines that have been referenced here.
I tried to track down as many of the files as possible which had input to the compiled binary.
The files I found were obtained from three sources:
The header files and linker scripts for the MSP430 MCUs e.g. msp430g2553.h
are not shipped in the MSP430 GCC source code but are provided in the executable installer version. This list of instructions suggest that if building GCC from source, these files should be obtained separately from the “msp430-gcc-support-files.zip” package.
The source code for the various runtime subroutines is split between the newlib and libgcc directories of the GCC source code.
Newlib is an implementation of the C standard runtime library, e.g. the code which implements the functions found in string.h
or stdlib.h
, etc. According to it’s wiki page it was written with a focus on embedded systems which do not have an operating system (aka “bare metal”).
I found this porting guide provides relevant information on its workings.
The additional subroutines pulled in from libgcc on the other hand appear to be for the handling of miscellaneous aspects of the C runtime. In particular code from the crtstuff.c file ends up in the non-minrt
binary. This file is documented as providing:
Specialized bits of code needed to support construction and destruction of file-scope objects in C++ code
I don’t have any knowledge of C++ but this file contains functions referencing ctor
and dtor
which I assume are written for this purpose.
Below is a listing of the names of the subroutines and their various sizes, in bytes, for different optimisation levels.
Name | Default | -Os | minrt | minrt and -Os |
---|---|---|---|---|
__reset_vector | 2 | 2 | 2 | 2 |
__crt0_start | 4 | 4 | 4 | 4 |
__crt0_just_call_main | 6 | 6 | ||
__ctr0_init_bss | 14 | 14 | ||
__crt0_movedata | 20 | 20 | ||
__crt0_call_init_then_main | 10 | 10 | ||
_msp430_run_init_array | 14 | 14 | ||
_msp430_run_preinit_array | 14 | 14 | ||
_msp430_run_fini_array | 16 | 16 | ||
_msp430_run_array | 14 | 14 | ||
_msp430_run_done | 6 | 6 | ||
deregister_tm_clones | 30 | 30 | ||
register_tm_clones | 46 | 46 | ||
__do_global_dtors_aux | 78 | 78 | ||
call__do_global_dtors_aux | 44 | 44 | ||
main | 46 | 26 | 46 | 26 |
__mspabi_func_epilog* | 16 | 16 | ||
__mspabi_srli* | 74 | 74 | ||
__mspabi_srll* | 106 | 106 | ||
memmove | 64 | 64 | ||
memset | 22 | 22 | ||
__do_global_ctors_aux | 26 | 26 | ||
call__do_global_ctors_aux | 18 | 18 | ||
__msp430_fini | 10 | 10 | ||
Totals | 694 | 674 | 58 | 38 |
The following resources were helpful in understanding the disassembly:
Note: The installed GCC version and GCC source code version were mismatched as this is what I had on hand. I don’t believe there were any significant changes to the files of interest to me between the versions. Both versions of GCC were obtained from TI’s Website.
Another option which can help reduce binary size is: -Wl,--gc-sections
This consists of two parts:
-Wl,
- Passes the option following the comma to the linker (i.e. to msp430-elf-ld).
--gc-sections
- This is a linker option which per the man page will:
“Enable garbage collection of unused input sections”.
The inclusion of -Wl,-gc-sections
to the invocation of GCC reduces the size of “text” from the default and space optimised build by 6 bytes. The subroutines which seen the saving, _msp430_run_done
and memset
, are not present when -minrt
was enabled.
Interestingly this saving came from removing redundant return from subroutine (ret
) instructions present within these two subroutines. For instance, the unoptimized build produced the following subroutine, which clearly has two unrequired ret
instructions.
0000c076 <_msp430_run_done>:
c076: 30 41 ret
0000c078 <L0>:
c078: 30 41 ret
c07a: 30 41 ret