Patent Story: ESP32 Memory And Flash Optimisations

25 Jun 2024

Introduction

It’s been a while since I left Enapter, where I led the firmware development team for ESP32 based projects. It was a wonderful experience and I brought a lot of new projects into the renewable energy field. I’m particularly proud of developing a technology, that allows to connect hydrogen devices in a wireless mesh network. This technology has been nominated for a patent and recently it has appeared on the global Google Patents platform. It still has a pending status, because it usually takes a long time for a patent to be approved by each patent office. I’m really happy about it, it’s my first patent.

Before we start, let me point out one thing. The core idea behind the technology is easily found in the abstract paragraph of the patent. The abstract idea without implementation costs almost nothing. All information from this article can be found on the web or in the official Espressif documentation. It doesn’t reveal any important information that belongs to the Enapter. Developing such a large technology required a huge amount of effort from many teams, but in this article I want to focus on the firmware development obstacles I faced during the design and development of this technology.

This new technology brought a lot of challenges to the whole project. There was already a lot of stuff on a very tiny ESP32: MQTT, BLE, WiFi, HTTP, HTTPS, OTA, UART and a lot of other domain-specific business logic. The module had a small amount of flash and memory space, and I had to put completely new and potentially huge subsystems on this tiny module. I couldn’t just move to another module with more resources, because the existing customers had bought a product and were waiting for the new features. That’s the nature of most embedded projects, you can’t just put more hardware in a server rack.

Every time I designed this technology I felt like I was on the edge of what the SDK could do. I had a lot of problems with the BLE and WiFi SDK. I tried to tweak the memory allocation in the cJSON library. I collaborated very closely with the ESP-IDF community, SDK for ESP32 modules, on the GitHub to solve all of the problems. Huge thanks to them and to Ivan Grohotkov, VP of Software Platforms at Espressif. They helped a lot and tried to ship bug fixes as fast as possible.

In terms of numbers, there were the following limitations:

Maximum available flash size was about 1.3Mb.
Maximum available DRAM was about 320Kb. For more information see the ESP-IDF documentation.

Before diving into platform-specific optimisations, it is always worth reviewing the existing system design. Perhaps some functionality can be simplified, and some can be even removed. That’s where I started. The main design principle remains the same, as it was in the old UNIX days, keep it simple and stupid. It still applies in the embedded domain, where you should be even more creative with simplicity. The simpler the program would be, the less resources it will use. Let’s start with the optimisations. The next section would suggest techniques that could potentially reduce the memory and flash footprint.

Optimisations

Analyse maximum possible stack usage of each FreeRTOS task

In FreeRTOS there is a function uxTaskGetStackHighWaterMark( TaskHandle_t xTask ); to measure how close a task was to reaching the allocated stack limit. The closer a task is to a value of 0, the more likely it is that a stack overflow will occur for that task. If the number is large enough, it means that the stack size can be reduced, which reduces the amount of heap memory used. For more details see the FreeRTOS documentation.

Let’s see what is reported on a vanilla project flashed on an EPS32-WROOM-U32 devkit:

#include <cstdio>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include "esp_system.h"

void print_stack_size() {
    fprintf(stderr, "stack_water_mark=%u\n", uxTaskGetStackHighWaterMark(nullptr));
}

extern "C" void app_main(void) {
    while (true) {
        print_stack_size();
        vTaskDelay(pdMS_TO_TICKS(1000));
    }
}

The output is the following:

stack_water_mark=2244
stack_water_mark=2244
stack_water_mark=2244
stack_water_mark=2244
stack_water_mark=2244
stack_water_mark=2244

Reduce stack usage of each FreeRTOS task

There is also another useful function provided by FreeRTOS itself that helps with task resource analysis, vTaskList. Make sure you enable this functionality using idf.py menuconfig: Component Config -> FreeRTOS -> Kernel -> configUSE_TRACE_FACILITY enabled, configUSE_STATS_FORMATTING_FUNCTIONS enabled.

Let’s check what will be reported:

#include <cstdio>
#include <memory>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include "esp_system.h"

void print_tasks() {
    auto data = std::make_unique<char[]>(1024);

    vTaskList(data.get());
    fprintf(stderr, "tasks:\r\n%s", data.get());
}

extern "C" void app_main(void) {
    while (true) {
        print_tasks();
        vTaskDelay(pdMS_TO_TICKS(1000));
    }
}

The output is the following:

main            X       1       0       2244    4
IDLE1           R       0       1       1036    6
IDLE0           R       0       0       1024    5
esp_timer       S       22      0       3576    3
ipc1            S       24      1       488     2
Tmr Svc         B       1       0       1516    7
ipc0            S       24      0       488     1

The table below provides more context for the received output:

ID	State	Priority	Stack	Number
main	X	1	2244	4
IDLE1	R	0	1036	6
IDLE0	R	0	1024	5
esp_timer	S	22	3576	3
ipc1	S	24	480	2
Tmr Svc	B	1	1508	7
ipc0	S	24	488	1

ID - human readable task identifier.
State - task state: X - executing, R - ready, S - suspended, B - blocked.
Priority - the higher the value, the more priority a task will have.
Stack - the minimum amount of stack space left for the task since it was created, in words. The same value has recently been obtained with the uxTaskGetStackHighWaterMark function.
Number - unique number assigned to a task, used mostly in trace analyses.

As we can see, some tasks have a lot of available stack space, that can be reduced. For example, the main task. It has 2244 words of available stack space. That’s more than enough. This means that the overall task size can be reduced. This can be done with idf.py menuconfig, Component Config -> ESP System Settings -> Main Stack Size, CONFIG_ESP_MAIN_TASK_STACK_SIZE=3584 in the generated sdkconfig file. See the documentation. The same algorithm is used for other tasks.

Replace stack allocations with heap allocations

There is also a memory, usually allocated once in the initialisation functions or in C++ object constructors. This memory is usually allocated on the stack of a task. If such memory was allocated statically, I replaced it with a heap allocation.

For example, instead of:

struct Data {
    uint8_t buf[1024];
}

I used:

struct Data {
    std::unique_ptr<uint8_t> buf;
}

Minimise number of used FreeRTOS tasks

Initially, for historical reasons, each subsystem was run on its own FreeRTOS task. They consumed a lot of heap memory and increased CPU load due to frequent and unnecessary context switching. I introduced a limited number of event loops, much less than the original number of FreeRTOS tasks, on which all subsystems were scheduled.

Verify heap memory usage

Heap memory usage is a big topic. Espressif provides extensive documentation on the various capabilities provided by the ESP-IDF SDK. They also provide good documentation on how to analyse memory usage.

Most of the time I used the following two functions:

esp_get_free_heap_size()
esp_get_minimum_free_heap_size()

Let’s see what they will report on a vanilla project:

#include <cstdio>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include "esp_system.h"

void print_memory() {
    fprintf(stderr, "free_heap_size=%lu free_min_heap_size=%lu\n",
            esp_get_free_heap_size(), esp_get_minimum_free_heap_size());
}

extern "C" void app_main(void) {
    while (true) {
        print_memory();
        vTaskDelay(pdMS_TO_TICKS(1000));
    }
}

The output is the following:

free_heap_size=297964 (bytes) free_min_heap_size=297964 (bytes)
free_heap_size=297964 (bytes) free_min_heap_size=297964 (bytes)
free_heap_size=297964 (bytes) free_min_heap_size=297964 (bytes)
free_heap_size=297964 (bytes) free_min_heap_size=297964 (bytes)

Apply pause-resume design pattern

Sometimes every heavy sub-system should be present on a module. It is still possible to reduce the memory space for a period of time. Assume there is some operation, let’s say the OTA process, that requires more memory than is currently available. In this case, some parts of the system can be temporarily disabled and then re-enabled back when the operation is complete. Have a look at the Bluetooth or mbedTLS stack used, they may have some kind of init/de-init functionality.

Verify flash space usage

Most of the time I used idf.py size:

Total sizes:
Used static DRAM:   11332 bytes ( 169404 remain, 6.3% used)
      .data size:    8748 bytes
      .bss  size:    2584 bytes
Used static IRAM:   52634 bytes (  78438 remain, 40.2% used)
      .text size:   51607 bytes
   .vectors size:    1027 bytes
Used Flash size :  117111 bytes
           .text:   78787 bytes
         .rodata:   38068 bytes
Total image size:  178493 bytes (.bin may be padded larger)

and more interestingly idf.py size-components, which gives an overview of how much flash space is occupied by each library, component in terms of ESP-IDF.

Use design decision list

It’s often useful to look at the system from the top down and try to understand if it’s possible to replace or simplify some of the functionality. These changes can be game changing. I made a list of the heaviest subsystems and tried to answer whether I could do anything with them.

Here is the list:

HTTPS stack. Each HTTPS request uses a lot of heap memory space. If it can be replaced with simple HTTP, I can save a huge amount of heap space.
mbedTLS stack. Do I really need to use mbedTLS as it consumes a lot of both, memory space and flash space.
Bluetooth stack. Do I really need the whole Bluetooth stack or just the BLE part? Can I maybe use a different stack, that uses less resources. This idea was very important. I replaced the Bluedroid BLE I was using with NimBLE, which gave me extra flash and memory space that I desperately needed.
Bluetooth security level. Do I really need to encrypt the data over the BLE protocol or I can use a plain text protocol instead which reduces memory, flash and CPU usage.

It’s just an example of thinking. The point I’m trying to make is that it’s good to have such a list and to revisit it from time to time.

Conclusion

At the time I was developing the firmware, there were a small number of firmware optimisation guides written for the ESP32 platform. Most of them were too simple and didn’t cover the cases I needed. Since then Espressif has done a great job in providing a very detailed documentation on how to optimise the firmware in terms of memory, flash size and speed. I would suggest going through these guides step by step and see if any optimisations can be applied to the project.

That was my story, and I’m glad it ended happily. Happy optimising and may your firmware meet your needs.