Offloading The CPU Step By Step — Wenting's Web Page

/upload/blog/2021/1616287187157-2021-03-20%2020.39.25.jpg

I recently purchased some Planar EL640.480-AM series panels. They are monochrome 10.4″ 640×480 TFEL panels with an STN LCD interface. This is similar in DPI (sometimes called RGB), the screen is timed to the HVSync signal, and needs to be constantly refreshed. But, as a monochrome screen, the pixels are 1bpp, so the data simply transmits multiple pixels at a time. These screens are also dual-scan. There are two “raster beams” at the same time, one refreshing from the top of the screen, the other refreshing from the middle of the screen. They should be refreshed at 120 Hz.

This creates some interesting challenges: It expects to run with an STN LCD controller. However, where can I find it?

There used to be dedicated graphics controllers like the NM128 or CT65530 that could drive STN LCDs directly. But they have long been obsolete.
ARM SoCs since the early 2000s typically have an LCD controller capable of driving STN LCDs. However many of these SoCs are obsolete or even obsolete.
There were display controller chips that could convert VGA to STN LCD signals. But I don’t know any specific model. I suspect he is also being demoted.

If not using dedicated hardware (and I’m not interested in buying it anyway), there are several alternative ways to run it:

Use a really fast microcontroller with a large SRAM to bit-bang the GPIO to generate the video signal.
Use CPLD/FPGA to generate timing

I have used both methods before. First method:

Second method:

I am going to use microcontroller way again this time. But this time I’m going to use the RP2040, which has a very powerful IO engine called PIO. We will see how this will help us in running the screen and loading the CPU cores.

Claimant: The methods described herein are provided as is. Use at your own risk.

I am new to RP2040, this is the first time I have used PIO. Therefore the program provided here is probably not optimal.

I started with the fully bit-banged version because it is the most straightforward way to implement the protocol.

Code:

static void frame(void) 
        pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
        pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);
        // prefill FIFO
        elsm_put(*rdptr_ud++, *rdptr_ld++);
        // start SM
        pio_enable_sm_mask_in_sync(el_pio, (1u << EL_UDATA_SM)

Because the screen needs to be refreshed constantly, the main function will look like this:

This eats up all the CPU cycles, so the MCU won’t be able to do anything else. What’s worse is that the MCU must not allow any interference, as the screen is time sensitive.

This is also quite slow. At 125 MHz system clock, I got:

44Hz VSync
11.5kHz HSync
870 kHz pixel clock

1616287183178-2021-03-20 20.39.18.jpg

The screen works, but the flicking is quite noticeable. As a proof of concept, it shows that the screen works, so I can proceed.

Sending data in parallel over GPIO by punching different lines like in the previous example is quite slow. Fortunately, this can be easily removed using PIO.

As a first step, I would just replace the data sending part with a PIO. That’s 2 4-bit synchronous parallel buses. Using the auto pulling feature and sideset pin feature, this can be easily implemented as a simple PIO SM:

.program el_udata
.side_set 1
.wrap_target
    out pins, 4     side 1
    nop             side 0
.wrap

Each PIO SM (state machine) is capable of handling one FIFO (datastream). Here the screen has two raster beams, so two state machines will be required. I am calling them EL_UDATA_SM and EL_LDATA_SM, the problem is, the two state-machines must be strictly synchronous with each other. We will see how it will be implemented.

Putting aside sync problems, simply replacing GPIO with PIO, the frame() function would look like this:

static void frame(void) 
    pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
    pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);
    // Setup DMA
    dma_channel_start(el_udma_chan);
    dma_channel_start(el_ldma_chan);
    // start SM
    pio_enable_sm_mask_in_sync(el_pio, (1u << EL_UDATA_SM)

elsm_put() The function puts the data into the PIO FIFO and lets the PIO send the data, and elsm_wait() The function waits for the PIO to finish sending data. Which are implemented as follows:

static inline void elsm_put(uint32_t ud, uint32_t ld) 
    uint8_t *framebuf = frame_state ? framebuf_bp0 : framebuf_bp1;
    frame_state = !frame_state;
    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    dma_channel_set_read_addr(el_udma_chan, rdptr_ud, false);
    dma_channel_set_read_addr(el_ldma_chan, rdptr_ld, false);

    pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
    pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);

    pio_sm_clear_fifos(el_pio, EL_UDATA_SM);
    pio_sm_clear_fifos(el_pio, EL_LDATA_SM);

    pio_sm_restart(el_pio, EL_UDATA_SM);
    pio_sm_restart(el_pio, EL_LDATA_SM);

    // Load configuration values
    el_sm_load_reg(EL_UDATA_SM, pio_y, SCR_REFRESH_LINES - 2);
    el_sm_load_reg(EL_UDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);
    el_sm_load_reg(EL_LDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);

    // Setup DMA
    dma_channel_start(el_udma_chan);
    dma_channel_start(el_ldma_chan);
    // Clear IRQ flag
    el_pio->irq = 0x02;
    // start SM
    pio_enable_sm_mask_in_sync(el_pio,
            (1u << EL_UDATA_SM) 

static inline void elsm_wait(void) 
    uint8_t *framebuf = frame_state ? framebuf_bp0 : framebuf_bp1;
    frame_state = !frame_state;
    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    dma_channel_set_read_addr(el_udma_chan, rdptr_ud, false);
    dma_channel_set_read_addr(el_ldma_chan, rdptr_ld, false);

    pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
    pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);

    pio_sm_clear_fifos(el_pio, EL_UDATA_SM);
    pio_sm_clear_fifos(el_pio, EL_LDATA_SM);

    pio_sm_restart(el_pio, EL_UDATA_SM);
    pio_sm_restart(el_pio, EL_LDATA_SM);

    // Load configuration values
    el_sm_load_reg(EL_UDATA_SM, pio_y, SCR_REFRESH_LINES - 2);
    el_sm_load_reg(EL_UDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);
    el_sm_load_reg(EL_LDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);

    // Setup DMA
    dma_channel_start(el_udma_chan);
    dma_channel_start(el_ldma_chan);
    // Clear IRQ flag
    el_pio->irq = 0x02;
    // start SM
    pio_enable_sm_mask_in_sync(el_pio,
            (1u << EL_UDATA_SM)

However, this code will not work. Checking only one SM for full or stall is not a problem here. If both SMs are guaranteed to be in sync, it is sufficient to check only one SM. However, this code does not guarantee sync between two SMs. Considering line start, both SMs are being stopped because the FIFO is empty. The elsm_put function writes to EL_LDATA_SM, but before writing to EL_UDATA_SM, the clock edge of the PIO clock arrives, then EL_UDATA_SM will advance its state, but not to EL_LDATA_SM. In such cases, they become out of sync.

The solution is to stop the first SM, fill the FIFO first and then start both SMs in sync:

    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    for (int y = 0; y < SCR_HEIGHT / 2; y++)  (1u << EL_LDATA_SM));
        // Increment addr
        // Wait for finish
        dma_channel_wait_for_finish_blocking(el_udma_chan);
        dma_channel_wait_for_finish_blocking(el_ldma_chan);
        // Wait for SM to finish
        elsm_wait();
        gpio_put(HSYNC_PIN, 1);
        gpio_put(VSYNC_PIN, (y == 0) ? 1 : 0);
        delay(15);
        gpio_put(HSYNC_PIN, 0);
        delay(5);
        gpio_put(VSYNC_PIN, 0);

This will create a glitch-free image.

Following are the additional PIO setup codes

static void elsm_init() {
    static uint udata_offset, ldata_offset;

    for (int i = 0; i < 4; i++)  (1u << EL_LDATA_SM));

    pio_gpio_init(el_pio, PIXCLK_PIN);
    pio_sm_set_consecutive_pindirs(el_pio, EL_UDATA_SM, UD0_PIN, 4, true);
    pio_sm_set_consecutive_pindirs(el_pio, EL_UDATA_SM, PIXCLK_PIN, 1, true);
    pio_sm_set_consecutive_pindirs(el_pio, EL_LDATA_SM, LD0_PIN, 4, true);

    udata_offset = pio_add_program(el_pio, &el_udata_program);
    ldata_offset = pio_add_program(el_pio, &el_ldata_program);

    int cycles_per_pclk = 2;
    float div = clock_get_hz(clk_sys) / (EL_TARGET_PIXCLK * cycles_per_pclk);

    pio_sm_config cu = el_udata_program_get_default_config(udata_offset);
    sm_config_set_sideset_pins(&cu, PIXCLK_PIN);
    sm_config_set_out_pins(&cu, UD0_PIN, 4);
    sm_config_set_fifo_join(&cu, PIO_FIFO_JOIN_TX);
    sm_config_set_out_shift(&cu, true, true, 32);
    sm_config_set_clkdiv(&cu, div);
    pio_sm_init(el_pio, EL_UDATA_SM, udata_offset, &cu);

    pio_sm_config cl = el_ldata_program_get_default_config(ldata_offset);
    sm_config_set_out_pins(&cl, LD0_PIN, 4);
    sm_config_set_fifo_join(&cl, PIO_FIFO_JOIN_TX);
    sm_config_set_out_shift(&cl, true, true, 32);
    sm_config_set_clkdiv(&cl, div);
    pio_sm_init(el_pio, EL_LDATA_SM, ldata_offset, &cl);
}

In the previous example, although the MCU can now churn out pixels at a much faster rate, this does not necessarily free up CPU time. The core is still busy writing data to the PIO and waiting for it to finish. A better way to do this is to use DMA.

To test the idea of using DMA, first use a blocking mode transfer, so it will be a drop-in replacement of the current copy loop. Use the following code to setup the 2 DMA channels to be used:

static void el_dma_init()  (1u << EL_LDATA_SM));

then in the copy loop frame() The function can be replaced with a DMA function call:

static void frame(void) {
    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    dma_channel_set_read_addr(el_udma_chan, rdptr_ud, false);
    dma_channel_set_read_addr(el_ldma_chan, rdptr_ld, false);

    for (int y = 0; y < SCR_HEIGHT / 2; y++) {
        pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
        pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);
        // Setup DMA
        dma_channel_start(el_udma_chan);
        dma_channel_start(el_ldma_chan);
        // start SM
        pio_enable_sm_mask_in_sync(el_pio, (1u << EL_UDATA_SM) | (1u << EL_LDATA_SM));
        // Increment addr
        // Wait for finish
        dma_channel_wait_for_finish_blocking(el_udma_chan);
        dma_channel_wait_for_finish_blocking(el_ldma_chan);
        // Wait for SM to finish
        elsm_wait();
        gpio_put(HSYNC_PIN, 1);
        gpio_put(VSYNC_PIN, (y == 0) ? 1 : 0);
        delay(15);
        gpio_put(HSYNC_PIN, 0);
        delay(5);
        gpio_put(VSYNC_PIN, 0);
    }
}

Now the CPU is not copying the data, this is being done by DMA. However the CPU is not freed yet: it is still always waiting for DMA to finish, and the main loop is still a while(1)frame(). Nothing else could be done except waiting for the DMA.

So instead of letting the CPU wait all the time, interrupts can be used to allow the CPU to do other things while the DMA and PIO are busy sending stuff.

Modify the handler into 3 functions like this:

static int el_cur_y = 0;

static void el_dma_start_line() {
    pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
    pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);
    // Setup DMA
    dma_channel_start(el_udma_chan);
    dma_channel_start(el_ldma_chan);
    // start SM
    pio_enable_sm_mask_in_sync(el_pio, (1u << EL_UDATA_SM) | (1u << EL_LDATA_SM));
}

static void el_dma_start_frame() {
    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    dma_channel_set_read_addr(el_udma_chan, rdptr_ud, false);
    dma_channel_set_read_addr(el_ldma_chan, rdptr_ld, false);
    el_dma_start_line();
}

static void el_dma_handler() {
    dma_hw->ints0 = 1u << el_udma_chan;

    elsm_wait();
    gpio_put(HSYNC_PIN, 1);
    gpio_put(VSYNC_PIN, (el_cur_y == 0) ? 1 : 0);
    delay(15);
    gpio_put(HSYNC_PIN, 0);
    delay(5);
    gpio_put(VSYNC_PIN, 0);
    el_cur_y ++;
    if (el_cur_y == SCR_HEIGHT / 2) {
        // End of frame, reset
        el_cur_y = 0;
        el_dma_start_frame();
    }
    else {
        el_dma_start_line();
    }
}

Every time the DMA finishes transferring a line, the el_dma_handler function will be called, generate the HVSync signal, and configure the DMA to start the next line.

Add the following code to the DMA initialization code before enabling the interrupt:

    dma_channel_set_irq0_enabled(el_udma_chan, true);

    irq_set_exclusive_handler(DMA_IRQ_0, el_dma_handler);
    irq_set_enabled(DMA_IRQ_0, true);

Finally, remove the frame() call in the main loop, and call el_dma_start_frame() before the main loop to initialize the DMA for the first frame. Now, try adding things like LED blinking to the main, note how this will work with DMA+PIO taking care of the EL display refresh in the background.

To see how much CPU is free for other tasks, it is possible to toggle the GPIO pin during interrupt time, so non-interrupted time will be free time.

It turns out, the output duty cycle is around 48%, which means the CPU load is 48%. There is still about 50% left for other works. Just keep in mind that this doesn’t include the overhead for interrupt processing (context switching), so the actual usable CPU time will be less.

50% CPU usage (if I want to refresh it at 120Hz) compared to 270% CPU usage is a pretty good improvement at first, but still not perfect:

The CPU is still using the busy-loop delay loop to control the timing of HVSync.
The CPU still needs to use a busy-loop to wait for the PIO to finish sending data. DMA is interrupted only when data has been pushed into the FIFO, but not when the PIO has exhausted the FIFO. A TX FIFO is a non-full interrupt, but not an empty interrupt.
The screen is sensitive to time differences. If for some reason the CPU did not service the interrupt at precisely the right time, some artifacts may appear.

The solution is to use the PIO to handle HVsync as well. Fortunately, the PIO has support for down counters, so I will use this capability to calculate coordinates and generate HVSync signals.

The idea is that the SM will not only output data, but also countdown in both the X and Y directions, so that it can generate both HVsync signals. When it reaches the end of the frame, it generates an interrupt to the CPU.

The code is as follows, I am using 1 IRQ bit to sync between them:

; UDATA SM handles UD0-3, PCLK, and VSYNC
; PCLK is mapped to SIDE, VSYNC is mapped to SET, and UD0-3 are mapped to OUT
.program el_udata
.side_set 1
    irq set 5 side 0
    mov x, isr side 0 
loop_first_line:
    out pins, 4 side 1
    jmp x-- loop_first_line side 0
end_first_line:
    set pins, 1 [6] side 0
    set pins, 0 [9] side 0
line_start:
    irq set 5 side 0
    mov x, isr side 0
loop:
    out pins, 4 side 1 ; Output 4 bit data
    jmp x-- loop side 0 ; Loop until x hits 0, then wait for next line
loop_end:
    nop [15] side 0
    jmp y-- line_start side 0 
    ; end of frame, signal CPU
    irq wait 1 side 0


; LDATA SM handles LD0-3 and HSYNC
; HSYNC is mapped to SET, and LD0-3 are mapped to OUT
.program el_ldata
    ; Signal UDATA SM to start outputting data
    mov x, isr
    wait irq 5
loop:
    out pins, 4
    jmp x-- loop
    ; toggle Hsync and signal Vsync SM
    set pins, 1 [5]
    set pins, 0 [10]

For the Y counter, it only needs to count down once, so it’s fine to load it directly into the register. However, the X counter needs to be reloaded each row, so a separate register is needed to hold the initial value. Values can be loaded into C code:

static void el_sm_load_reg(uint sm, enum pio_src_dest dst, uint32_t val) {
    pio_sm_put_blocking(el_pio, sm, val);
    pio_sm_exec(el_pio, sm, pio_encode_pull(false, false));
    pio_sm_exec(el_pio, sm, pio_encode_out(dst, 32));
}

...
    // Load configuration values
    el_sm_load_reg(EL_UDATA_SM, pio_y, SCR_REFRESH_LINES - 2);
    el_sm_load_reg(EL_UDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);
    el_sm_load_reg(EL_LDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);
...

PIO interrupts also need to be configured. In the code, I was using irq wait 1 To signal the main CPU, IRQ flag 0 must therefore be routed to the main CPU:

    el_pio->inte0 = PIO_IRQ0_INTE_SM1_BITS;
    irq_set_exclusive_handler(PIO0_IRQ_0, el_pio_irq_handler);
    irq_set_enabled(PIO0_IRQ_0, true);

It’s kind of confusing in that from the register definition it looks like it’s routing the SM1 interrupt to PIO0_IRQ_0, but it’s just routing PIO IRQ flag 1 to the CPU. It is not tied to SM1. For example, here I am using SM0 to generate IRQ flag 1.

Now the interrupt handler can be further simplified:

static void el_pio_irq_handler() {
    uint8_t *framebuf = frame_state ? framebuf_bp0 : framebuf_bp1;
    frame_state = !frame_state;
    uint32_t *rdptr_ud = (uint32_t *)framebuf;
    uint32_t *rdptr_ld = (uint32_t *)(framebuf + SCR_STRIDE * SCR_HEIGHT / 2);
    dma_channel_set_read_addr(el_udma_chan, rdptr_ud, false);
    dma_channel_set_read_addr(el_ldma_chan, rdptr_ld, false);

    pio_sm_set_enabled(el_pio, EL_UDATA_SM, false);
    pio_sm_set_enabled(el_pio, EL_LDATA_SM, false);

    pio_sm_clear_fifos(el_pio, EL_UDATA_SM);
    pio_sm_clear_fifos(el_pio, EL_LDATA_SM);

    pio_sm_restart(el_pio, EL_UDATA_SM);
    pio_sm_restart(el_pio, EL_LDATA_SM);

    // Load configuration values
    el_sm_load_reg(EL_UDATA_SM, pio_y, SCR_REFRESH_LINES - 2);
    el_sm_load_reg(EL_UDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);
    el_sm_load_reg(EL_LDATA_SM, pio_isr, SCR_LINE_TRANSFERS - 1);

    // Setup DMA
    dma_channel_start(el_udma_chan);
    dma_channel_start(el_ldma_chan);
    // Clear IRQ flag
    el_pio->irq = 0x02;
    // start SM
    pio_enable_sm_mask_in_sync(el_pio,
            (1u << EL_UDATA_SM) | (1u << EL_LDATA_SM));
}

The interrupt now fires only at 120Hz and takes only 750ns to run at a 125MHz clock frequency. This means that the CPU load is only 0.009% (750ns * 120Hz / 1e9ns)! This is a huge improvement at the cost of a few extra SM instructions.

Note that it is actually possible to eliminate this blockage. PIO and DMA can push pixels out automatically without any CPU intervention. However, I’m not doing it here, because precise Vsync interrupts are quite useful. I can switch framebuffers here without any breaking effects. It is also important to apply FRC to grayscale.

This blog shows how to take advantage of the PIOs found on the RP2040 to drive non-common types of displays. Combined with the larger SRAM found on the RP2040, this is quite useful. As I mentioned earlier I’m a newbie who has just started learning the RP2040, the method presented here almost certainly does not make full use of the PIO and could be further optimized. All codes provided are licensed under the MIT license, so take them if they prove useful to your project. Thanks for reading.

Full source code is available here: https://gist.github.com/zephray/cb9340d278ed2ab6eb47398d2ca29b3c

Offloading the CPU step by step — Wenting’s Web Page

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply