tests/run-tests.py: Add automatic retry for known-flaky tests.#6
tests/run-tests.py: Add automatic retry for known-flaky tests.#6andrewleech wants to merge 26 commits intoreview/flaky-test-retryfrom
Conversation
29e879c to
91111d5
Compare
|
Code size report: |
tests/run-tests.py
Outdated
| saved_testcase_count = testcase_count.value | ||
| flaky_failures = [] | ||
| for i, r in enumerate(results): | ||
| if r[1] == "fail": |
There was a problem hiding this comment.
Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?
tests/run-tests.py
Outdated
| retried_pass = False | ||
| for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1): | ||
| print( | ||
| "retry {} ({}/{}, {})".format( |
There was a problem hiding this comment.
Is .format the standard way of outputting strings in this project?
tests/run-tests.py
Outdated
| run_one_test(test_file) | ||
| retry_entries = results[pre_len:] | ||
| del results[pre_len:] # clean up all entries appended during retry | ||
| if len(retry_entries) == 1 and retry_entries[0][1] == "pass": |
There was a problem hiding this comment.
As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?
tests/run-tests.py
Outdated
| "{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES), | ||
| ) | ||
| print( | ||
| "FAIL {} (retried {}, {})".format( |
There was a problem hiding this comment.
How does this FAIL compare to earlier fail strings?
|
/review |
There was a problem hiding this comment.
Several issues with the retry mechanism: the flaky-test registry uses a fragile string prefix for tagging results and private _value access to reset the counter, and stress_aes.py with platforms=None will triple the timeout on MIPS/ARM QEMU. Style nits: missing blank line and len(x) > 0 checks.
tests/run-tests.py
Outdated
| "extmod/time_time_ns.py": ("CI runner clock precision", None), | ||
| } | ||
|
|
||
| FLAKY_TEST_MAX_RETRIES = 2 # 3 total runs including initial |
There was a problem hiding this comment.
Missing blank line between closing } of FLAKY_TESTS and this constant.
tests/run-tests.py
Outdated
| test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason | ||
| ) | ||
| ) | ||
| testcase_count._value = saved_testcase_count |
There was a problem hiding this comment.
_value is a private attribute of ThreadSafeCounter. Use testcase_count.value = saved_testcase_count if the type supports assignment, or restructure to avoid resetting it entirely. Accessing internals like this is fragile.
tests/run-tests.py
Outdated
| # or a tuple of sys.platform strings to restrict retries to those platforms. | ||
| FLAKY_TESTS = { | ||
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), |
There was a problem hiding this comment.
stress_aes.py is listed with platforms=None (all platforms) but the reason is "timeout under QEMU emulation". On MIPS/ARM QEMU it takes 70-90 s; retrying twice means up to 3x that timeout burning CI time. Restrict to the riscv64 platform, or drop from FLAKY_TESTS given ci.sh already excludes it there with --exclude thread/stress_aes.py anyway.
tests/test_utils.py
Outdated
| flaky_tests = [ | ||
| r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_tests) > 0: |
There was a problem hiding this comment.
Use if flaky_tests: not if len(flaky_tests) > 0:.
tests/test_utils.py
Outdated
| flaky_failed = [ | ||
| r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_failed) > 0: |
There was a problem hiding this comment.
Same: if flaky_failed: not if len(flaky_failed) > 0:.
| import pyboard | ||
|
|
||
| # Prefix used by run-tests.py to tag flaky-retry results. | ||
| FLAKY_REASON_PREFIX = "flaky" |
There was a problem hiding this comment.
Using a string prefix to tag result tuples is fragile -- a test whose reason string happens to start with "flaky" would be misidentified. Better to use a distinct status value such as "pass-retry" or extend the result tuple with a dedicated field.
tests/run-tests.py
Outdated
| return False, "" | ||
| return True, reason | ||
|
|
||
| results = test_results.value # direct ref to internal list |
There was a problem hiding this comment.
Worth verifying this is safe. If run_one_test appends via the ThreadSafeCounter wrapper method (which takes the lock), and here we mutate the raw list directly (bypassing the lock), the del results[pre_len:] could race with a concurrent appender. The comment claims this block runs after all parallel work is done; confirm that is always true, especially for the single-job (sequential) path.
Fixes a build issue on newer Zephyr versions. Signed-off-by: Antonio Galea <antonio.galea@gmail.com>
The Pololu Zumo 2040 Robot is supported in pico-sdk now so we should not include the header file here anymore, similarly to other boards. This is necessary for future changes from the SDK to be reflected in MicroPython builds. Signed-off-by: Paul Grayson <paul@pololu.com>
Replace the custom rosc_random_u8()/rosc_random_u32() implementation with the pico_rand API from the Pico SDK. The RP2040 datasheet notes that ROSC "does not meet the requirements of randomness for security systems because it can be compromised", and the current 8-bit LFSR conditioning is not a vetted algorithm under NIST SP 800-90B. pico_rand uses various hardware RNG sources depending on the available platform (including the RP2350 hardware TRNG) and is officially supported and maintained as part of the Pico SDK. This changes os.urandom(), the mbedTLS entropy source, the PRNG seed, and the lwIP random function to all use pico_rand, and removes the custom ROSC random functions from main.c. Signed-off-by: Michel Le Bihan <michel@lebihan.pl>
When the timeout parameter of `esp32.RMT.wait_done()` is set to a non-zero value, the underlying `rmt_tx_wait_all_done` blocks (it passes the timeout to `xQueueReceive`). Thus we should release the GIL so that other MicroPython threads are not blocked from running. Signed-off-by: Daniël van de Giessen <daniel@dvdgiessen.nl>
This commit lets the native emitter preserve the value of the index register when performing register-indexed loads or stores of halfword or word values on Thumb. The original code was optimised too aggressively for a register-starved architecture like Thumb, and the index value in the sequence to generate was assumed to be allocated somewhere safe. This is valid on other architectures, but not on Thumb. To solve this, load operations do clobber a temporary register that should be safe to use, REG_TEMP2, to store the scaled register offset. REG_TEMP2's value is only used within the scope of a single ASM API instruction. Save operations unfortunately use a register that is aliased to REG_TEMP2, since they need to have three values in registers to perform the operation. This means the index register needs to be pushed to the stack before performing the scale + store operation, and then popped from the stack. That's a 4 bytes penalty on each store and a minor speed hit on generated code (plus a minor footprint increase of the firmware image). Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit lets the native emitter preserve the value of the index register when performing register-indexed loads or stores for halfword or word values on RV32. The original code was optimised too aggressively to reduce the generated code's size, using compressed opcodes that alias the target register to one of the operands. In register-indexed load/store operations, the index register was assumed to be allocated somewhere safe, but it was not always the case. To solve this, now all halfword and word register-indexed operations will use REG_TEMP2 to store the scaled index register. The size penalty on generated code varies across operation sizes and enabled extensions: - byte operations stay the same size with or without Zba - halfword operations will be 2 bytes larger without Zba, and will stay the same size with Zba - word operations will be 4 bytes larger without Zba, and 2 bytes larger with Zba There is also a minor firmware footprint increase to hold the extra logic needed for conditional register clobbering, but it shouldn't be that large anyway. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit introduces a test that should check whether viper load or store operations won't clobber either the buffer address or the index value being used. Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
Call `mp_event_handle_nowait()` in the VFS reader buffer refill path so
that pending scheduled events (USB task, network poll, etc.) get processed
during long-running import/parse/compile operations.
Without this, importing a large Python module from the filesystem blocks
for too long causing TinyUSB event queue to overflow. For example, on
renesas-ra, running a script that imports iperf3 via mpremote run, asserts,
most likely due to SOF interrupts not getting processing:
queue_event at lib/tinyusb/src/device/usbd.c:382
dcd_event_handler at lib/tinyusb/src/device/usbd.c:1318
dcd_event_sof at lib/tinyusb/src/device/dcd.h:237
dcd_int_handler at tinyusb/src/portable/renesas/rusb2/dcd_rusb2.c:964
<signal handler called>
disk_ioctl at extmod/vfs_fat_diskio.c:125
validate at lib/oofatfs/ff.c:3359
f_read at lib/oofatfs/ff.c:3625
file_obj_read at extmod/vfs_fat_file.c:75
mp_stream_rw at py/stream.c:60
mp_reader_vfs_readbyte at extmod/vfs_reader.c:59
next_char at py/lexer.c:174
mp_lexer_to_next at py/lexer.c:713
mp_parse at py/parse.c:1167
Signed-off-by: iabdalkader <i.abdalkader@gmail.com>
The aim of this commit is to clarify the command line options available. While they are available as well as in the CLI with --help, it's useful to document them and provide a few examples. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
|
/review |
1 similar comment
|
/review |
|
Review failed. Retry with |
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
tests/run-tests.py, FLAKY_TESTS — thread/stress_aes.py entry (suggestion)
platforms=None means retries apply everywhere including native unix/macOS, but the documented reason is "timeout under QEMU emulation". On native platforms this test isn't slow; if it fails for a real reason it'll be silently retried. Either restrict to QEMU platform identifiers or remove it from the allowlist — it's not actually flaky on MIPS/ARM, just slow.
tests/run-tests.py, after FLAKY_TEST_MAX_RETRIES = 2 (nitpick)
Missing blank line between the new constant and the existing # Tests require at least CPython 3.3 comment block below it.
tests/run-tests.py, testcase_count._value = saved_testcase_count (suggestion)
Don't write to ._value directly — that bypasses the ThreadSafeCounter abstraction. Better options:
- Track the count delta added during retries and subtract it via
testcase_count.add(-delta). - Add a public
set()orreset()method toThreadSafeCounter.
tests/run-tests.py, --no-retry argument (nitpick)
default=False is redundant for action="store_true". Can remove.
tests/test_utils.py, flaky summary blocks (nitpick)
if len(flaky_tests) > 0: → if flaky_tests: (same for flaky_failed).
Factor out mp_os_urandom() of each port into extmod/modos.c, which then calls the port-specific function mp_hal_get_random(). Move mp_hal_get_random() to mphalport where suitable. At the MIMXRT and SAMD it is left in modos.c, since there are different implementation depending on the MCU family. At the ALIF, ESP32, CC3200 and RP2 port the file modos.c was removed, since it was empty after moving mp_hal_get_random(). Tested for the cc3200, esp32, esp8266, mimxrt, nrf, rp2, samd, stm32 and unix ports. Compiled for the alif and the renesas port. Signed-off-by: robert-hh <robert@hammelrath.com>
Fixes issue micropython#18825. Signed-off-by: robert-hh <robert@hammelrath.com>
Signed-off-by: Damien George <damien@micropython.org>
|
/review |
There was a problem hiding this comment.
The retry approach is reasonable for reducing spurious CI failures. A few issues to address:
tests/run-tests.py, line R1113 — testcase_count._value = saved_testcase_count
Accessing ._value directly bypasses ThreadSafeCounter's public interface and any associated locking. If testcase_count.value is a read-only property, add a set() or reset() method to ThreadSafeCounter and use that here.
tests/run-tests.py, line R51 — "thread/stress_aes.py": ("timeout under QEMU emulation", None)
Platform is None (all platforms), but this test runs fine on macOS/native and is only slow on QEMU targets. On MIPS (~90s) and ARM (~70s), a retry on timeout failure burns up to 270s/210s of CI time. Either restrict to QEMU platforms or remove from FLAKY_TESTS entirely and rely on the existing --exclude in the rare case it times out.
tests/test_utils.py, lines R330 and R341 — r[2].startswith(FLAKY_REASON_PREFIX)
Detecting flaky results via startswith on the third tuple element is fragile — the result tuple has no documented schema for r[2]. If a non-flaky test's output string happens to start with "flaky", it'll be miscategorised. Consider using a 4-element tuple (test_file, status, reason, is_flaky_retry) or a named type instead of co-opting the reason string.
tests/run-tests.py, line R1064 — def _is_flaky(test_file):
No reason to define this inside the if not args.no_retry: block; it can be a module-level function alongside FLAKY_TESTS.
tests/test_utils.py, line R332 — if len(flaky_tests) > 0:
please use if flaky_tests: (likewise if flaky_failed: on R343).
tests/run-tests.py, line R1237 — default=False
Redundant with action="store_true"; remove it.
|
/review |
There was a problem hiding this comment.
Two bugs in FLAKY_TESTS: stress_recurse.py is listed twice so the first entry is silently dropped, and testcase_count._value is set directly via a private attribute that bypasses the ThreadSafeCounter abstraction. Several entries have platform-specific failure reasons but platforms=None, which will trigger unnecessary retries on native runners.
tests/run-tests.py
Outdated
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), | ||
| "thread/stress_schedule.py": ("intermittent crash under QEMU", None), | ||
| "thread/stress_recurse.py": ("stack overflow under emulation", None), |
There was a problem hiding this comment.
Duplicate key — "thread/stress_recurse.py" already appears on the line above, so this entry silently overwrites the first ("intermittent crash under QEMU" is dropped). Merge into one entry.
tests/run-tests.py
Outdated
| # or a tuple of sys.platform strings to restrict retries to those platforms. | ||
| FLAKY_TESTS = { | ||
| "thread/thread_gc1.py": ("GC race condition", None), | ||
| "thread/stress_aes.py": ("timeout under QEMU emulation", None), |
There was a problem hiding this comment.
Reason says "timeout under QEMU emulation" but platforms=None means this test will also be retried on native macOS/Linux runners where it doesn't time out. Restrict to QEMU platforms or broaden the reason.
tests/run-tests.py
Outdated
| "thread/stress_schedule.py": ("intermittent crash under QEMU", None), | ||
| "thread/stress_recurse.py": ("stack overflow under emulation", None), | ||
| "thread/stress_heap.py": ("flaky on macOS", ("darwin",)), | ||
| "cmdline/repl_lock.py": ("REPL timing under QEMU", None), |
There was a problem hiding this comment.
Same issue: "REPL timing under QEMU" but platforms=None. Should restrict to QEMU platforms.
tests/run-tests.py
Outdated
| test_file, FLAKY_TEST_MAX_RETRIES, flaky_reason | ||
| ) | ||
| ) | ||
| testcase_count._value = saved_testcase_count |
There was a problem hiding this comment.
Don't assign to ._value directly. Add a reset(value) method (or similar) to ThreadSafeCounter and use that here.
tests/run-tests.py
Outdated
| cmd_parser.add_argument( | ||
| "--no-retry", | ||
| action="store_true", | ||
| default=False, |
There was a problem hiding this comment.
default=False is redundant with action="store_true".
tests/test_utils.py
Outdated
| flaky_tests = [ | ||
| r for r in test_results if r[1] == "pass" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_tests) > 0: |
There was a problem hiding this comment.
Use if flaky_tests: instead of if len(flaky_tests) > 0:.
tests/test_utils.py
Outdated
| flaky_failed = [ | ||
| r for r in test_results if r[1] == "fail" and r[2].startswith(FLAKY_REASON_PREFIX) | ||
| ] | ||
| if len(flaky_failed) > 0: |
There was a problem hiding this comment.
Use if flaky_failed: instead.
.github/workflows/ports_unix.yml
Outdated
| (cd ports/unix && gcov -o build-coverage/py ../../py/*.c || true) | ||
| (cd ports/unix && gcov -o build-coverage/extmod ../../extmod/*.c || true) | ||
| - name: Upload coverage to Codecov | ||
| if: ${{ secrets.CODECOV_TOKEN != '' }} |
There was a problem hiding this comment.
This Codecov change is unrelated to the flaky-retry feature. Should be a separate PR.
5571e56 to
cbc5529
Compare
This is a reproducer for micropython#17117. Signed-off-by: Jeff Epler <jepler@unpythonic.net>
Signed-off-by: Jeff Epler <jepler@unpythonic.net>
This turns the reproducer into a sensible-ish crash:
TypeError: exceptions must derive from BaseException
Closes: micropython#17117
Signed-off-by: Jeff Epler <jepler@unpythonic.net>
Signed-off-by: Jeff Epler <jepler@unpythonic.net>
Signed-off-by: stijn <stijn@ignitron.net>
This is convenient when trying to figure out the correct values for --include/--exclude/--test-dirs/... arguments. Signed-off-by: stijn <stijn@ignitron.net>
Test file paths which get passed to the run_tests function can be absolute or relative and with or without leading slash in the latter case, depending on the arguments to run-tests.py, but the skip_tests list with tests to skip only contains relative paths so using simple string equality comparison easily leads to false negatives. Compare the full absolute path instead such that it doesn't matter anymore in which form the tests are passed. Note: - use realpath to resolve symlinks plus make the comparison case insensitive on windows - the test_file passed to run_one_test is not altered by this commit, such that when the user inputs relative paths the tests are also still displayed with relative paths - likewise the test_file_abspath is not modified because functions like run_micropython rely on it having forward slashes In practice this means that it used to be so that the only forms of running tests for which the skip_tests lists actually worked were: >python ./run-tests.py >python ./run-tests.py -d extmod whereas it now works consistently so also for these invocations, which in the end all point to the exact same path: >python ./run-tests.py -d ./extmod >python ./run-tests.py -d ../tests/extmod >python ./run-tests.py -d /full/path/to/tests/extmod These examples used to not skip any of the tests in the extmod/ directory thereby leading to test failures. Signed-off-by: stijn <stijn@ignitron.net>
Scan the --test-dirs argument for the main tests directory being passed and if so do the same thing as if running from within that main test directory. In practice this makes the following (which used to counterintuitively try and fail to run the .py files in the tests/ directory itself) >python micropython/tests/run-tests.py -d micropython/tests do the same thing as >cd micropython/tests >python ./run-tests.py which is logical and convenient. Signed-off-by: stijn <stijn@ignitron.net>
Signed-off-by: stijn <stijn@ignitron.net>
Test file paths which get passed to the run_tests function can be absolute or relative and with or without leading slash in the latter case, depending on the arguments to run-tests.py, but since that path is used to: - display which tests run - record which tests ran in the results.json - craft the filename for the .exp/.out file for failed tests it is desirable to always use the same file path irregardless of how the user passed the path. In practice this means that all forms of running our own tests like: >python ./run-tests.py -i extmod >python ./run-tests.py -d extmod >python ./run-tests.py -d ./extmod >python ./run-tests.py -d ../tests/extmod >python ./run-tests.py -d /full/path/to/tests/extmod will now consistently all display the tests like pass extmod/time_time_ns.py FAIL extmod/some_failing_test.py and produce output files like results/extmod_some_failing_test.py.exp results/extmod_some_failing_test.py.out instead of displaying/using the exact path as passed. For external tests, meaning not in the tests/ directory, we also want to be consistent so there the choice was made to always use absolute paths. Signed-off-by: stijn <stijn@ignitron.net>
Fixes fatal crash if serial port access returns an error (for example: port is native USB-CDC and the host hard faults during the test run). Instead of crashing, have the runner mark this as a test run error and continue. It's not certain the next test will run successfully, but this provides the context of output showing what was happening when the communication error occurred. Without this change, that output is lost when the fatal exception terminates the runner process. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
This gives a more user-friendly name when the Python object (eg Pin) is printed. If the nodelabel is unavailable then it uses `dev->name` as a fallback. Signed-off-by: Fin Maaß <f.maass@vogl-electronic.com>
cbc5529 to
36ae4eb
Compare
Signed-off-by: Antonio Galea <antonio.galea@gmail.com>
032aa59 to
5ed226e
Compare
Reclassify failures of tests listed in flaky_tests_to_ignore as "ignored" instead of retrying them. Ignored tests still run and their output is reported, but they don't affect the exit code. The ci.sh --exclude lists for these tests are removed so they run normally. Signed-off-by: Andrew Leech <andrew.leech@planet-innovation.com>
5ed226e to
06ba55a
Compare
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to
run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.tools/ci.shis updated to remove--excludepatterns for tests now handled by the retry logic, so they get actual coverage again.thread/stress_aes.pystays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.Retries are on by default;
--no-retrydisables them. The end-of-run summary reports tests that passed via retry separately from clean passes.Testing
Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings (
"pass","fail","skip") are bare string literals throughoutrun-tests.py— the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.