Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rv-virt/citest: test_hello or test_pipe failed #14808

Closed
1 task done
lupyuen opened this issue Nov 15, 2024 · 11 comments
Closed
1 task done

[BUG] rv-virt/citest: test_hello or test_pipe failed #14808

lupyuen opened this issue Nov 15, 2024 · 11 comments
Labels
Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Build system OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working

Comments

@lupyuen
Copy link
Member

lupyuen commented Nov 15, 2024

Description / Steps to reproduce the issue

Since yesterday: rv-virt/citest has been failing test_hello onwards, or test_pipe onwards, hanging our CI Checks in GitHub and Build Farm. (GitHub will cancel it after 6 hours)

It might have been caused by one of these NuttX Commits:

Or maybe one of these NuttX Apps Commits:

Also when one test fails: Why do the rest of the tests take a loooong time to fail, hanging our CI Checks in GitHub and Build Farm?

Fail at test_hello onwards: https://github.com/NuttX/nuttx/actions/runs/11833005280/job/32970891697#step:7:143

Configuration/Tool: rv-virt/citest
$ cd /github/workspace/sources/nuttx/tools/ci/testrun/script
$ python3 -m pytest -m 'qemu or rv_virt' ./ -B rv-virt -P /github/workspace/sources/nuttx -L /github/workspace/sources/nuttx/boards/risc-v/qemu-rv/rv-virt/configs/citest/logs/rv-virt/qemu -R qemu -C --json=/github/workspace/sources/nuttx/boards/risc-v/qemu-rv/rv-virt/configs/citest/logs/rv-virt/qemu/pytest.json

test_framework/test_cmocka.py::test_cmocka PASSED                        [  0%]
test_example/test_example.py::test_hello FAILED                          [  0%]
test_example/test_example.py::test_helloxx FAILED                        [  0%]
test_example/test_example.py::test_pipe FAILED                           [  0%]
test_example/test_example.py::test_popen FAILED                          [  0%]
test_example/test_example.py::test_usrsocktest FAILED                    [  0%]
[ Everything fails very slowly ]

Fail at test_pipe onwards: https://github.com/NuttX/nuttx/actions/runs/11850442831/job/33025374105#step:7:145

test_framework/test_cmocka.py::test_cmocka PASSED                        [  0%]
test_example/test_example.py::test_hello PASSED                          [  0%]
test_example/test_example.py::test_helloxx PASSED                        [  0%]
test_example/test_example.py::test_pipe FAILED                           [  0%]
test_example/test_example.py::test_popen FAILED                          [  0%]
test_example/test_example.py::test_usrsocktest FAILED                    [  0%]
[ Everything fails very slowly ]

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu LTS at GitHub Actions

NuttX Version

master

Issue Architecture

[Arch: risc-v]

Issue Area

[Area: Build System]

Verification

  • I have verified before submitting the report.
@lupyuen lupyuen added the Type: Bug Something isn't working label Nov 15, 2024
@github-actions github-actions bot added Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Build system OS: Linux Issues related to Linux (building system, etc) labels Nov 15, 2024
@lupyuen
Copy link
Member Author

lupyuen commented Nov 16, 2024

The Timeout Values are configured to One Minute or longer for some Python Tests. What if we reduce the Timeout Values? https://github.com/search?q=repo%3Aapache%2Fnuttx+timeout%3D+language%3APython+path%3A%2F%5Etools%5C%2Fci%5C%2Ftestrun%5C%2F%2F&type=code

Update: Nope, doesn't work: https://github.com/lupyuen/nuttx-build-farm/blob/main/run-job-macos.sh#L107-L131

Somehow the Timeout Value is hard-coded inside expect? https://github.com/apache/nuttx/blob/master/tools/ci/testrun/utils/common.py#L229-L288

@lupyuen
Copy link
Member Author

lupyuen commented Nov 16, 2024

For now we patched the NuttX Mirror Repo: Kill the CI Test if it exceeds 2 hours. Also for Ubuntu Build Farm and macOS Build Farm.

lupyuen added a commit to lupyuen2/wip-nuttx that referenced this issue Nov 19, 2024
CI Test will sometimes run for 6 hours (before getting auto-terminated by GitHub):
- apache#14808
- apache#14680

This is a problem because:
- It will increase our usage of GitHub Runners. Which may overrun the [GitHub Actions Budget](https://infra.apache.org/github-actions-policy.html) allocated by ASF.
- Suppose right after CI Test there's another build. If CI Test runs for all 6 hours, then the build after CI Test will never run.

For this PR: We assume that Every CI Job (e.g. risc-v-05) will complete normally within 2 hours. If any CI Job exceeds 2 hours: This PR will kill the CI Test Process `pytest` and allow the next build to run.
xiaoxiang781216 pushed a commit that referenced this issue Nov 19, 2024
CI Test will sometimes run for 6 hours (before getting auto-terminated by GitHub):
- #14808
- #14680

This is a problem because:
- It will increase our usage of GitHub Runners. Which may overrun the [GitHub Actions Budget](https://infra.apache.org/github-actions-policy.html) allocated by ASF.
- Suppose right after CI Test there's another build. If CI Test runs for all 6 hours, then the build after CI Test will never run.

For this PR: We assume that Every CI Job (e.g. risc-v-05) will complete normally within 2 hours. If any CI Job exceeds 2 hours: This PR will kill the CI Test Process `pytest` and allow the next build to run.
JaeheeKwon pushed a commit to JaeheeKwon/nuttx that referenced this issue Nov 28, 2024
CI Test will sometimes run for 6 hours (before getting auto-terminated by GitHub):
- apache#14808
- apache#14680

This is a problem because:
- It will increase our usage of GitHub Runners. Which may overrun the [GitHub Actions Budget](https://infra.apache.org/github-actions-policy.html) allocated by ASF.
- Suppose right after CI Test there's another build. If CI Test runs for all 6 hours, then the build after CI Test will never run.

For this PR: We assume that Every CI Job (e.g. risc-v-05) will complete normally within 2 hours. If any CI Job exceeds 2 hours: This PR will kill the CI Test Process `pytest` and allow the next build to run.
@xiaoxiang781216
Copy link
Contributor

@lupyuen is it still broken after #14950

@lupyuen
Copy link
Member Author

lupyuen commented Dec 4, 2024

@xiaoxiang781216 Yep CI Test is still failing, according to NuttX Dashboard:

https://nuttx-dashboard.org/d/fe2q876wubc3kc/nuttx-build-history?from=now-7d&to=now&timezone=browser&var-arch=$__all&var-subarch=$__all&var-board=rv-virt&var-config=citest&var-group=$__all&var-Filters=

Screenshot 2024-12-05 at 1 39 10 AM

https://github.com/NuttX/nuttx/actions/runs/12156309894/job/33899821697#step:7:88

test_framework/test_cmocka.py::test_cmocka PASSED                        [  0%]
test_example/test_example.py::test_hello PASSED                          [  0%]
test_example/test_example.py::test_helloxx FAILED                        [  0%]
test_example/test_example.py::test_pipe FAILED                           [  0%]
test_example/test_example.py::test_popen FAILED                          [  0%]

@lupyuen
Copy link
Member Author

lupyuen commented Dec 11, 2024

NuttX ps is crashing inside QEMU RISC-V 32-bit, causing CI Test to fail. But why is ps crashing for rv-virt:citest, but not other configs? 🤔

## Start Docker Container for NuttX
sudo docker run \
  -it \
  ghcr.io/apache/nuttx/apache-nuttx-ci-linux:latest \
  /bin/bash

## Inside Docker:
## We compile rv-virt:citest
cd
git clone https://github.com/apache/nuttx
git clone https://github.com/apache/nuttx-apps apps
pushd nuttx ; echo NuttX Source: https://github.com/apache/nuttx/tree/$(git rev-parse HEAD) ; popd
pushd apps  ; echo NuttX Apps: https://github.com/apache/nuttx-apps/tree/$(git rev-parse HEAD) ; popd
cd nuttx
tools/configure.sh rv-virt:citest
make -j
qemu-system-riscv32 \
    -M virt \
    -bios ./nuttx \
    -nographic

NuttShell (NSH) NuttX-12.7.0
nsh> uname -a
NuttX  12.7.0 5607eece84 Dec 11 2024 07:05:48 risc-v rv-virt

nsh> ps
  PID GROUP PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK            STACK    USED FILLED COMMAND
    0     0   0 FIFO     Kthread   - Ready              0000000000000000 0001952 0000908  46.5%  Idle_Task
    1     0 224 RR       Kthread   - Waiting  Semaphore 0000000000000000 0001904 0000508  26.6%  hpwork 0x8014b1e4 0x8014b210
    2     0 100 RR       Kthread   - Waiting  Semaphore 0000000000000000 0001896 0000508  26.7%  lpwork 0x8014b1a0 0x8014b1cc
riscv_exception: EXCEPTION: Load access fault. MCAUSE: 00000005, EPC: 80008bfe, MTVAL: 01473e00
riscv_exception: PANIC!!! Exception = 00000005
dump_assert_info: Current Version: NuttX  12.7.0 5607eece84 Dec 11 2024 07:05:48 risc-v
dump_assert_info: Assertion failed panic: at file: common/riscv_exception.c:131 task: nsh_main process: nsh_main 0x8000a806
up_dump_register: EPC: 80008bfe

See the Complete Log

See the CI Test Log

@tmedicci
Copy link
Contributor

NuttX ps is crashing inside QEMU RISC-V 32-bit, causing CI Test to fail. But why is ps crashing for rv-virt:citest, but not other configs? 🤔

## Start Docker Container for NuttX
sudo docker run \
  -it \
  ghcr.io/apache/nuttx/apache-nuttx-ci-linux:latest \
  /bin/bash

## Inside Docker:
## We compile rv-virt:citest
cd
git clone https://github.com/apache/nuttx
git clone https://github.com/apache/nuttx-apps apps
pushd nuttx ; echo NuttX Source: https://github.com/apache/nuttx/tree/$(git rev-parse HEAD) ; popd
pushd apps  ; echo NuttX Apps: https://github.com/apache/nuttx-apps/tree/$(git rev-parse HEAD) ; popd
cd nuttx
tools/configure.sh rv-virt:citest
make -j
qemu-system-riscv32 \
    -M virt \
    -bios ./nuttx \
    -nographic

NuttShell (NSH) NuttX-12.7.0
nsh> uname -a
NuttX  12.7.0 5607eece84 Dec 11 2024 07:05:48 risc-v rv-virt

nsh> ps
  PID GROUP PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK            STACK    USED FILLED COMMAND
    0     0   0 FIFO     Kthread   - Ready              0000000000000000 0001952 0000908  46.5%  Idle_Task
    1     0 224 RR       Kthread   - Waiting  Semaphore 0000000000000000 0001904 0000508  26.6%  hpwork 0x8014b1e4 0x8014b210
    2     0 100 RR       Kthread   - Waiting  Semaphore 0000000000000000 0001896 0000508  26.7%  lpwork 0x8014b1a0 0x8014b1cc
riscv_exception: EXCEPTION: Load access fault. MCAUSE: 00000005, EPC: 80008bfe, MTVAL: 01473e00
riscv_exception: PANIC!!! Exception = 00000005
dump_assert_info: Current Version: NuttX  12.7.0 5607eece84 Dec 11 2024 07:05:48 risc-v
dump_assert_info: Assertion failed panic: at file: common/riscv_exception.c:131 task: nsh_main process: nsh_main 0x8000a806
up_dump_register: EPC: 80008bfe

See the Complete Log

See the CI Test Log

Apparently, #15075 decreased the size of the available stack size and this makes ps fail on rv-virt:citest. I'm running some tests to confirm and submit a PR increasing the init task stack size.

@tmedicci
Copy link
Contributor

As I have said earlier at #15165, it didn't fix the problem of the failing CI, but fixed the problem related to ps command crashing.

I continued to investigate it and I was able to check that the commit that broke the CI was 656883f. I couldn't investigate it further, but I think this gives us a start point.

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Dec 12, 2024

656883f just modify arm specific files, it's very strange that this patch could break RISCV chip, @tmedicci .

@tmedicci
Copy link
Contributor

656883f just modify arm specific files, it's very strange that this patch could break RISCV chip, @tmedicci .

I will double-check. You can verify it with:

sudo docker run -it ghcr.io/apache/nuttx/apache-nuttx-ci-linux:latest /bin/bash -c "
  cd /tmp ; ls -la
  pwd ;
  git clone https://github.com/apache/nuttx nuttx;
  git clone https://github.com/apache/nuttx-apps apps ;
  pushd nuttx ;
  git checkout 656883fec5561ca91502a26bf018473ca0229aa4;
  echo NuttX Source: https://github.com/apache/nuttx/tree/\$(git rev-parse HEAD) ; popd ;
  pushd apps  ;
  git checkout 3a6ecb82b5f1ee670cd3bca505256f25d339e7d5;
  echo NuttX Apps: https://github.com/apache/nuttx-apps/tree/\$(git rev-parse HEAD) ; popd ;
  cd nuttx/tools/ci ; cat testlist/risc-v-05.dat;
  (./cibuild.sh -c -A -N -R testlist/risc-v-05.dat || echo '***** BUILD FAILED') ;
"

(and then substitute git checkout 656883fec5561ca91502a26bf018473ca0229aa4 to git checkout 656883fec5561ca91502a26bf018473ca0229aa4~1)

@tmedicci
Copy link
Contributor

656883f just modify arm specific files, it's very strange that this patch could break RISCV chip, @tmedicci .

Yes, the test is not conclusive. I ran it twice and the results are different...

@lupyuen
Copy link
Member Author

lupyuen commented Dec 13, 2024

Thank you so much Tiago! I created a new Bug Report for the Load Access Fault at ltp_interfaces_pthread_barrierattr_init_2_1:

@lupyuen lupyuen closed this as completed Dec 13, 2024
linguini1 pushed a commit to CarletonURocketry/nuttx that referenced this issue Jan 15, 2025
CI Test will sometimes run for 6 hours (before getting auto-terminated by GitHub):
- apache#14808
- apache#14680

This is a problem because:
- It will increase our usage of GitHub Runners. Which may overrun the [GitHub Actions Budget](https://infra.apache.org/github-actions-policy.html) allocated by ASF.
- Suppose right after CI Test there's another build. If CI Test runs for all 6 hours, then the build after CI Test will never run.

For this PR: We assume that Every CI Job (e.g. risc-v-05) will complete normally within 2 hours. If any CI Job exceeds 2 hours: This PR will kill the CI Test Process `pytest` and allow the next build to run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Build system OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants