Hi, Remi,

 

Based on your MR, I simply remove the RetryAction for UBootRetry, and it works for our scenario.

See  https://git.lavasoftware.org/lava/lava/-/merge_requests/1147

 

From: Remi Duraffort <remi.duraffort@linaro.org>
Sent: Wednesday, April 22, 2020 4:47 PM
To: Larry Shen <larry.shen@nxp.com>
Cc: lava-users@lists.lavasoftware.org
Subject: Re: [EXT] Re: [Lava-users] Question about uboot action retry and timeout?

 

Caution: EXT Email

Hello,

 

If you have some time, please try and send a MR.

 

Le mar. 21 avr. 2020 à 03:46, Larry Shen <larry.shen@nxp.com> a écrit :

Yes, Remi,

 

I guess if only one RetryAction now in this scenario, everything will be ok.

 

From: Remi Duraffort <remi.duraffort@linaro.org>
Sent: Monday, April 20, 2020 11:23 PM
To: Larry Shen <larry.shen@nxp.com>
Cc: lava-users@lists.lavasoftware.org
Subject: [EXT] Re: [Lava-users] Question about uboot action retry and timeout?

 

Caution: EXT Email

Hello,

 

 

Le ven. 17 avr. 2020 à 11:01, Larry Shen <larry.shen@nxp.com> a écrit :

Hi,

 

I have a question related to uboot boot actions retry settings, our job is:

 

- boot:

    failure_retry: 2

    namespace: test_suite_1

    connection-namespace: burning-uboot_1

    method: u-boot

    commands: nfs

    auto_login:

      login_prompt: '(.*) login:'

      username: root

    prompts:

    - 'root@(.*):~#'

    timeout:

      minutes: 10

 

1. From the code:

 

UBootAction extends from a RetryAction, while in its internal pipeline, there is action named UBootRetry which also extends from RetryAction.

If we define a retry, when exception happened in RetryAction, it will first cause UbootRetry to retry, then UBootAction to retry again.

 

Sounds confuse, I wonder for what reason we should had a nested retry here?

 

I believe that his is some technical debt. Mostly the fact that Bootaction inherit from RetryAction and not Action has been forgotten somewhere. I sent some patches to make this fact more obvious by removing the BootAction class and inheriting directly from RetryAction.

This will help to identify places where the RetryAction should be replaced by an Action.

 

2. In fact the real issue here for us is next:

Lets suppose we define failure_retry: 2, our situation is:

1) First boot timeout for some random block issue.

2) Then, it start Retrying: 4.4 uboot-retry (599 sec), but timeout again.

3) Then, it start Retrying: 4 uboot-action (599 sec), but timeout again.

4) Then, it start Retrying: 4.4 uboot-retry (599 sec), this time a lucky boot here, but before we are happy, it finish the last action export-device-envin uboot-retry. Then, looks like UBootAction timeout resume, then the lucky boot becomes useless although its in fact successfully boot.

 

The log is:

start: 4.4.5 expect-shell-connection (timeout 00:07:23) [test_suite_1]

Forcing a shell prompt, looking for ['root@(.*):~#']

 

root@imx8mnevk:~#

expect-shell-connection: Wait for prompt ['root@(.*):~#'] (timeout 00:10:00)

Waiting using forced prompt support. 299.9747439622879s timeout

end: 4.4.5 expect-shell-connection (duration 00:00:00) [test_suite_1]

start: 4.4.6 export-device-env (timeout 00:07:23) [test_suite_1]

end: 4.4.6 export-device-env (duration 00:00:00) [test_suite_1]

uboot-action timed out after 727 seconds

end: 4.4 uboot-retry (duration 00:02:07) [test_suite_1]

 

Im not sure, but looks like: for second uboot-action, there is two uboot-retry inside it because of retry, which will make when uboot-action timeout resume, the time diff becomes less than 0, which directly raise exception? Is it a bug or I misunderstand it?

 

I believe that we can remove the first RetryAction as https://git.lavasoftware.org/lava/lava/-/merge_requests/1127 has been merged.

 

duration = round(action_max_end_time - time.time())

if duration <= 0:

   signal.alarm(0)

   parent.timeout._timed_out(None, None)

 

Any suggestion for this?

 

This is a known issue. The retry logic only works for non-timeout errors as the remaining time after a timeout is 0. It might be possible to reset the remaining time to a positive value (if the job timeout is large enough).

I thought that the problem was solved. Maybe the fact that we have two RetryAction is breaking something?

 

 

Rgds

 

--

Rémi Duraffort

LAVA Architect

Linaro


 

--

Rémi Duraffort

LAVA Architect

Linaro