Dear , all
We found that when lava executed a script which may output a long string (more than 30000 bytes) in a line (only one line break), lava web UI always hung and there was no more lava log outputting and devices under test (short for DUT) were still powered until Lava Job time-out function triggered , however, after checked the whole log file we found that cases behind the hanging case were executed (there's new files generated) .
So the problem is that when lava encountered those cases lava web UI always hangs and DUTs may not be powered off when all the cases are completed !
best wishes,
Chuan Su
On Mon, 11 Dec 2018 at 11:30, Neil Williams <neil.williams at linaro.org> wrote:
> On Tue, 11 Dec 2018 at 11:28, Tim Jaacks <tim.jaacks(a)garz-fricke.com> wrote:
> >
> > Thanks, the CLI operations are very helpful for automating the process.
> > However, the docs say that all devices in "Reserved" state have to
> > have their "current job" cleared. I can use "lava-server manage devices details"
> > to check whether this field is actually set. There is no command to
> > modify it, though. Seems like using the Python API is the only way to
> > go here, right? The same applies to setting "Running" jobs to "Cancelled".
>
> https://git.lavasoftware.org/lava/lava/merge_requests/273
>
> This should get into the upcoming 2018.12 release.
Thank you very much for your quick help. The "lava-server manage jobs fail"
command takes care of clearing the "current job" field of the associated
device, do I understand that right?
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks(a)garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun
Hi folks,
We at Fairphone have developed a variant of the Tradefed-runner in LAVA
test-definitions that is meant to run complete Tradefed test suites on
multiple devices by making use of the shards feature in Tradefed. The
runner is currently in “staging” state. We still want to share now what
we are using and developing to see if there are more people with
interest in it. Feedback on the general approach taken would also be
much appreciated.
On the higher level, our setup works as follows:
• Use MultiNode to allocate multiple devices for one test submission.
• One “master” runs the Tradefed shell, similarly as in the
existing runner.
• The master connects to the workers’ DUTs via adb TCP/IP. These
DUTs are transparently available to Tradefed just in the same way as
USB-attached devices.
• Workers ensure that their respective DUTs remain accessible to
the master, especially in case of WLAN disconnects, reboots, crashes, etc.
Major features of our runner:
• Support for Android CTS, GTS and STS.
• Test run split into “shards” in Tradefed to run tests in parallel
on multiple devices. This allows for a major speedup when running large
test suites.
• Tradefed retry: Rerun test suites until the failure count stabilizes.
• No adb root required.
• Based on the original Tradefed runner, having at least parts of
the common code moved to python libraries.
Current limitations:
• Test executions are not always stable. This needs further
investigation.
• Test executions produce more false positives than local test
runs. This needs further investigation but is at least partially due to
using adb TCP/IP instead of a local USB connection.
• Android VTS not implemented (would require only minor changes)
Our current changes have been pushed to the tradefed_shards_with_retry
topic on Gerrit[1]. Besides the two major changes to add MultiNode adb
support and then Tradefed support on top of that, a couple of smaller
changes that could be useful on their own have also been pushed.
We are looking forward to your feedback and to joint efforts in
automating and speeding up Tradefed test executions!
Best regards,
Karsten for the Fairphone Software Team
[1]
https://review.linaro.org/q/topic:%22tradefed_shards_with_retry%22+(status:…
On Mon, 10 Dec 2018 at 20:16, Neil Williams <neil.williams at linaro.org> wrote:
> Yes, there is a problem there - thanks for catching it. I think the
> bulk of the page dates from the last stages of the migration when V1
> data was still around. I'll look at an update of the page tomorrow.
> Step 7 is a sanity check that the install of the empty instance has
> gone well, Step 9 is to ensure that the newly restored database is put
> into maintenance as soon as possible to prevent any queued test jobs
> from attempting to start. The critical element of Step 9 is to ensure
> that the lava-master service is stopped.
>
> The emphasis of the section is on ensuring that the instance only
> serves a "Maintenance" page, e.g. the default Debian "It works!"
> apache page, to prevent access to the instance during the restore.
Thanks for pointing that out, Neil. I got the point, that the Apache
server has to serve a static site during the restore process.
> Accessing the UI would involve having an alternative way to serve the
> pages. If that can be arranged, just for admins, (e.g. by changing the
> external routing to the box or redirecting DNS temporarily) then the
> UI on the instance can be used with the change that the
> lava-server-gunicorn service does not need to be stopped (because
> access has been redirected). Other services would be stopped. However,
> this would involve a fair number of apache config changes, so is best
> left to those admins who have such config already on hand.
>
> The operations can be done from the command line and that's probably
> best for these docs.
>
> Step 7 can be replaced by:
>
> lava-server manage check --deploy
>
> Step 9 can be replaced by looping over:
>
> lava-server manage devices update --health MAINTENANCE --hostname ${HOSTNAME}
>
> or, if there are a lot of devices:
>
> lava-server manage maintenance --force
>
> (This maintenance helper has been fixed in master - soon to be 2018.12
> - so older versions would use the first command & loop.)
Thanks, the CLI operations are very helpful for automating the process.
However, the docs say that all devices in "Reserved" state have to have
their "current job" cleared. I can use "lava-server manage devices details"
to check whether this field is actually set. There is no command to
modify it, though. Seems like using the Python API is the only way to go
here, right? The same applies to setting "Running" jobs to "Cancelled".
> I'll look at changing the page to use CLI operations for steps 7 and
> 9. Some labs can do the http redirect / routing method but the detail
> of that is probably not in scope for this page in the LAVA docs. I'll
> add a note that admins have that choice but leave it for those admins
> to implement.
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks(a)garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun
Hello everyone,
I am trying to implement a backup and restore routine for our LAVA server, based on the documentation:
https://validation.linaro.org/static/docs/v2/admin-backups.html#restoring-a…
The creation of the backup is straight-forward. I have problems with the order of the proposed restore steps, though.
Step 6 is "Stop all LAVA services". However, afterwards in step 7 it says "Make sure that this instance actually works by browsing a few (empty) instance pages." This should obviously be done before, right?
The actual problem is that step 9 says "In the Django administration interface, take all devices which are not Retired into Offline". This cannot be an ordering issue, because the LAVA services actually must not be available during these modifications. How do I use the Django admin interface, while all LAVA services are stopped?
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks(a)garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun
Dear all,
I have a question when use lava.
Background:
1. I have only one hardware device with android.
2. I have a device-type jinja2 file start with "{% extends 'base-fastboot.jinja2' %}"
Here, I use "adb reboot bootloader" to enter in to fastboot.
3. I have another device-type jinja2 file start with "{% extends 'base-uboot.jinja2' %}"
Here, I use "fastboot 0" in uboot to enter in to fastboot.
Now, we have a scenario which need to test with above both methods, but we just have one device, if possible user can define some parameter in job.yaml, then can switch between the two methods just for one device? Any suggestion?
Thanks,
Larry
Hello
I got the following crash with 2018.11 on debian stretch
dpkg -l|grep lava
ii lava
2018.11-1~bpo9+1 all Linaro Automated
Validation Architecture metapackage
ii lava-common
2018.11-1~bpo9+1 all Linaro Automated
Validation Architecture common
ii lava-coordinator
0.1.7-1 all LAVA Coordinator
daemon
ii lava-dev
2018.11-1~bpo9+1 all Linaro Automated
Validation Architecture developer support
ii lava-dispatcher
2018.11-1~bpo9+1 amd64 Linaro Automated
Validation Architecture dispatcher
ii lava-server
2018.11-1~bpo9+1 all Linaro Automated
Validation Architecture server
ii lava-server-doc
2018.11-1~bpo9+1 all Linaro Automated
Validation Architecture documentation
ii lavacli
0.9.3-1~bpo9+1 all LAVA XML-RPC command
line interface
ii lavapdu-client
0.0.5-1 all LAVA PDU client
ii lavapdu-daemon
0.0.5-1 all LAVA PDU control
daemon
2018-12-04 14:14:40,187 ERROR [EXIT] Unknown exception raised, leaving!
2018-12-04 14:14:40,187 ERROR string index out of range
Traceback (most recent call last):
File
"/usr/lib/python3/dist-packages/lava_server/management/commands/lava-logs.py",
line 193, in handle
self.main_loop()
File
"/usr/lib/python3/dist-packages/lava_server/management/commands/lava-logs.py",
line 253, in main_loop
while self.wait_for_messages(False):
File
"/usr/lib/python3/dist-packages/lava_server/management/commands/lava-logs.py",
line 287, in wait_for_messages
self.logging_socket()
File
"/usr/lib/python3/dist-packages/lava_server/management/commands/lava-logs.py",
line 433, in logging_socket
job.save()
File
"/usr/lib/python3/dist-packages/django_restricted_resource/models.py", line
71, in save
return super(RestrictedResource, self).save(*args, **kwargs)
File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 796,
in save
force_update=force_update, update_fields=update_fields)
File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 820,
in save_base
update_fields=update_fields)
File "/usr/lib/python3/dist-packages/django/dispatch/dispatcher.py", line
191, in send
response = receiver(signal=self, sender=sender, **named)
File "/usr/lib/python3/dist-packages/lava_scheduler_app/signals.py", line
139, in testjob_notifications
send_notifications(job)
File
"/usr/lib/python3/dist-packages/lava_scheduler_app/notifications.py", line
305, in send_notifications
title, body, settings.SERVER_EMAIL, [recipient.email_address]
File "/usr/lib/python3/dist-packages/django/core/mail/__init__.py", line
62, in send_mail
return mail.send()
File "/usr/lib/python3/dist-packages/django/core/mail/message.py", line
342, in send
return self.get_connection(fail_silently).send_messages([self])
File "/usr/lib/python3/dist-packages/django/core/mail/backends/smtp.py",
line 107, in send_messages
sent = self._send(message)
File "/usr/lib/python3/dist-packages/django/core/mail/backends/smtp.py",
line 120, in _send
recipients = [sanitize_address(addr, encoding) for addr in
email_message.recipients()]
File "/usr/lib/python3/dist-packages/django/core/mail/backends/smtp.py",
line 120, in <listcomp>
recipients = [sanitize_address(addr, encoding) for addr in
email_message.recipients()]
File "/usr/lib/python3/dist-packages/django/core/mail/message.py", line
161, in sanitize_address
address = Address(nm, addr_spec=addr)
File "/usr/lib/python3.5/email/headerregistry.py", line 42, in __init__
a_s, rest = parser.get_addr_spec(addr_spec)
File "/usr/lib/python3.5/email/_header_value_parser.py", line 1988, in
get_addr_spec
token, value = get_local_part(value)
File "/usr/lib/python3.5/email/_header_value_parser.py", line 1800, in
get_local_part
if value[0] in CFWS_LEADER:
IndexError: string index out of range
2018-12-04 14:14:40,211 INFO [EXIT] Disconnect logging socket and
process messages
2018-12-04 14:14:40,211 DEBUG [EXIT] unbinding from 'tcp://0.0.0.0:5555'
2018-12-04 14:14:50,221 INFO [EXIT] Closing the logging socket: the
queue is empty
Regards
Hi,
I'm trying to experiment with 'interactive' test shell. The docs are here:
https://master.lavasoftware.org/static/docs/v2/actions-test.html#index-1
As I understand the feature should work in the following way (the docs
aren't very clear):
1. wait for one of the prompts
2. send command
3. match the result to regex (pass or fail)
However when I try it out, there is no wait for prompt. LAVA
immediately sends the command and adds value from prompts to list of
expressions to match. Is this correct? Example job:
https://staging.validation.linaro.org/scheduler/job/245886#L53
LAVA sends the command even before the board starts booting.
Any help is appreciated.
milosz
Hi everyone,
I’m facing an issue when U-Boot commands are sent one after another without waiting for a prompt. Obviously, device is not able to boot.
Excerpt from logs:
dhcp
=> dhcp
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:56)
setenv serverip 172.17.1.189
setenv serverip 172.17.1.189
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:56)
tftp 0x80000000 60/tftp-deploy-46_3_i9j/kernel/zImage-am335x-vcu.bin
tftp 0x80000000 60/tftp-deploy-46_3_i9j/kernel/zImage-am335x-vcu.bin
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:56)
tftp 0x83000000 60/tftp-deploy-46_3_i9j/ramdisk/ramdisk.cpio.gz.uboot
tftp 0x83000000 60/tftp-deploy-46_3_i9j/ramdisk/ramdisk.cpio.gz.uboot
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:55)
setenv initrd_size ${filesize}
setenv initrd_size ${filesize}
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:55)
tftp 0x82000000 60/tftp-deploy-46_3_i9j/dtb/am335x-vcu-prod1.dtb
tftp 0x82000000 60/tftp-deploy-46_3_i9j/dtb/am335x-vcu-prod1.dtb
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:55)
dhcp
dhcp
link up on port 0, speed 10, half duplex
BOOTP broadcast 1
setenv serverip 172.17.1.189
tftp 0x80000000 60/tftp-deploy-46_3_i9j/kernel/zImage-am335x-vcu.bin
BOOTP broadcast 2
tftp 0x83000000 60/tftp-deploy-46_3_i9j/ramdisk/ramdisk.cpio.gz.uboot
setenv initrd_size ${filesize}
tftp 0x82000000 60/tftp-deploy-46_3_i9j/dtb/am335x-vcu-prod1.dtb
BOOTP broadcast 3
BOOTP broadcast 4
BOOTP broadcast 5
Retry time exceeded
setenv bootargs 'console=ttyS2,115200n8 root=/dev/ram0 ti_cpsw.rx_packet_max=1526 ip=dhcp'
=> setenv bootargs 'console=ttyS2,115200n8 root=/dev/ram0 ti_cpsw.rx_packet_max=1526 ip=dhcp'
bootloader-commands: Wait for prompt ['=>', 'Resetting CPU', 'Must RESET board to recover', 'TIMEOUT', 'Retry count exceeded', 'ERROR: The remote end did not respond in time.'] (timeout 00:04:50)
setenv bootargs 'console=ttyS2,115200n8 root=/dev/ram0 ti_cpsw.rx_packet_max=1526 ip=dhcp'
setenv bootargs 'console=ttyS2,115200n8 root=/dev/ram0 ti_cpsw.rx_packet_max=1526 ip=dhcp'
=>
bootz 0x80000000 0x83000000 0x82000000
=> bootz 0x80000000 0x83000000 0x82000000
bootloader-commands: Wait for prompt Starting kernel (timeout 00:04:50)
bootz 0x80000000 0x83000000 0x82000000
bootz 0x80000000 0x83000000 0x82000000
Bad Linux ARM zImage magic!
=>
Bad Linux ARM zImage magic!
As you see, all commands that were sent between `dhcp` command and until it was able to complete or fail are simply dropped.
Device type config:
{% extends 'base-uboot.jinja2' %}
{% set device_type = "vcu" %}
{% set console_device = 'ttyS2' %}
{% set baud_rate = 115200 %}
{% set interrupt_prompt = 'Press s to abort autoboot' %}
{% set interrupt_char = 's' %}
{% set bootloader_prompt = '=>' %}
{% set uboot_mkimage_arch = 'arm' %}
{% set bootz_kernel_addr = '0x80000000' %}
{% set bootz_ramdisk_addr = '0x83000000' %}
{% set bootz_dtb_addr = '0x82000000' %}
{% set extra_kernel_args = 'ti_cpsw.rx_packet_max=1526' %}
{% set kernel_start_message = 'Welcome to' %}
Device config:
{% extends 'vcu.jinja2' %}
{% set connection_command = 'telnet lava-disp-1.local 7000' %}
{% set power_on_command = 'relay-ctrl --relay 1 --state on' %}
{% set power_off_command = 'relay-ctrl --relay 1 --state off' %}
{% set hard_reset_command = 'relay-ctrl --relay 1 --toggle' %}
Boot action block from job definition:
- boot:
timeout:
minutes: 5
method: u-boot
commands: ramdisk
auto_login:
login_prompt: 'am335x-nmhw21 login: '
username: root
prompts:
- 'fct@am335x-nmhw21:~# '
Have I misconfigured something? What I’m missing? Thanks!
Best regards,
Andrejs Cainikovs.
Hello,
I have a LAVA job with long running test, so I put a timeout of 180
minutes in the test action itself[1].
However, but the job times out after 3989 seconds (~ 66 min)[2].
Looking closer at the "Timing" section of the job, I see that
lava-test-shell indeed has a timeout of ~3989 seconds, but I have no
idea where that number comes from. That's neither the 10 minutes in the
default "timeouts" section, nor the 180 minutes I put in the "test"
action.
Hmm, after almost pushing send on this, I now seeing that in the
"timeouts" sections, the whole job has a timeout of 70 minutes. So, I
assume that means an absolute max, even if one of the actions puts a
higher timeout?
So I guess this email now turns into a feature request rather than a bug
report.
Maybe LAVA should show a warning at the top of the job if any of the
actions has a timeout that's longer than the job timeout.
Kevin
[1] http://lava.baylibre.com:10080/scheduler/job/60374/definition#defline89
[2] http://lava.baylibre.com:10080/scheduler/job/60374#results_694663