lava-users December 2018

lava-users@lists.lavasoftware.org

18 participants
28 discussions

Have health-check job to be registered before submitting test job?

by tomato

Hi, I have a problem for Installing LAVA server and dispatcher using docker images that Linaro offer. I installed both two images(server and dispatcher) on my local pc. When I submit job, submitted job is listed on Lava server. But it remain the status as 'Submitted' and not change. When i visit server {local ip address:port number}/scheduler/device/qemu01, I can see message like below. Is this mean that health-check job have to be registered before submitting test job? If then, how to do? I have looked for the way to figure out this problem, but I couldn't. Although I tried to disable health check on this device and forced to change Health as 'Good', Health status soon change like Good → Bad (Invalid device configuration). Below is what I did for installing LAVA server and dispatcher. - LAVA Server 1) Pull docker image and run. $ |docker pull lavasoftware/lava-server||:2018.11| ||$ docker run -itd --name new_lava_server --cap-add=NET_ADMIN \|| || -p 9099:80 -p 5557:5555 -p 5558:5556 -h new_lava_server \ || || lavasoftware/lava-server||||:2018.11|| ||2) Create superuser|| ||Create id as admin, pw as admin.|| |||| ||$ ||||lava-server manage createsuperuser|| ||3) Create token|| ||Create token for admin account on server web ui.|| 4) Add device type and device $ lava-server manage device-types add qemu 5) Add device dictionary $ lava-server manage devices add --device-type qemu --worker new_lava_slave qemu01 - LAVA dispatcher 1) Pull docker image and run. $ |docker pull lavasoftware/lava-dispatcher||:2018.11| |$ ||docker run -it --name new_lava_slave \| |||-||v||/boot||:||/boot||-||v||/lib/modules||:||/lib/modules||-||v||/home/lava-slave/LAVA-TEST||:||/opt/share||\| |||-||v||/dev/bus/usb||:||/dev/bus/usb||-||v||~/.||ssh||/id_rsa_lava||.pub:||/home/lava/||.||ssh||/authorized_keys||:ro -||v||/sys/fs/cgroup||:||/sys/fs/cgroup||\| |||--device=||/dev/ttyUSB0||\| |||-p 2022:22 -p 5555:5555 -p 5556:5556 \| |||-h new_lava_slave \| |||--privileged \| |||-e LAVA_SERVER_IP=||"192.168.1.44"||\| |||-e||"LOGGER_URL=tcp://192.168.1.44:5557"||\| |||-e||"MASTER_URL=tcp://192.168.1.44:5558"||\| |||-e||"DISPATCHER_HOSTNAME=--hostname=new_lava_slave"||\| |||lavasoftware||/lava-dispatcher||:2018.11| |2) Submit job file| || $ ./submityaml.py -p -k apikey.txt qemu01.yaml |Below is submityaml.py python code.| |apikey.txt file is token created on server. | || #!/usr/bin/python import argparse import os.path import sys import time import xmlrpclib SLEEP = 5 __version__ = 0.5 LAVA_SERVER_IP = "192.168.1.44" def is_valid_file(parser, arg, flag): if not os.path.exists(arg): parser.error("The file %s does not exist!" % arg) else: return open(arg, flag) # return an open file handle def setup_args_parser(): """Setup the argument parsing. :return The parsed arguments. """ description = "Submit job file" parser = argparse.ArgumentParser(version=__version__, description=description) parser.add_argument("yamlfile", help="specify target job file", metavar="FILE", type=lambda x: is_valid_file(parser, x, 'r')) parser.add_argument("-d", "--debug", action="store_true", help="Display verbose debug details") parser.add_argument("-p", "--poll", action="store_true", help="poll job status until job completes") parser.add_argument("-k", "--apikey", default="apikey.txt", help="File containing the LAVA api key") parser.add_argument("--port", default="9099", help="LAVA/Apache default port number") return parser.parse_args() def loadConfiguration(): global args args = setup_args_parser() def loadJob(server_str): """loadJob - read the JSON job file and fix it up for future submission """ return args.yamlfile.read() def submitJob(yamlfile, server): """submitJob - XMLRPC call to submit a JSON file returns jobid of the submitted job """ # When making the call to submit_job, you have to send a string jobid = server.scheduler.submit_job(yamlfile) return jobid def monitorJob(jobid, server, server_str): """monitorJob - added to poll for a job to complete """ if args.poll: sys.stdout.write("Job polling enabled\n") # wcount = number of times we loop while the job is running wcount = 0 # count = number of times we loop waiting for the job to start count = 0 f = open("job_status.txt", "w+") while True: status = server.scheduler.job_status(jobid) if status['job_status'] == 'Complete': f.write("Complete\n") break elif status['job_status'] == 'Canceled': f.write("Canceled\n") print '\nJob Canceled' exit(0) elif status['job_status'] == 'Submitted': sys.stdout.write("Job waiting to run for % 2d seconds\n" % (wcount * SLEEP)) sys.stdout.flush() wcount += 1 elif status['job_status'] == 'Running': sys.stdout.write("Job Running for % 2d seconds\n" % (count * SLEEP)) sys.stdout.flush() count += 1 else: f.write("unkonwn status\n") print "unknown status" exit(0) time.sleep(SLEEP) print '\n\nJob Completed: ' + str(count * SLEEP) + ' s (' + str(wcount * SLEEP) + ' s in queue)' def process(): print "Submitting test job to LAVA server" loadConfiguration() user = "admin" with open(args.apikey) as f: line = f.readline() apikey = line.rstrip('\n') server_str = 'http://' + LAVA_SERVER_IP + ":" + args.port xmlrpc_str = 'http://' + user + ":" + apikey + "@" + LAVA_SERVER_IP + ":" + args.port + '/RPC2/' print server_str print xmlrpc_str server = xmlrpclib.ServerProxy(xmlrpc_str) server.system.listMethods() yamlfile = loadJob(server_str) jobid = submitJob(yamlfile, server) monitorJob(jobid, server, server_str) if __name__ == '__main__': process() |The job file named qemu01.yaml is below.| || |# Your first LAVA JOB definition for an x86_64 QEMU device_type: qemu job_name: QEMU pipeline, first job timeouts: job: minutes: 15 action: minutes: 5 connection: minutes: 2 priority: medium visibility: public # context allows specific values to be overridden or included context: # tell the qemu template which architecture is being tested # the template uses that to ensure that qemu-system-x86_64 is executed. arch: amd64 metadata: # please change these fields when modifying this job for your own tests. docs-source: first-job docs-filename: qemu-pipeline-first-job.yaml # ACTION_BLOCK actions: - deploy: timeout: minutes: 5 to: tmpfs images: rootfs: image_arg: -drive format=raw,file={rootfs} url: https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz compression: gz # BOOT_BLOCK - boot: timeout: minutes: 2 method: qemu media: tmpfs prompts: ["root@debian:"] auto_login: login_prompt: "login:" username: root - test: timeout: minutes: 5 definitions: - repository: http://git.linaro.org/lava-team/lava-functional-tests.git from: git path: lava-test-shell/smoke-tests-basic.yaml name: smoke-tests| | | ||

6 years, 6 months

FW: Timeouts in LAVA failing

by Patryk Mungai Ndungu

Hello, I have noticed sometimes when I run healthchecks, LAVA gets stuck when doing a http download of the kernel and ramdisk to run a healthcheck. For example in [1] there seems to be a 3 min timeout for the deploy images section, but LAVA didn’t pick this up, and was stuck there for 17 hours. After the job was cancelled and the device health was manually set to unknown again, the healthcheck succeeds (eg. job 25 on the same lava instance). I am running LAVA 2018.7. [1] https://lava.ciplatform.org/scheduler/job/20 Thanks, Patryk Renesas Electronics Europe Ltd, Dukes Meadow, Millboard Road, Bourne End, Buckinghamshire, SL8 5FH, UK. Registered in England & Wales under Registered No. 04586709.

6 years, 6 months

Dependencies between master and worker when restoring from backups

by Tim Jaacks

Hello everyone, I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time? Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states. Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work. However, starting a multinode job does not work. The job log says: lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1 start: 0 validate Start time: 2018-12-18 12:25:14.335215+00:00 (UTC) This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize lxc, installed at version: 1:2.0.7-2+deb9u2 validate duration: 0.01 case: validate case_id: 112 definition: lava result: pass Initialising group b6eb846d-689f-40c5-b193-8afce41883ee Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds. This comes out in a loop, until the job times out. The lava-slave logfile says: 2018-12-18 12:27:15,114 INFO master => START(12) 2018-12-18 12:27:15,117 INFO [12] Starting job [...] 2018-12-18 12:27:15,124 DEBUG [12] dispatch: 2018-12-18 12:27:15,124 DEBUG [12] env : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True} 2018-12-18 12:27:15,124 DEBUG [12] env-dut : 2018-12-18 12:27:15,129 ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok' 2018-12-18 12:27:15,129 ERROR 'NoneType' object has no attribute 'send_start_ok' It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups? Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks(a)garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS! Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

6 years, 6 months

Re: [Lava-users] Lava job always exits when running a long duration test case without outputs

by Neil Williams

Please make sure you include the mailing list in all replies so that others know when a problem has been fixed (and how it was fixed) On Tue, 18 Dec 2018 at 12:00, Chuan Su <lavanxp(a)126.com> wrote: > > According to your comments , we checked our setups and we found that we utilized ser2net & telnet to communicate with DUT , however , ser2net set default timeout parameter as 600 seconds . When DUT runs a long duration case (more than 600 seconds ) without any log outputting , the connection is usually dropped by ser2net , and telnet program always prints logs as 'Connection closed by foreign host ' . Anyway thanks for your help ! See https://git.linaro.org/lava/lava-lab.git/tree/shared/server-configs/ser2net… The Linaro lab in Cambridge sets all the ser2net configs to have a zero timeout. > Sincerely, > Chuan Su > > > > > > At 2018-12-18 15:59:00, "Neil Williams" <neil.williams(a)linaro.org> wrote: > >On Tue, 18 Dec 2018 at 06:16, Chuan Su <lavanxp(a)126.com> wrote: > >> > >> Dear all, > >> We are encountered with an issue that our job always exits halfway when running a long duration test case (around 20 minutes) which outputs nothing , and lava server reports an InfrastructureError error and prints as below : > >> Connection closed by foreign host.Marking unfinished test run as failed > > > >Connection closed by foreign host means that the serial connection > >failed at the DUT - this is not a problem in the LAVA test job, this > >is an infrastructure failure at your end. The foreign host (the DUT) > >closed the serial connection. There is nothing LAVA can do about that. > >The serial connection to the DUT has simply failed. > > > >If the serial connection is USB, check for logs on the worker like > >/var/log/messages and /var/log/syslog for events related to the serial > >connection. Check that the DUT didn't simply kill the serial > >connection - maybe the DUT went into some kind of suspend mode. > > > >> definition: lava > >> result: fail > >> case: 0_apache-servers1 > >> uuid: 597_1.4.2.4.1 > >> duration: 603.53 > >> lava_test_shell connection dropped.end: 3.1 lava-test-shell (duration 00:10:05) [ns_s1] > >> namespace: ns_s1 > >> extra: ... > >> definition: lava > >> level: 3.1 > >> result: fail > >> case: lava-test-shell > >> duration: 604.55 > >> lava-test-retry failed: 1 of 1 attempts. 'lava_test_shell connection dropped.'lava_test_shell connection dropped. > >> > >> And we just test it with a very simple python script as below: > >> #!/usr/bin/env python3 > >> import time > >> print('Hello,world!') > >> time.sleep(1200) > >> print("Hello,Lava!") > >> We can see 'Hello,world!' string outputs , but there's no more output of this program found on webUI! > >> We just don't know what's wrong , so we have to mail to you for help! > >> Sincerely, > >> Chuan Su > >> > >> > >> > >> > >> _______________________________________________ > >> Lava-users mailing list > >> Lava-users(a)lists.lavasoftware.org > >> https://lists.lavasoftware.org/mailman/listinfo/lava-users > > > > > > > >-- > > > >Neil Williams > >============= > >neil.williams(a)linaro.org > >http://www.linux.codehelp.co.uk/ > > > > -- Neil Williams ============= neil.williams(a)linaro.org http://www.linux.codehelp.co.uk/

6 years, 6 months

Default patterns and fixup dicts in Lava-Test Test Definition 1.0

by Neil Williams

https://git.lavasoftware.org/lava/lava/issues/179 If your Lava-Test Test Definition 1.0 YAML files explicitly use a parse: block (like: https://git.linaro.org/qa/test-definitions.git/tree/automated/linux/ltp/ltp…) then this will remain supported in Definition 1.0. If you use the monitors or interactive test actions, this does not affect you at all. If you rely on LAVA to create a TestCase based on a command in the Lava-Test Test Definition just echoing "pass" or "fail", then this is the Default Pattern and this change will directly affect those test jobs. The current Default Pattern and Fixup are lifted directly from V1 (https://git.lavasoftware.org/lava/lava/blob/master/lava_common/constants.py…): # V1 compatibility DEFAULT_V1_PATTERN = "(?P<test_case_id>.*-*)\\s+:\\s+(?P<result>(PASS|pass|FAIL|fail|SKIP|skip|UNKNOWN|unknown))" DEFAULT_V1_FIXUP = { "PASS": "pass", "FAIL": "fail", "SKIP": "skip", "UNKNOWN": "unknown", } We've recently updated the documentation to drop mention of the default pattern support for the following reasons: * It has always been problematic to encode a Python regular expression in YAML. Failures are difficult to debug and patterns are global for the entire test operation. * The move towards more portable test definitions puts the emphasis on parsing the test output locally on the DUT using a customised parser. This has further advantages: * The pattern does not have to be mangled into YAML * The pattern can be implemented by a language other than Python * The pattern can change during the operation of the test shell, e.g. a different pattern may be required for setup than for the test itself. We are now starting to plan for Lava-Test Test Definition 2.0 with an emphasis on requiring portable test scripts and removing more of the lava_test_shell Test Helper scripts. Full information on 2.0 will be available early in 2019. As a first step, the generally unhelpful Default Pattern and Default Fixup dict are likely to be removed. If you need this support, the pattern can be added to your Lava-Test Test Definition 1.0 YAML files. In the next release, it is proposed that unless an explicit pattern is specified in the Lava-Test Test Definition 1.0 YAML file, then no pattern will be implemented. Processes which echo "pass" or "fail" would be ignored and no TestCase would be created. Let us know if there are any thoughts or problems on this proposal. -- Neil Williams ============= neil.williams(a)linaro.org http://www.linux.codehelp.co.uk/

6 years, 6 months

Lava job always exits when running a long duration test case without outputs

by Chuan Su

Dear all, We are encountered with an issue that our job always exits halfway when running a long duration test case (around 20 minutes) which outputs nothing , and lava server reports an InfrastructureError error and prints as below : Connection closed by foreign host.Marking unfinished test run as failed definition: lava result: fail case: 0_apache-servers1 uuid: 597_1.4.2.4.1 duration: 603.53 lava_test_shell connection dropped.end: 3.1 lava-test-shell (duration 00:10:05) [ns_s1] namespace: ns_s1 extra: ... definition: lava level: 3.1 result: fail case: lava-test-shell duration: 604.55 lava-test-retry failed: 1 of 1 attempts. 'lava_test_shell connection dropped.'lava_test_shell connection dropped. And we just test it with a very simple python script as below: #!/usr/bin/env python3 import time print('Hello,world!') time.sleep(1200) print("Hello,Lava!") We can see 'Hello,world!' string outputs , but there's no more output of this program found on webUI! We just don't know what's wrong , so we have to mail to you for help! Sincerely, Chuan Su

6 years, 6 months

Git authentication

by Axel Lebourhis

Hi everyone, Is it possible to handle git authentication in a test job ? I need LAVA to clone a repo that can't be set to public, and obviously it won't work because of the authentication step. So is it possible to specify a password or a token ? Best regards, Axel

6 years, 7 months

LAVA webUI hangs when test case outputs long string in a line

by Chuan Su

Dear , all We found that when lava executed a script which may output a long string (more than 30000 bytes) in a line (only one line break), lava web UI always hung and there was no more lava log outputting and devices under test (short for DUT) were still powered until Lava Job time-out function triggered , however, after checked the whole log file we found that cases behind the hanging case were executed (there's new files generated) . So the problem is that when lava encountered those cases lava web UI always hangs and DUTs may not be powered off when all the cases are completed ! best wishes, Chuan Su

6 years, 7 months

LAVA webUI hangs when test case outputs long string in a line

by Chuan Su

6 years, 7 months

Re: [Lava-users] Steps for restoring a backup

by Tim Jaacks

On Mon, 11 Dec 2018 at 11:30, Neil Williams <neil.williams at linaro.org> wrote: > On Tue, 11 Dec 2018 at 11:28, Tim Jaacks <tim.jaacks(a)garz-fricke.com> wrote: > > > > Thanks, the CLI operations are very helpful for automating the process. > > However, the docs say that all devices in "Reserved" state have to > > have their "current job" cleared. I can use "lava-server manage devices details" > > to check whether this field is actually set. There is no command to > > modify it, though. Seems like using the Python API is the only way to > > go here, right? The same applies to setting "Running" jobs to "Cancelled". > > https://git.lavasoftware.org/lava/lava/merge_requests/273 > > This should get into the upcoming 2018.12 release. Thank you very much for your quick help. The "lava-server manage jobs fail" command takes care of clearing the "current job" field of the associated device, do I understand that right? Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks(a)garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS! Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

6 years, 7 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

lava-users December 2018