Hi, I have a problem for Installing LAVA server and dispatcher using
docker images that Linaro offer.
I installed both two images(server and dispatcher) on my local pc.
When I submit job, submitted job is listed on Lava server.
But it remain the status as 'Submitted' and not change.
When i visit server {local ip address:port number}/scheduler/device/qemu01,
I can see message like below.
Is this mean that health-check job have to be registered before
submitting test job? If then, how to do?
I have looked for the way to figure out this problem, but I couldn't.
Although I tried to disable health check on this device and forced to
change Health as 'Good',
Health status soon change like Good → Bad (Invalid device configuration).
Below is what I did for installing LAVA server and dispatcher.
- LAVA Server
1) Pull docker image and run.
$ |docker pull lavasoftware/lava-server||:2018.11|
||$ docker run -itd --name new_lava_server --cap-add=NET_ADMIN \||
|| -p 9099:80 -p 5557:5555 -p 5558:5556 -h new_lava_server \
||
|| lavasoftware/lava-server||||:2018.11||
||2) Create superuser||
||Create id as admin, pw as admin.||
||||
||$ ||||lava-server manage createsuperuser||
||3) Create token||
||Create token for admin account on server web ui.||
4) Add device type and device
$ lava-server manage device-types add qemu
5) Add device dictionary
$ lava-server manage devices add --device-type qemu --worker
new_lava_slave qemu01
- LAVA dispatcher
1) Pull docker image and run.
$ |docker pull lavasoftware/lava-dispatcher||:2018.11|
|$ ||docker run -it --name new_lava_slave \|
|||-||v||/boot||:||/boot||-||v||/lib/modules||:||/lib/modules||-||v||/home/lava-slave/LAVA-TEST||:||/opt/share||\|
|||-||v||/dev/bus/usb||:||/dev/bus/usb||-||v||~/.||ssh||/id_rsa_lava||.pub:||/home/lava/||.||ssh||/authorized_keys||:ro
-||v||/sys/fs/cgroup||:||/sys/fs/cgroup||\|
|||--device=||/dev/ttyUSB0||\|
|||-p 2022:22 -p 5555:5555 -p 5556:5556 \|
|||-h new_lava_slave \|
|||--privileged \|
|||-e LAVA_SERVER_IP=||"192.168.1.44"||\|
|||-e||"LOGGER_URL=tcp://192.168.1.44:5557"||\|
|||-e||"MASTER_URL=tcp://192.168.1.44:5558"||\|
|||-e||"DISPATCHER_HOSTNAME=--hostname=new_lava_slave"||\|
|||lavasoftware||/lava-dispatcher||:2018.11|
|2) Submit job file|
||
$ ./submityaml.py -p -k apikey.txt qemu01.yaml
|Below is submityaml.py python code.|
|apikey.txt file is token created on server.
|
||
#!/usr/bin/python
import argparse
import os.path
import sys
import time
import xmlrpclib
SLEEP = 5
__version__ = 0.5
LAVA_SERVER_IP = "192.168.1.44"
def is_valid_file(parser, arg, flag):
if not os.path.exists(arg):
parser.error("The file %s does not exist!" % arg)
else:
return open(arg, flag) # return an open file handle
def setup_args_parser():
"""Setup the argument parsing.
:return The parsed arguments.
"""
description = "Submit job file"
parser = argparse.ArgumentParser(version=__version__,
description=description)
parser.add_argument("yamlfile", help="specify target job file",
metavar="FILE",
type=lambda x: is_valid_file(parser, x, 'r'))
parser.add_argument("-d", "--debug", action="store_true",
help="Display verbose debug details")
parser.add_argument("-p", "--poll", action="store_true", help="poll
job status until job completes")
parser.add_argument("-k", "--apikey", default="apikey.txt",
help="File containing the LAVA api key")
parser.add_argument("--port", default="9099", help="LAVA/Apache
default port number")
return parser.parse_args()
def loadConfiguration():
global args
args = setup_args_parser()
def loadJob(server_str):
"""loadJob - read the JSON job file and fix it up for future submission
"""
return args.yamlfile.read()
def submitJob(yamlfile, server):
"""submitJob - XMLRPC call to submit a JSON file
returns jobid of the submitted job
"""
# When making the call to submit_job, you have to send a string
jobid = server.scheduler.submit_job(yamlfile)
return jobid
def monitorJob(jobid, server, server_str):
"""monitorJob - added to poll for a job to complete
"""
if args.poll:
sys.stdout.write("Job polling enabled\n")
# wcount = number of times we loop while the job is running
wcount = 0
# count = number of times we loop waiting for the job to start
count = 0
f = open("job_status.txt", "w+")
while True:
status = server.scheduler.job_status(jobid)
if status['job_status'] == 'Complete':
f.write("Complete\n")
break
elif status['job_status'] == 'Canceled':
f.write("Canceled\n")
print '\nJob Canceled'
exit(0)
elif status['job_status'] == 'Submitted':
sys.stdout.write("Job waiting to run for % 2d
seconds\n" % (wcount * SLEEP))
sys.stdout.flush()
wcount += 1
elif status['job_status'] == 'Running':
sys.stdout.write("Job Running for % 2d seconds\n" %
(count * SLEEP))
sys.stdout.flush()
count += 1
else:
f.write("unkonwn status\n")
print "unknown status"
exit(0)
time.sleep(SLEEP)
print '\n\nJob Completed: ' + str(count * SLEEP) + ' s (' +
str(wcount * SLEEP) + ' s in queue)'
def process():
print "Submitting test job to LAVA server"
loadConfiguration()
user = "admin"
with open(args.apikey) as f:
line = f.readline()
apikey = line.rstrip('\n')
server_str = 'http://' + LAVA_SERVER_IP + ":" + args.port
xmlrpc_str = 'http://' + user + ":" + apikey + "@" + LAVA_SERVER_IP
+ ":" + args.port + '/RPC2/'
print server_str
print xmlrpc_str
server = xmlrpclib.ServerProxy(xmlrpc_str)
server.system.listMethods()
yamlfile = loadJob(server_str)
jobid = submitJob(yamlfile, server)
monitorJob(jobid, server, server_str)
if __name__ == '__main__':
process()
|The job file named qemu01.yaml is below.|
||
|# Your first LAVA JOB definition for an x86_64 QEMU
device_type: qemu
job_name: QEMU pipeline, first job
timeouts:
job:
minutes: 15
action:
minutes: 5
connection:
minutes: 2
priority: medium
visibility: public
# context allows specific values to be overridden or included
context:
# tell the qemu template which architecture is being tested
# the template uses that to ensure that qemu-system-x86_64 is executed.
arch: amd64
metadata:
# please change these fields when modifying this job for your own tests.
docs-source: first-job
docs-filename: qemu-pipeline-first-job.yaml
# ACTION_BLOCK
actions:
- deploy:
timeout:
minutes: 5
to: tmpfs
images:
rootfs:
image_arg: -drive format=raw,file={rootfs}
url:
https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz
compression: gz
# BOOT_BLOCK
- boot:
timeout:
minutes: 2
method: qemu
media: tmpfs
prompts: ["root@debian:"]
auto_login:
login_prompt: "login:"
username: root
- test:
timeout:
minutes: 5
definitions:
- repository: http://git.linaro.org/lava-team/lava-functional-tests.git
from: git
path: lava-test-shell/smoke-tests-basic.yaml
name: smoke-tests|
|
|
||
Hello,
I have noticed sometimes when I run healthchecks, LAVA gets stuck when doing a http download of the kernel and ramdisk to run a healthcheck.
For example in [1] there seems to be a 3 min timeout for the deploy images section, but LAVA didn’t pick this up, and was stuck there for 17 hours. After the job was cancelled and the device health was manually set to unknown again, the healthcheck succeeds (eg. job 25 on the same lava instance).
I am running LAVA 2018.7.
[1] https://lava.ciplatform.org/scheduler/job/20
Thanks,
Patryk
Renesas Electronics Europe Ltd, Dukes Meadow, Millboard Road, Bourne End, Buckinghamshire, SL8 5FH, UK. Registered in England & Wales under Registered No. 04586709.
Hello everyone,
I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.
Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.
However, starting a multinode job does not work. The job log says:
lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
start: 0 validate
Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
lxc, installed at version: 1:2.0.7-2+deb9u2
validate duration: 0.01
case: validate
case_id: 112
definition: lava
result: pass
Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.
The lava-slave logfile says:
2018-12-18 12:27:15,114 INFO master => START(12)
2018-12-18 12:27:15,117 INFO [12] Starting job
[...]
2018-12-18 12:27:15,124 DEBUG [12] dispatch:
2018-12-18 12:27:15,124 DEBUG [12] env : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
2018-12-18 12:27:15,124 DEBUG [12] env-dut :
2018-12-18 12:27:15,129 ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
2018-12-18 12:27:15,129 ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks(a)garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun
Please make sure you include the mailing list in all replies so that
others know when a problem has been fixed (and how it was fixed)
On Tue, 18 Dec 2018 at 12:00, Chuan Su <lavanxp(a)126.com> wrote:
>
> According to your comments , we checked our setups and we found that we utilized ser2net & telnet to communicate with DUT , however , ser2net set default timeout parameter as 600 seconds . When DUT runs a long duration case (more than 600 seconds ) without any log outputting , the connection is usually dropped by ser2net , and telnet program always prints logs as 'Connection closed by foreign host ' . Anyway thanks for your help !
See https://git.linaro.org/lava/lava-lab.git/tree/shared/server-configs/ser2net…
The Linaro lab in Cambridge sets all the ser2net configs to have a zero timeout.
> Sincerely,
> Chuan Su
>
>
>
>
>
> At 2018-12-18 15:59:00, "Neil Williams" <neil.williams(a)linaro.org> wrote:
> >On Tue, 18 Dec 2018 at 06:16, Chuan Su <lavanxp(a)126.com> wrote:
> >>
> >> Dear all,
> >> We are encountered with an issue that our job always exits halfway when running a long duration test case (around 20 minutes) which outputs nothing , and lava server reports an InfrastructureError error and prints as below :
> >> Connection closed by foreign host.Marking unfinished test run as failed
> >
> >Connection closed by foreign host means that the serial connection
> >failed at the DUT - this is not a problem in the LAVA test job, this
> >is an infrastructure failure at your end. The foreign host (the DUT)
> >closed the serial connection. There is nothing LAVA can do about that.
> >The serial connection to the DUT has simply failed.
> >
> >If the serial connection is USB, check for logs on the worker like
> >/var/log/messages and /var/log/syslog for events related to the serial
> >connection. Check that the DUT didn't simply kill the serial
> >connection - maybe the DUT went into some kind of suspend mode.
> >
> >> definition: lava
> >> result: fail
> >> case: 0_apache-servers1
> >> uuid: 597_1.4.2.4.1
> >> duration: 603.53
> >> lava_test_shell connection dropped.end: 3.1 lava-test-shell (duration 00:10:05) [ns_s1]
> >> namespace: ns_s1
> >> extra: ...
> >> definition: lava
> >> level: 3.1
> >> result: fail
> >> case: lava-test-shell
> >> duration: 604.55
> >> lava-test-retry failed: 1 of 1 attempts. 'lava_test_shell connection dropped.'lava_test_shell connection dropped.
> >>
> >> And we just test it with a very simple python script as below:
> >> #!/usr/bin/env python3
> >> import time
> >> print('Hello,world!')
> >> time.sleep(1200)
> >> print("Hello,Lava!")
> >> We can see 'Hello,world!' string outputs , but there's no more output of this program found on webUI!
> >> We just don't know what's wrong , so we have to mail to you for help!
> >> Sincerely,
> >> Chuan Su
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Lava-users mailing list
> >> Lava-users(a)lists.lavasoftware.org
> >> https://lists.lavasoftware.org/mailman/listinfo/lava-users
> >
> >
> >
> >--
> >
> >Neil Williams
> >=============
> >neil.williams(a)linaro.org
> >http://www.linux.codehelp.co.uk/
>
>
>
>
--
Neil Williams
=============
neil.williams(a)linaro.org
http://www.linux.codehelp.co.uk/
https://git.lavasoftware.org/lava/lava/issues/179
If your Lava-Test Test Definition 1.0 YAML files explicitly use a
parse: block (like:
https://git.linaro.org/qa/test-definitions.git/tree/automated/linux/ltp/ltp…)
then this will remain supported in Definition 1.0.
If you use the monitors or interactive test actions, this does not
affect you at all.
If you rely on LAVA to create a TestCase based on a command in the
Lava-Test Test Definition just echoing "pass" or "fail", then this is
the Default Pattern and this change will directly affect those test
jobs.
The current Default Pattern and Fixup are lifted directly from V1
(https://git.lavasoftware.org/lava/lava/blob/master/lava_common/constants.py…):
# V1 compatibility
DEFAULT_V1_PATTERN =
"(?P<test_case_id>.*-*)\\s+:\\s+(?P<result>(PASS|pass|FAIL|fail|SKIP|skip|UNKNOWN|unknown))"
DEFAULT_V1_FIXUP = {
"PASS": "pass",
"FAIL": "fail",
"SKIP": "skip",
"UNKNOWN": "unknown",
}
We've recently updated the documentation to drop mention of the
default pattern support for the following reasons:
* It has always been problematic to encode a Python regular expression
in YAML. Failures are difficult to debug and patterns are global for
the entire test operation.
* The move towards more portable test definitions puts the emphasis on
parsing the test output locally on the DUT using a customised parser.
This has further advantages:
* The pattern does not have to be mangled into YAML
* The pattern can be implemented by a language other than Python
* The pattern can change during the operation of the test shell,
e.g. a different pattern may be required for setup than for the test
itself.
We are now starting to plan for Lava-Test Test Definition 2.0 with an
emphasis on requiring portable test scripts and removing more of the
lava_test_shell Test Helper scripts. Full information on 2.0 will be
available early in 2019.
As a first step, the generally unhelpful Default Pattern and Default
Fixup dict are likely to be removed. If you need this support, the
pattern can be added to your Lava-Test Test Definition 1.0 YAML files.
In the next release, it is proposed that unless an explicit pattern is
specified in the Lava-Test Test Definition 1.0 YAML file, then no
pattern will be implemented. Processes which echo "pass" or "fail"
would be ignored and no TestCase would be created.
Let us know if there are any thoughts or problems on this proposal.
--
Neil Williams
=============
neil.williams(a)linaro.org
http://www.linux.codehelp.co.uk/
Dear all,
We are encountered with an issue that our job always exits halfway when running a long duration test case (around 20 minutes) which outputs nothing , and lava server reports an InfrastructureError error and prints as below :
Connection closed by foreign host.Marking unfinished test run as failed
definition: lava
result: fail
case: 0_apache-servers1
uuid: 597_1.4.2.4.1
duration: 603.53
lava_test_shell connection dropped.end: 3.1 lava-test-shell (duration 00:10:05) [ns_s1]
namespace: ns_s1
extra: ...
definition: lava
level: 3.1
result: fail
case: lava-test-shell
duration: 604.55
lava-test-retry failed: 1 of 1 attempts. 'lava_test_shell connection dropped.'lava_test_shell connection dropped.
And we just test it with a very simple python script as below:
#!/usr/bin/env python3
import time
print('Hello,world!')
time.sleep(1200)
print("Hello,Lava!")
We can see 'Hello,world!' string outputs , but there's no more output of this program found on webUI!
We just don't know what's wrong , so we have to mail to you for help!
Sincerely,
Chuan Su
Hi everyone,
Is it possible to handle git authentication in a test job ?
I need LAVA to clone a repo that can't be set to public,
and obviously it won't work because of the authentication step.
So is it possible to specify a password or a token ?
Best regards,
Axel
Dear , all
We found that when lava executed a script which may output a long string (more than 30000 bytes) in a line (only one line break), lava web UI always hung and there was no more lava log outputting and devices under test (short for DUT) were still powered until Lava Job time-out function triggered , however, after checked the whole log file we found that cases behind the hanging case were executed (there's new files generated) .
So the problem is that when lava encountered those cases lava web UI always hangs and DUTs may not be powered off when all the cases are completed !
best wishes,
Chuan Su
Dear , all
We found that when lava executed a script which may output a long string (more than 30000 bytes) in a line (only one line break), lava web UI always hung and there was no more lava log outputting and devices under test (short for DUT) were still powered until Lava Job time-out function triggered , however, after checked the whole log file we found that cases behind the hanging case were executed (there's new files generated) .
So the problem is that when lava encountered those cases lava web UI always hangs and DUTs may not be powered off when all the cases are completed !
best wishes,
Chuan Su
On Mon, 11 Dec 2018 at 11:30, Neil Williams <neil.williams at linaro.org> wrote:
> On Tue, 11 Dec 2018 at 11:28, Tim Jaacks <tim.jaacks(a)garz-fricke.com> wrote:
> >
> > Thanks, the CLI operations are very helpful for automating the process.
> > However, the docs say that all devices in "Reserved" state have to
> > have their "current job" cleared. I can use "lava-server manage devices details"
> > to check whether this field is actually set. There is no command to
> > modify it, though. Seems like using the Python API is the only way to
> > go here, right? The same applies to setting "Running" jobs to "Cancelled".
>
> https://git.lavasoftware.org/lava/lava/merge_requests/273
>
> This should get into the upcoming 2018.12 release.
Thank you very much for your quick help. The "lava-server manage jobs fail"
command takes care of clearing the "current job" field of the associated
device, do I understand that right?
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks(a)garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun