Hi,

We've recently had an issue with our LAVA instance (version 2019.05.post1), where a long running LAVA job which had a large log file led to instabilities when serving web content.

The large job was seemingly causing lava-server-gunicorn workers to use up more memory than was available, leading to workers crashing and then restarting. This led to all the workers processing the large jobs most of the time, while other requests would only be served once the workers restarted. This led to the webpages being served extremely slowly and lavacli usage timing out (if a larger timeout was not set).

We had "LOG_SIZE_LIMIT": 3 set in our /etc/lava-server/settings.conf, and we did have the message on that job page for "This log file is too large to view", but it seems that some requests were still attempting to process some aspect of the job causing these worker crashes. Is there any other settings that might need to be set in order to cope with long running jobs with large log files that might help with this situation?

Before we look into this any further, does anyone know if this is fixed with a newer version of LAVA? Has anyone had any similar issues with their instances?

Thanks,
Dean