Multiple builds interrupted by transient network issue #568

jugglinmike · 2018-06-01T01:51:45Z

At approximately 8:00 AM EST today, five jobs assigned to five workers failed. Each job was for a different "chunk" of tests in Edge via the Sauce Labs service.

For the first, the WPT CLI logs indicate that the WebDriver server crashed. The Sauce Labs Connect logs don't describe any aberrations, though. It's also strange that Sauce Labs reports "Number of jobs using tunnel: 2.". My guess is that it failed to clean up the "job" for the first attempt before the second attempt began.

Partial logs from first failed job

2018-05-31 12:01:24,602 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:24 - Started scproxy on port 43981.
2018-05-31 12:01:24,602 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:24 - Please wait for 'you may start your tests' to start your tests.
2018-05-31 12:01:38,088 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Secure remote tunnel VM provisioned.
2018-05-31 12:01:38,088 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Tunnel ID: 884503292ffb41e29c308e0788093352
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Using no proxy for connecting to tunnel VM.
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Starting Selenium listener...
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Establishing secure TLS connection to tunnel...
2018-05-31 12:01:38,340 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Selenium listener started on port 4445.
2018-05-31 12:01:39,344 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:39 - Sauce Connect is up, you may start your tests.

(later)

2018-05-31 12:04:04,936 INFO validate-wpt-results wpt-run:stdout Starting runner
2018-05-31 12:05:01,373 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
2018-05-31 12:05:01,375 INFO validate-wpt-results wpt-run:stdout Traceback (most recent call last):
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/protocol.py", line 47, in setup
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout     self.connect()
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executorselenium.py", line 182, in connect
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout     desired_capabilities=self.capabilities)
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout     self.start_session(capabilities, browser_profile)
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in start_session
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout     response = self.execute(Command.NEW_SESSION, parameters)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout     self.error_handler.check_response(response)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout   File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 208, in check_response
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout     raise exception_class(value)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout WebDriverException: Message: <html><body><h1>503 Service Unavailable</h1>
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout No server is available to handle this request.
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout </body></html>

The other logs tell a similar story (although the third and fourth did not exhibit the "job buildup").

They all started with the typical "Sauce Connect is up, you may start your tests." message, and they all executed some (different) number of tests successfully. That means this wasn't a problem with the initial connection negotiation. And since all five workers successfully completed the very next job assigned to them (also for Edge via Sauce Labs), I don't think it is related to the local environment.

After looking closer at the logs, I found that the very first error message was printed at almost exactly the same moment across all five machines:

First:

2018-05-31 12:05:01,373 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection

Second:

2018-05-31 12:04:54,149 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection

Third:

2018-05-31 12:04:54,666 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection

Fourth:

2018-05-31 12:04:59,264 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection

Fifth:

2018-05-31 12:04:59,945 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection

This strongly suggests a transitory networking issue, possibly within Sauce Labs or possibly within AWS (where all of the workers are deployed). The Sauce Labs status page does not report any incidents, though it does show a suspicious spike in "wait times" at 8:10.

In any event, a few corrections come to mind:

introduce some "cool down" period between retries
fix a previously-reported bug in the WPT CLI so that it pauses while tunnels close (this pause may allow the workers to wait ought transient network outages)
change service providers.

The current duration of the results collection process (over 12 hours) makes me highly reluctant to introduce arbitrary delays. The second solution is slightly more compelling because it has the potential to help in other scenarios where error recovery is necessary. The third is the most drastic, but it's not wise to assume that any provider will offer 100% availability. It might be warranted if this issue recurs frequently. Fortunately, these symptoms are distinct from many of the other problems we've experienced with collecting results from Sauce Labs, so it should be easy to recognize if/when it requires further attention.

The text was updated successfully, but these errors were encountered:

jugglinmike mentioned this issue Jun 1, 2018

Integrate with new results receiver #566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple builds interrupted by transient network issue #568

Multiple builds interrupted by transient network issue #568

jugglinmike commented Jun 1, 2018

Multiple builds interrupted by transient network issue #568

Multiple builds interrupted by transient network issue #568

Comments

jugglinmike commented Jun 1, 2018