You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.
At approximately 8:00 AM EST today, five jobs assigned to five workers failed. Each job was for a different "chunk" of tests in Edge via the Sauce Labs service.
For the first, the WPT CLI logs indicate that the WebDriver server crashed. The Sauce Labs Connect logs don't describe any aberrations, though. It's also strange that Sauce Labs reports "Number of jobs using tunnel: 2.". My guess is that it failed to clean up the "job" for the first attempt before the second attempt began.
Partial logs from first failed job
2018-05-31 12:01:24,602 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:24 - Started scproxy on port 43981.
2018-05-31 12:01:24,602 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:24 - Please wait for 'you may start your tests' to start your tests.
2018-05-31 12:01:38,088 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Secure remote tunnel VM provisioned.
2018-05-31 12:01:38,088 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Tunnel ID: 884503292ffb41e29c308e0788093352
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Using no proxy for connecting to tunnel VM.
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Starting Selenium listener...
2018-05-31 12:01:38,089 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Establishing secure TLS connection to tunnel...
2018-05-31 12:01:38,340 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:38 - Selenium listener started on port 4445.
2018-05-31 12:01:39,344 INFO validate-wpt-results wpt-run:stdout 31 May 12:01:39 - Sauce Connect is up, you may start your tests.
(later)
2018-05-31 12:04:04,936 INFO validate-wpt-results wpt-run:stdout Starting runner
2018-05-31 12:05:01,373 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
2018-05-31 12:05:01,375 INFO validate-wpt-results wpt-run:stdout Traceback (most recent call last):
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/protocol.py", line 47, in setup
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout self.connect()
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executorselenium.py", line 182, in connect
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout desired_capabilities=self.capabilities)
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout self.start_session(capabilities, browser_profile)
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in start_session
2018-05-31 12:05:01,376 INFO validate-wpt-results wpt-run:stdout response = self.execute(Command.NEW_SESSION, parameters)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout self.error_handler.check_response(response)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout File "/home/samus/worker/Remote_Chunked_Runner/build/_venv/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 208, in check_response
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout raise exception_class(value)
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout WebDriverException: Message: <html><body><h1>503 Service Unavailable</h1>
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout No server is available to handle this request.
2018-05-31 12:05:01,377 INFO validate-wpt-results wpt-run:stdout </body></html>
The other logs tell a similar story (although the third and fourth did not exhibit the "job buildup").
They all started with the typical "Sauce Connect is up, you may start your tests." message, and they all executed some (different) number of tests successfully. That means this wasn't a problem with the initial connection negotiation. And since all five workers successfully completed the very next job assigned to them (also for Edge via Sauce Labs), I don't think it is related to the local environment.
After looking closer at the logs, I found that the very first error message was printed at almost exactly the same moment across all five machines:
First:
2018-05-31 12:05:01,373 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
Second:
2018-05-31 12:04:54,149 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
Third:
2018-05-31 12:04:54,666 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
Fourth:
2018-05-31 12:04:59,264 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
Fifth:
2018-05-31 12:04:59,945 INFO validate-wpt-results wpt-run:stdout Failed to start protocol connection
This strongly suggests a transitory networking issue, possibly within Sauce Labs or possibly within AWS (where all of the workers are deployed). The Sauce Labs status page does not report any incidents, though it does show a suspicious spike in "wait times" at 8:10.
The current duration of the results collection process (over 12 hours) makes me highly reluctant to introduce arbitrary delays. The second solution is slightly more compelling because it has the potential to help in other scenarios where error recovery is necessary. The third is the most drastic, but it's not wise to assume that any provider will offer 100% availability. It might be warranted if this issue recurs frequently. Fortunately, these symptoms are distinct from many of the other problems we've experienced with collecting results from Sauce Labs, so it should be easy to recognize if/when it requires further attention.
The text was updated successfully, but these errors were encountered:
At approximately 8:00 AM EST today, five jobs assigned to five workers failed. Each job was for a different "chunk" of tests in Edge via the Sauce Labs service.
For the first, the WPT CLI logs indicate that the WebDriver server crashed. The Sauce Labs Connect logs don't describe any aberrations, though. It's also strange that Sauce Labs reports "Number of jobs using tunnel: 2.". My guess is that it failed to clean up the "job" for the first attempt before the second attempt began.
Partial logs from first failed job
(later)
The other logs tell a similar story (although the third and fourth did not exhibit the "job buildup").
They all started with the typical "Sauce Connect is up, you may start your tests." message, and they all executed some (different) number of tests successfully. That means this wasn't a problem with the initial connection negotiation. And since all five workers successfully completed the very next job assigned to them (also for Edge via Sauce Labs), I don't think it is related to the local environment.
After looking closer at the logs, I found that the very first error message was printed at almost exactly the same moment across all five machines:
First:
Second:
Third:
Fourth:
Fifth:
This strongly suggests a transitory networking issue, possibly within Sauce Labs or possibly within AWS (where all of the workers are deployed). The Sauce Labs status page does not report any incidents, though it does show a suspicious spike in "wait times" at 8:10.
In any event, a few corrections come to mind:
The current duration of the results collection process (over 12 hours) makes me highly reluctant to introduce arbitrary delays. The second solution is slightly more compelling because it has the potential to help in other scenarios where error recovery is necessary. The third is the most drastic, but it's not wise to assume that any provider will offer 100% availability. It might be warranted if this issue recurs frequently. Fortunately, these symptoms are distinct from many of the other problems we've experienced with collecting results from Sauce Labs, so it should be easy to recognize if/when it requires further attention.
The text was updated successfully, but these errors were encountered: