Closed Bug 560777 Opened 15 years ago Closed 15 years ago

10.6 are getting stuck with "debug test mochitest-other"

Categories

(Release Engineering :: General, defect)

x86_64
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

()

Details

Two possibilities occur to me to explain the way that tinderbox says that there are multiple "still building" 10.6 debug mochitest-other runs that have been going for from 5 hours to more than 18 hours now: * it just fails in a way that doesn't ever tell tinderbox it finished, which is ugly and will have to be fixed someday but is not very interesting, or, * it really is tying up the limited number of 10.6 slaves forever or until a wave of "slave lost" passes through, leading to the way that the tests that are green enough to be visible without &noignore=1 only manage to get a few of the possible tests, or a few of the talos runs to run every three or four pushes, and only manage to get all of the visible tests and talos tests run every six or eight pushes. Right now, having pushed the patch that should green up both opt and debug reftests so we can make them visible, I've been waiting fruitlessly for three hours and three other pushes for a single reftest run of either sort to get kicked off, which makes it difficult to get excited about working on greening them up.
Yes, they really are taking a really long time to complete. They seem to be stuck, constantly outputting: SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x10682aae0): poll timeout, looping SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x10682aae0): poll timeout, looping SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x10682aae0): poll timeout, looping SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- SSLTUNNEL(0x10682aae0): poll timeout, looping SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- (that's been running like that since Monday)
There are several slaves stuck for few days in that suites for more time that they should. # #116 [mochitest-chrome] # #117 [mochitest-chrome] # #118 [mochitest-chrome] # #119 [mochitest-chrome] # #121 [mochitest-chrome] # #122 [mochitest-chrome] # #123 [mochitest-chrome] # #124 [mochitest-chrome] # #125 [mochitest-chrome] I have tried interrupting them through the waterfall and it seems that it has not worked (I did not try the last two to leave room for debugging). I am raising the priority. The solution might be to disable debug unit tests as requested in bug 559826.
Severity: normal → major
Summary: Is "Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test mochitest-other" really tying up multiple slaves for 18+ hour runs? → 10.6 are getting stuck with "debug test mochitest-other"
Trying to kill off the build results in this exception on the slave: 2010-04-21 06:17:43-0700 [Broker,client] asked to interrupt current command: The web-page 'stop build' button was pressed by '': 2010-04-21 06:17:43-0700 [Broker,client] <bound method SlaveBuilder.remote_interruptCommand of <SlaveBuilder 'Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test mochitest-other' at 22256416>> didn't accept ('115325', "The web-page 'stop build' button was pressed by '': \n") and {} 2010-04-21 06:17:43-0700 [Broker,client] Peer will receive following PB traceback: 2010-04-21 06:17:43-0700 [Broker,client] Unhandled Error Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/banana.py", line 146, in gotItem self.callExpressionReceived(item) File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/banana.py", line 111, in callExpressionReceived self.expressionReceived(obj) File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/pb.py", line 514, in expressionReceived method(*sexp[1:]) File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/pb.py", line 826, in proto_message self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw) --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/pb.py", line 840, in _recvMessage netResult = object.remoteMessageReceived(self, message, netArgs, netKw) File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/twisted/spread/flavors.py", line 114, in remoteMessageReceived state = method(*args, **kw) File "/Library/Python/2.5/site-packages/buildbot-0.7.9-py2.5.egg/buildbot/slave/bot.py", line 187, in remote_interruptCommand self.command.doInterrupt() File "/Library/Python/2.5/site-packages/buildbot-0.7.9-py2.5.egg/buildbot/slave/commands.py", line 698, in doInterrupt self.interrupt() File "/Library/Python/2.5/site-packages/buildbot-0.7.9-py2.5.egg/buildbot/slave/commands.py", line 1064, in interrupt self.command.kill("command interrupted") File "/Library/Python/2.5/site-packages/buildbot-0.7.9-py2.5.egg/buildbot/slave/commands.py", line 521, in kill msg += ", killing pid %d" % self.process.pid exceptions.TypeError: int argument required
(In reply to comment #1) > Yes, they really are taking a really long time to complete. They seem to be > stuck, constantly outputting: > > SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- > SSLTUNNEL(0x10682aae0): poll timeout, looping > SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- > SSLTUNNEL(0x10682aae0): poll timeout, looping > SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- > SSLTUNNEL(0x10682aae0): poll timeout, looping > SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- > SSLTUNNEL(0x10682aae0): poll timeout, looping > SSLTUNNEL(0x10682aae0): polling flags csock(0)=--, ssock(1)=R- > > (that's been running like that since Monday) 1) josh: do you know why sshtunnel is causing unittests to loop endlessly? fwiw, it seems to now be not happening anymore, so maybe some bad code got backed out? 2) just talked with catlee: todo here is a) upgrade buildbot slave on these talos minis as part of the scheduler db rollout b) make sure that the "detect-and-kill-long-running-jobs" is set for mochitest steps in buildbot master. It *should* be already, but need to check.
(In reply to comment #4) > 1) josh: do you know why sshtunnel is causing unittests to loop endlessly? > fwiw, it seems to now be not happening anymore, so maybe some bad code got > backed out? No, I don't know.
FWIW, ssltunnel itself hasn't changed since February: http://hg.mozilla.org/mozilla-central/log/ab906b5af4ab/testing/mochitest/ssltunnel/ssltunnel.cpp (and even that was a minor change) Could certainly be a Core bug that was tripping this.
This isn't happening any more.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.