It seems there was some network problem on Friday and agents were not accessible.
Agents are now fine, but ContinuaCI is stuck. Builds are trying to reconnect with message:
"
Agent agent_name which is executing stage is alive but not active. Waiting one second before rechecking. Retry 170
"
It’s hanging like that for 55 hours
When I stop the stuck builds, all other in queue start properly.
I think there should be some fallback to either kill such builds after certain time (just like timeout on actions in stages) or maybe restart the connection between agents and server.
Yes, all agent services are running, I see timeout errors on agents:
A call from the agent to the server to update build status failed: Exception: TimeoutException
On server side, I see only errors that workspace could not be deleted (seems normal on our setup), but it’s from friday, nothing new.
I encountered this problem once or twice in the past, usually restart of the server helped.
Might this be some windows networking issue ?
Yes, a timeout could be due to a network issue, or it could be that the server is too busy to accept the request, although usually you would then see errors on the server.
Check whether the CPU, disk, memory usage is high on either server or agent, then restart the server and agent and see whether that resolves the issue.
CPU usage was rather small 10-20%, after restart all is working.
Version is 1.9.2.983, but I encountered such problem couple times before on older versions, I would rather suspect its some problem windows socket reopen after some network issue.