Comment by cyberax
1 month ago
I solved it by adding a simple Tailscale action to handle failure. It creates an ephemeral instance and waits for connections for 3 minutes. Then it loops while there's an active SSH session present.
It's that simple: https://gist.github.com/Cyberax/9edbde51380bf7e1b298245464a2... and it saved me _hours_ of debug time.
I've moved all my CI/CD to use Taskfiles inside a Docker container since then, so my local environment can replicate the CI/CD environment up to the GITHUB_TOKEN. Still, being able to poke around Github builders is great.
That looks like a useful trick, using an ephemeral instance to SSH into a failed CI action context. I see in the script how it waits and checks for root user login, but to keep it alive, this part:
> Then it loops while there's an active SSH session present.
From what I can see, the loop stops when a user is logged in. Is this handled elsewhere?
> use Taskfiles inside a Docker container since then, so my local environment can replicate the CI/CD environment
Oh this is what I've been wanting, a vendor-neutral way to run the same CI actions locally. I'd seen go-task before, will try it, thanks for the info!
> That looks like a useful trick, using an ephemeral instance to SSH into a failed CI action context.
Yup. And Tailscale even manages the SSH key provisioning.
> From what I can see, the loop stops when a user is logged in. Is this handled elsewhere?
The script does handle it. The `pgrep` succeeds (returns zero exit code) if there's a "login" process for user 'root' present, which is created when there's an active SSH session. If pgrep fails, then `break` runs and exits the loop.
Github then terminates the workflow and releases the runner.
Ah I see what you mean, the loop keeps it alive until login is detected, and after that the machine is kept alive by the SSH session itself. Appreciated.
You also got the Tesla keys, nice!