Skip to content

Conversation

@qoke
Copy link
Contributor

@qoke qoke commented Apr 17, 2025

There is a goroutine leak in EphemeralGarbageCollector where each call to Schedule() creates a goroutine, but those goroutines are not properly terminated when a node is canceled or when the garbage collector is shut down i.e. leaving dangling goroutines.

Wrote a test to confirm:

=== RUN TestEphemeralGarbageCollectorGoRoutineLeak
ephemeral_garbage_collector_test.go:16: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:68: Final number of goroutines: 102
ephemeral_garbage_collector_test.go:71:
Error Trace: github.com/juanfont/headscale/hscontrol/db/ephemeral_garbage_collector_test.go:71
Error: "102" is not less than or equal to "7"
Test: TestEphemeralGarbageCollectorGoRoutineLeak
Messages: There are significantly more goroutines after GC usage, which suggests a leak
--- FAIL: TestEphemeralGarbageCollectorGoRoutineLeak (0.30s)
FAIL
FAIL github.com/juanfont/headscale/hscontrol/db 0.317s
FAIL

The fix ensures a proper cleanup of dangling goroutines and stopping all timers. Also added a safety check in Schedule() to prevent scheduling after Close() - There don't appear to be any current codepaths that call in this order currently, but I have added a hygiene check for this to ensure if someone does this in future, that it is handled correctly.

After this fix, also added a bunch of additional tests to ensure cancel and reschedule are not caught up in the same problem (which they are not).

=== RUN TestEphemeralGarbageCollectorGoRoutineLeak
ephemeral_garbage_collector_test.go:21: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:71: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorGoRoutineLeak (0.50s)
=== RUN TestEphemeralGarbageCollectorReschedule
--- PASS: TestEphemeralGarbageCollectorReschedule (0.10s)
=== RUN TestEphemeralGarbageCollectorCancelAndReschedule
--- PASS: TestEphemeralGarbageCollectorCancelAndReschedule (0.20s)
=== RUN TestEphemeralGarbageCollectorCloseBeforeTimerFires
--- PASS: TestEphemeralGarbageCollectorCloseBeforeTimerFires (0.10s)
=== RUN TestEphemeralGarbageCollectorScheduleAfterClose
ephemeral_garbage_collector_test.go:205: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:248: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorScheduleAfterClose (0.20s)
=== RUN TestEphemeralGarbageCollectorConcurrentScheduleAndClose
ephemeral_garbage_collector_test.go:260: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:331: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorConcurrentScheduleAndClose (0.35s)
=== RUN TestEphemeralGarbageCollectorOrder
--- PASS: TestEphemeralGarbageCollectorOrder (7.00s)
=== RUN TestEphemeralGarbageCollectorLoads
--- PASS: TestEphemeralGarbageCollectorLoads (10.01s)
PASS
ok github.com/juanfont/headscale/hscontrol/db 18.487s

Given the amount of time these tests take to run, you might not want to run these by default.

@qoke qoke requested review from juanfont and kradalby as code owners April 17, 2025 01:11
@ghost
Copy link

ghost commented Apr 17, 2025

Pull Request Revisions

RevisionDescription
r6
Fixed typo in garbage collector commentCorrected a minor typo in the comment explaining the logic of the ephemeral garbage collector's scheduling mechanism
r5
Improved node deletion channel handlingAdded more robust channel communication for node deletion in garbage collector, preventing potential race conditions during shutdown
r4
Added constants for time durationsIntroduced named constants for millisecond time durations to improve code readability in ephemeral garbage collector test file
r3
Improved ephemeral garbage collector testRefactored ephemeral garbage collector tests with improved synchronization, channel-based signaling, and more robust goroutine leak detection
r2
Improved goroutine leak prevention logicEnhanced ephemeral garbage collector to prevent potential goroutine leaks by adding cancellation channel checks and improving error handling
r1
Improved ephemeral garbage collector implementationEnhanced ephemeral garbage collector with more robust concurrency handling, safer Close() method, and improved Schedule() logic with better timer management
✅ AI review completed for r6
Help React with emojis to give feedback on AI-generated reviews:
  • 👍 means the feedback was helpful and actionable
  • 👎 means the feedback was incorrect or unhelpful
💬 Replying to feedback with a comment helps us improve the system. Your input also contributes to shaping future interactions with the AI reviewer.

We'd love to hear from you—reach out anytime at [email protected].

Copy link
Collaborator

@kradalby kradalby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! thank you!

@kradalby kradalby merged commit 92e587a into juanfont:main Apr 23, 2025
270 of 276 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants