-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: DNS resolver config respects MagicDNS superseding override_local_dns #2959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
edcad88 to
a0f46d4
Compare
kradalby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful, some direction.
| // - When MagicDNS=false: | ||
| // - override_local_dns=true: Send Resolvers | ||
| // - override_local_dns=false/unset: No Resolvers | ||
| func TestDNSOverrideLocalBehavior(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think help on this is very helpful, it has been quite painful, and I have never found time to sit down and test all the upstream Tailscale SaaS options to map out exactly how it should work, which often has resulted in this subtle bugs or strange behaviours.
I do think that we are starting a bit in the wrong end here with resolving this once and for all, either by method, or just by the naming.
This is very oriented around testing "what our settings do", while it should focus on first mapping out a test case/scenario for every behaviour of the possible configuration in Tailscale SaaS, and name them accordingly, and then change the Headscale code to behave in the correct way.
The best example of a time I managed to work like this is probably the recent Tags reimplementation in #2931.
I appreciate someone digging into this as it requires due diligence and I have not had time to sit down with it, or put it on the current roadmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a thinking member of the public I care about exactly three outcomes for my clients:
1.) Am I using the "all singing all dancing dns proxy service"
2.) Am I sending alternate DNS servers IPs to my clients
3.) Am I letting the client choose/keep its own DNS
This is MVP.
Making it "work just like TS service" is more of a nice to have than a base requirement - where I come from.
There are undoubtedly many more knobs available with the business logic TS head end and I am not interested in "feature parity" as much as "good enuff"™
Sending overriding DNS servers to the client when they think they've disabled it is obviously not "as intended"
And MagicDNS will not work if it's not sent to the client...
I don't understand what was meant/intended by "Fallback Servers" - If you don't have working DNS servers and you don't understand why you're stuck - I don't have a big enough clue bat to cure what ails you.
If MagicDNS needs proxy targets (at all times) these should be named accordingly in the code and set to one or more of the usual suspects... Unless they're explicitly mapped to configured values.
MagicDNS - overrides the Client DNS - cause that's where the magic happens
- Uses plausible public defaults as its proxy targets in the absence of user configured targets
Override True - ships the customer provided/configured dns directly (w/o proxy)
Override False - ships nothings
Please let me know if I've missed something important that TS provides that we care about deeply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as a bug fix ( where what I hear you describing above is a desire for feature parity through testing)
Obviously low-level DNS resolution changes will catch flack from outlier users making use of misbehavior - Hopefully I've made a reasonable case for just shipping this (or something akin to it) and then making it incrementally better over time.
LMK - Love the software. Thanks for the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously low-level DNS resolution changes will catch flack from outlier users making use of misbehavior
I think the particular challenge with this part of the code is that it has been implemented, re-implemented, re-implemented and re-implemented a lot of times, and always not fully correct. This means that over time we have broken the people relying on or working around misbehavior many times.
I feel that unless we can give it the care it needs to sit down and implement it as it should be, we should not patch it further until we have the time to do it properly.
I get that this is likely a bug fix, but without doing the effort of sitting down and validating how it is suppose to work, I cannot really say if that is the case or not by reading the code.
I would greatly appreciate help to get it up to scratch, but it's been "good enuff"™ for too long, so we need to do it properly and partial changes feels like it will cause more work to be untangled when we have time to do it fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to get your input of what more/different use-cases need coverage to "implement it as it should be"; I don't have access to the TS headend code. I am not interested in implementing a test/validation harness for TS headend, especially for what I see as a relatively straightforward branching codepath. I don't have any history with what might make this part of the code "radioactive" - but I am not worried about losing traction with the userbase - as headscale is the only OS alternative to TS commercial offering I am currently aware of.
Would you share what "up to scratch" might look like from your perspective? --understanding I am interested in making the tool actually useful (no sharp edges on the user/admin/operator facing components), not in achieving (transient) ivory tower feature parity.
I see only the three cases described in my diff - MagicDNS (proxied DNS), HS propagated Direct DNS, No Propagated DNS. Please, please, please, let me know if I am being dumb. It wouldn't be the first time.
Would having more integration test cases with different configuration options be a win? Better error messages for incompatible option selections? Better documentation of each configuration knob?
I understand having FUD about what's apparently been a sore spot in the codebase - I've experienced that pain first hand for the last 18+ months running it from my homelab. But, perhaps because I took my drama/improv classes too seriously - I strongly believe in "entering the danger" - Tackling the challenge head on.
I know that if the user gets DNS assigned when they believe that they've disabled the option - the software has violated its implicit contract - which qualifies as a sharp edge from my perspective.
I am also happy to let sleeping dogs lie - if it's too soon to go back to fixing this codepath - if there are exigent circumstances.
Thanks for the dialogue. -Rogan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've just pushed a "more comprehensive" change set that attempts to comply with the articulated theory of operation in tailcfg.DNSConfig
This add optional config fields for SplitDNSFallbackResolvers that will default to the Global resolvers (previous behavior - and what TS documents) if the setting is omitted - but explicitly setting them will obviously supersede that default behavior.
And perhaps most importantly added an integration check to make sure that IF override_local_dns=false - nothing should leak out into the TS Client env.
Reading through the TS code nothing appears to correlate with the override DNS functionality with a 1:1 correspondence. It treats the presence of Resolvers, plus some Magic/Split DNS flags as indicative of the desired state. Thus the closest approximations of OverrideLocalDNS=false ≈≈ DontConfigureDNS=True or DontConfigureResolvers=True
Clients need resolvers if they have SplitDNS (routed proxy), MagicDNS (Full Proxy), or No Proxy.
LMK if you have cycles to discuss this as I feel like this implementation is at least worth beta-testing.
3444ba3 to
31d87c8
Compare
9804eba to
063f31f
Compare
f0d768b to
62a306e
Compare
Fixes #2899
Summary
This PR fixes the inconsistent DNS resolver behavior reported in issue #2899. The core issue was that DNS resolvers were not being sent correctly based on the
override_local_dnssetting.Changes
Root Cause
When
override_local_dns=false, the code was incorrectly settingFallbackResolversinstead ofResolvers. FallbackResolvers are only used when primary DNS fails, causing DNS to never work.The Fix
Updated
dnsToTailcfgDNS()inhscontrol/types/config.goto implement the correct logic:override_local_dns=true→ Send Resolversoverride_local_dns=false/unset→ No Resolvers (clients use local DNS)Behavior Matrix
Testing
hscontrol/types/config_test.gointegration/dns_override_local_test.goFiles Changed
hscontrol/types/config.go: Fixed DNS configuration logichscontrol/types/config_test.go: Updated test expectationsintegration/dns_override_local_test.go: Added integration tests