-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Include --options-with-hyphens in statement regex #3863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Andriamanitra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also include numbers, capitals, and underscores while we're at it. They're definitely used occasionally:
rg --only-matching --no-filename --search-zip ' --\\w[\\w-]+' /usr/share/man | rg '[A-Z0-9_]' | sort --unique
Output on my system
--3-2-pulldown
--3gpp-scan
--64
--802_3-sap
--802_3-type
--90
--abi-call0
--action1
--action9
--appData
--assume-utf16be
--assume-utf16le
--authValue
--aws-sigv4
--background_compression
--btf_encode
--cipherText
--Complain
--contains_enumerator
--data_replicas
--disable-pcre2-8
--disable-pcre2grep-callout
--disable-pcre2grep-callout-fork
--disable-pcre2grep-jit
--dns-ipv4-addr
--dns-ipv6-addr
--do-0x20-encode
--do-x20-encode
--dualprime-mpeg2
--eccpoint-L
--enable-ebcdic-nl25
--enable-never-backslash-C
--enable-pcre2-16
--enable-pcre2-32
--enable-pcre2grep-libbz2
--enable-pcre2grep-libz
--enable-pcre2test-libedit
--enable-pcre2test-libreadline
--entityPath
--expect100-timeout
--exportedData
--filter-A
--filter-AAAA
--filterwin2k
--force-ipv4
--force-ipv6
--force-R5
--force-V4
--Foreground
--fwdownload-mode3
--fwdownload-mode3-max
--fwdownload-mode7
--gb18030
--GBR
--hostpubmd5
--hostpubsha256
--http0
--http1
--http2
--http2-prior-knowledge
--http3
--http3-only
--ignoreDefinites
--importData
--Include
--intra_dc_prec
--ip6
--ip6-destination
--ip6-destination-port
--ip6-dport
--ip6-dst
--ip6-icmp-type
--ip6-proto
--ip6-protocol
--ip6-source
--ip6-source-port
--ip6-sport
--ip6-src
--ip6-tclass
--ipv4
--ipv6
--Iraw
--Istdin
--Istdout
--jsonPolicy
--keep-utf16
--keyPath
--lang_exclude
--Lc
--Lee
--linearize-pass1
--Lmac2
--Ln
--logData
--log-ip6
--Lx
--mirror-above-4G
--mirror-below-4G
--netfilter6
--no-0x20-encode
--no3d
--no-altscan-mpeg2
--no-D
--no-dhcpv4-interface
--no-dummy-svcd-SOF
--nr_methods
--numBytes
--Numeric
--nvPath
--oauth2-bearer
--omit_journal_replay
--output-Z1
--output-Z2
--p12-charset
--pathList
--pathOfKeyToDuplicate
--pcrIndex
--pcrList
--pcrLog
--pcrValue
--plainText
--policyPath
--policyRef
--post301
--post302
--post303
--prefer-ata12
--print0
--PROG-option
--proxy1
--proxy-http2
--proxy-tls13-ciphers
--proxy-tlsv1
--publicKey
--publicKeyPath
--qualifyingData
--quiet-dhcp6
--quoteInfo
--readimpliesX
--reduction-2x2
--reduction-4x4
--replay_journal_only
--searchPath
--seek_bytes
--set-recommended-min_free_kbytes
--S-force
--size_bytes
--socks4
--socks4a
--socks5
--socks5-basic
--socks5-gssapi
--socks5-gssapi-nec
--socks5-gssapi-service
--socks5-hostname
--sslv2
--sslv3
--tls13-ciphers
--tlsv1
--tpm2
--tpm2bPrivate
--tpm2bPublic
--type_trans
--type_transition
--UCS-2
--Version
--warning-exit-0
--Wext
--with-pcre2grep-bufsize
--with-pcre2grep-max-bufsize
--with-PROG
--Wnp
--Wp
--x11
--x32
--x509certData
Co-authored-by: Jöran Karl <[email protected]>
|
Thanks for the quick review and that list @Andriamanitra and @JoeKar. Took those changes, updated Regex101 link including those test cases. |
Andriamanitra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth requiring an alphanumeric character after -- to exclude eg. echo --------, but it works fine as-is too.
Co-authored-by: niten94 <[email protected]>
|
Took those changes, thanks @niten94! Updated Regex101 link. |
runtime/syntax/sh.yaml
Outdated
| - type: "\\b(base64|basename|cat|chcon|chgrp|chmod|chown|chroot|cksum|comm|cp|csplit|cut|date|dd|df|dir|dircolors|dirname|du|env|expand|expr|factor|false|fmt|fold|head|hostid|id|install|join|link|ln|logname|ls|md5sum|mkdir|mkfifo|mknod|mktemp|mv|nice|nl|nohup|nproc|numfmt|od|paste|pathchk|pinky|pr|printenv|printf|ptx|pwd|readlink|realpath|rm|rmdir|runcon|seq|(sha1|sha224|sha256|sha384|sha512)sum|shred|shuf|sleep|sort|split|stat|stdbuf|stty|sum|sync|tac|tail|tee|test|time|timeout|touch|tr|true|truncate|tsort|tty|uname|unexpand|uniq|unlink|users|vdir|wc|who|whoami|yes)\\b" | ||
| # Conditional flags | ||
| - statement: "\\s+(-[0-9A-Za-z]+|--[0-9A-Za-z_-]+)" | ||
| - statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single-letter options like --a are no longer matched. I assume it is not intentional?
Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single-letter options like
--aare no longer matched. I assume it is not intentional?
Depends, see this comment. At least --a isn't really posix compliant and depends on the implementation of the invoked command and its argument list.
Usually single-letter arguments should be single-hyphened and multi-letter arguments double-hyphened.
Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?
Yes, purely cosmetic, but I agree. Makes it more comprehensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?
Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?
Yes, you're right1. It was a GNU extension2.
Realizing that the special -- argument is missing that way.
Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?
Yes it does, because it doesn't care about.
Long story short:
| - statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)" | |
| - statement: "\\s+(-[0-9A-Za-z]+|--[0-9A-Za-z][\\w-]*|--\\s)" |
?
Edit:
Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?
Considered in suggestion.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing that the special
--argument is missing that way.
If we want it to be matched as well, can't the entire regex be much simpler: "\\s+(-[\\w-]+)" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of programs accept flags combined with the value like
-Dval,with,commasand--opt=value.The former, yes, is still highlighted not quite as the user would expect...
Especially since the value could have a lot of possible formats, am I right that this doesn't really need to be addressed currently or in this PR?
Well, if have no better solution, so be it.
Maybe:
(\\B(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)\\b)|((\\s|^)--(\\s|$))(see). Just this "special"--(ignoring---etc.) is a pain in the ...I'm running out of ideas too.
I don't think \B at the beginning is appropiate, since (...) below is highlighted (if any):
- with
\B:not-option(--opt1) -(--opt2) - with
(\s|^):not-option--opt1 ---opt2
An illustration of what \B matches is shown in this StackOverflow answer, but the explanation there doesn't cover the example above.
To explain shortly for example, \B matches the empty string between n and - because there isn't any non-word character, according to the description under "Empty strings" in documentation.
So, these
?:are pointless, and the regex should be just:(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*|--(\\s|$))
Also, putting this
(\\s|$)before)doesn't quite solve the root cause, because: https://regex101.com/r/r8LPJO/1
Maybe we shouldn't match -- anymore since it's hard to do so, and there are situations to write something like wrapper -- --opt-of-other-cmd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha, I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.
I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @niten94 that we should just not match -- if it causes problems. It might even be better for readability if it has a different color than "normal" options.
I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.
In this case the problem is more that there is no universal syntax for arguments as it is entirely up to the invoked program to parse them. From the shell's perspective they are all just arbitrary strings.
I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?
I agree, this is a lot of bikeshedding while just expanding the character groups without any of the special handling for edge cases would already be a nice improvement. Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @niten94 that we should just not match
--if it causes problems. It might even be better for readability if it has a different color than "normal" options.
Yes, I agree. There is no good solution for this, just some hacks.
Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?
Which one exactly? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?
Which one exactly? 😅
I believe the part @Andriamanitra meant is expanding the character group used to highlight --long-opt. Anyways, I think we can finally decide on this pattern which is lastly based on @dmaluka's suggestion:
(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)
Co-authored-by: Jöran Karl <[email protected]>
Fixes #3862
Adds hypen into long option regex to also highlight
--long-options-with-hypens. I also noted that the current regex misses--Long-options-capitalsand--long-options-with-numb3rsbut I reckon those are so rare it's not worth highlighting them.Regex101 link