Skip to content

Conversation

@nabeelsherazi
Copy link
Contributor

Fixes #3862

Adds hypen into long option regex to also highlight --long-options-with-hypens. I also noted that the current regex misses --Long-options-capitals and --long-options-with-numb3rs but I reckon those are so rare it's not worth highlighting them.

Regex101 link

Copy link
Contributor

@Andriamanitra Andriamanitra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also include numbers, capitals, and underscores while we're at it. They're definitely used occasionally:

rg --only-matching --no-filename --search-zip ' --\\w[\\w-]+' /usr/share/man | rg '[A-Z0-9_]' | sort --unique
Output on my system

 --3-2-pulldown
 --3gpp-scan
 --64
 --802_3-sap
 --802_3-type
 --90
 --abi-call0
 --action1
 --action9
 --appData
 --assume-utf16be
 --assume-utf16le
 --authValue
 --aws-sigv4
 --background_compression
 --btf_encode
 --cipherText
 --Complain
 --contains_enumerator
 --data_replicas
 --disable-pcre2-8
 --disable-pcre2grep-callout
 --disable-pcre2grep-callout-fork
 --disable-pcre2grep-jit
 --dns-ipv4-addr
 --dns-ipv6-addr
 --do-0x20-encode
 --do-x20-encode
 --dualprime-mpeg2
 --eccpoint-L
 --enable-ebcdic-nl25
 --enable-never-backslash-C
 --enable-pcre2-16
 --enable-pcre2-32
 --enable-pcre2grep-libbz2
 --enable-pcre2grep-libz
 --enable-pcre2test-libedit
 --enable-pcre2test-libreadline
 --entityPath
 --expect100-timeout
 --exportedData
 --filter-A
 --filter-AAAA
 --filterwin2k
 --force-ipv4
 --force-ipv6
 --force-R5
 --force-V4
 --Foreground
 --fwdownload-mode3
 --fwdownload-mode3-max
 --fwdownload-mode7
 --gb18030
 --GBR
 --hostpubmd5
 --hostpubsha256
 --http0
 --http1
 --http2
 --http2-prior-knowledge
 --http3
 --http3-only
 --ignoreDefinites
 --importData
 --Include
 --intra_dc_prec
 --ip6
 --ip6-destination
 --ip6-destination-port
 --ip6-dport
 --ip6-dst
 --ip6-icmp-type
 --ip6-proto
 --ip6-protocol
 --ip6-source
 --ip6-source-port
 --ip6-sport
 --ip6-src
 --ip6-tclass
 --ipv4
 --ipv6
 --Iraw
 --Istdin
 --Istdout
 --jsonPolicy
 --keep-utf16
 --keyPath
 --lang_exclude
 --Lc
 --Lee
 --linearize-pass1
 --Lmac2
 --Ln
 --logData
 --log-ip6
 --Lx
 --mirror-above-4G
 --mirror-below-4G
 --netfilter6
 --no-0x20-encode
 --no3d
 --no-altscan-mpeg2
 --no-D
 --no-dhcpv4-interface
 --no-dummy-svcd-SOF
 --nr_methods
 --numBytes
 --Numeric
 --nvPath
 --oauth2-bearer
 --omit_journal_replay
 --output-Z1
 --output-Z2
 --p12-charset
 --pathList
 --pathOfKeyToDuplicate
 --pcrIndex
 --pcrList
 --pcrLog
 --pcrValue
 --plainText
 --policyPath
 --policyRef
 --post301
 --post302
 --post303
 --prefer-ata12
 --print0
 --PROG-option
 --proxy1
 --proxy-http2
 --proxy-tls13-ciphers
 --proxy-tlsv1
 --publicKey
 --publicKeyPath
 --qualifyingData
 --quiet-dhcp6
 --quoteInfo
 --readimpliesX
 --reduction-2x2
 --reduction-4x4
 --replay_journal_only
 --searchPath
 --seek_bytes
 --set-recommended-min_free_kbytes
 --S-force
 --size_bytes
 --socks4
 --socks4a
 --socks5
 --socks5-basic
 --socks5-gssapi
 --socks5-gssapi-nec
 --socks5-gssapi-service
 --socks5-hostname
 --sslv2
 --sslv3
 --tls13-ciphers
 --tlsv1
 --tpm2
 --tpm2bPrivate
 --tpm2bPublic
 --type_trans
 --type_transition
 --UCS-2
 --Version
 --warning-exit-0
 --Wext
 --with-pcre2grep-bufsize
 --with-pcre2grep-max-bufsize
 --with-PROG
 --Wnp
 --Wp
 --x11
 --x32
 --x509certData

Co-authored-by: Jöran Karl <[email protected]>
@nabeelsherazi
Copy link
Contributor Author

Thanks for the quick review and that list @Andriamanitra and @JoeKar. Took those changes, updated Regex101 link including those test cases.

Copy link
Contributor

@Andriamanitra Andriamanitra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth requiring an alphanumeric character after -- to exclude eg. echo --------, but it works fine as-is too.

@nabeelsherazi
Copy link
Contributor Author

nabeelsherazi commented Sep 18, 2025

Took those changes, thanks @niten94! Updated Regex101 link.

- type: "\\b(base64|basename|cat|chcon|chgrp|chmod|chown|chroot|cksum|comm|cp|csplit|cut|date|dd|df|dir|dircolors|dirname|du|env|expand|expr|factor|false|fmt|fold|head|hostid|id|install|join|link|ln|logname|ls|md5sum|mkdir|mkfifo|mknod|mktemp|mv|nice|nl|nohup|nproc|numfmt|od|paste|pathchk|pinky|pr|printenv|printf|ptx|pwd|readlink|realpath|rm|rmdir|runcon|seq|(sha1|sha224|sha256|sha384|sha512)sum|shred|shuf|sleep|sort|split|stat|stdbuf|stty|sum|sync|tac|tail|tee|test|time|timeout|touch|tr|true|truncate|tsort|tty|uname|unexpand|uniq|unlink|users|vdir|wc|who|whoami|yes)\\b"
# Conditional flags
- statement: "\\s+(-[0-9A-Za-z]+|--[0-9A-Za-z_-]+)"
- statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-letter options like --a are no longer matched. I assume it is not intentional?

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-letter options like --a are no longer matched. I assume it is not intentional?

Depends, see this comment. At least --a isn't really posix compliant and depends on the implementation of the invoked command and its argument list.
Usually single-letter arguments should be single-hyphened and multi-letter arguments double-hyphened.

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Yes, purely cosmetic, but I agree. Makes it more comprehensible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?

Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?

Copy link
Collaborator

@JoeKar JoeKar Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?

Yes, you're right1. It was a GNU extension2.
Realizing that the special -- argument is missing that way.

Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?

Yes it does, because it doesn't care about.

Long story short:

Suggested change
- statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)"
- statement: "\\s+(-[0-9A-Za-z]+|--[0-9A-Za-z][\\w-]*|--\\s)"

?

Edit:

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Considered in suggestion.

Footnotes

  1. https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap12.html

  2. https://www.gnu.org/prep/standards/html_node/Command_002dLine-Interfaces.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that the special -- argument is missing that way.

If we want it to be matched as well, can't the entire regex be much simpler: "\\s+(-[\\w-]+)" ?

Copy link
Contributor

@niten94 niten94 Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of programs accept flags combined with the value like -Dval,with,commas and --opt=value.

The former, yes, is still highlighted not quite as the user would expect...

Especially since the value could have a lot of possible formats, am I right that this doesn't really need to be addressed currently or in this PR?

Well, if have no better solution, so be it.

Maybe: (\\B(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)\\b)|((\\s|^)--(\\s|$)) (see). Just this "special" -- (ignoring --- etc.) is a pain in the ...

I'm running out of ideas too.

I don't think \B at the beginning is appropiate, since (...) below is highlighted (if any):

  • with \B: not-option(--opt1) -(--opt2)
  • with (\s|^): not-option--opt1 ---opt2

An illustration of what \B matches is shown in this StackOverflow answer, but the explanation there doesn't cover the example above.

To explain shortly for example, \B matches the empty string between n and - because there isn't any non-word character, according to the description under "Empty strings" in documentation.

So, these ?: are pointless, and the regex should be just:

(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*|--(\\s|$))

Also, putting this (\\s|$) before ) doesn't quite solve the root cause, because: https://regex101.com/r/r8LPJO/1

Maybe we shouldn't match -- anymore since it's hard to do so, and there are situations to write something like wrapper -- --opt-of-other-cmd.

Copy link
Contributor Author

@nabeelsherazi nabeelsherazi Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.

I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @niten94 that we should just not match -- if it causes problems. It might even be better for readability if it has a different color than "normal" options.

I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.

In this case the problem is more that there is no universal syntax for arguments as it is entirely up to the invoked program to parse them. From the shell's perspective they are all just arbitrary strings.

I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?

I agree, this is a lot of bikeshedding while just expanding the character groups without any of the special handling for edge cases would already be a nice improvement. Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @niten94 that we should just not match -- if it causes problems. It might even be better for readability if it has a different color than "normal" options.

Yes, I agree. There is no good solution for this, just some hacks.

Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Which one exactly? 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Which one exactly? 😅

I believe the part @Andriamanitra meant is expanding the character group used to highlight --long-opt. Anyways, I think we can finally decide on this pattern which is lastly based on @dmaluka's suggestion:

(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)

@nabeelsherazi
Copy link
Contributor Author

Took those changes, thanks @JoeKar @dmaluka . Ready for re-review if all in agreement!

@JoeKar JoeKar merged commit 9183fbe into zyedidia:master Nov 5, 2025
@nabeelsherazi nabeelsherazi deleted the patch-1 branch November 5, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Syntax highlighting for long options in shell scripts doesn't handle hyphens

5 participants