Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion runtime/syntax/sh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ rules:
# Coreutils commands
- type: "\\b(base64|basename|cat|chcon|chgrp|chmod|chown|chroot|cksum|comm|cp|csplit|cut|date|dd|df|dir|dircolors|dirname|du|env|expand|expr|factor|false|fmt|fold|head|hostid|id|install|join|link|ln|logname|ls|md5sum|mkdir|mkfifo|mknod|mktemp|mv|nice|nl|nohup|nproc|numfmt|od|paste|pathchk|pinky|pr|printenv|printf|ptx|pwd|readlink|realpath|rm|rmdir|runcon|seq|(sha1|sha224|sha256|sha384|sha512)sum|shred|shuf|sleep|sort|split|stat|stdbuf|stty|sum|sync|tac|tail|tee|test|time|timeout|touch|tr|true|truncate|tsort|tty|uname|unexpand|uniq|unlink|users|vdir|wc|who|whoami|yes)\\b"
# Conditional flags
- statement: "\\s+(-[A-Za-z]+|--[a-z]+)"
- statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-letter options like --a are no longer matched. I assume it is not intentional?

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-letter options like --a are no longer matched. I assume it is not intentional?

Depends, see this comment. At least --a isn't really posix compliant and depends on the implementation of the invoked command and its argument list.
Usually single-letter arguments should be single-hyphened and multi-letter arguments double-hyphened.

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Yes, purely cosmetic, but I agree. Makes it more comprehensible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?

Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?

Copy link
Collaborator

@JoeKar JoeKar Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "not posix compliant"? POSIX doesn't define any double-hyphened arguments for its standard commands, but doesn't forbid them either, right?

Yes, you're right1. It was a GNU extension2.
Realizing that the special -- argument is missing that way.

Does getopt_long(), for example, allow single-letter double-hyphened arguments or not?

Yes it does, because it doesn't care about.

Long story short:

Suggested change
- statement: "\\s+(-[0-9A-Za-z]+|--[A-Za-z0-9][\\w-]+)"
- statement: "\\s+(-[0-9A-Za-z]+|--[0-9A-Za-z][\\w-]*|--\\s)"

?

Edit:

Also nit: use the same order of alphanumerics in both parts of the regex, i.e. either 0-9A-Za-z in both or A-Za-z0-9 in both?

Considered in suggestion.

Footnotes

  1. https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap12.html

  2. https://www.gnu.org/prep/standards/html_node/Command_002dLine-Interfaces.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that the special -- argument is missing that way.

If we want it to be matched as well, can't the entire regex be much simpler: "\\s+(-[\\w-]+)" ?

Copy link
Contributor

@niten94 niten94 Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of programs accept flags combined with the value like -Dval,with,commas and --opt=value.

The former, yes, is still highlighted not quite as the user would expect...

Especially since the value could have a lot of possible formats, am I right that this doesn't really need to be addressed currently or in this PR?

Well, if have no better solution, so be it.

Maybe: (\\B(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)\\b)|((\\s|^)--(\\s|$)) (see). Just this "special" -- (ignoring --- etc.) is a pain in the ...

I'm running out of ideas too.

I don't think \B at the beginning is appropiate, since (...) below is highlighted (if any):

  • with \B: not-option(--opt1) -(--opt2)
  • with (\s|^): not-option--opt1 ---opt2

An illustration of what \B matches is shown in this StackOverflow answer, but the explanation there doesn't cover the example above.

To explain shortly for example, \B matches the empty string between n and - because there isn't any non-word character, according to the description under "Empty strings" in documentation.

So, these ?: are pointless, and the regex should be just:

(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*|--(\\s|$))

Also, putting this (\\s|$) before ) doesn't quite solve the root cause, because: https://regex101.com/r/r8LPJO/1

Maybe we shouldn't match -- anymore since it's hard to do so, and there are situations to write something like wrapper -- --opt-of-other-cmd.

Copy link
Contributor Author

@nabeelsherazi nabeelsherazi Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.

I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @niten94 that we should just not match -- if it causes problems. It might even be better for readability if it has a different color than "normal" options.

I think in general the problem is we are trying to write a parser in regex, which is famously recommended against.

In this case the problem is more that there is no universal syntax for arguments as it is entirely up to the invoked program to parse them. From the shell's perspective they are all just arbitrary strings.

I would say let not the perfect be the enemy of the good -- what we have now is still an improvement over what was there before. The correct way to handle this is likely a syntax highlighting mechanism a bit more sophisticated than regex. Shall we merge what we have and then circle back on this when someone has cycles?

I agree, this is a lot of bikeshedding while just expanding the character groups without any of the special handling for edge cases would already be a nice improvement. Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @niten94 that we should just not match -- if it causes problems. It might even be better for readability if it has a different color than "normal" options.

Yes, I agree. There is no good solution for this, just some hacks.

Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Which one exactly? 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get that part merged and if someone wants to craft a perfect regex later that can happen in a separate PR?

Which one exactly? 😅

I believe the part @Andriamanitra meant is expanding the character group used to highlight --long-opt. Anyways, I think we can finally decide on this pattern which is lastly based on @dmaluka's suggestion:

(\\s|^)(-[A-Za-z0-9]+|--[A-Za-z0-9][\\w-]*)


- identifier: "\\$\\{[\\w:!%&=+#~@*^$?, .\\-\\/\\[\\]]+\\}"
- identifier: "\\$([0-9!#@*$?-]|[A-Za-z_]\\w*)"
Expand Down