TCL 正则表达式示例

发布于 2024-11-18 09:23:23 字数 495 浏览 7 评论 0原文

我想通过编写正则表达式来获取以 abc_ 或 xyz_ 开头的字符串中的单词。这是我的脚本：

[regexp -nocase -- {.*\s+(abc_|xyz_\S+)\s+.*} $str all necessaryStr]

因此，如果我在 str1 和 str2 上应用上面编写的正则表达式，我想从 $str1 获取“xyz_hello”，从 $str2 获取“abc_bye”。

set str1 "gfrdgasjklh dlasd =-0-489 xyz_hello sddf 89rn sf n9"
set str2 "dytfasjklh abc_bye dlasd =-0tyj-489 sddf tyj89rn sjf n9"

但我的正则表达式不起作用。我的问题是：

1）我的正则表达式有什么问题？ 2）使用正则表达式查找以某些预定义前缀开头的作品是否很好，或者最好使用字符串函数（字符串匹配等）？

原文

I want to get a word in a string which starts with abc_ or with xyz_ by writing a regexp.
Here my script:

[regexp -nocase -- {.*\s+(abc_|xyz_\S+)\s+.*} $str all necessaryStr]

So if I apply the above written regexp on str1 and str2 I want to get "xyz_hello" from $str1 and "abc_bye" from $str2.

set str1 "gfrdgasjklh dlasd =-0-489 xyz_hello sddf 89rn sf n9"
set str2 "dytfasjklh abc_bye dlasd =-0tyj-489 sddf tyj89rn sjf n9"

But my regexps does not work. And my questions are:

1) What is wrong with my regexp?
2) Is it good to find the works starting with some predefined prefixes with regexp or it is better to use string functions (string match or so)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

呆头 2024-11-25 09:23:23

您的问题并不清楚什么是单词。是否允许进一步使用下划线？允许使用数字吗？那么“仅由前缀组成的单词”（例如“abc_”或“xyz”）怎么样？

做出保守的假设（基于您的示例），您只期望英文字母表中的字母，至少还有一个字符，并且您不关心大小写，您可以简化您的正则表达式：

[regexp -nocase -- {\m(abc_|xyz_)[a-zA-Z]+} $str match]

这将设置 match< /code> 到匹配的单词。如果您对某个单词的定义与我的假设不同，您可以替换方括号中的内容。

关于是否更喜欢正则表达式而不是字符串函数的第二个问题将取决于上下文，并且可能会导致主观领域。

需要考虑的一些事情：

性能真的很重要吗？除非您在紧密循环中进行搜索，或者搜索很长的字符串，否则我怀疑任何性能差异都不会相关。等到出现性能问题时，分析您的应用程序以查看瓶颈所在，然后您可以测试替代实现。
便利性将取决于必须编写和维护代码的程序员的偏好。他们喜欢/讨厌使用正则表达式吗？
使用正则表达式可能会提供更大的灵活性，但可能会牺牲可读性。

我的建议是使用您最舒服的方式。为您的代码编写一组良好的单元测试，然后仅当您在分析过程中发现了瓶颈时才进行优化。

It is not clear in your question what consitutes a word. Are further underscores permitted? Are digits permitted? What about "words that consist of just the prefix", e.g. "abc_" or "xyz"?

Making the conservative assumptions (based on your examples) that you are expecting only letters from the English alphabet, at least one further character, and you don't care about case, you can simplify your regexp:

[regexp -nocase -- {\m(abc_|xyz_)[a-zA-Z]+} $str match]

This will set match to the matching word. You can replace the conents of the square brackets if your definition of a word differs from my assumptions.

Your second question about whether to prefer regexp to string functions will depend upon context, and could lead into subjective territory.

Some things to consider:

Does performance really matter? Unless you are doing the search in a tight loop, or searching very long strings, I suspect any performance difference will not be relevant. Wait until you have a performance issue, then profile your application to see where the bottleneck is, then you can test alternative implementations.
Convenience is going to depend upon the preference of the programmer(s) who have to write and maintain the code. Do they love/hate using regexps?
Using a regexp is likely to offer more flexibility, but it can be at the cost of readability.

My recommendation would be to use whichever you are most comfortable with. Write a good set of unit tests for your code, then optimise later only if you have identified a bottleneck there during profiling.

回复收藏 0 原文

≈。彩虹 2024-11-25 09:23:23

根据您所写的内容，您似乎是以 abc_ 或 xyz_ （无论如何）开头的单词，后面只有字母。匹配这个的一个很好的第一次尝试是这样的：

regexp -nocase -- {\y(?:abc_|xyz_)[a-z]+} $str match

它的特殊功能是：

\y 意味着它只在单词开头匹配（理论上单词结尾也是如此，但在所有情况下我们都在它后面跟一个字母！ )
(?:…) 进行分组而不捕获
贪婪匹配意味着我们将获得所有单词（假设它仅表示 UNICODE 的 ASCII 范围内的字母）。考虑使用 \w 或 \S 而不是 [az]，但这些确实会改变匹配内容的语义 (\w< /code> 会告诉你程序标识符中通常允许使用哪些符号，而 \S 会告诉你非空格）。

On the basis of what you've written, you seem to be words beginning with abc_ or xyz_ (in any case) and having just letters after that. A good first attempt at matching this is this:

regexp -nocase -- {\y(?:abc_|xyz_)[a-z]+} $str match

The special features of this are:

\y means this only matches at word start (theoretically word end too, but we follow it by a letter in all cases!)
(?:…) is grouping without capturing
Greedy matching means we'll get all the word (assuming it just means letters from the ASCII range of UNICODE). Consider using \w or \S instead of [a-z], but these do change the semantics of what's matched (\w will give you about what symbols are usually allowed in program identifiers, and \S will give you non-spaces).