仅选择属于冒号一部分的单词

发布于 2025-01-11 00:11:26 字数 398 浏览 3 评论 0原文

我有一个使用标记语言的文本文件（类似于维基百科文章），

cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.

我需要仅选择单词“having：”，因为它是常规文本的一部分。我尝试过

grep -v '[*:*]' test.txt

这将正确避免标签，但不会选择预期的单词。

原文

I have a text file using markup language (similar to wikipedia articles)

cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.

I need to select the word "having:" only because that is part of regular text. I tried

grep -v '[*:*]' test.txt

This will correctly avoid the tags, but does not select the expected word.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦晓ヶ微光ヅ倾城 2025-01-18 00:11:26

方括号指定字符类，因此正则表达式会查找字符 * 或 : （或 *）之一的任何出现，但是我们已经说过了，不是吗？）

grep 有选项 -o 只打印匹配的文本，所以谎言

grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt

会提取任何带有冒号的文本，每侧被零个或多个非空白字符包围。 -w 选项添加了匹配需要在单词边界之间的条件。

但是，如果您想限制在哪个上下文中匹配文本，您可能需要切换到比普通 grep 功能更强大的工具。例如，您可以使用 sed 预处理每一行以删除所有括号内的文本，然后在剩余文本中查找匹配项。

sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt

（这假设您的 sed 将替换字符串中的 \n 识别为文字换行符。如果不能识别，可以使用简单的解决方法，但如果是这样，我们就不要去那里没有必要。）

简而言之，我们首先替换方括号之间的任何文本。（如果您的输入可以在一行中包含多个方括号序列，并且它们之间有普通文本，则需要改进这一点。您的示例仅显示嵌套的方括号，但我的方法对于这两种情况可能都太简单了。）然后，我们删除任何不包含冒号的单词，对行中的最后一个单词有特殊规定，以及一些后续的清理。最后，我们用换行符替换所有剩余的空格，并（隐式）打印剩余的内容。（这仍然会导致打印过多的换行符，但这很容易在以后修复。）

或者，我们可以使用 sed 删除任何括号内的表达式，然后使用 grep在剩余的代币上。

sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'

:a 创建一个标签 a ，而 ta 表示如果正则表达式匹配则跳转回该标签并重试。这还演示了如何处理嵌套和重复的括号。（我想它可以被重构到之前的尝试中，这样我们就可以避免使用 grep 的管道。但是我想，概述不同的解决方案模型在这里也很有用。）

如果您想确保有至少有一个与冒号相邻的非冒号字符，您可以执行类似

... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'

-E 选项选择稍微更现代的正则表达式方言的操作，该方言允许我们在之间使用 |替代方案和 +一次或多次重复。（1969 年的基本 grep 根本没有这些功能；很久以后，POSIX 标准用一种稍微古怪的语法嫁接了它们，需要您使用反斜杠来删除字面意义并选择元字符行为...但我们不要去那里。）

还要注意 [^:[:space:]] 如何匹配不是冒号或空白字符的单个字符，在哪里[:space:] 是（有点神秘）特殊的 POSIX 命名字符类，它匹配任何空白字符（常规空格、水平制表符、垂直制表符、可能是 Unicode 空白字符，具体取决于区域设置）。

Awk 可以轻松地让您迭代一行上的标记。忽略方括号内的匹配的要求使事情变得有些复杂；您可以保留一个单独的变量来跟踪您是否在括号内。

awk '{ for(i=1; i<=NF; ++i) {
        if($i ~ /\]/) { brackets=0; next }
        if($i ~ /\[/) brackets=1;
        if(brackets) next;
        if($i ~ /:/) print $i }' file.txt

这再次硬编码了一些关于如何放置括号的可能不正确的假设。如果单个标记包含一个右方括号，后跟一个左方括号，并且对嵌套括号的处理过于简单（一系列左方括号之后的第一个右方括号将有效地假设我们不再位于方括号内），那么它将出现意外的行为。

The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)

grep has the option -o to only print the matching text, so something lie

grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt

would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.

However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.

sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt

(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)

In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)

Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.

sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'

The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)

If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like

... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'

where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)

Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).

Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.

awk '{ for(i=1; i<=NF; ++i) {
        if($i ~ /\]/) { brackets=0; next }
        if($i ~ /\[/) brackets=1;
        if(brackets) next;
        if($i ~ /:/) print $i }' file.txt

This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).

回复收藏 0 原文

爱*していゐ 2025-01-18 00:11:26

使用 sed 和 awk 的组合解决方案：

sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'

sed 会将所有空格更改为换行符
awk（或 gawk）将输出所有匹配 $0~/:$/ 的行，只要 i 等于 0
awk 内容的最后部分保留左括号和右括号的计数。

使用 sed 和 grep 的另一个解决方案：

sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt  | grep ':

 's/\[.*\]+//g' 将过滤括号之间的内容
's/ /\n/g' 将用换行符替换空格
grep 将仅查找以 : 结尾的

行 第三种仅使用 awk：
gawk '{ for (t=1;t<=NF;t++){ 
            if(i==0 && $t~/:$/) print $t; 
            i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt


 gsub 返回替换的数量。
变量i用于计算括号的层数。在每个 [ 上，它加 1，在每个 ] 上，它减 1。这样做是因为 gsub(/\[/,"",$t) 返回替换字符的数量。当有像 [[][ 这样的标记时，计数增加 (3-1=) 2。当标记有括号和分号时，我的代码将失败，因为标记将匹配，如果它在括号计数之前以 : 结尾。

's/\[.*\]+//g' 将过滤括号之间的内容
's/ /\n/g' 将用换行符替换空格
grep 将仅查找以 : 结尾的

行第三种仅使用 awk：

gsub 返回替换的数量。
变量i用于计算括号的层数。在每个 [ 上，它加 1，在每个 ] 上，它减 1。这样做是因为 gsub(/\[/,"",$t) 返回替换字符的数量。当有像 [[][ 这样的标记时，计数增加 (3-1=) 2。当标记有括号和分号时，我的代码将失败，因为标记将匹配，如果它在括号计数之前以 : 结尾。

A combined solution using sed and awk:

sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'

sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.

Another solution using sed and grep:

sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt  | grep ':

's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g'  will replace a space with a newline
grep will only find lines ending with :

A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){ 
            if(i==0 && $t~/:$/) print $t; 
            i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt


gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2.  When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :

A third on using only awk:

gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.