为什么 findstr 不能正确处理大小写（在某些情况下）？

发布于 2024-08-28 10:17:50 字数 1015 浏览 13 评论 0原文

在 cmd.exe 中编写最近的一些脚本时，我需要将 findstr 与正则表达式一起使用 - 客户需要标准 cmd.exe 命令（没有 GnuWin32、Cygwin、VBS 或 Powershell）。

我只是想知道变量是否包含任何大写字符并尝试使用：

> set myvar=abc
> echo %myvar%|findstr /r "[A-Z]"
abc
> echo %errorlevel%
0

当 %myvar% 设置为 abc 时，实际上输出字符串并设置 < code>errorlevel 为 0，表示找到匹配项。

但是，完整列表变体：

> echo %myvar%|findstr /r "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]"
> echo %errorlevel%
1

不输出该行，并且它正确地将errorlevel设置为1。

此外：

> echo %myvar%|findstr /r "^[A-Z]*$"
> echo %errorlevel%
1

也按预期工作。

我显然在这里遗漏了一些东西，即使只是 findstr 不知何故损坏了。

为什么第一个（范围）正则表达式在这种情况下不起作用？

还有更奇怪的地方：

> echo %myvar%|findstr /r "[A-Z]"
abc
> echo %myvar%|findstr /r "[A-Z][A-Z]"
abc
> echo %myvar%|findstr /r "[A-Z][A-Z][A-Z]"
> echo %myvar%|findstr /r "[A]"

上面的最后两个也没有输出字符串！

原文

While writing some recent scripts in cmd.exe, I had a need to use findstr with regular expressions - customer required standard cmd.exe commands (no GnuWin32 nor Cygwin nor VBS nor Powershell).

I just wanted to know if a variable contained any upper-case characters and attempted to use:

> set myvar=abc
> echo %myvar%|findstr /r "[A-Z]"
abc
> echo %errorlevel%
0

When %myvar% is set to abc, that actually outputs the string and sets errorlevel to 0, saying that a match was found.

However, the full-list variant:

> echo %myvar%|findstr /r "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]"
> echo %errorlevel%
1

does not output the line and it correctly sets errorlevel to 1.

In addition:

> echo %myvar%|findstr /r "^[A-Z]*$"
> echo %errorlevel%
1

also works as expected.

I'm obviously missing something here even if it's only the fact that findstr is somehow broken.

Why does the first (range) regex not work in this case?

And yet more weirdness:

> echo %myvar%|findstr /r "[A-Z]"
abc
> echo %myvar%|findstr /r "[A-Z][A-Z]"
abc
> echo %myvar%|findstr /r "[A-Z][A-Z][A-Z]"
> echo %myvar%|findstr /r "[A]"

The last two above also does not output the string!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光沙漏 2024-09-04 10:17:50

我相信这主要是一个可怕的设计缺陷。

我们都希望范围能够根据 ASCII 代码值进行整理。但它们没有 - 相反，范围基于几乎与 SORT 使用的默认序列匹配的排序规则序列。编辑 -FINDSTR 使用的确切排序顺序现在可在获取https://stackoverflow.com/a/20159191/1012053 标题为正则表达式字符类范围 [xy] 的部分。

我准备了一个文本文件，其中包含 1 - 255 范围内每个扩展 ASCII 字符的一行，不包括 10 (LF)、13 (CR) 和 26（Windows 上的 EOF）。
在每一行上，我都有一个字符，后跟一个空格，然后是该字符的十进制代码。然后，我通过 SORT 运行该文件，并将输出捕获到sortedChars.txt 文件中。

现在，我可以轻松地针对此排序文件测试任何正则表达式范围，并演示如何通过与 SORT 几乎相同的排序序列确定范围。

>findstr /nrc:"^[0-9]" sortedChars.txt
137:0 048
138:½ 171
139:¼ 172
140:1 049
141:2 050
142:² 253
143:3 051
144:4 052
145:5 053
146:6 054
147:7 055
148:8 056
149:9 057

结果并不完全符合我们的预期，因为混合了字符 171、172 和 253。但结果是完全有道理的。行号前缀对应于SORT排序顺序，可以看到范围按照SORT顺序完全匹配。

这是另一个完全遵循 SORT 序列的范围测试：

>findstr /nrc:"^[!-=]" sortedChars.txt
34:! 033
35:" 034
36:# 035
37:$ 036
38:% 037
39:& 038
40:( 040
41:) 041
42:* 042
43:, 044
44:. 046
45:/ 047
46:: 058
47:; 059
48:? 063
49:@ 064
50:[ 091
51:\ 092
52:] 093
53:^ 094
54:_ 095
55:` 096
56:{ 123
57:| 124
58:} 125
59:~ 126
60:¡ 173
61:¿ 168
62:¢ 155
63:£ 156
64:¥ 157
65:₧ 158
66:+ 043
67:∙ 249
68:< 060
69:= 061

存在一个带有字母字符的小异常。字符“a”在“A”和“Z”之间排序，但与 [AZ] 不匹配。 “z”排在“Z”之后，但它匹配 [AZ]。 [az] 也有相应的问题。 “A”排序在“a”之前，但它匹配 [az]。 “Z”在“a”和“z”之间排序，但它不匹配 [az]。

以下是 [AZ] 结果：

>findstr /nrc:"^[A-Z]" sortedChars.txt
151:A 065
153:â 131
154:ä 132
155:à 133
156:å 134
157:Ä 142
158:Å 143
159:á 160
160:ª 166
161:æ 145
162:Æ 146
163:B 066
164:b 098
165:C 067
166:c 099
167:Ç 128
168:ç 135
169:D 068
170:d 100
171:E 069
172:e 101
173:é 130
174:ê 136
175:ë 137
176:è 138
177:É 144
178:F 070
179:f 102
180:ƒ 159
181:G 071
182:g 103
183:H 072
184:h 104
185:I 073
186:i 105
187:ï 139
188:î 140
189:ì 141
190:í 161
191:J 074
192:j 106
193:K 075
194:k 107
195:L 076
196:l 108
197:M 077
198:m 109
199:N 078
200:n 110
201:ñ 164
202:Ñ 165
203:ⁿ 252
204:O 079
205:o 111
206:ô 147
207:ö 148
208:ò 149
209:Ö 153
210:ó 162
211:º 167
212:P 080
213:p 112
214:Q 081
215:q 113
216:R 082
217:r 114
218:S 083
219:s 115
220:ß 225
221:T 084
222:t 116
223:U 085
224:u 117
225:û 150
226:ù 151
227:ú 163
228:ü 129
229:Ü 154
230:V 086
231:v 118
232:W 087
233:w 119
234:X 088
235:x 120
236:Y 089
237:y 121
238:ÿ 152
239:Z 090
240:z 122

[az] 结果

>findstr /nrc:"^[a-z]" sortedChars.txt
151:A 065
152:a 097
153:â 131
154:ä 132
155:à 133
156:å 134
157:Ä 142
158:Å 143
159:á 160
160:ª 166
161:æ 145
162:Æ 146
163:B 066
164:b 098
165:C 067
166:c 099
167:Ç 128
168:ç 135
169:D 068
170:d 100
171:E 069
172:e 101
173:é 130
174:ê 136
175:ë 137
176:è 138
177:É 144
178:F 070
179:f 102
180:ƒ 159
181:G 071
182:g 103
183:H 072
184:h 104
185:I 073
186:i 105
187:ï 139
188:î 140
189:ì 141
190:í 161
191:J 074
192:j 106
193:K 075
194:k 107
195:L 076
196:l 108
197:M 077
198:m 109
199:N 078
200:n 110
201:ñ 164
202:Ñ 165
203:ⁿ 252
204:O 079
205:o 111
206:ô 147
207:ö 148
208:ò 149
209:Ö 153
210:ó 162
211:º 167
212:P 080
213:p 112
214:Q 081
215:q 113
216:R 082
217:r 114
218:S 083
219:s 115
220:ß 225
221:T 084
222:t 116
223:U 085
224:u 117
225:û 150
226:ù 151
227:ú 163
228:ü 129
229:Ü 154
230:V 086
231:v 118
232:W 087
233:w 119
234:X 088
235:x 120
236:Y 089
237:y 121
238:ÿ 152
240:z 122

Sort 将大写字母排在小写字母前面。（编辑 - 我刚刚阅读了 SORT 的帮助，并了解到它不区分大小写。事实上，我的 SORT 输出始终将大写放在小写之前，这可能是输入顺序的结果。）< /em> 但正则表达式显然将小写字母排在大写字母之前。以下所有范围均无法匹配任何字符。

>findstr /nrc:"^[A-a]" sortedChars.txt

>findstr /nrc:"^[B-b]" sortedChars.txt

>findstr /nrc:"^[C-c]" sortedChars.txt

>findstr /nrc:"^[D-d]" sortedChars.txt

颠倒顺序即可找到字符。

>findstr /nrc:"^[a-A]" sortedChars.txt
151:A 065
152:a 097

>findstr /nrc:"^[b-B]" sortedChars.txt
163:B 066
164:b 098

>findstr /nrc:"^[c-C]" sortedChars.txt
165:C 067
166:c 099

>findstr /nrc:"^[d-D]" sortedChars.txt
169:D 068
170:d 100

正则表达式对其他字符的排序与 SORT 不同，但我还没有准确的列表。

I believe this is mostly a horrible design flaw.

We all expect the ranges to collate based on the ASCII code value. But they don't - instead the ranges are based on a collation sequence that nearly matches the default sequence used by SORT. EDIT -The exact collation sequence used by FINDSTR is now available at https://stackoverflow.com/a/20159191/1012053 under the section titled Regex character class ranges [x-y].

I prepared a text file containing one line for each extended ASCII character from 1 - 255, excluding 10 (LF), 13 (CR), and 26 (EOF on Windows).
On each line I have the character, followed by a space, followed by the decimal code for the character. I then ran the file through SORT and captured the output in a sortedChars.txt file.

I now can easily test any regex range against this sorted file and demonstrate how the range is determined by a collation sequence that is nearly the same as SORT.

>findstr /nrc:"^[0-9]" sortedChars.txt
137:0 048
138:½ 171
139:¼ 172
140:1 049
141:2 050
142:² 253
143:3 051
144:4 052
145:5 053
146:6 054
147:7 055
148:8 056
149:9 057

The results are not quite what we expected in that chars 171, 172 and 253 are thrown in the mix. But the results make perfect sense. The line number prefix corresponds to the SORT collation sequence, and you can see that the range exactly matches according to the SORT sequence.

Here is another range test that exactly follows the SORT sequence:

>findstr /nrc:"^[!-=]" sortedChars.txt
34:! 033
35:" 034
36:# 035
37:$ 036
38:% 037
39:& 038
40:( 040
41:) 041
42:* 042
43:, 044
44:. 046
45:/ 047
46:: 058
47:; 059
48:? 063
49:@ 064
50:[ 091
51:\ 092
52:] 093
53:^ 094
54:_ 095
55:` 096
56:{ 123
57:| 124
58:} 125
59:~ 126
60:¡ 173
61:¿ 168
62:¢ 155
63:£ 156
64:¥ 157
65:₧ 158
66:+ 043
67:∙ 249
68:< 060
69:= 061

There is one small anomaly with alpha characters. Character "a" sorts between "A" and "Z" yet it does not match [A-Z]. "z" sorts after "Z", yet it matches [A-Z]. There is a corresponding problem with [a-z]. "A" sorts before "a", yet it matches [a-z]. "Z" sorts between "a" and "z", yet it does not match [a-z].

Here are the [A-Z] results:

>findstr /nrc:"^[A-Z]" sortedChars.txt
151:A 065
153:â 131
154:ä 132
155:à 133
156:å 134
157:Ä 142
158:Å 143
159:á 160
160:ª 166
161:æ 145
162:Æ 146
163:B 066
164:b 098
165:C 067
166:c 099
167:Ç 128
168:ç 135
169:D 068
170:d 100
171:E 069
172:e 101
173:é 130
174:ê 136
175:ë 137
176:è 138
177:É 144
178:F 070
179:f 102
180:ƒ 159
181:G 071
182:g 103
183:H 072
184:h 104
185:I 073
186:i 105
187:ï 139
188:î 140
189:ì 141
190:í 161
191:J 074
192:j 106
193:K 075
194:k 107
195:L 076
196:l 108
197:M 077
198:m 109
199:N 078
200:n 110
201:ñ 164
202:Ñ 165
203:ⁿ 252
204:O 079
205:o 111
206:ô 147
207:ö 148
208:ò 149
209:Ö 153
210:ó 162
211:º 167
212:P 080
213:p 112
214:Q 081
215:q 113
216:R 082
217:r 114
218:S 083
219:s 115
220:ß 225
221:T 084
222:t 116
223:U 085
224:u 117
225:û 150
226:ù 151
227:ú 163
228:ü 129
229:Ü 154
230:V 086
231:v 118
232:W 087
233:w 119
234:X 088
235:x 120
236:Y 089
237:y 121
238:ÿ 152
239:Z 090
240:z 122

And the [a-z] results

>findstr /nrc:"^[a-z]" sortedChars.txt
151:A 065
152:a 097
153:â 131
154:ä 132
155:à 133
156:å 134
157:Ä 142
158:Å 143
159:á 160
160:ª 166
161:æ 145
162:Æ 146
163:B 066
164:b 098
165:C 067
166:c 099
167:Ç 128
168:ç 135
169:D 068
170:d 100
171:E 069
172:e 101
173:é 130
174:ê 136
175:ë 137
176:è 138
177:É 144
178:F 070
179:f 102
180:ƒ 159
181:G 071
182:g 103
183:H 072
184:h 104
185:I 073
186:i 105
187:ï 139
188:î 140
189:ì 141
190:í 161
191:J 074
192:j 106
193:K 075
194:k 107
195:L 076
196:l 108
197:M 077
198:m 109
199:N 078
200:n 110
201:ñ 164
202:Ñ 165
203:ⁿ 252
204:O 079
205:o 111
206:ô 147
207:ö 148
208:ò 149
209:Ö 153
210:ó 162
211:º 167
212:P 080
213:p 112
214:Q 081
215:q 113
216:R 082
217:r 114
218:S 083
219:s 115
220:ß 225
221:T 084
222:t 116
223:U 085
224:u 117
225:û 150
226:ù 151
227:ú 163
228:ü 129
229:Ü 154
230:V 086
231:v 118
232:W 087
233:w 119
234:X 088
235:x 120
236:Y 089
237:y 121
238:ÿ 152
240:z 122

Sort sorts upper case before lower case. (EDIT - I just read the help for SORT and learned that it does not differentiate between upper and lower case. The fact that my SORT output consistently put upper before lower is probably a result of the order of the input.) But regex apparently sorts lower case before upper case. All of the following ranges fail to match any characters.

>findstr /nrc:"^[A-a]" sortedChars.txt

>findstr /nrc:"^[B-b]" sortedChars.txt

>findstr /nrc:"^[C-c]" sortedChars.txt

>findstr /nrc:"^[D-d]" sortedChars.txt

Reversing the order finds the characters.

>findstr /nrc:"^[a-A]" sortedChars.txt
151:A 065
152:a 097

>findstr /nrc:"^[b-B]" sortedChars.txt
163:B 066
164:b 098

>findstr /nrc:"^[c-C]" sortedChars.txt
165:C 067
166:c 099

>findstr /nrc:"^[d-D]" sortedChars.txt
169:D 068
170:d 100

There are additional characters that regex sorts differently than SORT, but I haven't got a precise list.

回复收藏 0 原文

雪花飘飘的天空 2024-09-04 10:17:50

因此，如果您想要

仅数字：FindStr /R“^[0123-9]*$”
八进制：FindStr /R "^[0123-7]*$"
八
十六进制：FindStr /R "^[0123-9aAb-Cd-EfF]*$"
十六
不带重音符号的 alpha : FindStr /R "^[aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ] *$"
字母数字 : FindStr /R "^[0123-9aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ]*$"
字母数字

回复收藏 0 原文

沉默的熊 2024-09-04 10:17:50

这似乎是由于在正则表达式搜索中使用范围引起的。

该范围内的第一个字符不会出现这种情况。对于非范围根本不会发生。

> echo a | findstr /r "[A-C]"
> echo b | findstr /r "[A-C]"
    b
> echo c | findstr /r "[A-C]"
    c
> echo d | findstr /r "[A-C]"
> echo b | findstr /r "[B-C]"
> echo c | findstr /r "[B-C]"
    c

> echo a | findstr /r "[ABC]"
> echo b | findstr /r "[ABC]"
> echo c | findstr /r "[ABC]"
> echo d | findstr /r "[ABC]"
> echo b | findstr /r "[BC]"
> echo c | findstr /r "[BC]"

> echo A | findstr /r "[A-C]"
    A
> echo B | findstr /r "[A-C]"
    B
> echo C | findstr /r "[A-C]"
    C
> echo D | findstr /r "[A-C]"

根据 SS64 CMD FINDSTR 页面（令人惊叹的圆度的显示，参考这个问题），范围[AZ]：

...包括完整的英文字母，包括大写和小写（“a”除外），以及带有变音符号的非英语字母字符。

为了解决我的环境中的问题，我只是使用了特定的正则表达式（例如 [ABCD] 而不是 [AD]）。对于那些被允许的人来说，更明智的方法是下载 CygWin 或 GnuWin32 并使用其中一个包中的 grep 。

This appears to be caused by the use of ranges within regular expression searches.

It doesn't occur for the first character in the range. It doesn't occur at all for non-ranges.

> echo a | findstr /r "[A-C]"
> echo b | findstr /r "[A-C]"
    b
> echo c | findstr /r "[A-C]"
    c
> echo d | findstr /r "[A-C]"
> echo b | findstr /r "[B-C]"
> echo c | findstr /r "[B-C]"
    c

> echo a | findstr /r "[ABC]"
> echo b | findstr /r "[ABC]"
> echo c | findstr /r "[ABC]"
> echo d | findstr /r "[ABC]"
> echo b | findstr /r "[BC]"
> echo c | findstr /r "[BC]"

> echo A | findstr /r "[A-C]"
    A
> echo B | findstr /r "[A-C]"
    B
> echo C | findstr /r "[A-C]"
    C
> echo D | findstr /r "[A-C]"

According to the SS64 CMD FINDSTR page (which, in a stunning display of circularity, references this question), the range [A-Z]:

... includes the complete English alphabet, both upper and lower case (except for "a"), as well as non-English alpha characters with diacriticals.

To get around the problem in my environment, I simply used specific regular expressions (such as [ABCD] rather than [A-D]). A more sensible approach for those that are allowed would be to download CygWin or GnuWin32 and use grep from one of those packages.

回复收藏 0 原文