用于查找未终止字符串的正则表达式

发布于 2024-09-03 02:21:33 字数 316 浏览 4 评论 0原文

我需要在 CSV 文件中搜索以未终止的双引号字符串结尾的行。

例如:

1,2,a,b,"dog","rabbit

会匹配而

1,2,a,b,"dog","rabbit","cat bird"
1,2,a,b,"dog",rabbit

不会匹配。

我对正则表达式的经验非常有限,我唯一能想到的就是“

"[^"]*$

但是,将最后一个引号与行尾相匹配”。

这将如何完成?

I need to search for lines in a CSV file that end in an unterminated, double-quoted string.

For example:

1,2,a,b,"dog","rabbit

would match whereas

1,2,a,b,"dog","rabbit","cat bird"
1,2,a,b,"dog",rabbit

would not.

I have very limited experience with regular expressions, and the only thing I could think of is something like

"[^"]*$

However, that matches the last quote to the end of the line.

How would this be done?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

萌无敌 2024-09-10 02:21:33

假设引号无法转义,您需要测试引号的奇偶性(确保它们的数量是偶数而不是奇数)。正则表达式非常适合:

^(([^"]*"){2})*[^"]*$

它将匹配具有偶数个引号的所有行。您可以反转所有奇数字符串的结果。或者您可以在开头添加另一个 ([^"]*") 部分:

^[^"]*"(([^"]*"){2})*[^"]*$

同样,如果您可以使用不情愿的运算符而不是贪婪的运算符,则可以使用看起来更简单的表达式:

^((.*"){2})*.*$         #even
^.*"((.*"){2})*.*$      #odd

现在,如果引号可以被转义,这完全是一个不同的问题,但方法是相似的:确定未转义引号的奇偶性。

Assuming quotes can't be escaped, you need to test the parity of quotes (making sure that there's an even number of them instead of odd). Regular expressions are great for that:

^(([^"]*"){2})*[^"]*$

That will match all lines with an even number of quotes. You can invert the result for all strings with an odd number. Or you can just add another ([^"]*") part at the beginning:

^[^"]*"(([^"]*"){2})*[^"]*$

Similarly, if you have access to reluctant operators instead of greedy ones you can use a simpler-looking expression:

^((.*"){2})*.*$         #even
^.*"((.*"){2})*.*$      #odd

Now, if quotes can be escaped, it's a different question entirely, but the approach would be similar: determine the parity of unescaped quotes.

第几種人 2024-09-10 02:21:33

假设字符串不能包含 ",则需要匹配具有奇数个引号的字符串,如下所示:

([^"]*("[^"]*")?)*"

请注意,这容易受到 DDOS 攻击。

将匹配零组或多组不带引号的运行,后跟带引号的字符串。

Assuming that the strings cannot contain ", you need to match a string that has an odd number of quotes, like this:

([^"]*("[^"]*")?)*"

Note that this is vulnerable to a DDOS attack.

This will match zero or more sets of unquoted run, followed by quoted strings.

养猫人 2024-09-10 02:21:33

试试这个:

".+[^"](,|$)

它匹配一个引号(行中的任何位置),后面(贪婪地)跟任何但是行尾之前的另一个引号或逗号。

最终影响是它只会匹配带有悬空引号字符串的行。

我认为它甚至不受“嵌套扩展攻击”的影响(我们确实生活在一个非常危险的世界......)

Try this one:

".+[^"](,|$)

This matches a quote (anywhere in the line), followed (greedily) by anything but another quote before the end of the line or a comma.

The net affect is that it will only match lines with dangling quoted strings.

I think it's even immune to 'nested expandos attacks' (we do live in a very dangerous world ...)

北城孤痞 2024-09-10 02:21:33

为了避免“嵌套扩展”:

egrep -v '^[^"]*("[^"]*"[^"]*)*[^"]*
 my_file

To avoid "nested expandos":

egrep -v '^[^"]*("[^"]*"[^"]*)*[^"]*
 my_file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文