帮助处理草图中的正则表达式？

发布于 2024-10-06 03:04:55 字数 969 浏览 6 评论 0原文

我是一名初级程序员，试图在处理草图中解析 HTML 文件。（顺便说一句，如果您不了解Processing，它会编译为Java并使用相同的正则表达式函数）。我已使用 SimpleML 将 HTML 文件正确捕获为单个字符串。我试图捕获的数据来自表格，如下所示：

<th>Name</th>
    <th>John F. Kennedy</th>
    <th>Lyndon Johnson</th>
    <th>Richard Nixon</th>

等等

。我想将候选人的姓名解析为数组（删除“姓名”）。

所以我首先尝试

    candidates = match(rawString,"<th>.*</th>");

返回整个列表。

然后我尝试

    candidates = match(rawString,"<th>.{1,50}</th>");

只返回

<th>Name</th>

处理文档说：

如果正则表达式中有组（由括号组指定），则每个组的内容将在数组中返回。正则表达式匹配的元素 [0] 返回整个匹配字符串，匹配组从元素 [1] 开始（第一组是 [1]，第二组是 [2]，依此类推）。

所以现在我一直在尝试组和量词的各种组合，例如：

    candidates = match(rawString,"(<th>.{1,50}</th>)*");

但是一定有一些我没有得到的概念部分，因为没有任何效果。看起来这应该很容易，对吧？

原文

I am a beginning programmer trying to parse an HTML file in a Processing sketch. (Incidentally, if you don't know Processing, it compiles to Java and uses the same regex functions). I have correctly captured the HTML file as a single String using SimpleML. The data I'm trying to capture comes from a table, like so:

<th>Name</th>
    <th>John F. Kennedy</th>
    <th>Lyndon Johnson</th>
    <th>Richard Nixon</th>

etc.

I want to parse out the names of candidates into an array (dropping the "Name").

So I first tried

    candidates = match(rawString,"<th>.*</th>");

which returned the whole list.

Then I tried

    candidates = match(rawString,"<th>.{1,50}</th>");

which returns only

<th>Name</th>

The Processing documentation says:

If there are groups (specified by sets of parentheses) in the regexp, then the contents of each will be returned in the array. Element [0] of a regexp match returns the entire matching string, and the match groups start at element [1] (the first group is [1], the second [2], and so on).

So now I've been trying various combinations of groups and quantifiers, like:

    candidates = match(rawString,"(<th>.{1,50}</th>)*");

But there must be some conceptual piece I'm not getting, because nothing is working. Seems like this should be easy, right?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅蓝的眸勾画不出的柔情 2024-10-13 03:04:55

使用正则表达式解析 HTML 通常不是一个好主意，但您可能会在这里使用它。

您的问题似乎是 .* 贪婪地匹配，即尽可能多的字符，从而匹配从第一个到最后一个 的所有内容> 在你的字符串中。

使其变得懒惰，即告诉量词尽可能少地匹配是一种解决方案：

<th>.*?</th>

可能会起作用。

更稳定一点，速度至少更快：准确地告诉引擎它允许匹配什么，例如：

<th>[^<>]*</th>

[^<>] 表示“除尖括号之外的任何字符”。

如果您尝试将嵌套结构与正则表达式匹配，您将会遇到问题。它可以用现代正则表达式风格来完成，但很难做得正确。添加 HTML 注释和字符串（可能包含您要匹配的分隔符），您就会陷入痛苦的境地。

Parsing HTML with regular expression is usually not a good idea, but you might get by with it here.

Your problem appears to have been that .* matches greedily, i. e. as many characters as possible, thereby matching everything from the very first <th> to the very last </th> in your string.

Making it lazy, i. e. telling the quantifier to match as little as possible is one solution:

<th>.*?</th>

would probably work.

A bit more stable and minimally faster: Tell the engine exactly what it's allowed to match, for example:

<th>[^<>]*</th>

[^<>] means "any character except angle brackets".

You will be running into problems if you're ever trying to match nested structures with regular expressions. It can be done in modern regex flavors, but it's very hard to do right. Add HTML comments and strings to the mix (that might contain the very delimiters you're matching against) and you're in for a world of hurt.

回复收藏 0 原文