当前位置：文江博客话题详情

使用 CSS 选择器从流解析器（例如 SAX 流）收集 HTML 元素

发布于 2024-10-11 13:35:16 字数 406 浏览 5 评论 0原文

如何解析CSS（CSS3）选择器并使用它（以类似jQuery的方式）不是从DOM（从树结构）而是从流（例如SAX）收集HTML元素，即使用顺序访问基于事件的解析器？

顺便说一句，是否有任何 CSS 选择器（或其组合）需要访问 DOM（维基百科 SAX 页面说XPath 选择器“需要能够随时访问解析的 XML 树中的任何节点”）？

我最感兴趣的是实现选择器组合器，例如“A B”后代选择器。

我更喜欢描述算法或 Perl 的解决方案（对于 HTML::Zoom）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦魇绽荼蘼 2024-10-18 13:35:16

我会用正则表达式来做到这一点。

首先，将选择器转换为正则表达式，该正则表达式与表示给定解析器堆栈状态的简单从上到下的开始标记列表相匹配。为了解释一下，这里有一些简单的选择器及其相应的正则表达式：

A 变为 /]*>$/
A#someid< /code> 变为 /]*id="someid"[^>]*>$/
A.someclass 变为 /]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/< /代码>
<代码>A> B 变为 /]*>]*>$/
A B 变为 /]*>(?:<[^>]*>)*]*>$/

依此类推。注意，正则表达式均以$结尾，但不以^开头；这与 CSS 选择器不必从文档的根目录进行匹配的方式相对应。另请注意，类匹配代码中有一些lookbehind和lookahead的内容，这是必要的，这样当您想要完全不同的类“someclass”时，您就不会意外地匹配“someclass-super-duper”。

如果您需要更多示例，请告诉我。

一旦构建了选择器正则表达式，您就可以开始解析了。当您解析时，维护当前应用的标签堆栈；每当您下降或上升时都会更新此堆栈。要检查选择器匹配，请将该堆栈转换为可以与正则表达式匹配的标签列表。例如，考虑此文档：

<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>

当您输入每个元素时，您的堆栈状态字符串将按顺序遍历以下值：

< /code>

匹配过程很简单：每当解析器下降到新元素时，更新状态字符串并检查它是否与选择器正则表达式匹配。如果正则表达式匹配，则选择器匹配该元素！

需要注意的问题：

属性内的双引号。要解决此问题，请在创建正则表达式时对属性值应用 html 实体编码，并在创建堆栈状态字符串时对属性值应用 html 实体编码。
属性顺序。构建正则表达式和状态字符串时，请对属性使用一些规范顺序（按字母顺序最简单）。否则，您可能会发现选择器 a#someid.someclass 的正则表达式期望不幸地失败了您的解析器进入 .
区分大小写。根据 HTML 规范，该类和id 属性区分大小写（请注意相应部分上的“CS”标记）。因此，您必须使用区分大小写的正则表达式匹配。然而，在 HTML 中，元素名称不区分大小写，尽管它们在 XML 中。如果您希望像 HTML 一样不区分大小写的元素名称匹配，请在选择器正则表达式和状态堆栈字符串中将元素名称规范化为大写或小写。
需要额外的魔法来处理涉及兄弟元素存在或不存在的选择器模式，即 A:first-child 和 A + B。您可以通过向标签添加一个特殊属性来完成这些操作，该属性包含紧邻的前一个标签的名称，如果该标签是第一个子标签，则包含“”。还有一般的兄弟选择器，A ~ B；我不太确定如何处理这个问题。

编辑：如果您不喜欢正则表达式黑客，您仍然可以使用此方法来解决问题，只需使用您自己的状态机而不是正则表达式引擎。具体来说，CSS 选择器可以实现为非确定性有限状态机，这是一个令人生畏的- 听起来不错，但实际上意味着以下内容：

从任何给定状态可能有不止一种可能的转换
机器尝试其中一种，如果不起作用，则回溯并尝试另一种
最简单的实现这一点的方法是为机器保留一个堆栈，每当您沿着一条路径前进时就将其压入该堆栈，并在需要回溯时从该堆栈中弹出。这与你用来进行深度优先搜索的东西是一样的。

正则表达式的几乎所有神奇之处背后的秘密都在于它对这种类型的状态机的使用。

I would do it with regular expressions.

First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:

A becomes /<A[^>]*>$/
A#someid becomes /<A[^>]*id="someid"[^>]*>$/
A.someclass becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
A > B becomes /<A[^>]*><B[^>]*>$/
A B becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/

And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".

If you need more examples, please let me know.

Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:

<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>

Your stack state string would go through the following values in order as you enter each element:

<x>
<x><a>
<x><y id="boo">
<x><y id="boo"><z class="bar">

The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!

Issues to watch out for:

Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.
Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass which expects <a id="someid" class="someclass"> unfortunately fails when your parser goes into <a class="someclass" id="someid">.
Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.
Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child and A + B. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B; I'm not quite sure how to deal with that one.

EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:

There might be more than one possible transition from any given state
The machine tries one of them, and if that doesn't work out, then it backtracks and tries the other
The easiest way to implement this is to keep a stack for the machine, which you push onto whenever you follow a path and pop from whenever you need to backtrack. It comes down to the same sort of thing you'd use to do a depth-first search.

The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.

回复收藏 0 原文

灯角 2024-10-18 13:35:16

查看nokogiri。从他们的页面：

Nokogiri 是一个 HTML、XML、SAX 和 Reader 解析器。 Nokogiri 的众多功能之一是能够通过 XPath 或 CSS3 选择器搜索文档。”。

它是用 Ruby 编写的，但您说您想要一个算法，而 Ruby 非常适合阅读。或者只是从您正在使用的任何内容中调用它。

回复收藏 0 原文

怎会甘心 2024-10-18 13:35:16

浏览器如何从流中创建 DOM？我想这就是你问题的答案，因为它必须以一种有利于 CSS 选择器查询的形式存储发现的元素。如果您有能力阅读开源浏览器解析器的源代码，那么我认为您可以重用它。

老实说，我不会这样做。相反，我会重用现有的基于SAX的解析器（也许你用 perl 重写另一个），然后遍历整个字符串。
当处理程序触发时，使用它们为元素构建内存数据库。为每个元素创建一个虚拟“表”，其中包含 #number [用于引用]、tagName、父 #number、下一个 #number 及其在源母字符串中的起始标记字符偏移量。
另外，为找到的每个属性创建一个表，并用具有该属性值的每个标签的记录填充该表。

现在是关于创建数据库、表和索引的过程。