使用 CSS 选择器从流解析器(例如 SAX 流)收集 HTML 元素
如何解析CSS(CSS3)选择器并使用它(以类似jQuery的方式)不是从DOM(从树结构)而是从流(例如SAX)收集HTML元素,即使用顺序访问基于事件的解析器?
顺便说一句,是否有任何 CSS 选择器(或其组合)需要访问 DOM(维基百科 SAX 页面说XPath 选择器“需要能够随时访问解析的 XML 树中的任何节点”)?
我最感兴趣的是实现选择器组合器,例如“A B”后代选择器。
我更喜欢描述算法或 Perl 的解决方案(对于 HTML::Zoom)。
How to parse CSS (CSS3) selector and use it (in jQuery-like way) to collect HTML elements not from DOM (from tree structure), but from stream (e.g. SAX), i.e. using sequential access event based parser?
By the way, are there any CSS selectors (or their combination) that need access to DOM (Wikipedia SAX page says that XPath selectors "need to be able to access any node at any time in the parsed XML tree")?
I am most interested in implementing selector combinators, e.g. 'A B' descendant selector.
I prefer solutions describing algorithm, or in Perl (for HTML::Zoom).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我会用正则表达式来做到这一点。
首先,将选择器转换为正则表达式,该正则表达式与表示给定解析器堆栈状态的简单从上到下的开始标记列表相匹配。为了解释一下,这里有一些简单的选择器及其相应的正则表达式:
A
变为/]*>$/
A#someid< /code> 变为
/]*id="someid"[^>]*>$/
A.someclass
变为/]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/< /代码>
/]*>]*>$/
A B
变为/]*>(?:<[^>]*>)*]*>$/
依此类推。注意,正则表达式均以$结尾,但不以^开头;这与 CSS 选择器不必从文档的根目录进行匹配的方式相对应。另请注意,类匹配代码中有一些lookbehind和lookahead的内容,这是必要的,这样当您想要完全不同的类“someclass”时,您就不会意外地匹配“someclass-super-duper”。
如果您需要更多示例,请告诉我。
一旦构建了选择器正则表达式,您就可以开始解析了。当您解析时,维护当前应用的标签堆栈;每当您下降或上升时都会更新此堆栈。要检查选择器匹配,请将该堆栈转换为可以与正则表达式匹配的标签列表。例如,考虑此文档:
当您输入每个元素时,您的堆栈状态字符串将按顺序遍历以下值:
匹配过程很简单:每当解析器下降到新元素时,更新状态字符串并检查它是否与选择器正则表达式匹配。如果正则表达式匹配,则选择器匹配该元素!
需要注意的问题:
属性内的双引号。要解决此问题,请在创建正则表达式时对属性值应用 html 实体编码,并在创建堆栈状态字符串时对属性值应用 html 实体编码。
属性顺序。构建正则表达式和状态字符串时,请对属性使用一些规范顺序(按字母顺序最简单)。否则,您可能会发现选择器
a#someid.someclass
的正则表达式期望不幸地失败了您的解析器进入
.
区分大小写。根据 HTML 规范,该类和id 属性区分大小写(请注意相应部分上的“CS”标记)。因此,您必须使用区分大小写的正则表达式匹配。然而,在 HTML 中,元素名称不区分大小写,尽管它们在 XML 中。如果您希望像 HTML 一样不区分大小写的元素名称匹配,请在选择器正则表达式和状态堆栈字符串中将元素名称规范化为大写或小写。
需要额外的魔法来处理涉及兄弟元素存在或不存在的选择器模式,即
A:first-child
和A + B
。您可以通过向标签添加一个特殊属性来完成这些操作,该属性包含紧邻的前一个标签的名称,如果该标签是第一个子标签,则包含“”。还有一般的兄弟选择器,A ~ B
;我不太确定如何处理这个问题。编辑:如果您不喜欢正则表达式黑客,您仍然可以使用此方法来解决问题,只需使用您自己的状态机而不是正则表达式引擎。具体来说,CSS 选择器可以实现为非确定性有限状态机,这是一个令人生畏的- 听起来不错,但实际上意味着以下内容:
正则表达式的几乎所有神奇之处背后的秘密都在于它对这种类型的状态机的使用。
I would do it with regular expressions.
First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:
A
becomes/<A[^>]*>$/
A#someid
becomes/<A[^>]*id="someid"[^>]*>$/
A.someclass
becomes/<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
A > B
becomes/<A[^>]*><B[^>]*>$/
A B
becomes/<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/
And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".
If you need more examples, please let me know.
Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:
Your stack state string would go through the following values in order as you enter each element:
<x>
<x><a>
<x><y id="boo">
<x><y id="boo"><z class="bar">
The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!
Issues to watch out for:
Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.
Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector
a#someid.someclass
which expects<a id="someid" class="someclass">
unfortunately fails when your parser goes into<a class="someclass" id="someid">
.Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.
Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely
A:first-child
andA + B
. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector,A ~ B
; I'm not quite sure how to deal with that one.EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:
The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.
查看nokogiri。从他们的页面:
它是用 Ruby 编写的,但您说您想要一个算法,而 Ruby 非常适合阅读。或者只是从您正在使用的任何内容中调用它。
Check out nokogiri. From their page:
It's in Ruby, but you said you wanted an algorithm and Ruby is great for reading. Or just call it from whatever you are working in.
浏览器如何从流中创建 DOM?我想这就是你问题的答案,因为它必须以一种有利于 CSS 选择器查询的形式存储发现的元素。如果您有能力阅读开源浏览器解析器的源代码,那么我认为您可以重用它。
老实说,我不会这样做。相反,我会重用现有的基于SAX的解析器(也许你用 perl 重写另一个),然后遍历整个字符串。
当处理程序触发时,使用它们为元素构建内存数据库。为每个元素创建一个虚拟“表”,其中包含 #number [用于引用]、tagName、父 #number、下一个 #number 及其在源母字符串中的起始标记字符偏移量。
另外,为找到的每个属性创建一个表,并用具有该属性值的每个标签的记录填充该表。
现在是关于创建数据库、表和索引的过程。
What does a browser do to create the DOM out of a stream? I guess there lies the answer to your question because it must be storing the discovered elements in a form which facilitates a CSS selector query. If you can afford reading the source code for an open source browser parser, then I think you can reuse it.
I would not do so, honestly. Rather I would reuse an existing SAX based parser (maybe you rewrite another with perl), and that go through the entire string.
When handlers fire, use them to construct an in memory database for elements. Create a virtual "table" for every element with it's #number [for references] , tagName, parent #number, next #number and its opening tag char offset in the source mother string.
As well, create a table for every attribute ever found, and fill it with a record for each tag with that attribute value.
Now it's all about the process of creating a database, tables and indices.