如何将 HTML 标记流重写到新文档中?
假设我有一个已标记化的 HTML 文档,我如何将其转换为新文档或应用一些其他转换?
例如,假设我有这样的 HTML:
<html>
<body>
<p><a href="/foo">text</a></p>
<p>Hello <span class="green">world</span></p>
</body>
</html>
我当前编写的是一个输出令牌流的分词器。对于本文档,它们将是(用伪代码编写):
TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]
但现在我不知道如何使用此流来创建一些转换。
例如,我想将 TAG_OPEN[a] TAG_ATTRIBUTE[href]
中的 TAG_ATTRIBUTE_VALUE[/foo]
重写为其他内容。
我想做的另一个转换是让它在括号中的 TAG_OPEN[a]
之后输出 TAG_ATTRIBUTE[href]
属性,例如,
<a href="/foo">text</a>
重写为
<a href="/foo">text</a>(/foo)
What is the General Strategy进行这样的转变?我还想做许多其他转换,例如剥离所有标签并仅保留文本内容,在某些特定标签后添加标签等。
我需要创建解析树吗?我从来没有这样做过,也不知道如何从令牌流创建解析树。或者我可以用其他方法来做吗?
欢迎任何建议。
还有一件事 - 我想自己学习所有这些解析,所以我不是在寻找图书馆!
预先感谢,博达·西多
Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?
For example, suppose I have this HTML:
<html>
<body>
<p><a href="/foo">text</a></p>
<p>Hello <span class="green">world</span></p>
</body>
</html>
What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):
TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]
But now I don't have any idea how could I use this stream to create some transformations.
For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo]
in TAG_OPEN[a] TAG_ATTRIBUTE[href]
to something else.
Another transformation I would like to do is make it output TAG_ATTRIBUTE[href]
attributes after the TAG_OPEN[a]
in parenthesis, for example,
<a href="/foo">text</a>
gets rewritten into
<a href="/foo">text</a>(/foo)
What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.
Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?
Any suggestions are welcome.
And one more thing - I would like to learn all this parsing myself, so I am not looking for a library!
Thanks beforehand, Boda Cydo
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我们可以假设 html 符合 xml,那么 xslt 将是一种可行的方法。但我假设这会被淘汰,因为您似乎想编写自己的解析器(不知道为什么)。
如果你真的想编写一个解析器(我会编写解析规则,而不是你自己的解析器引擎),请看看antlr和MS oslo。
If we can assume that the html is xml compliant, then xslt would be a way to go. But I am assuming that would be out as you seem to want to write your own parser (not sure why).
If you really want to write a parser (I'd write parse rules, not your own parser engine) take a look at antlr and MS oslo.
解析/遍历 XML/HTML 树的方法有多种。也许我可以向您指出:-
http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html
如果你想对 DOM 元素进行前序或后序操作,您可以使用那里描述的算法。
There are various ways of parsing/traversing an XML/HTML tree. Perhaps I can point you to:-
http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html
If you want to do pre-order or post-order manipulation of DOM elements, you can use the algorithm described there.