如何将 HTML 标记流重写到新文档中?

发布于 2024-08-22 23:22:27 字数 1256 浏览 3 评论 0原文

假设我有一个已标记化的 HTML 文档,我如何将其转换为新文档或应用一些其他转换?

例如,假设我有这样的 HTML:

<html>
 <body>
  <p><a href="/foo">text</a></p>
  <p>Hello <span class="green">world</span></p>
 </body>
</html>

我当前编写的是一个输出令牌流的分词器。对于本文档,它们将是(用伪代码编写):

TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]

但现在我不知道如何使用此流来创建一些转换。

例如,我想将 TAG_OPEN[a] TAG_ATTRIBUTE[href] 中的 TAG_ATTRIBUTE_VALUE[/foo] 重写为其他内容。

我想做的另一个转换是让它在括号中的 TAG_OPEN[a] 之后输出 TAG_ATTRIBUTE[href] 属性,例如,

<a href="/foo">text</a>

重写为

<a href="/foo">text</a>(/foo)

What is the General Strategy进行这样的转变?我还想做许多其他转换,例如剥离所有标签并仅保留文本内容,在某些特定标签后添加标签等。

我需要创建解析树吗?我从来没有这样做过,也不知道如何从令牌流创建解析树。或者我可以用其他方法来做吗?

欢迎任何建议。

还有一件事 - 我想自己学习所有这些解析,所以我不是在寻找图书馆!

预先感谢,博达·西多

Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?

For example, suppose I have this HTML:

<html>
 <body>
  <p><a href="/foo">text</a></p>
  <p>Hello <span class="green">world</span></p>
 </body>
</html>

What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):

TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]

But now I don't have any idea how could I use this stream to create some transformations.

For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo] in TAG_OPEN[a] TAG_ATTRIBUTE[href] to something else.

Another transformation I would like to do is make it output TAG_ATTRIBUTE[href] attributes after the TAG_OPEN[a] in parenthesis, for example,

<a href="/foo">text</a>

gets rewritten into

<a href="/foo">text</a>(/foo)

What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.

Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?

Any suggestions are welcome.

And one more thing - I would like to learn all this parsing myself, so I am not looking for a library!

Thanks beforehand, Boda Cydo

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幼儿园老大 2024-08-29 23:22:27

如果我们可以假设 html 符合 xml,那么 xslt 将是一种可行的方法。但我假设这会被淘汰,因为您似乎想编写自己的解析器(不知道为什么)。
如果你真的想编写一个解析器(我会编写解析规则,而不是你自己的解析器引擎),请看看antlr和MS oslo。

If we can assume that the html is xml compliant, then xslt would be a way to go. But I am assuming that would be out as you seem to want to write your own parser (not sure why).
If you really want to write a parser (I'd write parse rules, not your own parser engine) take a look at antlr and MS oslo.

不即不离 2024-08-29 23:22:27

解析/遍历 XML/HTML 树的方法有多种。也许我可以向您指出:-

http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html

如果你想对 DOM 元素进行前序或后序操作,您可以使用那里描述的算法。

There are various ways of parsing/traversing an XML/HTML tree. Perhaps I can point you to:-

http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html

If you want to do pre-order or post-order manipulation of DOM elements, you can use the algorithm described there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文