格式错误的 XML/HTML 解析

发布于 2024-12-01 18:41:01 字数 622 浏览 0 评论 0原文

我需要解析多个（读取大约 1600 个）HTML 页面，并从每个文件中提取以下标记的内容。

    textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE

（这实际上是一个 html textarea 标签）我原以为我可以使用 DOMparser 但文件包含太多错误，所以我从 stackoverflow 上的另一个问题中遇到了 JTidy，并且我尝试使用它......

但这似乎无法转换html 从任何页面转换为 XHTML，这样我就可以使用 DOM 解析器。

然后我想我可以使用正则表达式，但我找不到提取该文本所需的特定表达式，而且我还遇到了多个问题/答案，这些问题/答案说不要使用正则表达式来解析 HTML...

所以本质上我的问题是有什么为了从格式错误的 html 中获取我需要的文本，需要采取其他方法吗？

原文

I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.

    textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE

(this is actually meant to be a html textarea tag)
I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...

But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.

I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...

SO essentially my question is there any other approach to take in order to get the text I need from a malformed html?

分享到QQ

分享到微博