格式错误的 XML/HTML 解析
我需要解析多个(读取大约 1600 个)HTML 页面,并从每个文件中提取以下标记的内容。
textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE
(这实际上是一个 html textarea 标签) 我原以为我可以使用 DOMparser 但文件包含太多错误,所以我从 stackoverflow 上的另一个问题中遇到了 JTidy,并且我尝试使用它......
但这似乎无法转换html 从任何页面转换为 XHTML,这样我就可以使用 DOM 解析器。
然后我想我可以使用正则表达式,但我找不到提取该文本所需的特定表达式,而且我还遇到了多个问题/答案,这些问题/答案说不要使用正则表达式来解析 HTML...
所以本质上我的问题是有什么为了从格式错误的 html 中获取我需要的文本,需要采取其他方法吗?
I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.
textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE
(this is actually meant to be a html textarea tag)
I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...
But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.
I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...
SO essentially my question is there any other approach to take in order to get the text I need from a malformed html?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该能够直接使用 JTidy 解析您的文档,而无需将它们转换为 XHTML。不久前,我曾多次这样做过,但它对我来说效果很好,而且 HTML 相当丑陋。
编辑: 上次我需要解析 HTML 文件时,我看到的另一个选项是 TagSoup。由于其 GPL 许可证,我无法在商业产品中使用它,但如果您只需要此功能作为内部工具,它可能适合您
You should be able to parse your documents wit JTidy directly, without having to convert them to XHTML. I did it on several occasions, granted a while ago, but it worked for me fine and with quite ugly HTML.
EDIT: Another option that I looked at, last time I needed to parse HTML files, was TagSoup. I couldn't use it in a commercial product because of its GPL licence, but if you just need this functionality as an internal tool, it might work for you