使用 XmlSlurper 时如何找到有问题的行
我正在使用 XmlSlurper 解析脏 html 页面,并收到以下错误:
ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
现在,我有 html,我在执行此操作之前提供并打印它。如果我打开它并尝试转到错误 1157 中提到的行,则其中没有“src”(但文件中有数百个此类字符串)。所以我猜插入了一些额外的东西(可能是 或类似的东西)来改变行号。
有没有一个好方法可以准确地找到有问题的行或 html 片段?
I am parsing a dirty html page with XmlSlurper, and I get the following error:
ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script>
or something like that) that changes line numbers.
Is there a good way to find exactly the offending line or html piece?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您使用哪个 SAXParser? HTML 不是严格的 XML,因此将 XMLSlurper 与默认解析器一起使用可能会导致持续的错误。
在谷歌上粗略地搜索“Groovy html slurper”,我找到了 HTML Scraping With Groovy 指向名为 TagSoup。
尝试一下,看看它是否解析脏页。
Which SAXParser are you using? HTML is not strict XML, so using XMLSlurper with the default parser is probably going to result in continued errors.
A cursory google search for "Groovy html slurper" led me to HTML Scraping With Groovy which points to a SaxParser called TagSoup.
Give that a whirl and see if it parses the dirty page.
您可以向每个元素添加一个名为 _lineNum 的属性,然后就可以使用该属性。
上面添加了 line num 属性。您也许可以尝试设置自己的错误处理程序,它可以从定位器读取行号。
You could add an attribute named _lineNum to each element, which can then be used.
The above adds the line num attribute. You can perhaps try to set your own error handler which can read the line number from the locator.