使用 XmlSlurper 时如何找到有问题的行

发布于 2024-12-25 02:31:45 字数 674 浏览 2 评论 0原文

我正在使用 XmlSlurper 解析脏 html 页面,并收到以下错误:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

现在,我有 html,我在执行此操作之前提供并打印它。如果我打开它并尝试转到错误 1157 中提到的行,则其中没有“src”(但文件中有数百个此类字符串)。所以我猜插入了一些额外的东西(可能是

有没有一个好方法可以准确地找到有问题的行或 html 片段?

I am parsing a dirty html page with XmlSlurper, and I get the following error:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script> or something like that) that changes line numbers.

Is there a good way to find exactly the offending line or html piece?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九局 2025-01-01 02:31:45

您使用哪个 SAXParser? HTML 不是严格的 XML,因此将 XMLSlurper 与默认解析器一起使用可能会导致持续的错误。

在谷歌上粗略地搜索“Groovy html slurper”,我找到了 HTML Scraping With Groovy 指向名为 TagSoup

尝试一下,看看它是否解析脏页。

Which SAXParser are you using? HTML is not strict XML, so using XMLSlurper with the default parser is probably going to result in continued errors.

A cursory google search for "Groovy html slurper" led me to HTML Scraping With Groovy which points to a SaxParser called TagSoup.

Give that a whirl and see if it parses the dirty page.

人间☆小暴躁 2025-01-01 02:31:45

您可以向每个元素添加一个名为 _lineNum 的属性,然后就可以使用该属性。

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

上面添加了 line num 属性。您也许可以尝试设置自己的错误处理程序,它可以从定位器读取行号。

You could add an attribute named _lineNum to each element, which can then be used.

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

The above adds the line num attribute. You can perhaps try to set your own error handler which can read the line number from the locator.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文