使用 XmlSlurper 时如何找到有问题的行

发布于 2024-12-25 02:31:45 字数 674 浏览 7 评论 0原文

我正在使用 XmlSlurper 解析脏 html 页面，并收到以下错误：

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

现在，我有 html，我在执行此操作之前提供并打印它。如果我打开它并尝试转到错误 1157 中提到的行，则其中没有“src”（但文件中有数百个此类字符串）。所以我猜插入了一些额外的东西（可能是

有没有一个好方法可以准确地找到有问题的行或 html 片段？

原文

I am parsing a dirty html page with XmlSlurper, and I get the following error:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script> or something like that) that changes line numbers.

Is there a good way to find exactly the offending line or html piece?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九局 2025-01-01 02:31:45

您使用哪个 SAXParser？ HTML 不是严格的 XML，因此将 XMLSlurper 与默认解析器一起使用可能会导致持续的错误。

在谷歌上粗略地搜索“Groovy html slurper”，我找到了 HTML Scraping With Groovy 指向名为 TagSoup。

尝试一下，看看它是否解析脏页。

回复收藏 0 原文

人间☆小暴躁 2025-01-01 02:31:45

您可以向每个元素添加一个名为 _lineNum 的属性，然后就可以使用该属性。

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

上面添加了 line num 属性。您也许可以尝试设置自己的错误处理程序，它可以从定位器读取行号。

You could add an attribute named _lineNum to each element, which can then be used.

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

The above adds the line num attribute. You can perhaps try to set your own error handler which can read the line number from the locator.

回复收藏 0 原文

~没有更多了~