为什么 Jericho 解析器无法解析这段 HTML 代码?

发布于 2024-12-23 18:53:41 字数 1612 浏览 0 评论 0原文

我在应用程序中使用 jericho 解析器来获取网页的更轻版本,并从中提取一些部分。因此,例如,当我得到这段代码时:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN/" "http://www.w3.org/TR/html4/loose.dtd"><html> <head> </head> <body> <b> <span class="articletitletext">Happy New Year!</span></b> <br> <span class="postedstamp">Posted By <script language="JavaScript" type="text/javascript"> <!-- document.write('<a href="&#32;&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#99;&#104;&#114;&#105;&#115;&#46;&#119;&#121;&#109;&#97;&#110;&#64;&#118;&#101;&#114;&#105;&#122;&#111;&#110;&#46;&#110;&#101;&#116;">'); // --> </script>Chris</a> on January 1, 2012</span><br> <br> <span id="intelliTXT">

From all of us here at TheForce.net, we wish you and your family a safe and Happy New Year. May the Force be with you in 2012!

</span></body> </html>

我想使用 jericho 解析器再次解析它,但是当我运行时,

ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);

我得到了这个异常

01-01 10:46:37.518: ERROR/AndroidRuntime(648): java.lang.RuntimeException: Unable to start activity ComponentInfo{net.test.theforce/net.test.theforce.NewsListActivity}: java.lang.RuntimeException: java.lang.ClassCastException: java.util.Collections$EmptyList

并且应用程序崩溃了......那么,较轻的页面有什么问题?

I use jericho parser in my application to get a lighter version of a web page, extracting some parts from it. So, for instance, when I get this code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN/" "http://www.w3.org/TR/html4/loose.dtd"><html> <head> </head> <body> <b> <span class="articletitletext">Happy New Year!</span></b> <br> <span class="postedstamp">Posted By <script language="JavaScript" type="text/javascript"> <!-- document.write('<a href=" mailto:chris.wyman@verizon.net">'); // --> </script>Chris</a> on January 1, 2012</span><br> <br> <span id="intelliTXT">

From all of us here at TheForce.net, we wish you and your family a safe and Happy New Year. May the Force be with you in 2012!

</span></body> </html>

I'd like to parse it once again using jericho parser, but when I run

ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);

I got this exception

01-01 10:46:37.518: ERROR/AndroidRuntime(648): java.lang.RuntimeException: Unable to start activity ComponentInfo{net.test.theforce/net.test.theforce.NewsListActivity}: java.lang.RuntimeException: java.lang.ClassCastException: java.util.Collections$EmptyList

and the application crashes...so, what's wrong with the lighter page?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

心凉怎暖 2024-12-30 18:53:41

在我看来 Jericho 解析器可以解析你给它的 HTML。出现错误的原因是您对 getAllElements() 方法返回的内容做出了错误的假设。

我承认我只能找到 零参数重载,而不是您正在使用的单参数重载,因此我必须假设这两种方法返回相同的类型, 列表<元素>。在您的示例中,HTML 中没有 center 元素,因此 getAllElements() 方法应返回空的 List。它不必在此处返回 ArrayListList 的任何实现都可以。在这种情况下,它选择返回 Collections.emptyList()。这不是 ArrayList,并且您会收到 ClassCastException,因为您无法将其转换为 ArrayList

据我所知,您有两个选择:

  • 首先,您可能不需要返回的列表是 ArrayList。使用 List 代替可能就足够了。在这种情况下,您应该更换该行

    ArrayList<元素>; centerElems=(ArrayList) pageSource.getAllElements(HTMLElementName.CENTER);
    

    列表<元素> centerElems = pageSource.getAllElements(HTMLElementName.CENTER);
    
  • 其次,如果您确实需要列表是 ArrayList,那么您可以创建一个 ArrayList< /code> 结果:

    ArrayList<元素>; centerElems = new ArrayList(pageSource.getAllElements(HTMLElementName.CENTER));
    

It looks to me like the Jericho parser can parse the HTML you gave it. The error you're getting arises because you've made an incorrect assumption about what the getAllElements() method returns.

I admit I could only find the Javadoc for the zero-argument overload of this method, as opposed to the one-argument overload that you're using, so I'll have to assume that both methods return the same type, List<Element>. In your example, there are no center elements in the HTML, so the getAllElements() method should return an empty List<Element>. It doesn't have to return an ArrayList<Element> here; any implementation of List<Element> will do. In this case, it chooses to return a Collections.emptyList(). This isn't an ArrayList<Element>, and you get a ClassCastException because you cannot cast this to an ArrayList<Element>.

As far as I can see, you have two options:

  • Firstly, you might not need the returned list to be an ArrayList<Element>. It might be sufficient to use List<Element> instead. In this case, you should replace the line

    ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);
    

    with

    List<Element> centerElems = pageSource.getAllElements(HTMLElementName.CENTER);
    
  • Secondly, if you really do need the list to be an ArrayList<Element>, then you can create an ArrayList<Element> from the results:

    ArrayList<Element> centerElems = new ArrayList<Element>(pageSource.getAllElements(HTMLElementName.CENTER));
    
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文