为什么 Jericho 解析器无法解析这段 HTML 代码?
我在应用程序中使用 jericho 解析器来获取网页的更轻版本,并从中提取一些部分。因此,例如,当我得到这段代码时:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN/" "http://www.w3.org/TR/html4/loose.dtd"><html> <head> </head> <body> <b> <span class="articletitletext">Happy New Year!</span></b> <br> <span class="postedstamp">Posted By <script language="JavaScript" type="text/javascript"> <!-- document.write('<a href=" mailto:chris.wyman@verizon.net">'); // --> </script>Chris</a> on January 1, 2012</span><br> <br> <span id="intelliTXT">
From all of us here at TheForce.net, we wish you and your family a safe and Happy New Year. May the Force be with you in 2012!
</span></body> </html>
我想使用 jericho 解析器再次解析它,但是当我运行时,
ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);
我得到了这个异常
01-01 10:46:37.518: ERROR/AndroidRuntime(648): java.lang.RuntimeException: Unable to start activity ComponentInfo{net.test.theforce/net.test.theforce.NewsListActivity}: java.lang.RuntimeException: java.lang.ClassCastException: java.util.Collections$EmptyList
并且应用程序崩溃了......那么,较轻的页面有什么问题?
I use jericho parser in my application to get a lighter version of a web page, extracting some parts from it. So, for instance, when I get this code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN/" "http://www.w3.org/TR/html4/loose.dtd"><html> <head> </head> <body> <b> <span class="articletitletext">Happy New Year!</span></b> <br> <span class="postedstamp">Posted By <script language="JavaScript" type="text/javascript"> <!-- document.write('<a href=" mailto:chris.wyman@verizon.net">'); // --> </script>Chris</a> on January 1, 2012</span><br> <br> <span id="intelliTXT">
From all of us here at TheForce.net, we wish you and your family a safe and Happy New Year. May the Force be with you in 2012!
</span></body> </html>
I'd like to parse it once again using jericho parser, but when I run
ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);
I got this exception
01-01 10:46:37.518: ERROR/AndroidRuntime(648): java.lang.RuntimeException: Unable to start activity ComponentInfo{net.test.theforce/net.test.theforce.NewsListActivity}: java.lang.RuntimeException: java.lang.ClassCastException: java.util.Collections$EmptyList
and the application crashes...so, what's wrong with the lighter page?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在我看来 Jericho 解析器可以解析你给它的 HTML。出现错误的原因是您对
getAllElements()
方法返回的内容做出了错误的假设。我承认我只能找到 零参数重载,而不是您正在使用的单参数重载,因此我必须假设这两种方法返回相同的类型,
列表<元素>
。在您的示例中,HTML 中没有center
元素,因此getAllElements()
方法应返回空的List
。它不必在此处返回ArrayList
;List
的任何实现都可以。在这种情况下,它选择返回Collections.emptyList()
。这不是ArrayList
,并且您会收到ClassCastException
,因为您无法将其转换为ArrayList
。据我所知,您有两个选择:
首先,您可能不需要返回的列表是
ArrayList
。使用List
代替可能就足够了。在这种情况下,您应该更换该行与
其次,如果您确实需要列表是
ArrayList
,那么您可以创建一个ArrayList< /code> 结果:
It looks to me like the Jericho parser can parse the HTML you gave it. The error you're getting arises because you've made an incorrect assumption about what the
getAllElements()
method returns.I admit I could only find the Javadoc for the zero-argument overload of this method, as opposed to the one-argument overload that you're using, so I'll have to assume that both methods return the same type,
List<Element>
. In your example, there are nocenter
elements in the HTML, so thegetAllElements()
method should return an emptyList<Element>
. It doesn't have to return anArrayList<Element>
here; any implementation ofList<Element>
will do. In this case, it chooses to return aCollections.emptyList()
. This isn't anArrayList<Element>
, and you get aClassCastException
because you cannot cast this to anArrayList<Element>
.As far as I can see, you have two options:
Firstly, you might not need the returned list to be an
ArrayList<Element>
. It might be sufficient to useList<Element>
instead. In this case, you should replace the linewith
Secondly, if you really do need the list to be an
ArrayList<Element>
, then you can create anArrayList<Element>
from the results: