如何从 HTML 文档中的脚本标记解析 XML?
我一直在使用 Jsoup 从网站上抓取 HTML 数据,但我需要获取 javascript 标记内的 XML 部分,因为它有一堆我需要提取和下载图像的 URL。它看起来像这样:
<script type="text/javascript">
var xmlTxt = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><mediaObject><mediaList rail="1"><carMedia thumbnail="http://images.blah.com/scaler/80/60/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" url="http://images.blah.com/scaler/544/408/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" type="INV_PHOTO" mediaLabel="" category="UNCATEGORIZED" sequence="2"/></mediaList></mediaObject>';'
接下来是 script 标签内的一大堆 javascript 代码。如果我有 Jsoup Document
,从页面中提取这些 URL 的最佳方法是什么?如果我不能用 Jsoup 做到这一点,我该怎么办?问题在于图像保存在轮播中,因此页面上的 HTML 仅显示轮播中当前显示的图像的源。
I've been using Jsoup to scrape HTML data from a website, but there is one section of XML inside a javascript tag that I need to get because it has a bunch of URLs I need to pull out and download the images. Here is what it looks like:
<script type="text/javascript">
var xmlTxt = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><mediaObject><mediaList rail="1"><carMedia thumbnail="http://images.blah.com/scaler/80/60/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" url="http://images.blah.com/scaler/544/408/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" type="INV_PHOTO" mediaLabel="" category="UNCATEGORIZED" sequence="2"/></mediaList></mediaObject>';'
That is followed by a whole bunch of javascript code inside the script tag. What is the best way to extract those URLs from the page if I have a Jsoup Document
? If I can't do it with Jsoup, how can I do it? The problem is that the images are held in a carousel and so the HTML on the page only shows the source for the ones currently displayed in the carousel.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您可以使用 javascript 绑定将 xmlTxt 获取到 java 中。请参阅 http://developer.android.com/guide/webapps/webview.html# BindingJavaScript
其次,解析您的 xml。我不确定您是否可以在一般 XML(而不是 HTML)中使用 Jsoup。如果不能,您可以使用 android 内置 xmlpullparser ( http:// developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html )或其他 xml 库。
Fist, you can get xmlTxt into java using javascript binding. see http://developer.android.com/guide/webapps/webview.html#BindingJavaScript
Second, parse your xml. I'm not sure you can use Jsoup in general XML(not HTML). If you can't , you can use android builtin xmlpullparser ( http://developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html ) or other xml libraries.
好吧,我用肮脏的方式做到了,但它应该有效。我希望有一个更优雅的解决方案,但现在我只是将文档转换为字符串(
doc.toString()
),然后获取开始和结束 XML 标记的开始和结束索引我想要的。从那里我应该能够使用内置的 Java XML 解析器来完成其余的工作。Well, I did it the dirty way but it should work. I was hoping there was a more elegant solution, but for now I just converted the doc to a string (
doc.toString()
) and then get the start and ending index of the starting and ending XML tags that I want. From there I should be able to use the built in Java XML parser to do the rest.