如何从 HTML 文档中的脚本标记解析 XML？

发布于 2024-12-07 08:26:07 字数 823 浏览 1 评论 0原文

我一直在使用 Jsoup 从网站上抓取 HTML 数据，但我需要获取 javascript 标记内的 XML 部分，因为它有一堆我需要提取和下载图像的 URL。它看起来像这样：

<script type="text/javascript">
    var xmlTxt = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><mediaObject><mediaList rail="1"><carMedia thumbnail="http://images.blah.com/scaler/80/60/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" url="http://images.blah.com/scaler/544/408/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" type="INV_PHOTO" mediaLabel="" category="UNCATEGORIZED" sequence="2"/></mediaList></mediaObject>';'

接下来是 script 标签内的一大堆 javascript 代码。如果我有 Jsoup Document，从页面中提取这些 URL 的最佳方法是什么？如果我不能用 Jsoup 做到这一点，我该怎么办？问题在于图像保存在轮播中，因此页面上的 HTML 仅显示轮播中当前显示的图像的源。

原文

I've been using Jsoup to scrape HTML data from a website, but there is one section of XML inside a javascript tag that I need to get because it has a bunch of URLs I need to pull out and download the images. Here is what it looks like:

<script type="text/javascript">
    var xmlTxt = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><mediaObject><mediaList rail="1"><carMedia thumbnail="http://images.blah.com/scaler/80/60/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" url="http://images.blah.com/scaler/544/408/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" type="INV_PHOTO" mediaLabel="" category="UNCATEGORIZED" sequence="2"/></mediaList></mediaObject>';'

That is followed by a whole bunch of javascript code inside the script tag. What is the best way to extract those URLs from the page if I have a Jsoup Document? If I can't do it with Jsoup, how can I do it? The problem is that the images are held in a carousel and so the HTML on the page only shows the source for the ones currently displayed in the carousel.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暗喜 2024-12-14 08:26:07

首先，您可以使用 javascript 绑定将 xmlTxt 获取到 java 中。请参阅 http://developer.android.com/guide/webapps/webview.html# BindingJavaScript

其次，解析您的 xml。我不确定您是否可以在一般 XML（而不是 HTML）中使用 Jsoup。如果不能，您可以使用 android 内置 xmlpullparser ( http:// developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html ）或其他 xml 库。

回复收藏 0 原文