当前位置：文江博客话题详情

HtmlUnit 和片段标识

发布于 10-10 01:15 字数 1242 浏览 7 评论 0 原文

我目前想知道如何处理片段标识，我想要从中获取信息的链接包含片段标识。看起来 HtmlUnit 正在丢弃我的 url 的“#/db4mj”，因此加载原始 url。

有谁知道处理片段身份的方法吗？（如果需要，我可以发布示例代码来进一步解释）

编辑

由于我没有得到很多视图（也没有答案），所以我将添加赏金。抱歉，只有 50 个，但我只有 79 个来开始

编辑

这是按要求提供的示例代码。

我们的网址为： http://browse.deviantart.com /resources/applications/psbrushes/?order=9&offset=0

因此，如果您查看链接中的内容，您会看到多个也包含 URL 的画笔。所以我的脚本获取 URL： http:// browser.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

如您所见，有片段标识符 #/dbwam4 现在我尝试抓取该页面上的内容，但 HtmlUnit 仍然认为它位于原始 URL 上。

下面是我的脚本中的示例代码，它在片段标识符 URL 上失败，但在原始 URL 上没有问题。

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

我希望能够从带有片段标识符的 URL 中获取某些信息，但我无法访问它。

原文

I'm currently wondering how to deal with fragment identities, a link that I am wanting to grab information from, contains a fragment identity. It seems as if HtmlUnit is discarding the "#/db4mj" of my url and therefore loading the original url.

Does anyone know of a way to deal with fragment identities? (I can post example code to further explain if need be)

EDIT

Since I wasn't getting many views (and no answers), I'm going to add a bounty. Sorry it's only 50, but I only had 79 to start with

EDIT

Here is an example code as requested.

Our URL will be: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0

So if you take a look at the content in the link, you'll see multiple brushes that contain URLs as well. So my script grabs the URL: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

As you can see there is the fragment identifier #/dbwam4 Now I try and grab the content that is on this page, but HtmlUnit still thinks it is on the original URL.

Here is an the example code in my script where it fails on the fragment identifier URL but has no problem with the original URL.

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

I'm expecting to be able to grab certain information from the URL with the fragment identifier but am unable to access it whatsoever.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怀里藏娇 2024-10-17 01:15:04

有好消息和坏消息。

首先好消息是 HtmlUnit 似乎工作得很好。

如果您访问带有片段标识符 URL 的页面< /a> 在关闭 JavaScript 的浏览器中（也许使用 Firefox 的 QuickJava 插件)，您将看不到您想要的“单画笔视图”。

因此，为了获取此页面，您需要使用 WebClient，并将 setJavaScriptEnabled 设置为 true。

现在是坏消息：

我一直无法在打开 JavaScript 的情况下使用 HtmlUnit 获取“单刷视图”页面（我不知道为什么）。尽管如此，我偶尔还是能够获得整页。

真正的问题是返回的 HTML 的状态非常糟糕，以至于无法解析它（我尝试了 TagSoup，jsoup，"="">Jaxen 等）。因此，我怀疑尝试使用 XPath 解析页面可能不适合您。

因此，我认为您需要求助于使用正则表达式（这远非理想），甚至使用 String.indexOf("gmi-ResViewSizer_img") 的某些变体。

我希望这有帮助。

编辑

我设法得到一些偶尔有效的东西。恐怕我还没有转换为 Groovy，所以它将使用普通的旧 Java。

我还没有查看 HtmlUnit 的源代码，但它几乎就像运行保存过程中的某些东西正在帮助解析工作？如果没有保存，我似乎会得到 NullPointerExceptions。

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}

There is good news and bad news.

First the good news is that HtmlUnit appears to be working just fine.

If you visit the page with the fragment identier URL in a browser with JavaScript turned off (maybe using Firefox's QuickJava plugin), you will not see the "single brush view" that you want.

So in order to acquire this page you need to use WebClient with setJavaScriptEnabled set to true.

And now the bad news:

I have not consistently been able to acquire the "single brush view" page using HtmlUnit with JavaScript turned on (I know not why). Although, I have been able to acquire the full page on occassion.

The real problem is the state of the returned HTML is so bad as to defy my attempts to parse it (I tried TagSoup, jsoup, Jaxen, etc). I therefore suspect attempting to parse the page using XPath may not work for you.

I would therefore think you need to resort to using regular expressions (which is far from ideal) or even use some variant of String.indexOf("gmi-ResViewSizer_img").

I hope this helps.

EDIT

I managed to get something that sporadically works. I'm afraid I am not converted to Groovy yet, so it will be in plain old Java.

I haven't looked at the source of HtmlUnit but it is almost as if something in the process of running the save is helping to make the parsing work?? Without the save I seem to get NullPointerExceptions.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}

回复收藏 0 原文

~没有更多了~