Groovy htmlunit getFirstByXPath 返回 null + OCR问题

发布于 2024-10-10 16:46:52 字数 1527 浏览 1 评论 0原文

我最近在 HtmlUnit 返回空值方面遇到了一些问题,正在寻求指导。我抓取网站第一行的每个结果都返回 null。我想知道是否有人可以

A) 解释为什么他们可能会返回 null

B) 解释更好的方法(如果有的话)来获取信息

这是我当前的代码(URL 在源代码中):

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

def url = "http://www.hidemyass.com/proxy-list/"

page = client.getPage(url)

IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue()
println "IP Address is: $data"          //returns null

//Port_Number is an Image

Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue()
println "Country abbreviation is: $Country"

//differentiate speed and connection by name of gif?

Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue()
println "Proxy type is: $Type"

Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue()
println "Anonymity Level is: $Anonymity"

client.closeAllWindows()

现在我所有的 XPath 都返回 null,并且 .getValue() 显然不适用于 null。

我也有疑问,由于 PORT 是图像,所以我应该对它做什么?有没有比下载它并尝试通过 OCR 解决它更好的替代方案?

旁注

这个网站没有任何意义,我只是在寻找一个可以练习抓取的网站(我遇到的最后一个片段身份问题,无法得到答案:HtmlUnit getByXpath returns null HtmlUnit 和片段标识

I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can

A) explain why they might be returning null

B) explain better ways (if there are some) to go about getting the information

Here is my current code (URL is in the source):

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

def url = "http://www.hidemyass.com/proxy-list/"

page = client.getPage(url)

IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue()
println "IP Address is: $data"          //returns null

//Port_Number is an Image

Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue()
println "Country abbreviation is: $Country"

//differentiate speed and connection by name of gif?

Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue()
println "Proxy type is: $Type"

Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue()
println "Anonymity Level is: $Anonymity"

client.closeAllWindows()

Right now all of my XPaths return null and .getValue() obviously doesn't work on null.

I also have questions as to what I should do about the PORT since it is an image? Is there a better alternative than downloading it and attempting to solve it by OCR?

Side Note

There is no significance in this site, I was just looking for a site that I could practice scraping on (the last one I ran into issues of fragment identities and couldn't get an answer to: HtmlUnit getByXpath returns null and HtmlUnit and Fragment Identities )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

披肩女神 2024-10-17 16:46:52

您的 xpath 查询似乎不正确。根据代码示例中提供的 url,应从搜索路径中删除表单元素。

alt text

这是一个 xpath 查询,当页面布局更改时,该查询不太容易中断。

//table[@id='proxylist-table']/tbody/tr/td[2]

就端口号而言,该页面的作者一定希望这部分数据不会因某种原因被抓取。进行 OCR 可能是您的最佳选择。

但是,您可以做的一件事是查看返回的图像的大小以猜测端口号。例如,我注意到显示端口 80 的图像的内容长度均为 406 或 411。端口 8080 的内容长度为 402 或 409。图像有两种不同的大小可与行颜色混合。如果 Url 以 1 结尾,则其背景为白色;如果以 0 结尾,则其背景为浅灰色,并且始终会大几个字节。这种方法有明显的缺点,但它可能有效。

It looks like your xpath query is incorrect. Based on the url provided in the code sample the form element should be removed from the search path.

alt text

Here is an xpath query that will be less prone to breaking when the layout of the page changes.

//table[@id='proxylist-table']/tbody/tr/td[2]

As far as the port number goes The author of that page must have wanted that portion of the data to not be scraped for some reason. Doing OCR might be your best option.

However, one thing you could do is look at the size of the image that is returned to guess the port number. For example I've noticed that images that display port 80 all have a content length of 406 or 411. Port 8080 are either 402 or 409. There are two different sizes to the images to blend in with the row color. If the Url ends in a 1 it will have a white back ground if it ends in 0 it will have a light grey back ground and always be a few bytes larger. There are obvious drawbacks to this approach but it may work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文