RCurl 无法检索网站的完整源文本 - 链接丢失?

发布于 2024-12-11 23:51:15 字数 1215 浏览 0 评论 0原文

我想使用 RCurl 作为礼貌的网络爬虫从网站下载数据。 显然我需要科学研究的数据。尽管我有权通过我的大学访问该网站的内容,但该网站的使用条款禁止使用网络爬虫。

我试图直接向网站管理员询问数据,但他们只以非常模糊的方式回复。无论如何,他们似乎不会简单地将底层数据库发送给我。

我现在想做的是要求他们正式获得一次性许可,使用基于 RCurl 的 R 代码从其网站下载特定的纯文本内容,其中包括在执行每个请求后延迟三秒。

我想从工作中下载数据的网站地址如下: http://plants.jstor.org/specimen/站点 ID

我尝试用 RCurl 对其进行编程,但无法完成。 有几件事使事情变得复杂:

  1. 只有在允许 cookie 的情况下才能访问该网站(我在 RCurl 中使用 cookiefile-argument 实现了这一点)。

  2. 只有当人们通过单击普通浏览器中的不同链接实际访问该站点时,下一步按钮才会出现在源代码中。 在源代码中,“下一步”按钮使用包含

    的表达式进行编码 <上一页><代码>下一页> >

    当尝试直接访问该网站(之前没有在同一浏览器中单击过该网站)时,它将无法工作,带有链接的行根本不在源代码中。

  3. 站点的 ID 是字母和数字的组合(例如“goe0003746”或“cord00002203”),因此我不能简单地在 R 中编写一个 for 循环来尝试其中的每个数字1 到 1,000,000。

因此,我的程序应该模仿一个通过“下一步”按钮单击所有站点的人,每次都保存文本内容。

每次保存网站内容后,都应该等待三秒钟,然后再单击“下一步”按钮(它必须是礼貌的爬虫)。我在 R 中也使用 Sys.sleep 函数实现了这一点。

我也想过使用自动化程序,但这样的程序似乎有很多,我不知道该使用哪一个。

我也不是一个真正的程序编写者(除了一点点 R 之外),所以我真的很感激一个不包含 Python、C++、PHP 等编程的解决方案。

任何想法将不胜感激!预先非常感谢您的意见和建议!

I would like to use RCurl as a polite webcrawler to download data from a website.
Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.

I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.

What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.

The address of the sites that I want to download data from work like this:
http://plants.jstor.org/specimen/ID of the site

I tried to program it with RCurl but I cannot get it done.
A few things complicate things:

  1. One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).

  2. The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser.
    In the source code the Next-button is encoded with an expression including

    <a href="/.../***ID of next site***">Next > > </a>
    

    When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.

  3. The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.

So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.

Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.

I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.

I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.

Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

香橙ぽ 2024-12-18 23:51:15

尝试不同的策略。

 ##########################
 ####
 ####            Scrape http://plants.jstor.org/specimen/
 ####        Idea:: Gather links from http://plants.jstor.org/search?t=2076
 ####            Then follow links:
 ####
 #########################

 library(RCurl)
 library(XML)

 ### get search page::

 cookie = 'cookiefile.txt'
 curl  =  getCurlHandle ( cookiefile = cookie , 
     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
     header = F,
     verbose = TRUE,
     netrc = TRUE,
     maxredirs = as.integer(20),
     followlocation = TRUE)

 querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)

 ## remove white spaces:
 querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

 ### get links from search page
  getLinks = function() {
        links = character()
        list(a = function(node, ...) {
                    links <<- c(links, xmlGetAttr(node, "href"))
                    node
                 },
             links = function()links)
      }

 ## retrieve links
  querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)

 ## cleanup links to keep only the one we want. 
  querry.jstor.links = NULL
  querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
  querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
  querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
  querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
  querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
  querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

 ## number of results
  jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
  NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
  NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))

  for(i in 2:ceiling(NumOfRes/20)){
    querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
    ## remove white spaces:
    querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
    querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
    querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
    querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
    querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
    querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
    querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
    querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

    Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
  }

  ## make directory for saving data: 
  dir.create('./jstorQuery/')

  ## Now we have all the links, so we can retrieve all the info
  for(j in 1:length(querry.jstor.links)){
    if(nchar(querry.jstor.links[j]) != 1){
       querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
       ## remove white spaces:
       querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

       ## contruct name:
       filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]

       ## save in directory: 
       write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))

       Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
    }
  }

Try a different strategy.

 ##########################
 ####
 ####            Scrape http://plants.jstor.org/specimen/
 ####        Idea:: Gather links from http://plants.jstor.org/search?t=2076
 ####            Then follow links:
 ####
 #########################

 library(RCurl)
 library(XML)

 ### get search page::

 cookie = 'cookiefile.txt'
 curl  =  getCurlHandle ( cookiefile = cookie , 
     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
     header = F,
     verbose = TRUE,
     netrc = TRUE,
     maxredirs = as.integer(20),
     followlocation = TRUE)

 querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)

 ## remove white spaces:
 querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

 ### get links from search page
  getLinks = function() {
        links = character()
        list(a = function(node, ...) {
                    links <<- c(links, xmlGetAttr(node, "href"))
                    node
                 },
             links = function()links)
      }

 ## retrieve links
  querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)

 ## cleanup links to keep only the one we want. 
  querry.jstor.links = NULL
  querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
  querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
  querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
  querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
  querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
  querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

 ## number of results
  jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
  NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
  NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))

  for(i in 2:ceiling(NumOfRes/20)){
    querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
    ## remove white spaces:
    querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
    querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
    querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
    querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
    querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
    querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
    querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
    querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

    Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
  }

  ## make directory for saving data: 
  dir.create('./jstorQuery/')

  ## Now we have all the links, so we can retrieve all the info
  for(j in 1:length(querry.jstor.links)){
    if(nchar(querry.jstor.links[j]) != 1){
       querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
       ## remove white spaces:
       querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

       ## contruct name:
       filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]

       ## save in directory: 
       write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))

       Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
    }
  }
醉梦枕江山 2024-12-18 23:51:15

我可能完全错过了你所挂断的部分,但听起来你已经快到了。

看来您可以在启用 cookie 的情况下请求第 1 页。然后解析搜索下一个站点 ID 的内容,然后通过使用下一个站点 ID 构建 URL 来请求该页面。然后抓取你想要的任何数据。

听起来您的代码几乎可以完成所有这些工作。问题是解析第 1 页以获取下一步的 ID 吗?如果是这样,您应该制定一个可重现的示例,我怀疑您会很快得到语法问题的答案。

如果您无法查看该网站正在执行的操作,我建议使用 Tamper Data 插件。它可以让您看到每次鼠标单击时发出的请求。我发现它对于此类事情非常有用。

I may be missing exactly the bit you are hung up on, but it sounds like you are almost there.

It seems you can request page 1 with cookies on. Then parse the content searching for the next site ID, then request that page by building the URL with the next site ID. Then scrape whatever data you want.

It sounds like you have code that does almost all of this. Is the problem parsing page 1 to get the ID for the next step? If so, you should formulate a reproducible example and I suspect you'll get a very fast answer to your syntax problems.

If you're having trouble seeing what the site is doing, I recommend using the Tamper Data plug in for Firefox. It will let you see what request is being made at each mouse click. I find it really useful for this type of thing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文