在R中，如何解析网页中的特定框架？

发布于 2024-10-04 00:06:22 字数 2545 浏览 9 评论 0原文

大家好，

有没有办法只读取网页中特定框架的 HTML 代码？

例如，如果我向谷歌翻译提交一个网址，有没有办法只解析翻译后的页面框架？每当我尝试时，我只能访问页面的顶部框架，而不能访问翻译后的框架。这是我的独立示例代码：

library(XML)
url <- "http://www.baidu.com/s?wd=r+project"
url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

上面的代码引用了这个 url:

$file
[1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

但是输出仅访问页面的顶部框架而不是主框架，这正是我感兴趣的。

希望这是有道理的，并提前致谢寻求任何帮助。

Tony

更新 - 感谢下面 @kwantam 的回答（已接受），我能够使用它来获得我的解决方案，如下（独立）：

> # Load R packages
> library(RCurl)
> library(XML)
> 
> # STAGE 1 - find forward url in relevent frame
> ( url <- "http://www.baidu.com/s?wd=r+project" )
[1] "http://www.baidu.com/s?wd=r+project"
> gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
> gt.doc <- getURL(gt.url)
> gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]')
> gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
> gt.url <- paste("http://translate.google.com", gt.parameters, sep = "")
> 
> # STAGE 2 - find forward url to translated page
> doc <- getURL(gt.url, followlocation = TRUE)
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
> url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
> url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2]
> url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE)
> url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]])
> 
> # STAGE 3 - load translated page
> url.trans
[1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A "
> #getURL(url.trans)

如果有人知道对于我上面给出的问题有一个更简单的解决方案，请随时告诉我！ :)

原文

Greetings all,

Is there a way to only read the HTML code from a specific frame within a webpage?

For example, if I submit a url to google translate, is there a way to parse only the translated page frame? Whenever I try, I can only access the top frame on the page but not the translated frame. Here is my self-contained sample code:

library(XML)
url <- "http://www.baidu.com/s?wd=r+project"
url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

The above code refers to this url:

$file
[1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

The output however only access the top frame of the page and not the main frame, which is what I am interested in.

Hope that made sense and thanks in advance for any help.

Tony

UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (self-contained):

> # Load R packages
> library(RCurl)
> library(XML)
> 
> # STAGE 1 - find forward url in relevent frame
> ( url <- "http://www.baidu.com/s?wd=r+project" )
[1] "http://www.baidu.com/s?wd=r+project"
> gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
> gt.doc <- getURL(gt.url)
> gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]')
> gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
> gt.url <- paste("http://translate.google.com", gt.parameters, sep = "")
> 
> # STAGE 2 - find forward url to translated page
> doc <- getURL(gt.url, followlocation = TRUE)
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
> url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
> url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2]
> url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE)
> url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]])
> 
> # STAGE 3 - load translated page
> url.trans
[1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A "
> #getURL(url.trans)

If anyone knows of a simpler solution to what I've given above then please feel free to let me know! :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤千羽 2024-10-11 00:06:22

以下大部分答案是针对谷歌翻译的特定情况。在大多数情况下，您只需要解析并提取您要查找的框架，尽管可能不会立即明显看出哪个是 HTML 中的主要框架（也许看看框架的相对大小）。

看起来您必须进行几次刷新才能获得实际内容。特别是，当您获取刚刚提到的 URL 时，您会看到类似的内容

  *snip*
<noframes>
<script>
<!--document.location="/translate_p?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&usg=asdf";-->
</script>
<a href="/translate_p?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&usg=asdf">Translate
</a>
</noframes>
  *snip*

如果您点击此处的链接（请记住先取消转义 '&'），它会为您提供另一个小的 HTML 片段，其中包含

<meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=asdf">

再次，取消转义'&'然后刷新后，您将获得所需的翻译页面。

在 wget 或curl 中尝试一下这个，你应该会更清楚你需要做什么。

Most of the following answer is for the particular case of google translate. In most cases, you'll just need to parse the <frameset> and pull out whichever frame you're looking for, though it might not be immediately obvious which is the main one from the HTML (perhaps look at the relative sizing of the frames).

It looks like you're going to have to follow a few refreshes to get the actual content. In particular, when you grab the URL you just mentioned, you'll see something like

  *snip*
<noframes>
<script>
<!--document.location="/translate_p?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&usg=asdf";-->
</script>
<a href="/translate_p?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&usg=asdf">Translate
</a>
</noframes>
  *snip*

If you follow the link here (remember to unescape '&' first), it'll give you another small HTML fragment which includes

<meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=asdf">

Again, unescaping the '&' and then following the refresh, you'll have the translated page that you're looking for.

Play with this in wget or curl and it should become more clear what you're going to need to do.

回复收藏 0 原文