当前位置：文江博客话题详情

HTML r screen-scraping rselenium

在rselenium中的WebEtrement列表上循环时，如何仅选择特定类？

发布于 2025-01-31 11:46:30 字数 2031 浏览 5 评论 0 原文

出于纯粹的教育目的，我试图使用rselenium（ link ）。但是，我很难以正确的格式提取有关评论的信息。最后，我的目标是循环循环所有评论，并仅提取我需要的每个评论的信息（fi仅在审阅者的位置）。

这是评论的HTML部分（部分）和特定评论中的实际信息（部分）：

现在我保存了像评论列表一样因此：

rdriver <- rsDriver(browser = "chrome",
                    chromever = "101.0.4951.15",
                    port = 2232L
)

driver <- rdriver[["client"]]

reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]
review$getElementText()

最终命令为我提供了第一个评论中存在的所有文本，例如评论的标题，姓名，年龄和位置，实际评论的实际文本等：

1 "Zoek niet verder als je een tv zoek met deze grote en alle laatste Sma\nGer1965rotterdam 60-69 jaar Rotterdam 18 april 2022 Heeft dit artikel gekocht\nIk raad dit product aan\nGoede beeldkwaliteit\nEenvoudig in gebruik\nJuiste formaat\nHeeft alles wat een tv moet hebben onder ander Sat.tv.ontvanger en alle nieuwste Smart Mogelijkheden hij is eind februari 2022 op de Hollandse markt gekomen dus nieuwer kan het niet

！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！，在这种情况下，在第一行的末尾“鹿特丹”。

我尝试过：

check <- review$findElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()

但是它仍然像以前一样给我整个文本，而不仅仅是“鹿特丹”。有人知道我在做什么错吗？我在网上看了很多以解决这个问题，但似乎找不到它。应该可以循环循环浏览网络列表，并仅从这些元素中提取某些信息，对吗？就像我说的那样，我是出于教育目的而这样做的，所以我对材料非常陌生。

任何帮助将不胜感激！

原文

For purely educational purposes Im trying to scrape reviews of a Dutch retail website using RSelenium (Link to website). I struggle however to extract the information on the review in the right format. Finally my goal would be to loop over all the reviews and extract only the pieces of info per review that I need (f.i. just location of reviewer).

This is the html piece of the reviews (piece 1) and the actual info within a specific review (piece 2):

html piece 1

html piece 2

Now I have saved the list of reviews like so:

rdriver <- rsDriver(browser = "chrome",
                    chromever = "101.0.4951.15",
                    port = 2232L
)

driver <- rdriver[["client"]]

reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]
review$getElementText()

The final command gives me all the text that is present in the first review like title of review, name, age and location of reviewer, actual text of review and so on:

But I would actually like to fetch only certain parts of the review, for instance just the location of the reviewer, in this case 'Rotterdam' at the end of the first line.

I tried:

check <- review$findElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()

But it still gives me the entire piece of text like before and not just 'Rotterdam'. Does anyone know what Im doing wrong? I've looked online a lot to resolve this issue, but cant seem to find it. It should be possible to loop over a list of webelements and extract only certain pieces of info from these Elements right? Like I said Im doing this for educational purposes, so Im pretty new to the material.

Any help is greatly appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

芸娘子的小脾气 2025-02-07 11:46:30

我认为您的代码问题是您的评论列表包含Webelement对象。您不能在webelement对象上使用Findelement。

您可以采取的措施来获得所有评论的位置，就是直接与他们联系。

driver$findElements(using = 'xpath', '//li[@data-test="review-author-city")

更新：我自己在rselenium中尝试了一下，发现有一种方法Findchildlement。您可以在此处找到有关此信息的更多信息： https://rdrr.io/cran/cran/cran/cran /rselenium/man/webelement-class.html

在您的情况下，这应该有效：

driver <- rdriver[["client"]]

reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]

check <- review$findChildElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()

I think the problem with your code is that your reviews list contains WebElement objects. You cannot use findElement on a WebElement object afaik.

What you could do to get the location of all reviews is getting them directly with.

driver$findElements(using = 'xpath', '//li[@data-test="review-author-city")

Update: i tried it out myself in RSelenium and i found out there's a method findChildElement. You can find more information about this here: https://rdrr.io/cran/RSelenium/man/webElement-class.html

In your case this should work:

driver <- rdriver[["client"]]

reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]

check <- review$findChildElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()

回复收藏 0 原文

~没有更多了~

关于作者

失退

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

在rselenium中的WebEtrement列表上循环时，如何仅选择特定类？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

在rselenium中的WebEtrement列表上循环时，如何仅选择特定类？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。