Python Scrapy 和 Selenium 打印了这么多我根本没想到的数据

发布于 2025-01-09 14:07:03 字数 1266 浏览 1 评论 0原文

我正在开展一个项目,该项目涉及一次抓取多个网站。由于我无法用纯粹的 scrapy 解析其中一些,所以我必须使用 selenium。 [我已经应用了 scrapy-selenium 但问题是在影子根中包含我想要的元素的网址之一,因此必须使用 selenium find_element 方法。]

我的问题是,尽管我得到了我的结果但终端中有太多打印。见下图: 输入图片此处描述

有任何说明吗?

谢谢

更新

这是代码:

# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
                      callback=self.parse_usatoday,)

# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
    shadow_root_main = s.shadow_root
    sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
    cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
    for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
        sr2 = lit.shadow_root
        cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
    href = item.get_attribute('href')
    al.append(href)
    at.append(item.text)

我知道我不应该在 scrapy 中使用 selenium,但是这样我得到了结果。唯一的问题是印刷量太大。

I'm working on a project which involves scraping several websites in one run. Since I'm not able to parse some of them with bare scrapy, I have to use selenium. [I've already applied scrapy-selenium but the problem is with one of the urls that has my desired element inside shadow roots so using selenium find_element method is obligatory.]

My problem is, although I get my result but there's so much printing in the terminal. see image below:
enter image description here

Any instructions?

Thanks

Update

here's the code:

# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
                      callback=self.parse_usatoday,)

# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
    shadow_root_main = s.shadow_root
    sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
    cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
    for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
        sr2 = lit.shadow_root
        cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
    href = item.get_attribute('href')
    al.append(href)
    at.append(item.text)

I know I shouldn't use selenium inside scrapy, but with this I get my result. The only problem is with those so much printing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文