Python Scrapy 和 Selenium 打印了这么多我根本没想到的数据

发布于 2025-01-09 14:07:03 字数 1266 浏览 1 评论 0原文

我正在开展一个项目，该项目涉及一次抓取多个网站。由于我无法用纯粹的 scrapy 解析其中一些，所以我必须使用 selenium。 [我已经应用了 scrapy-selenium 但问题是在影子根中包含我想要的元素的网址之一，因此必须使用 selenium find_element 方法。]

我的问题是，尽管我得到了我的结果但终端中有太多打印。见下图：

有任何说明吗？

谢谢

更新

这是代码：

# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
                      callback=self.parse_usatoday,)

# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
    shadow_root_main = s.shadow_root
    sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
    cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
    for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
        sr2 = lit.shadow_root
        cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
    href = item.get_attribute('href')
    al.append(href)
    at.append(item.text)

我知道我不应该在 scrapy 中使用 selenium，但是这样我得到了结果。唯一的问题是印刷量太大。

原文

I'm working on a project which involves scraping several websites in one run. Since I'm not able to parse some of them with bare scrapy, I have to use selenium. [I've already applied scrapy-selenium but the problem is with one of the urls that has my desired element inside shadow roots so using selenium find_element method is obligatory.]

My problem is, although I get my result but there's so much printing in the terminal. see image below:

Any instructions?

Thanks

Update

here's the code:

# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
                      callback=self.parse_usatoday,)

# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
    shadow_root_main = s.shadow_root
    sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
    cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
    for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
        sr2 = lit.shadow_root
        cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
    href = item.get_attribute('href')
    al.append(href)
    at.append(item.text)

I know I shouldn't use selenium inside scrapy, but with this I get my result. The only problem is with those so much printing.

分享到QQ

分享到微博