Python Scrapy 和 Selenium 打印了这么多我根本没想到的数据
我正在开展一个项目,该项目涉及一次抓取多个网站。由于我无法用纯粹的 scrapy 解析其中一些,所以我必须使用 selenium。 [我已经应用了 scrapy-selenium
但问题是在影子根中包含我想要的元素的网址之一,因此必须使用 selenium find_element 方法。]
我的问题是,尽管我得到了我的结果但终端中有太多打印。见下图:
有任何说明吗?
谢谢
更新
这是代码:
# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
callback=self.parse_usatoday,)
# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
shadow_root_main = s.shadow_root
sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
sr2 = lit.shadow_root
cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
href = item.get_attribute('href')
al.append(href)
at.append(item.text)
我知道我不应该在 scrapy 中使用 selenium,但是这样我得到了结果。唯一的问题是印刷量太大。
I'm working on a project which involves scraping several websites in one run. Since I'm not able to parse some of them with bare scrapy, I have to use selenium. [I've already applied scrapy-selenium
but the problem is with one of the urls that has my desired element inside shadow roots so using selenium find_element method is obligatory.]
My problem is, although I get my result but there's so much printing in the terminal. see image below:
Any instructions?
Thanks
Update
here's the code:
# in start_requests()
yield SeleniumRequest(url="https://eu.usatoday.com/",
callback=self.parse_usatoday,)
# in parse_usatoday()
d = response.request.meta['driver']
cust = []
sections = d.find_elements(By.XPATH, "//div[@id='post-content']//promo-story-bucket-short")
for s in sections:
shadow_root_main = s.shadow_root
sr2 = shadow_root_main.find_element(By.CSS_SELECTOR, "lit-story-thumb-large").shadow_root
cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
for lit in shadow_root_main.find_elements(By.CSS_SELECTOR, "lit-story-card"):
sr2 = lit.shadow_root
cust.append(sr2.find_element(By.CSS_SELECTOR, "a"))
al = []
at = []
for item in cust:
href = item.get_attribute('href')
al.append(href)
at.append(item.text)
I know I shouldn't use selenium inside scrapy, but with this I get my result. The only problem is with those so much printing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论