我需要进行网络刮擦,以获取来自不同报纸的不同文章的链接
我需要进行网络刮擦,以获取来自不同报纸的不同文章的链接,并且我的代码对今天的新闻(来自Googlenews)都非常有效。但是,它不适用于较旧文章。例如,此代码可以从Google新闻中获取不同的文章链接:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import time
from newspaper import Article
import random
import pandas as pd
root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#
link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&tbm=nws&ei=qEWUYorfOuiy5OUP-aGLgA4&ved=0ahUKEwiK07Wfr4b4AhVoGbkGHfnQAuAQ4dUDCA0&uact=5&oq=revuelta+la+tercera&gs_lcp=Cgxnd3Mtd2l6LW5ld3MQAzIFCCEQoAEyBQghEKABOgsIABCABBCxAxCDAToFCAAQgAQ6CAgAEIAEELEDOggIABCxAxCDAToKCAAQsQMQgwEQQzoECAAQQzoECAAQCjoGCAAQHhAWOggIABAeEA8QFlDIEliUnwFg1aABaAVwAHgAgAGSAYgBuw-SAQQyMS4ymAEAoAEBsAEAwAEB&sclient=gws-wiz-news'
time.sleep(random.randint(0, 6)) #----------stop---------#
req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#
webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#
with requests.Session() as c:
soup = BeautifulSoup(webpage, 'html5lib')
for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
raw_link = item.find('a', href=True)['href']
link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]
article = Article(link, language = "es")
article.download()
article.parse()
title = article.title
descript = article.text
date = article.publish_date
print(title)
print(descript)
print(link)
现在我需要更改同一搜索的日期,因此我只使用自定义间隔更改链接:
root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#
link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws'
time.sleep(random.randint(0, 6)) #----------stop---------#
req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#
webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#
with requests.Session() as c:
soup = BeautifulSoup(webpage, 'html5lib')
for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
raw_link = item.find('a', href=True)['href']
link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]
article = Article(link, language = "es")
article.download()
article.parse()
title = article.title
descript = article.text
date = article.publish_date
print(title)
print(descript)
print(link)
链接应该不同(由于更改搜索日期),但它们都给我同样的结果,我不明白为什么。请帮助伙计们,我不知道该如何解决。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过您提供的URL是
strong>和 cd_max 似乎包括日期时间数据。
So let's slice them from URL.上面的字符串从URL切成薄片,并编码URL。如果您解码它,您会看到..
因此,如果要更改查询日期,则应更改为URL。
随机日期的代码是您的工作:)
By the URL you provided is
If you read it carefully cd_min and cd_max seems to include datetime data.
So let's slice them from URL. String above is sliced from the URL and it is URL encoded. If you decode it you will see..
So If you want to change date for your query, you should make change to URL.
The code to randomize date is your job :)