我需要进行网络刮擦,以获取来自不同报纸的不同文章的链接

发布于 2025-02-03 04:32:07 字数 3310 浏览 2 评论 0 原文

我需要进行网络刮擦,以获取来自不同报纸的不同文章的链接,并且我的代码对今天的新闻(来自Googlenews)都非常有效。但是,它不适用于较旧文章。例如,此代码可以从Google新闻中获取不同的文章链接:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import time
from newspaper import Article
import random
import pandas as pd

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&tbm=nws&ei=qEWUYorfOuiy5OUP-aGLgA4&ved=0ahUKEwiK07Wfr4b4AhVoGbkGHfnQAuAQ4dUDCA0&uact=5&oq=revuelta+la+tercera&gs_lcp=Cgxnd3Mtd2l6LW5ld3MQAzIFCCEQoAEyBQghEKABOgsIABCABBCxAxCDAToFCAAQgAQ6CAgAEIAEELEDOggIABCxAxCDAToKCAAQsQMQgwEQQzoECAAQQzoECAAQCjoGCAAQHhAWOggIABAeEA8QFlDIEliUnwFg1aABaAVwAHgAgAGSAYgBuw-SAQQyMS4ymAEAoAEBsAEAwAEB&sclient=gws-wiz-news'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

现在我需要更改同一搜索的日期,因此我只使用自定义间隔更改链接:

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

链接应该不同(由于更改搜索日期),但它们都给我同样的结果,我不明白为什么。请帮助伙计们,我不知道该如何解决。

I need to do web scraping to googlenews to get the link for different articles from different newspaper and I have a code that works pretty fine for today news (from googlenews). However it doesn't work for older articles. For example this code works to get different article links from google news:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import time
from newspaper import Article
import random
import pandas as pd

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&tbm=nws&ei=qEWUYorfOuiy5OUP-aGLgA4&ved=0ahUKEwiK07Wfr4b4AhVoGbkGHfnQAuAQ4dUDCA0&uact=5&oq=revuelta+la+tercera&gs_lcp=Cgxnd3Mtd2l6LW5ld3MQAzIFCCEQoAEyBQghEKABOgsIABCABBCxAxCDAToFCAAQgAQ6CAgAEIAEELEDOggIABCxAxCDAToKCAAQsQMQgwEQQzoECAAQQzoECAAQCjoGCAAQHhAWOggIABAeEA8QFlDIEliUnwFg1aABaAVwAHgAgAGSAYgBuw-SAQQyMS4ymAEAoAEBsAEAwAEB&sclient=gws-wiz-news'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

Now I need to change the dates for the same search, so I just change the link with the custom interval:

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

The links are supposed to be different (due to the change of search dates) but they both give me the same result and I don't understand why. Please help guys, I have no idea how to fix that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

花辞树 2025-02-10 04:32:07

通过您提供的URL是

https:///www.google.com/search.com/search.q = quevuelta+lazccera+teremclae+terepera+lazccera = razccera = rz.reand.clae ccera ump.clz = ump.Ampclae ccera = ; biw = 1536& bih = 714& source = lnt& tbs = cdr%3A1%2ccd_min%3A1%2F1%2F2018%2CCD_MAX%3A1%3A1%2F6%2f6%2f6%2F2018&amp& tbm = nws

strong>和 cd_max 似乎包括日期时间数据。

CD_MIN%3A1%2F1%2F2018%2CCD_MAX%3A1%2F6%2F2018

So let's slice them from URL.上面的字符串从URL切成薄片,并编码URL。如果您解码它,您会看到..

cd_min:1/1/2018,cd_max:1/6/2018

因此,如果要更改查询日期,则应更改为URL。

from urllib import parse

# URL ENCODE
# DON'T FORGET : and , 
start_date = parse.quote_plus(":1/1/2018,")
end_date = parse.quote_plus(":1/6/2018")

# CREATE QUERY
link = f"https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min{start_date}cd_max{end_date}&tbm=nws"

随机日期的代码是您的工作:)

By the URL you provided is

https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws

If you read it carefully cd_min and cd_max seems to include datetime data.

cd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018

So let's slice them from URL. String above is sliced from the URL and it is URL encoded. If you decode it you will see..

cd_min:1/1/2018,cd_max:1/6/2018

So If you want to change date for your query, you should make change to URL.

from urllib import parse

# URL ENCODE
# DON'T FORGET : and , 
start_date = parse.quote_plus(":1/1/2018,")
end_date = parse.quote_plus(":1/6/2018")

# CREATE QUERY
link = f"https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min{start_date}cd_max{end_date}&tbm=nws"

The code to randomize date is your job :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文