未能在具有多链接的网站中使用Webccrape标题和作者

发布于 2025-02-13 17:35:57 字数 533 浏览 2 评论 0原文

我正在尝试将此 link 。例如，我只想刮擦第一页。我想为您在第一页中找到的10个链接中的每个链接中收集标题和作者。

为了收集标题和作者，我编写了以下代码行：

from bs4 import BeautifulSoup
import requests
import numpy as np

url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
  
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget

但是，我得到一个空列表。我在做什么错？

谢谢！

原文

I am trying to webscrape this link. As an example, I just want to scrape the first page. I would like to collect titles and authors for each of the 10 link you find in the first page.

To gather titles and authors, I write the following line of code:

from bs4 import BeautifulSoup
import requests
import numpy as np

url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
  
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget

However, I get an empty list. What am I doing wrong?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

在你怀里撒娇 2025-02-20 17:35:58

数据以API作为Post方法从外部源加载。只有您必须使用API URL。

from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    title = card.select_one('.title a').get_text()
    author = card.select_one('.authorlnk.dashed').get_text().strip()
    data.append({
        'title':title,
        'author':author
        })

print(data)

输出

[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common 
shield -  protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'}, 
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]

Data is loaded from external source by API as post method. Just you have to use the API url.

from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    title = card.select_one('.title a').get_text()
    author = card.select_one('.authorlnk.dashed').get_text().strip()
    data.append({
        'title':title,
        'author':author
        })

print(data)

Output

[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common 
shield -  protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'}, 
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]

回复收藏 0 原文

笑饮青盏花 2025-02-20 17:35:58

尝试以下操作：

data = {
  'from': '',
  'till': '',
  'objid': 'cbspeeches',
  'page': '',
  'paging_length': '25',
  'sort_list': 'date_desc',
  'theme': 'cbspeeches',
  'ml': 'false',
  'mlurl': '',
  'emptylisttext': ''
}

response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)

soup = BeautifulSoup(response.content)

for elem in soup.find_all("tr"):
    # the title
    print(elem.find("a").text)
    # the author
    print(elem.find("a", class_="authorlnk dashed").text)
    print()

打印出来：

Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos

Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank 
Klaas Knot

Try this:

data = {
  'from': '',
  'till': '',
  'objid': 'cbspeeches',
  'page': '',
  'paging_length': '25',
  'sort_list': 'date_desc',
  'theme': 'cbspeeches',
  'ml': 'false',
  'mlurl': '',
  'emptylisttext': ''
}

response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)

soup = BeautifulSoup(response.content)

for elem in soup.find_all("tr"):
    # the title
    print(elem.find("a").text)
    # the author
    print(elem.find("a", class_="authorlnk dashed").text)
    print()

Prints out:

Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos

Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank 
Klaas Knot

回复收藏 0 原文

~没有更多了~