如何修复” TypeError:列表索引必须是整数或切片,而不是str。 &quot?

发布于 2025-02-13 07:40:44 字数 1343 浏览 1 评论 0原文

我正在尝试刮擦网站。我希望能够从此网页检索一个URL链接,并使用它到达另一个页面,在这里我可以访问所需的此信息。

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")

table = soup.find_all("a",  attrs = {"class": "job-details-link"})

该部分正常工作,但是下一部分是我被卡住的地方。

def jobScan(link):
     
    the_job = {}

    jobUrl = '{}{}'.format(baseUrl, link['href'])
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    print(the_job)

    return the_job

jobScan(table)

我遇到了这个错误:

"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
    jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "

我显然做错了什么,但看不到。我需要你的帮助。谢谢。

I'm trying to scrape a website. I want to be able to retrieve a URL link from this webpage and use it to get to another page wherein I can access this information that I need.

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")

table = soup.find_all("a",  attrs = {"class": "job-details-link"})

This part works fine however the next part is where I get stuck.

def jobScan(link):
     
    the_job = {}

    jobUrl = '{}{}'.format(baseUrl, link['href'])
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    print(the_job)

    return the_job

jobScan(table)

I'm getting this error:

"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
    jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "

I'm clearly doing something wrong but i can't see it. I need your help. thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小…红帽 2025-02-20 07:40:44

有两个主要问题:

  • 您没有迭代URL的resultset,您将table作为URL列表推送到您的函数。

  • 您的URL变得无效,在准备baseurl时,只需尝试使用joburl = link ['href']原因路径是绝对的。

注意 您还应该检查响应中您要寻找的元素

示例

是否在前两个URL中进行迭代 - 第三个会给您带来错误,因为没有<代码>&lt; h3&gt; 在响应中,应在新的问题中以此重点提出:

def jobScan(link):
     
    the_job = {}

    jobUrl = link['href']
    print(jobUrl)
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    return the_job

data = []

for a in table[:2]:
    data.append(jobScan(a))

data
输出
[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' World Vision Uganda\n'},
 {'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' Catholic Relief Services (CRS)\n'}]

There are two main issues:

  • You are not iterating the ResultSet of urls, you push table as list of urls to your function.

  • Your urls become invalid, while prepending baseUrl, just try to use jobUrl = link['href'] cause path is absolute.

Note You also should check if the elements you are looking for exists in the responses

Example

Iterates over the first two urls - Third will give you an error, cause there is no <h3> in response, but this should be asked in new question with exact this focus:

def jobScan(link):
     
    the_job = {}

    jobUrl = link['href']
    print(jobUrl)
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    return the_job

data = []

for a in table[:2]:
    data.append(jobScan(a))

data
Output
[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' World Vision Uganda\n'},
 {'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' Catholic Relief Services (CRS)\n'}]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文