如何修复” TypeError：列表索引必须是整数或切片，而不是str。＆quot？

发布于 2025-02-13 07:40:44 字数 1343 浏览 1 评论 0原文

我正在尝试刮擦网站。我希望能够从此网页检索一个URL链接，并使用它到达另一个页面，在这里我可以访问所需的此信息。

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")

table = soup.find_all("a",  attrs = {"class": "job-details-link"})

该部分正常工作，但是下一部分是我被卡住的地方。

def jobScan(link):
     
    the_job = {}

    jobUrl = '{}{}'.format(baseUrl, link['href'])
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    print(the_job)

    return the_job

jobScan(table)

我遇到了这个错误：

"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
    jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "

我显然做错了什么，但看不到。我需要你的帮助。谢谢。

原文

I'm trying to scrape a website. I want to be able to retrieve a URL link from this webpage and use it to get to another page wherein I can access this information that I need.

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")

table = soup.find_all("a",  attrs = {"class": "job-details-link"})

This part works fine however the next part is where I get stuck.

def jobScan(link):
     
    the_job = {}

    jobUrl = '{}{}'.format(baseUrl, link['href'])
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    print(the_job)

    return the_job

jobScan(table)

I'm getting this error:

"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
    jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "

I'm clearly doing something wrong but i can't see it. I need your help. thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小…红帽 2025-02-20 07:40:44

有两个主要问题：

您没有迭代URL的resultset，您将table作为URL列表推送到您的函数。
您的URL变得无效，在准备baseurl时，只需尝试使用joburl = link ['href']原因路径是绝对的。

注意 您还应该检查响应中您要寻找的元素

示例

是否在前两个URL中进行迭代 - 第三个会给您带来错误，因为没有<代码>＆lt; h3＆gt; 在响应中，应在新的问题中以此重点提出：

def jobScan(link):
     
    the_job = {}

    jobUrl = link['href']
    print(jobUrl)
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    return the_job

data = []

for a in table[:2]:
    data.append(jobScan(a))

data

输出

[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' World Vision Uganda\n'},
 {'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' Catholic Relief Services (CRS)\n'}]

There are two main issues:

You are not iterating the ResultSet of urls, you push table as list of urls to your function.
Your urls become invalid, while prepending baseUrl, just try to use jobUrl = link['href'] cause path is absolute.

Note You also should check if the elements you are looking for exists in the responses

Example

Iterates over the first two urls - Third will give you an error, cause there is no <h3> in response, but this should be asked in new question with exact this focus:

def jobScan(link):
     
    the_job = {}

    jobUrl = link['href']
    print(jobUrl)
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    return the_job

data = []

for a in table[:2]:
    data.append(jobScan(a))

data

Output

[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' World Vision Uganda\n'},
 {'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' Catholic Relief Services (CRS)\n'}]

回复收藏 0 原文

~没有更多了~