如何用同一班级刮擦另一个跨度
我正在使用BeautifureSoup4,这是我的代码:
def extract(page):
url = f'https://www.jobstreet.com.my/en/job-search/personal-assistant-jobs/{page}/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'sx2jih0 zcydq876 zcydq866 zcydq896 zcydq886 zcydq8n zcydq856 zcydq8f6 zcydq8eu')
for items in divs:
location = items.find('span', attrs={'class': 'sx2jih0 zcydq84u _18qlyvc0 _18qlyvc1x _18qlyvc3 _18qlyvc7'}).text.strip()
salary = items.find_next_sibling('span', attrs={'class': 'sx2jih0 zcydq84u _18qlyvc0 _18qlyvc1x _18qlyvc3 _18qlyvc7'}).text.strip()
两个跨度都具有相同的类,但是当我取消时,两个结果都是相同的。
im using beautifulSoup4, this is my code:
def extract(page):
url = f'https://www.jobstreet.com.my/en/job-search/personal-assistant-jobs/{page}/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'sx2jih0 zcydq876 zcydq866 zcydq896 zcydq886 zcydq8n zcydq856 zcydq8f6 zcydq8eu')
for items in divs:
location = items.find('span', attrs={'class': 'sx2jih0 zcydq84u _18qlyvc0 _18qlyvc1x _18qlyvc3 _18qlyvc7'}).text.strip()
salary = items.find_next_sibling('span', attrs={'class': 'sx2jih0 zcydq84u _18qlyvc0 _18qlyvc1x _18qlyvc3 _18qlyvc7'}).text.strip()
both span have the same class, but when i scrapped, both results were same.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通过
find_all
获取所有跨度,然后删除其少于2个跨度的项目。然后获取位置
和薪金
从跨度列表中get the all spans by
find_all
and then remove the items that they have less than 2 spans. then getlocation
andsalary
from the spans list尽量避免选择html中的动态零件(例如类) - 虽然您仍然弄清楚兄弟姐妹关系,但您可以使用
find_next_sibling('span'')
。以下将仅通过标签选择,还可以检查是否有薪水避免错误,而是刮擦整个工作,而显示薪水
none
:示例
Try to avoid selecting by dynamic parts in HTML such as classes - While you still figured out the sibling relation you could go with
find_next_sibling('span')
.Following will select only by tags and also checks if there is a salary or not to avoid errors but scrape the whole job, while display salary
None
:Example
使用动态类并不是最好的主意。该站点使用API进行工作搜索以及GraphQL。
输出:
Using dynamic classes is not the best idea. The site uses api for job search as well as graphql.
OUTPUT: