如何使用属性应用程序/ld+ json和数据反应螺旋网络刮擦美丽的小组?

发布于 2025-01-24 11:00:02 字数 2139 浏览 0 评论 0原文

我是使用Python的网络刮擦的新手。我已经编码了使用Selenium和Beautifuleoup从Job Portal网站中获取数据。我要做的流程是:

  1. 在作业门户网站上删除整个作业发布的链接,
  2. 从通过循环获得的作业发布的每个链接中刮擦详细信息。

我使用脚本标签类型上的find_all beautifulsoup方法删除了详细信息,='application/ld+json'和数据反应helmet。但是我得到了一个错误消息列表索引范围之外。有人了解如何解决吗?



job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
   headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'referrer': 'https://google.com',
    'Accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Pragma': 'no-cache',
   }
   response = requests.get(url=url, headers=headers)
   soup = BeautifulSoup(response.text, 'lxml')
   script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
   metadata = script_tags[-1].text

   temp_dict = {}

   try:
     job_info_json = json.loads(metadata, strict=False)
     try:
          jobID = job_info_json['identifier']['value']
          temp_dict['Job ID'] = jobID
          print('Job ID = ' +  jobID)
     except AttributeError :
          jobID = ''
  
     try:
         jobTitle = job_info_json['title']
         temp_dict['Job Title'] = jobTitle
         print('Title = ' +  jobTitle)
     except AttributeError :
         jobTitle = ''
      
     try:
         occupationalCategory = job_info_json['occupationalCategory']
         temp_dict['occupationalCategory'] = occupationalCategory
         print('Occupational Category = ' +  occupationalCategory)
     except AttributeError :
         occupationalCategory = ''
  
     temp_dict['Job Link'] = URL_job_list

     job_main_data = job_main_data.append(temp_dict, ignore_index=True)
      
   except json.JSONDecodeError:
     print("Empty response")

I'm new to web scraping using python. I've coded to pull data from a job portal site using Selenium and BeautifulSoup. The flow I do is:

  1. Scraping the entire a link of job posting on the job portal site
  2. Scraping detailed info from each link of the job posting that has been obtained by looping.

I scraped the detailed info using the find_all BeautifulSoup method on the script tag type = 'application/ld+json' and data-react-helmet. But I get an error message list index out of range. Does anyone understand how to solve it?

Message Error

job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
   headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'referrer': 'https://google.com',
    'Accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Pragma': 'no-cache',
   }
   response = requests.get(url=url, headers=headers)
   soup = BeautifulSoup(response.text, 'lxml')
   script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
   metadata = script_tags[-1].text

   temp_dict = {}

   try:
     job_info_json = json.loads(metadata, strict=False)
     try:
          jobID = job_info_json['identifier']['value']
          temp_dict['Job ID'] = jobID
          print('Job ID = ' +  jobID)
     except AttributeError :
          jobID = ''
  
     try:
         jobTitle = job_info_json['title']
         temp_dict['Job Title'] = jobTitle
         print('Title = ' +  jobTitle)
     except AttributeError :
         jobTitle = ''
      
     try:
         occupationalCategory = job_info_json['occupationalCategory']
         temp_dict['occupationalCategory'] = occupationalCategory
         print('Occupational Category = ' +  occupationalCategory)
     except AttributeError :
         occupationalCategory = ''
  
     temp_dict['Job Link'] = URL_job_list

     job_main_data = job_main_data.append(temp_dict, ignore_index=True)
      
   except json.JSONDecodeError:
     print("Empty response")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

羁绊已千年 2025-01-31 11:00:02

数据由API调用JSON响应的JavaScript动态加载,您可以随心所欲获取所有数据。下面给出了一个示例,如何使用请求从API提取数据仅模块

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

输出:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000

Data is dynamically loaded by Javascript from API calls json response and You can grab all data whatever you want. Below is given an example how to extract data from api using requests module only

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

Output:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文