如何使用属性应用程序/ld+ json和数据反应螺旋网络刮擦美丽的小组?
我是使用Python的网络刮擦的新手。我已经编码了使用Selenium和Beautifuleoup从Job Portal网站中获取数据。我要做的流程是:
- 在作业门户网站上删除整个作业发布的链接,
- 从通过循环获得的作业发布的每个链接中刮擦详细信息。
我使用脚本标签类型上的find_all beautifulsoup方法删除了详细信息,='application/ld+json'和数据反应helmet。但是我得到了一个错误消息列表索引范围之外。有人了解如何解决吗?
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' + jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' + jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' + occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")
I'm new to web scraping using python. I've coded to pull data from a job portal site using Selenium and BeautifulSoup. The flow I do is:
- Scraping the entire a link of job posting on the job portal site
- Scraping detailed info from each link of the job posting that has been obtained by looping.
I scraped the detailed info using the find_all BeautifulSoup method on the script tag type = 'application/ld+json' and data-react-helmet. But I get an error message list index out of range. Does anyone understand how to solve it?
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' + jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' + jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' + occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
数据由API调用JSON响应的JavaScript动态加载,您可以随心所欲获取所有数据。下面给出了一个示例,如何使用
请求从API提取数据
仅模块输出:
Data is dynamically loaded by Javascript from API calls json response and You can grab all data whatever you want. Below is given an example how to extract data from api using
requests
module onlyOutput: