如何使JSON离开网页?
因此,我正在尝试从此网页
但是我不需要整个数据集,我只需要:
- 操作员名称(Google,Cloudflare
- 等
- ) logids(
kxm+8j45oshwvnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6v。
我尝试编写一些代码,但我只是在Webscraping上,我只是一个初学者,所以我想知道是否有人可以帮助您, 这是我尝试的代码,我尝试使用LXML和请求库。
import requests
from lxml import html
page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)
#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')
print('Operators: ',operators)
我希望最终结果看起来像网站上的JSON减去所有不需要的信息,因此运营商:
[
{ "name": "Google",
"logs": [
{ description: "Google Argon2022 log",
log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" },
{ description: "GoogleArgon2023 log",
log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
}
....
{ "name": "CloudFlare",
"logs": [ ... ]
}
]
So I'm trying to parse data out of this webpage
But I don't need the whole dataset, I just need:
- The operator name (Google, CloudFlare, etc.)
- The description (Google 'Argon2022' log, Google 'Argon2023' log, etc.)
- The logIDs (
KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=
)
I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.
import requests
from lxml import html
page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)
#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')
print('Operators: ',operators)
My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:
[
{ "name": "Google",
"logs": [
{ description: "Google Argon2022 log",
log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" },
{ description: "GoogleArgon2023 log",
log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
}
....
{ "name": "CloudFlare",
"logs": [ ... ]
}
]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您要访问原始文件,而不是UI。就像Kache提到的那样,您可以使用JSON使用:
然后,您可以使用以下脚本来提取所需的数据:
First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:
Then, you can use the following script to extract only the data you want:
右下角有一个链接,可让您直接下载文件: https://chromium.googlesource.com/chromium/src/src/src/ +/main/main/components/certificate_transparency/data/data/log_list.json?format format format format formation formation
?完全解析。
这是将其提取为
dict
的Python代码:您的问题剩下的涉及JSON和
dict
traversal和Basic编码,您应该能够在其他问题中找到答案。There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=JSON
Which lets you avoid HTML parsing altogether.
Here's Python code to extract it as a
dict
:What remains of your question involves JSON and
dict
traversal and basic coding, which you should be able to find answers in other questions.