如何使JSON离开网页?

发布于 2025-02-08 01:13:52 字数 1230 浏览 2 评论 0原文

因此,我正在尝试从此网页

但是我不需要整个数据集,我只需要:

  • 操作员名称(Google,Cloudflare
  • ) logids(kxm+8j45oshwvnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6v。

我尝试编写一些代码,但我只是在Webscraping上,我只是一个初学者,所以我想知道是否有人可以帮助您, 这是我尝试的代码,我尝试使用LXML和请求库。

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')

print('Operators: ',operators)

我希望最终结果看起来像网站上的JSON减去所有不需要的信息,因此运营商:

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

So I'm trying to parse data out of this webpage

But I don't need the whole dataset, I just need:

  • The operator name (Google, CloudFlare, etc.)
  • The description (Google 'Argon2022' log, Google 'Argon2023' log, etc.)
  • The logIDs (KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=)

I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')

print('Operators: ',operators)

My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

数理化全能战士 2025-02-15 01:13:53

首先,您要访问原始文件,而不是UI。就像Kache提到的那样,您可以使用JSON使用:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

然后,您可以使用以下脚本来提取所需的数据:

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()

First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

Then, you can use the following script to extract only the data you want:

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()
多情癖 2025-02-15 01:13:53

右下角有一个链接,可让您直接下载文件: https://chromium.googlesource.com/chromium/src/src/src/ +/main/main/components/certificate_transparency/data/data/log_list.json?format format format format formation formation

?完全解析。

这是将其提取为dict的Python代码:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))

您的问题剩下的涉及JSON和dict traversal和Basic编码,您应该能够在其他问题中找到答案。

There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=JSON

Which lets you avoid HTML parsing altogether.

Here's Python code to extract it as a dict:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))

What remains of your question involves JSON and dict traversal and basic coding, which you should be able to find answers in other questions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文