如何使JSON离开网页？

发布于 2025-02-08 01:13:52 字数 1230 浏览 2 评论 0原文

因此，我正在尝试从此网页

但是我不需要整个数据集，我只需要：

操作员名称（Google，Cloudflare
等
） logids（kxm+8j45oshwvnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6vnofy6v。

我尝试编写一些代码，但我只是在Webscraping上，我只是一个初学者，所以我想知道是否有人可以帮助您，这是我尝试的代码，我尝试使用LXML和请求库。

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')

print('Operators: ',operators)

我希望最终结果看起来像网站上的JSON减去所有不需要的信息，因此运营商：

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

原文

So I'm trying to parse data out of this webpage

But I don't need the whole dataset, I just need:

The operator name (Google, CloudFlare, etc.)
The description (Google 'Argon2022' log, Google 'Argon2023' log, etc.)
The logIDs (KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=)

I tried to write some code but I'm just a beginner at webscraping, so was wondering if anyone could help. Here is my attempted code, I tried using lxml and requests library.

import requests
from lxml import html

page = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json')
tree = html.fromstring(page.content)

#This will create a list of operators:
operators = tree.xpath('//span[@class="operators"]/text()')

print('Operators: ',operators)

My hope is to have an end result that looks like the JSON on the website minus all the unneeded info so operators:

[
  { "name": "Google",
    "logs": [
      { description: "Google Argon2022 log",
        log_id: "KXm+8J45OSHwVnOfY6V35b5XfZxgCvj5TV0mXCVdx4Q=" }, 
      { description: "GoogleArgon2023 log",
        log_id: "6D7Q2j71BjUy51covIlryQPTy9ERa+zraeF3fW0GvW4=" }
  }
  ....
  { "name": "CloudFlare",
    "logs": [ ... ]
  }
]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

数理化全能战士 2025-02-15 01:13:53

首先，您要访问原始文件，而不是UI。就像Kache提到的那样，您可以使用JSON使用：

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

然后，您可以使用以下脚本来提取所需的数据：

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()

First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

Then, you can use the following script to extract only the data you want:

import requests
import json
import base64

def extract_log(log):
    keys = [ 'description', 'log_id' ]
    return { key: log[key] for key in keys }

def extract_logs(logs):
    return [ extract_log(log) for log in logs ]

def extract_operator(operator):
    return {
        'name': operator['name'],
        'logs': extract_logs(operator['logs'])
    }

def extract_certificates(obj):
    return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
    resp = requests.get(url)
    obj = json.loads(base64.decodebytes(resp.text.encode()))
    return extract_certificates(obj)

def main():
    out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
    print(json.dumps(out, indent=4))

if __name__ == '__main__':
    main()

回复收藏 0 原文

多情癖 2025-02-15 01:13:53

右下角有一个链接，可让您直接下载文件： https://chromium.googlesource.com/chromium/src/src/src/ +/main/main/components/certificate_transparency/data/data/log_list.json?format format format format formation formation

？完全解析。

这是将其提取为dict的Python代码：

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))

您的问题剩下的涉及JSON和dict traversal和Basic编码，您应该能够在其他问题中找到答案。

There is a link at the bottom right that lets you download the file directly: https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=JSON

Which lets you avoid HTML parsing altogether.

Here's Python code to extract it as a dict:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
js = json.loads(base64.decodebytes(resp.text.encode()))

What remains of your question involves JSON and dict traversal and basic coding, which you should be able to find answers in other questions.

回复收藏 0 原文

~没有更多了~