当前位置：文江博客话题详情

使用Python从GitHub爬网和下载readme.md文件

发布于 2025-02-03 04:41:28 字数 438 浏览 1 评论 0原文

我正在尝试执行NLP任务。为此，我需要GitHub的大量readme.md文件。这是我要做的：

对于给定的编号n，我想根据其星星数量列出第一个n github存储库（及其URL）。
我想从这些URL下载readme.md文件。
我想将readme.md文件保存在我的硬盘驱动器上，每个文件都放在单独的文件夹中。文件夹名称应为存储库的名称。

我不熟悉爬行和网络刮擦，但我对Python的效果相对较好。如果您能为如何完成此步骤提供一些帮助，我会很感激。任何帮助将不胜感激。

我的努力是：我已经搜索了一点，我找到了一个网站（gitstar-ranking.com），该网站根据他们的明星对GitHub存储库进行了排名。但这并不能解决我的问题，因为从本网站获取名称或这些存储库的URL再次是一项刮擦任务。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦过后 2025-02-10 04:41:28

这是我使用@luke的建议。我将最低恒星更改为500，因为我们不需要500万个结果（＆gt; 500仍然产生66513的结果）。
您可能不需要第29-30行上的SSL解决方法，但是由于我落后于代理人，因此很难做到这一点。
该脚本在较低和大写的任何组合中找到了称为readme.md的文件，但别无其他。它将文件保存为readme.md（大写），但可以通过使用实际文件名来调整此文件。

import urllib.request
import json
import ssl
import os
import time


n = 5  # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)

repos = api_json['items'][:n]

for repo in repos:
    full_name = repo['full_name']
    print('fetching readme from', full_name)
    
    # find readme url (case senitive)
    contents_url = repo['url'] + '/contents'
    request = urllib.request.urlopen(contents_url)
    page = request.read().decode()
    contents_json = contents_json = json.loads(page)
    readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
    
    # download readme contents
    try:
        context = ssl._create_unverified_context()  # prevent ssl problems
        request = urllib.request.urlopen(readme_url, context=context)
    except urllib.error.HTTPError as error:
        print(error)
        continue  # if the url can't be opened, there's no use to try to download anything
    readme = request.read().decode()
    
    # create folder named after repo's name and save readme.md there
    try:
        os.mkdir(repo['name'])  
    except OSError as error:
        print(error)
    f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
    f.write(readme)
    print('ok')
    
    # only 10 requests per min for unauthenticated requests
    if n >= 9:  # n + 1 initial request 
        time.sleep(6)

Here's my attempt using the suggestion from @Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.

import urllib.request
import json
import ssl
import os
import time


n = 5  # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)

repos = api_json['items'][:n]

for repo in repos:
    full_name = repo['full_name']
    print('fetching readme from', full_name)
    
    # find readme url (case senitive)
    contents_url = repo['url'] + '/contents'
    request = urllib.request.urlopen(contents_url)
    page = request.read().decode()
    contents_json = contents_json = json.loads(page)
    readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
    
    # download readme contents
    try:
        context = ssl._create_unverified_context()  # prevent ssl problems
        request = urllib.request.urlopen(readme_url, context=context)
    except urllib.error.HTTPError as error:
        print(error)
        continue  # if the url can't be opened, there's no use to try to download anything
    readme = request.read().decode()
    
    # create folder named after repo's name and save readme.md there
    try:
        os.mkdir(repo['name'])  
    except OSError as error:
        print(error)
    f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
    f.write(readme)
    print('ok')
    
    # only 10 requests per min for unauthenticated requests
    if n >= 9:  # n + 1 initial request 
        time.sleep(6)

回复收藏 0 原文

~没有更多了~