使用Python从GitHub爬网和下载readme.md文件

发布于 2025-02-03 04:41:28 字数 438 浏览 1 评论 0原文

我正在尝试执行NLP任务。为此,我需要GitHub的大量readme.md文件。这是我要做的:

  1. 对于给定的编号n,我想根据其星星数量列出第一个n github存储库(及其URL) 。
  2. 我想从这些URL下载readme.md文件。
  3. 我想将readme.md文件保存在我的硬盘驱动器上,每个文件都放在单独的文件夹中。文件夹名称应为存储库的名称。

我不熟悉爬行和网络刮擦,但我对Python的效果相对较好。如果您能为如何完成此步骤提供一些帮助,我会很感激。任何帮助将不胜感激。

我的努力是:我已经搜索了一点,我找到了一个网站(gitstar-ranking.com),该网站根据他们的明星对GitHub存储库进行了排名。但这并不能解决我的问题,因为从本网站获取名称或这些存储库的URL再次是一项刮擦任务。

I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:

  1. For a given number n, I want to list the first n GitHub repositories (And Their URLs) based on the number of their stars.
  2. I want to download the Readme.md file from those URLs.
  3. I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.

I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.

My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦过后 2025-02-10 04:41:28

这是我使用@luke的建议。我将最低恒星更改为500,因为我们不需要500万个结果(> 500仍然产生66513的结果)。
您可能不需要第29-30行上的SSL解决方法,但是由于我落后于代理人,因此很难做到这一点。
该脚本在较低和大写的任何组合中找到了称为readme.md的文件,但别无其他。它将文件保存为readme.md(大写),但可以通过使用实际文件名来调整此文件。

import urllib.request
import json
import ssl
import os
import time


n = 5  # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)

repos = api_json['items'][:n]

for repo in repos:
    full_name = repo['full_name']
    print('fetching readme from', full_name)
    
    # find readme url (case senitive)
    contents_url = repo['url'] + '/contents'
    request = urllib.request.urlopen(contents_url)
    page = request.read().decode()
    contents_json = contents_json = json.loads(page)
    readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
    
    # download readme contents
    try:
        context = ssl._create_unverified_context()  # prevent ssl problems
        request = urllib.request.urlopen(readme_url, context=context)
    except urllib.error.HTTPError as error:
        print(error)
        continue  # if the url can't be opened, there's no use to try to download anything
    readme = request.read().decode()
    
    # create folder named after repo's name and save readme.md there
    try:
        os.mkdir(repo['name'])  
    except OSError as error:
        print(error)
    f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
    f.write(readme)
    print('ok')
    
    # only 10 requests per min for unauthenticated requests
    if n >= 9:  # n + 1 initial request 
        time.sleep(6)

Here's my attempt using the suggestion from @Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.

import urllib.request
import json
import ssl
import os
import time


n = 5  # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)

repos = api_json['items'][:n]

for repo in repos:
    full_name = repo['full_name']
    print('fetching readme from', full_name)
    
    # find readme url (case senitive)
    contents_url = repo['url'] + '/contents'
    request = urllib.request.urlopen(contents_url)
    page = request.read().decode()
    contents_json = contents_json = json.loads(page)
    readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
    
    # download readme contents
    try:
        context = ssl._create_unverified_context()  # prevent ssl problems
        request = urllib.request.urlopen(readme_url, context=context)
    except urllib.error.HTTPError as error:
        print(error)
        continue  # if the url can't be opened, there's no use to try to download anything
    readme = request.read().decode()
    
    # create folder named after repo's name and save readme.md there
    try:
        os.mkdir(repo['name'])  
    except OSError as error:
        print(error)
    f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
    f.write(readme)
    print('ok')
    
    # only 10 requests per min for unauthenticated requests
    if n >= 9:  # n + 1 initial request 
        time.sleep(6)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文