使用Python从GitHub爬网和下载readme.md文件
我正在尝试执行NLP任务。为此,我需要GitHub的大量readme.md文件。这是我要做的:
- 对于给定的编号
n
,我想根据其星星数量列出第一个n
github存储库(及其URL) 。 - 我想从这些URL下载readme.md文件。
- 我想将readme.md文件保存在我的硬盘驱动器上,每个文件都放在单独的文件夹中。文件夹名称应为存储库的名称。
我不熟悉爬行和网络刮擦,但我对Python的效果相对较好。如果您能为如何完成此步骤提供一些帮助,我会很感激。任何帮助将不胜感激。
我的努力是:我已经搜索了一点,我找到了一个网站(gitstar-ranking.com),该网站根据他们的明星对GitHub存储库进行了排名。但这并不能解决我的问题,因为从本网站获取名称或这些存储库的URL再次是一项刮擦任务。
I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:
- For a given number
n
, I want to list the firstn
GitHub repositories (And Their URLs) based on the number of their stars. - I want to download the Readme.md file from those URLs.
- I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.
I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.
My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是我使用@luke的建议。我将最低恒星更改为500,因为我们不需要500万个结果(> 500仍然产生66513的结果)。
您可能不需要第29-30行上的SSL解决方法,但是由于我落后于代理人,因此很难做到这一点。
该脚本在较低和大写的任何组合中找到了称为
readme.md
的文件,但别无其他。它将文件保存为readme.md
(大写),但可以通过使用实际文件名来调整此文件。Here's my attempt using the suggestion from @Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called
readme.md
in any combination of lower- and uppercase but nothing else. It saves the file asREADME.md
(uppercase) but this can be adjusted by using the actual filename.