维基百科文本下载

发布于 2024-08-30 09:22:06 字数 202 浏览 10 评论 0原文

我正在寻找为我的大学项目下载完整的维基百科文本。我是否必须编写自己的蜘蛛才能下载此内容,或者是否有在线维基百科的公共数据集?

为了给你一些我的项目的概述,我想找出我感兴趣的几篇文章中有趣的单词。但是为了找到这些有趣的单词,我计划应用 tf/idf 来计算每个单词的词频并选择那些频率高的。但要计算 tf,我需要知道整个维基百科的总出现次数。

这怎么能做到呢?

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?

To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia.

How can this be done?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

送君千里 2024-09-06 09:22:06

来自维基百科:http://en.wikipedia.org/wiki/Wikipedia_database

维基百科向感兴趣的用户提供所有可用内容的免费副本。这些数据库可用于镜像、个人使用、非正式备份、离线使用或数据库查询(例如用于维基百科:维护)。所有文本内容均根据知识共享署名-相同方式共享 3.0 许可证 (CC-BY-SA) 和 GNU 自由文档许可证 (GFDL) 获得多重许可。图像和其他文件可以在不同的条款下使用,详情请参见其描述页面。有关遵守这些许可证的建议,请参阅维基百科:版权。

看来你也很幸运。从转储部分:

截至 2010 年 3 月 12 日,英文维基百科的最新完整转储可在 找到http://download.wikimedia.org/enwiki/20100130/ 这是自 2008 年以来创建的第一个完整的英文维基百科转储。
请注意,最近的转储(例如 20100312 转储)不完整。

所以数据只有 9 天了:)

编辑:旧链接已损坏:https://dumps。 wikimedia.org/enwiki/

from wikipedia: http://en.wikipedia.org/wiki/Wikipedia_database

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

Seems that you are in luck too. From the dump section:

As of 12 March 2010, the latest complete dump of the English-language Wikipedia can be found at http://download.wikimedia.org/enwiki/20100130/ This is the first complete dump of the English-language Wikipedia to have been created since 2008.
Please note that more recent dumps (such as the 20100312 dump) are incomplete.

So the data is only 9 days old :)

EDIT: new link as old is broken: https://dumps.wikimedia.org/enwiki/

骑趴 2024-09-06 09:22:06

如果您需要纯文本版本,而不是 Mediawiki XML,那么您可以在此处下载:
http://kopiwiki.dsd.sztaki.hu/

If you need a text only version, not a Mediawiki XML, then you can download it here:
http://kopiwiki.dsd.sztaki.hu/

誰ツ都不明白 2024-09-06 09:22:06

考虑到转储的大小,使用英语中的词频或使用 MediaWiki API 用于随机轮询页面(或访问最多的页面)。有一些基于此 API(使用 Ruby、C# 等)构建机器人的框架可以为您提供帮助。

Considering the size of the dump, you would probably be better served using the word frequency in the English language, or to use the MediaWiki API to poll pages at random (or the most consulted pages). There are frameworks to build bots based on this API (in Ruby, C#, ...) that can help you.

瞄了个咪的 2024-09-06 09:22:06

所有最新的维基百科数据集都可以从以下位置下载:Wikimedia
只需确保点击最新可用日期即可

All the latest wikipedia dataset can be downloaded from: Wikimedia
Just make sure to click on the latest available date

翻了热茶 2024-09-06 09:22:06

使用此脚本

#https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids=18630637&inprop=url&format=json
import sys, requests
for i in range(int(sys.argv[1]),int(sys.argv[2])):
  print("[wikipedia] getting source - id "+str(i))
  Text=requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids="+str(i)+"&inprop=url&format=json").text
  print("[wikipedia] putting into file - id "+str(i))
  with open("wikipedia/"+str(i)+"--id.json","w+") as File:
    File.writelines(Text)
  print("[wikipedia] archived - id "+str(i))

1 到 1062 位于 https://costlyyawning assembly.mkcodes.repl.co/

Use this script

#https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids=18630637&inprop=url&format=json
import sys, requests
for i in range(int(sys.argv[1]),int(sys.argv[2])):
  print("[wikipedia] getting source - id "+str(i))
  Text=requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids="+str(i)+"&inprop=url&format=json").text
  print("[wikipedia] putting into file - id "+str(i))
  with open("wikipedia/"+str(i)+"--id.json","w+") as File:
    File.writelines(Text)
  print("[wikipedia] archived - id "+str(i))

1 to 1062 is at https://costlyyawningassembly.mkcodes.repl.co/.

我的痛♀有谁懂 2024-09-06 09:22:06

我在 https://www.kaggle 找到了相关的 Kaggle 数据集。 com/datasets/ltcmdrdata/plain-text-wikipedia-202011

从数据集描述来看:

内容

该数据集包含约 40MB 的 JSON 文件,每个文件都包含维基百科文章的集合。 JSON 中的每个文章元素仅包含 3 个键:ID 号、文章标题和文章文本。每篇文章都被“扁平化”以占据单个纯文本字符串。与标记版本相比,这使得人们更容易阅读。它还使 NLP 任务变得更容易。您需要做的清理工作就会少得多。

每个文件如下所示:

[
 {
  "id": "17279752",
  "text": "Hawthorne Road was a cricket and football ground in Bootle in England...",
  "title": "Hawthorne Road"
 }
]

由此,使用 JSON 读取器提取文本非常简单。

I found out a relevant Kaggle dataset at https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011

From the dataset description:

Content

This dataset includes ~40MB JSON files, each of which contains a collection of Wikipedia articles. Each article element in the JSON contains only 3 keys: an ID number, the title of the article, and the text of the article. Each article has been "flattened" to occupy a single plain text string. This makes it easier for humans to read, as opposed to the markup version. It also makes it easier for NLP tasks. You will have much less cleanup to do.

Each file looks like this:

[
 {
  "id": "17279752",
  "text": "Hawthorne Road was a cricket and football ground in Bootle in England...",
  "title": "Hawthorne Road"
 }
]

From this it is trivial to extract the text with a JSON reader.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文