抓取和解析维基百科页面

发布于 2024-08-09 16:58:15 字数 1414 浏览 11 评论 0原文

我想知道 Objective-C 中是否存在任何现有库或可从 Objective-C 访问这些库，它们允许我抓取格式类似于

我四处寻找是否有任何替代方案可以替代抓取，例如 XML 文件或 API。我确实找到了一个 API，但我看到的唯一可用的客户端是其他语言的，它们似乎只能将内容发布到页面，而不能检索它。

编辑：所以我在这些链接中找到了有关 API 的更多信息：

我能够想出此请求返回一些 HTML 编码文本（格式是是 XML，但它包含页面的文本，例如 »a href= 等。我会继续查看文档，看看是否可以让它变得更好一点，如果不能的话不过，有什么关于解析这个的建议吗？

编辑2：好的，感谢此文档页面，我能够检索数据的最简单、最干净的方法是使用此构建的链接，返回相关部分的原始数据（在 wiki 标记中）。然而，我想我需要解析它，尽管如果确实如此，它应该比整篇文章容易得多。

有人对解析 wiki 标记（例如 Objective-C 中的以下内容）有任何建议吗？

==Events==
* [[710]] &ndash; [[Saracen]] invasion of [[Sardinia]].
*[[1275]] &ndash; Traditional founding of the city of [[Amsterdam]].
*[[1682]] &ndash; [[Philadelphia]], [[Pennsylvania]] is founded.

我最终想要的是，我猜想有一个 NSDictionary 或类似的集合，它将存储日期以及随附的信息片段。谢谢！

原文

I'm wondering if there are any existing libraries in or accessible from Objective-C that would allow me to scrape pages formatted like this one. Specifically, all of the dates and all of the text next to each date. If not, what would be the best way to go about doing this? Regular expressions? I heard that NSString might already have built-in methods for this. Is this true?

I was looking around to see if there were any alternative to scraping, such as an XML file or API. I did find an API but the only clients I see available are in other languages and they seem to just be able to post content to pages, not retrieve it.

EDIT: So I found more information regarding the API at these links:

And I was able to come up with this request which returns some HTML encoded text (Well the format is XML, but it includes the page's text such as »a href= etc. I'll keep looking through the docs to see if I can make this come out a bit better, if not though, are there any recommendations on parsing this?

EDIT 2: Alright so thanks to this doc page, the simplest and cleanest way I've been able to retrieve the data is using this constructed link which returns the raw data (In wiki markup) of the relevant section. However, I guess I would then need to parse that, though if that really is the case, it should be a lot easier than the entire article.

Does anyone have any recommendations on parsing wiki markup such as the following in Objective-C?

==Events==
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.

What I want to end up having is, I guess an NSDictionary or similar collection that will store the date with the accompanying snippet of information. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝眼泪 2024-08-16 16:58:15

将 &format=fmt 添加到查询末尾，如 API：数据格式。您的查询将变为： JSON 查询，例如。您可以指定 XML、JSON 或许多其他格式。

您可以轻松解析整个部分，然后将 HTML 格式的输出显示到 Web 视图中。

回复收藏 0 原文

┈┾☆殇 2024-08-16 16:58:15

鉴于维基百科上的页面以纯文本形式存储，并且用户以纯文本形式输入，您将无法从中获得结构化数据集。

回复收藏 0 原文

朮生 2024-08-16 16:58:15

我通过各种方式从 WP 中抓取了大量数据。格式取决于很多因素，包括信息所在的子域类型以及输入时间。正文是自由格式的，没有简单的方法可以抓取它。信息框采用特殊的 WP 格式，多年来该格式已发生变化。它不是为被刮擦而设计的。

有一个支持 WP 的数据库，它的结构更加结构化。

到目前为止，您最好的策略是联系您想要抓取的域中的维基百科 - 他们会了解数据库格式并且很可能能够提供帮助 - 他们肯定会想要提供他们想要的帮助以语义形式查看 WP（例如 DBPedia - http://dbpedia.org/About）。

回复收藏 0 原文

坏尐絯℡ 2024-08-16 16:58:15

Python算不算？ ;) 它可以从 Objective-C 访问。
有一些很棒的模块可用于抓取目的：Beautiful Soap 和/或 mechanize，您也可以考虑 lxml。

回复收藏 0 原文

梦里泪两行 2024-08-16 16:58:15

我将建议使用正则表达式在混合 HTML 数据流中进行目标数据提取。

手机上已经有 RegEx 库，但它们有点隐藏 - 您可以使用 RegexKitLite 通过一些简单的调用来公开它们（确保向下滚动并获取精简版）。它最终成为一个在 NSString 上有一些扩展的类，可以让您执行正则表达式，然后您可以定义一个包含两个捕获的匹配项的正则表达式 - 一个用于数字，一个用于内容，以及一些未捕获的匹配项封闭标签和中间标签。尽管它是标准 RegEX 的“精简”版本，但它仍然支持您需要的几乎任何功能。

API 方法很有前途，但一旦获得原始标记，您可能必须采用类似的正则表达式方法来解析其中的数据。如果它减少了正则表达式的复杂性和数据传输时间，它仍然可能有意义，但没有理由不能结合这两种方法。

回复收藏 0 原文