抓取和解析维基百科页面
我想知道 Objective-C 中是否存在任何现有库或可从 Objective-C 访问这些库,它们允许我抓取格式类似于
我四处寻找是否有任何替代方案可以替代抓取,例如 XML 文件或 API。我确实找到了一个 API,但我看到的唯一可用的客户端是其他语言的,它们似乎只能将内容发布到页面,而不能检索它。
编辑:所以我在这些链接中找到了有关 API 的更多信息:
我能够想出此请求 返回一些 HTML 编码文本(格式是是 XML,但它包含页面的文本,例如 »a href=
等。我会继续查看文档,看看是否可以让它变得更好一点,如果不能的话不过,有什么关于解析这个的建议吗?
编辑2:好的,感谢此文档页面,我能够检索数据的最简单、最干净的方法是使用此 构建的链接,返回相关部分的原始数据(在 wiki 标记中)。然而,我想我需要解析它,尽管如果确实如此,它应该比整篇文章容易得多。
有人对解析 wiki 标记(例如 Objective-C 中的以下内容)有任何建议吗?
==Events==
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
我最终想要的是,我猜想有一个 NSDictionary 或类似的集合,它将存储日期以及随附的信息片段。谢谢!
I'm wondering if there are any existing libraries in or accessible from Objective-C that would allow me to scrape pages formatted like this one. Specifically, all of the dates and all of the text next to each date. If not, what would be the best way to go about doing this? Regular expressions? I heard that NSString
might already have built-in methods for this. Is this true?
I was looking around to see if there were any alternative to scraping, such as an XML file or API. I did find an API but the only clients I see available are in other languages and they seem to just be able to post content to pages, not retrieve it.
EDIT: So I found more information regarding the API at these links:
And I was able to come up with this request which returns some HTML encoded text (Well the format is XML, but it includes the page's text such as »a href=
etc. I'll keep looking through the docs to see if I can make this come out a bit better, if not though, are there any recommendations on parsing this?
EDIT 2: Alright so thanks to this doc page, the simplest and cleanest way I've been able to retrieve the data is using this constructed link which returns the raw data (In wiki markup) of the relevant section. However, I guess I would then need to parse that, though if that really is the case, it should be a lot easier than the entire article.
Does anyone have any recommendations on parsing wiki markup such as the following in Objective-C?
==Events==
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
What I want to end up having is, I guess an NSDictionary
or similar collection that will store the date with the accompanying snippet of information. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
将
&format=fmt
添加到查询末尾,如 API:数据格式。您的查询将变为: JSON 查询,例如。您可以指定 XML、JSON 或许多其他格式。您可以轻松解析整个部分,然后将 HTML 格式的输出显示到 Web 视图中。
Add a
&format=fmt
to the end of your query, as described at API:Data_formats. Your query becomes: JSON query, for example. You can specify XML, JSON, or many other formats.You can easily parse the overall sections, and then just display the HTML formatted output into a webview.
鉴于维基百科上的页面以纯文本形式存储,并且用户以纯文本形式输入,您将无法从中获得结构化数据集。
Given that pages on Wikipedia are stored as plaintext, and input by users as plaintext, you're not going to get a structured data set from it.
我通过各种方式从 WP 中抓取了大量数据。格式取决于很多因素,包括信息所在的子域类型以及输入时间。正文是自由格式的,没有简单的方法可以抓取它。信息框采用特殊的 WP 格式,多年来该格式已发生变化。它不是为被刮擦而设计的。
有一个支持 WP 的数据库,它的结构更加结构化。
到目前为止,您最好的策略是联系您想要抓取的域中的维基百科 - 他们会了解数据库格式并且很可能能够提供帮助 - 他们肯定会想要提供他们想要的帮助以语义形式查看 WP(例如 DBPedia - http://dbpedia.org/About)。
I have scraped a lot of data from WP in various ways. the format depends on a lot of things including what type of subdomain the information is in and when it was entered. The main text is free format and there is no simple way to scrape it. The infoboxes are in a special WP format which has changed over the years. It wasn't designed to be scraped.
There is a database backing WP which is somewhat more structured.
By far your best strategy is to contact the Wikipedians in the domain you wish to scrape - they will know about the database format and may well be able to help - they will certainly want to help as they will want to see WP in semantic form (such as DBPedia - http://dbpedia.org/About).
Python算不算? ;) 它可以从 Objective-C 访问。
有一些很棒的模块可用于抓取目的:Beautiful Soap 和/或 mechanize,您也可以考虑 lxml。
Does Python count? ;) It is accessible from Objective-C.
And there are great modules for scraping purposes: Beautiful Soap and/or mechanize, you can also consider lxml.
我将建议使用正则表达式在混合 HTML 数据流中进行目标数据提取。
手机上已经有 RegEx 库,但它们有点隐藏 - 您可以使用 RegexKitLite 通过一些简单的调用来公开它们(确保向下滚动并获取精简版)。它最终成为一个在 NSString 上有一些扩展的类,可以让您执行正则表达式,然后您可以定义一个包含两个捕获的匹配项的正则表达式 - 一个用于数字,一个用于内容,以及一些未捕获的匹配项封闭标签和中间标签。尽管它是标准 RegEX 的“精简”版本,但它仍然支持您需要的几乎任何功能。
API 方法很有前途,但一旦获得原始标记,您可能必须采用类似的正则表达式方法来解析其中的数据。如果它减少了正则表达式的复杂性和数据传输时间,它仍然可能有意义,但没有理由不能结合这两种方法。
I'm going to go with suggesting regex for targeted data extraction in a mixed HTML data stream.
There are already RegEx libraries on the phone, they are sort of hidden though - you can expose them with a few simple calls using RegexKitLite (make sure to scroll down and get the light version). It ends up being a class with a few extensions on NSString that lets you do regexs, then you would define a regex with two captured matches - one for the number, and one for the content, along with a number of non-captured matches for the enclosing and intermediate tags. Even though it's a "lite" version of standard RegEX it sill supports just about any ability you would need.
The API approach is promising but once you get the raw markup you're probably going to have to take a similar regex approach to parsing data out of that. It still might make sense if it reduces regex complexity and data transfer time though, no reason you can't combine both approaches.
无论用什么语言,这绝对不是这样做的方法。
如果有任何在线网站能够以良好的方式公开其数据,那就是维基百科。
考虑以 XML、RDF 甚至 JSON 格式获取一篇文章。
that's most definitely not the way to do it, in any language.
if any site online will expose their data in a nice way, it'll be wikipedia.
look into getting an article as XML, as RDF, or maybe even as JSON.
我有一个 iPhone 应用程序,它使用以下内容进行屏幕抓取:
使用 YQL,您可以通过针对 DOM 使用 XPATH 查询从网络获取所需的任何信息。
我个人认为它比使用正则表达式要好得多。话又说回来,我只知道非常简单的正则表达式。
I've got an iPhone app which does a screen scrape using the following:
Using YQL you can get whatever information you need from the web by using XPATH queries against the DOM.
Personally I think its much better than using Regex. Then again I only know very simple regular expressions.