解析 XML/“屏幕抓取”的最佳方法是什么?在 iOS 中? UIWebview 还是 NSXMLParser?

发布于 2024-09-15 10:52:46 字数 673 浏览 5 评论 0原文

我正在创建一个 iOS 应用程序,需要从网页获取一些数据。我的第一个想法是使用 NSXMLParser initWithContentsOfURL: 并使用 NSXMLParser 委托解析 HTML。然而,这种方法似乎很快就会变得痛苦(例如,如果 HTML 发生变化,我将不得不重写解析代码,这可能会很尴尬)。

当我加载网页时,我也查看了 UIWebView 。看起来 UIWebView 可能是正确的选择。 stringByEvaluatingJavaScriptFromString: 似乎是一种非常方便的提取数据的方法,并且允许将 JavaScript 存储在一个单独的文件中,如果 HTML 发生更改,该文件很容易编辑。然而,使用 UIWebView 似乎有点老套(因为 UIWebView 是一个 UIView 子类,它可能会阻塞主线程,文档说javascript 有 10MB 的限制)。

在我陷入困境之前,有人对解析 XML/HTML 有什么建议吗?

更新:

我写了一篇关于我的解决方案的博客文章:HTML 解析/屏幕抓取在 iOS 中

I am creating an iOS app that needs to get some data from a web page. My first though was to use NSXMLParser initWithContentsOfURL: and parse the HTML with the NSXMLParser delegate. However this approach seems like it could quickly become painful (if, for example, the HTML changed I would have to rewrite the parsing code which could be awkward).

Seeing as I'm loading a web page I took take a look at UIWebView too. It looks like UIWebView may be the way to go. stringByEvaluatingJavaScriptFromString: seems like a very handy way to extract the data and would allow the javascript to be stored in a separate file that would be easy to edit if the HTML changed. However, using UIWebView seems a bit hacky (seeing as UIWebView is a UIView subclass it may block the main thread, and the docs say that the javascript has a limit of 10MB).

Does anyone have any advice regarding parsing XML/HTML before I get stuck in?

UPDATE:

I wrote a blog post about my solution:HTML parsing/screen scraping in iOS

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤千羽 2024-09-22 10:52:46

我已经这样做过几次了。我发现的最好方法是使用 libxml2,它具有 HTML 模式。然后就可以使用XPath来查询文档了。

使用 libxml2 API 并不是最令人愉快的。因此,我通常会带来此页面上记录的 XPathQuery.h/.m 文件:

http://cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html

然后我使用 NSConnection 获取数据并使用某些内容查询数据像这样:

NSArray *tdNodes = PerformHTMLXPathQuery(self.receivedData, @"//td[@class='col-name']/a/span");

摘要:

  1. 将 libxml2 添加到您的项目中,以下是 XCode4 的一些快速说明:
    http://cmar.me/2011 /04/20/adding-libxml2-to-an-xcode-4-project/

  2. 获取 XPathQuery.h/.m

  3. 使用使用 XPath 语句查询 html 文档。

I've done this a few times. The best approach I've found is to use libxml2 which has a mode for HTML. Then you can use XPath to query the document.

Working with the libxml2 API is not the most enjoyable. So, I usually bring over the XPathQuery.h/.m files documented on this page:

http://cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html

Then I fetch the data using a NSConnection and query the data with something like this:

NSArray *tdNodes = PerformHTMLXPathQuery(self.receivedData, @"//td[@class='col-name']/a/span");

Summary:

  1. Add libxml2 to your project, here are some quick instructions for XCode4:
    http://cmar.me/2011/04/20/adding-libxml2-to-an-xcode-4-project/

  2. Get the XPathQuery.h/.m

  3. Use an XPath statement to query the html document.

如梦 2024-09-22 10:52:46

使用 XML 解析器解析 HTML 通常不起作用,因为许多网站都有不正确的 HTML,Web 浏览器会处理这些错误,但像 NSXMLParser 这样严格的 XML 解析器将完全失败。

对于许多脚本语言来说,有一些更仁慈的优秀抓取库。就像Python的Beautiful Soup模块一样。不幸的是我不知道 Objective-C 有这样的模块。

将内容加载到 UIWebView 中可能是最简单的方法。请注意,您不必将 UIWebView 放在屏幕上。您可以创建一个单独的 UIWindow 并向其中添加 UIWebView,以便进行完全离屏渲染。我想 WWDC2009 上有一个关于此的视频。正如您已经提到的,它不会是轻量级的。

根据您想要的数据以及需要解析的页面的复杂性,您还可以使用正则表达式甚至手写解析器来解析它。我已经这样做过很多次了,对于简单的数据来说,这种方法效果很好。

Parsing HTML with an XML parser usually does not work anyway because many sites have incorrect HTML, which a web browser will deal with, but a strict XML parser like NSXMLParser will totally fail on.

For many scripting languages there are great scraping libraries that are more merciful. Like Python's Beautiful Soup module. Unfortunately I do not know of such modules for Objective-C.

Loading stuff into a UIWebView might be the simplest way to go here. Note that you do not have to put the UIWebView on screen. You can create a separate UIWindow and add the UIWebView to it, so that you do full off-screen rendering. There was a WWDC2009 video about this I think. As you already mention, it will not be lightweight though.

Depending on the data that you want and the complexity of the pages that you need to parse, you might also be able to parse it by using regular expressions or even a hand written parser. I have done this many times, and for simple data this works well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文