我正在尝试使用 Objective-C 从网站提取数据。这对我来说都是新鲜事,所以我做了一些研究。我现在知道的是我需要使用 xpath,并且我有另一个用于 iPhone 的名为 hpple 的包装器。我已经在我的项目中启动并运行了它。
我对从网站检索信息的方式感到困惑。显然我要在这行代码中使用正则表达式:
NSArray * a = [doc search:@"//a[@class='sponsor']"];
这只是一个例子。 search:@"...." 中的内容是正则表达式吗?如果是这样,我想我可以开发我的程序解析网站所需的数百种模式(我需要大量数据),但是有更好的方法吗?我对此非常迷失。任何帮助表示赞赏。
I am trying to pull data from a website using objective-c. This is all very new to me, so I've done some research. What I know now is that I need to use xpath, and I have another wrapper for that called hpple for the iPhone. I've got it up and running in my project.
I am confused about the way I retrieve information from the site. Apparently I am to use regular expressions in this line of code:
NSArray * a = [doc search:@"//a[@class='sponsor']"];
This is just an example. Is that stuff in the search:@"...." the regular expression? If so, I guess I can develop the hundreds of patterns that I will need for my program to parse the site (I need a lot of data), but is there a better way? I'm very lost in this. Any help is appreciated.
发布评论
评论(2)
该参数是 XPath,而不是正则表达式。详细说明如下:
//
是缩写,意思是“所有后代”a
表示“所有子代 节点,节点类型为“a””(在 HTML 中,即 锚点)[...]
包含一个 谓词,细化要匹配的a
@
是属性节点的缩写@class
表示名为“class”的属性@class='sponsor'
表示类属性等于“sponsor”。请注意,这不会匹配包含“sponsor”类的节点,例如;类必须相等。
总而言之,我们有“从根下降的‘a’节点,其类别等于‘赞助商’”。
The parameter is an XPath, not a regular expression. Here's a breakdown:
//
is an abbreviation meaning "all descendents"a
means "all child nodes with a node type of 'a'" (in HTML, that's anchors)[...]
contains a predicate, refining just whicha
to match@
is an abbreviation for attribute nodes@class
means an attribute named "class"@class='sponsor'
means a class attribute equal to "sponsor". Note this will not match nodes with a class containing "sponsor", such as<a class="big sponsor" ...>
; the class must be equal.All together, we have "'a' nodes descending from the root that have class equal to 'sponsor'".
这是一个 XPath 表达式,而不是正则表达式。 W3C 在此处提供了 XPath 参考:http://www.w3.org/TR/xpath/。基本上您正在搜索 具有“sponsor”类的元素。
请注意,这是一件好事!正则表达式不利于解析 HTML。
That is an XPath expression, not a regular expression. The W3C has an XPath reference here: http://www.w3.org/TR/xpath/. Basically you are searching for <a> elements with the class "sponsor".
Note that this is a good thing! Regular expressions are bad for parsing HTML.