正则表达式(iPhone 上的 HTML 解析)

发布于 2024-09-28 16:19:40 字数 358 浏览 3 评论 0 原文

我正在尝试使用 Objective-C 从网站提取数据。这对我来说都是新鲜事,所以我做了一些研究。我现在知道的是我需要使用 xpath,并且我有另一个用于 iPhone 的名为 hpple 的包装器。我已经在我的项目中启动并运行了它。

我对从网站检索信息的方式感到困惑。显然我要在这行代码中使用正则表达式:

NSArray * a = [doc search:@"//a[@class='sponsor']"];

这只是一个例子。 search:@"...." 中的内容是正则表达式吗?如果是这样,我想我可以开发我的程序解析网站所需的数百种模式(我需要大量数据),但是有更好的方法吗?我对此非常迷失。任何帮助表示赞赏。

I am trying to pull data from a website using objective-c. This is all very new to me, so I've done some research. What I know now is that I need to use xpath, and I have another wrapper for that called hpple for the iPhone. I've got it up and running in my project.

I am confused about the way I retrieve information from the site. Apparently I am to use regular expressions in this line of code:

NSArray * a = [doc search:@"//a[@class='sponsor']"];

This is just an example. Is that stuff in the search:@"...." the regular expression? If so, I guess I can develop the hundreds of patterns that I will need for my program to parse the site (I need a lot of data), but is there a better way? I'm very lost in this. Any help is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风透绣罗衣 2024-10-05 16:19:40

该参数是 XPath,而不是正则表达式。详细说明如下:

  • 所有 xpath 均相对于 上下文节点 进行解释。在本例中,它是根节点。
  • // 是缩写,意思是“所有后代”
  • a 表示“所有子代 节点,节点类型为“a””(在 HTML 中,即 锚点)
  • [...] 包含一个 谓词,细化要匹配的 a
    • @是属性节点的缩写
    • @class 表示名为“class”的属性
    • @class='sponsor' 表示类属性等于“sponsor”。请注意,这不会匹配包含“sponsor”类的节点,例如

总而言之,我们有“从根下降的‘a’节点,其类别等于‘赞助商’”。

The parameter is an XPath, not a regular expression. Here's a breakdown:

  • All xpaths are interpreted relative to a context node. In this case, it's the root node.
  • // is an abbreviation meaning "all descendents"
  • a means "all child nodes with a node type of 'a'" (in HTML, that's anchors)
  • [...] contains a predicate, refining just which a to match
    • @ is an abbreviation for attribute nodes
    • @class means an attribute named "class"
    • @class='sponsor' means a class attribute equal to "sponsor". Note this will not match nodes with a class containing "sponsor", such as <a class="big sponsor" ...>; the class must be equal.

All together, we have "'a' nodes descending from the root that have class equal to 'sponsor'".

云仙小弟 2024-10-05 16:19:40

这是一个 XPath 表达式,而不是正则表达式。 W3C 在此处提供了 XPath 参考:http://www.w3.org/TR/xpath/。基本上您正在搜索 具有“sponsor”类的元素。

请注意,这是一件好事!正则表达式不利于解析 HTML。

That is an XPath expression, not a regular expression. The W3C has an XPath reference here: http://www.w3.org/TR/xpath/. Basically you are searching for <a> elements with the class "sponsor".

Note that this is a good thing! Regular expressions are bad for parsing HTML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文