Objective-c - 我应该使用哪个库来解析 HTML?
我正在尝试解析 iPhone 中的一些不复杂的 RSS html 内容。
所以我不需要笨重的 HTML 解析器。
我在这里搜索并找到了这两个:
https://github.com/topfunky/hpple
https://github.com/zootreeves/Objective-C-HMTL-Parser
两者都很简单使用。但我想他们为了我的目的而有他们的问题。
对于 TFHpple 来说,这很好,但是对于每个元素,它没有完整的 HTML <> 。与它自己。例如,element 没有这个完整的标签字符串。我需要这个完整的标记字符串,因为我需要将其从整个 HTML 字符串中删除。如果element有这个的话我会更方便。
对于zootreeves HTML-Parser来说,也简单又好。它具有每个元素的完整标记字符串。我很高兴。然而,它似乎是一个很大的内存消耗者。我监控了它。如果我尝试解析大量 HTML 片段(例如 1000 个),则它将消耗并占用的内存约为 40MB。它不适用于 ios 设备。我猜,zootreeves 使用纯 C 代码和链表来组织 HTML 的树结构。它使用纯 malloc 和 free 内存。不知道会不会影响ios内存。
那么,有人可以为我推荐一个最先进的、更好、更快、更简单的 iO HTML 解析器吗?
谢谢
I am trying to parse some not-complicated RSS html content in iphone.
So I don't need a heavy HTML parser.
I have searched here and found these two:
https://github.com/topfunky/hpple
https://github.com/zootreeves/Objective-C-HMTL-Parser
Both are simple to use. But I guess they have their problems for my purpose.
For TFHpple, it is good, but for every element, it does not have the complete HTML <> with itself. for example, element doesn't have this complete tag string. I need this complete tag string, because I need to remove it from the whole HTML string. I would be more convenient for me if element has that.
For zootreeves HTML-Parser, it is also simple and good. And it has the complete tag string with every element. I am very happy. However, it seems to be a big memory-comsumer. I monitored it. If I try to parse a big number of HTML fragments (say, 1000), the memory it will cost and stays occupied is like 40MB. It is not applicable for ios devices. zootreeves is using pure C codes and linked-list to organise the tree structures of the HTML, I guess. and it uses pure malloc and free for memory. I don't know whether that will affect ios memory.
So, anyone can recommend a state-of-art better and fast and simple HTML parser for iOs for me?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会使用 libxml2。它不仅仅适用于 xml;它还适用于 xml。它也有一个 HTML 解析器。它速度快、占用内存少,并且可在 iOS 中使用。唯一的缺点是它是基于 C 的 API,但尽管如此,使用起来并不是非常困难。
更新
回应下面第一条评论:已经有一段时间了,所以我不确定,但我不这么认为。您得到的是一个包含大量有关文档结构信息的数据结构,每个标签都有一个属性/值对列表。原始的 html 字符串没有存储在任何地方(我认为这被认为是多余的,并且不是为了节省内存)。
但是,您似乎并不真正需要它来完成您想做的事情。在我看来,您正在使用解析器中的信息来修改原始字符串,删除 HTML 标签。相反,您想要做的是使用解析树中的信息重建文档,并且在执行此操作时,忽略您想要省略的标签。
I'd use libxml2. It's not just for xml; it has an HTML parser too. It's fast and low-memory and is available in iOS. The only drawback is that it's a C-based API, but for all that it's not terribly difficult to work with.
Update
In response to the first comment below: It's been awhile, so I'm not sure, but I don't think so. What you get is a data structure with lots of information about the document structure, and each tag has a list of attribute/value pairs. Nowhere is the original html string stored (I presume that this is considered redundant and is not done to save memory).
However, it doesn't seem like you actually need it for what you want to do. It seems to me that you are using information from the parser to modify the original string, stripping out HTML tags. What you want to do instead is to rebuild the document using information from the parse tree, and when you do this, leave out the tags you want omitted.