在 iPhone 上使用 NSXMLParser 解析 html 实体
我想我阅读了与此问题相关的每个网页,但我仍然找不到解决方案,所以我在这里。
我有一个不受我控制的 HTML 网页,我需要从我的 iPhone 应用程序中解析它。以下是我正在讨论的网页示例:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD>
<BODY>
<LI class="bye bye" rel="hello 1">
<H5 class="onlytext">
<A name="morning_part">morning</A>
</H5>
<DIV class="mydiv">
<SPAN class="myclass">something about you</SPAN>
<SPAN class="anotherclass">
<A href="http://www.google.it">Bye Bye è un saluto</A>
</SPAN>
</DIV>
</LI>
</BODY>
</HTML>
我正在使用 NSXMLParser,一切进展顺利,直到找到 è html 实体。它调用foundCharacters:表示“再见”,然后调用entityName为“egrave”的resolveExternalEntityName:systemID::。 在这个方法中,我只是返回在 NSData 中转换的字符“è”,再次调用foundCharacters,将字符串“è”添加到前一个“Bye Bye”,然后解析器引发 NSXMLParserUndeclaredEntityError 错误。
我没有 DTD,无法更改正在解析的 html 文件。您对这个问题有什么想法吗?
更新(2010 年 12 月 3 日)。在 Griffo 的建议之后,我最终得到了这样的结果:
data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];
其中replaceHtmlEntities:(NSData *) 是这样的:
- (NSData *)replaceHtmlEntities:(NSData *)data {
NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
[temp replaceOccurrencesOfString:@"&" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
[temp replaceOccurrencesOfString:@" " withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
...
[temp replaceOccurrencesOfString:@"À" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
return finalData;
}
但我仍在寻找解决此问题的最佳方法。我将在接下来的几天尝试 TouchXml,但我仍然认为应该有一种方法可以使用 NSXMLParser API 来完成此操作,因此如果您知道如何操作,请随意将其写在这里。
I think I read every single web page relating to this problem but I still cannot find a solution to it, so here I am.
I have an HTML web page which is not under my control and I need to parse it from my iPhone application. Here is a sample of the web page I'm talking about:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD>
<BODY>
<LI class="bye bye" rel="hello 1">
<H5 class="onlytext">
<A name="morning_part">morning</A>
</H5>
<DIV class="mydiv">
<SPAN class="myclass">something about you</SPAN>
<SPAN class="anotherclass">
<A href="http://www.google.it">Bye Bye è un saluto</A>
</SPAN>
</DIV>
</LI>
</BODY>
</HTML>
I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave".
In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.
I have no DTD and I cannot change the html file I'm parsing. Do you have any ideas on this problem?
Update (12/03/2010). After the suggestion of Griffo I ended up with something like this:
data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];
where replaceHtmlEntities:(NSData *) is something like this:
- (NSData *)replaceHtmlEntities:(NSData *)data {
NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
[temp replaceOccurrencesOfString:@"&" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
[temp replaceOccurrencesOfString:@" " withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
...
[temp replaceOccurrencesOfString:@"À" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
return finalData;
}
But I am still looking the best way to solve this problem. I will try TouchXml in the next days but I still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
在探索了几种替代方案之后,NSXMLParser 似乎不支持标准实体
<, >, ', " 以外的实体。 &
下面的代码失败,导致
NSXMLParserUndeclaredEntityError
。通过在 HTML 文档前面添加 ENTITY 声明来声明实体的尝试将会通过,但是扩展的实体不会传回
parser:foundCharacters
并且 è 和 à 字符会被删除。在另一个实验中,我使用内部 DTD 创建了一个完全有效的 xml 文档,
我实现了 parser:foundInternalEntityDeclarationWithName:value:; 委托方法,很明显解析器正在获取实体数据,但是 < code>parser:foundCharacters 仅针对预定义实体调用。
我找到了关于 使用 LibXML 的 SAX 接口< 的教程链接< /a>.
NSXMLParser
使用的xmlSAXHandler
允许定义getEntity
回调。调用getEntity
后,实体的扩展将传递给characters
回调。NSXMLParser
此处缺少功能。应该发生的是NSXMLParser
或其delegate
存储实体定义并将它们提供给xmlSAXHandler
getEntity
回调。这显然没有发生。我将提交错误报告。同时,如果您的文档很小,则执行字符串替换的早期答案是完全可以接受的。查看上面提到的 SAX 教程以及 Apple 的 XMLPerformance 示例应用程序,看看是否值得您自己实现
libxml
解析器。这很有趣。
After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities
<, >, ', " and &
The code below fails resulting in an
NSXMLParserUndeclaredEntityError
.Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to
parser:foundCharacters
and the è and à characters are dropped.In another experiment, I created a completely valid xml document with an internal DTD
I implemented the
parser:foundInternalEntityDeclarationWithName:value:;
delegate method and it is clear that the parser is getting the entity data, however theparser:foundCharacters
is only called for the pre-defined entities.I found a link to a tutorial on Using the SAX Interface of LibXML. The
xmlSAXHandler
that is used byNSXMLParser
allows for agetEntity
callback to be defined. After callinggetEntity
, the expansion of the entity is passed to thecharacters
callback.NSXMLParser
is missing functionality here. What should happen is that theNSXMLParser
or itsdelegate
store the entity definitions and provide them to thexmlSAXHandler
getEntity
callback. This is clearly not happening. I will file a bug report.In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the
libxml
parser on your own is worthwhile.This has been fun.
一种可能较少的hacky解决方案是将DTD替换为本地修改的DTD,并将所有外部实体声明替换为本地DTD。
我是这样做的:
首先,找到文档 DTD 声明并将其替换为本地文件。例如,将以下内容替换
为:
```
从 W3C URL 下载 DTD并将其添加到您的应用程序包中。您可以使用以下代码找到该文件的路径:
打开 DTD 文件,找到任何外部实体引用:
将其替换为内容实体文件的 ( http://www.w3. org/TR/xhtml1/DTD/xhtml-lat1.ent 在上述情况下)
替换所有外部引用后,NSXMLParser应该正确处理实体,而不需要每次解析每个远程DTD/外部实体XML 文件。
A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.
This is how I do it:
First, find and replace the document DTD declaration with a local file. For example, replace this:
with this:
```
Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:
Open the DTD file, find any external entity reference:
replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)
After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.
在使用 NSXMLParser 解析数据之前,您可以在数据中进行字符串替换。据我所知,NSXMLParser 是 UTF-8。
You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.
我认为您将在这个示例中遇到另一个问题,因为它不是 NSXMLParser 正在寻找的有效 XML。
上面的确切问题是标签 META、LI、HTML 和 BODY 没有关闭,因此解析器会一直查找文档的其余部分来寻找其结束标签。
据我所知,如果您无权更改 HTML,解决此问题的唯一方法是通过插入结束标记来镜像它。
I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.
The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.
The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.
我会尝试使用不同的解析器,比如 libxml2 - 理论上我认为它应该能够处理较差的 HTML。
I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.
自从我刚刚开始进行 iOS 开发以来,我一直在寻找同样的东西并找到了相关的邮件列表条目: http://www.mail-archive.com/[电子邮件受保护]/ msg17706.html
这与您的原始解决方案非常相似,也会导致解析器错误
NSXMLParserErrorDomain error 26
;但之后它确实会继续解析。当然,问题是很难区分真正的错误;-)Since I've just started doing iOS development I've been searching for the same thing and found a related mailing list entry: http://www.mail-archive.com/[email protected]/msg17706.html
This is fairly similar to your original solution and also causes a parser error
NSXMLParserErrorDomain error 26
; but it does continue parsing after that. The problem is, of course, that it's harder to tell real errors apart ;-)