在 iPhone 上使用 NSXMLParser 解析 html 实体

发布于 2024-10-11 12:38:49 字数 2258 浏览 6 评论 0原文

我想我阅读了与此问题相关的每个网页，但我仍然找不到解决方案，所以我在这里。

我有一个不受我控制的 HTML 网页，我需要从我的 iPhone 应用程序中解析它。以下是我正在讨论的网页示例：

<HTML>
  <HEAD>
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  </HEAD>
  <BODY>
    <LI class="bye bye" rel="hello 1">
      <H5 class="onlytext">
        <A name="morning_part">morning</A>
      </H5>
      <DIV class="mydiv">
        <SPAN class="myclass">something about you</SPAN> 
        <SPAN class="anotherclass">
          <A href="http://www.google.it">Bye Bye &egrave; un saluto</A>
        </SPAN>
      </DIV>
    </LI>
  </BODY>
</HTML>

我正在使用 NSXMLParser，一切进展顺利，直到找到 è html 实体。它调用foundCharacters:表示“再见”，然后调用entityName为“egrave”的resolveExternalEntityName:systemID::。在这个方法中，我只是返回在 NSData 中转换的字符“è”，再次调用foundCharacters，将字符串“è”添加到前一个“Bye Bye”，然后解析器引发 NSXMLParserUndeclaredEntityError 错误。

我没有 DTD，无法更改正在解析的 html 文件。您对这个问题有什么想法吗？

更新（2010 年 12 月 3 日）。在 Griffo 的建议之后，我最终得到了这样的结果：

data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];

其中replaceHtmlEntities:(NSData *) 是这样的：

- (NSData *)replaceHtmlEntities:(NSData *)data {
    
    NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
    NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
    
    [temp replaceOccurrencesOfString:@"&amp;" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    [temp replaceOccurrencesOfString:@"&nbsp;" withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    ...
    [temp replaceOccurrencesOfString:@"&Agrave;" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];

    NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
    return finalData;
    
}

但我仍在寻找解决此问题的最佳方法。我将在接下来的几天尝试 TouchXml，但我仍然认为应该有一种方法可以使用 NSXMLParser API 来完成此操作，因此如果您知道如何操作，请随意将其写在这里。

原文

I think I read every single web page relating to this problem but I still cannot find a solution to it, so here I am.

I have an HTML web page which is not under my control and I need to parse it from my iPhone application. Here is a sample of the web page I'm talking about:

<HTML>
  <HEAD>
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  </HEAD>
  <BODY>
    <LI class="bye bye" rel="hello 1">
      <H5 class="onlytext">
        <A name="morning_part">morning</A>
      </H5>
      <DIV class="mydiv">
        <SPAN class="myclass">something about you</SPAN> 
        <SPAN class="anotherclass">
          <A href="http://www.google.it">Bye Bye è un saluto</A>
        </SPAN>
      </DIV>
    </LI>
  </BODY>
</HTML>

I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave".
In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.

I have no DTD and I cannot change the html file I'm parsing. Do you have any ideas on this problem?

Update (12/03/2010). After the suggestion of Griffo I ended up with something like this:

data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];

where replaceHtmlEntities:(NSData *) is something like this:

- (NSData *)replaceHtmlEntities:(NSData *)data {
    
    NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
    NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
    
    [temp replaceOccurrencesOfString:@"&" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    [temp replaceOccurrencesOfString:@" " withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    ...
    [temp replaceOccurrencesOfString:@"À" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];

    NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
    return finalData;
    
}

But I am still looking the best way to solve this problem. I will try TouchXml in the next days but I still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

っ〆星空下的拥抱 2024-10-18 12:38:49

在探索了几种替代方案之后，NSXMLParser 似乎不支持标准实体 <, >, ', " 以外的实体。 &

下面的代码失败，导致 NSXMLParserUndeclaredEntityError。


// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent


NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

通过在 HTML 文档前面添加 ENTITY 声明来声明实体的尝试将会通过，但是扩展的实体不会传回 parser:foundCharacters 并且 è 和 à 字符会被删除。

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave "à">
  <!ENTITY egrave "è">
]>

在另一个实验中，我使用内部 DTD 创建了一个完全有效的 xml 文档，

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
    <!ELEMENT author (#PCDATA)>
    <!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>

我实现了 parser:foundInternalEntityDeclarationWithName:value:; 委托方法，很明显解析器正在获取实体数据，但是 < code>parser:foundCharacters 仅针对预定义实体调用。

2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model: 
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before: 
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: < 
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: < 
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document

我找到了关于使用 LibXML 的 SAX 接口< 的教程链接< /a>. NSXMLParser 使用的 xmlSAXHandler 允许定义 getEntity 回调。调用 getEntity 后，实体的扩展将传递给 characters 回调。

NSXMLParser 此处缺少功能。应该发生的是 NSXMLParser 或其 delegate 存储实体定义并将它们提供给 xmlSAXHandler getEntity 回调。这显然没有发生。我将提交错误报告。

同时，如果您的文档很小，则执行字符串替换的早期答案是完全可以接受的。查看上面提到的 SAX 教程以及 Apple 的 XMLPerformance 示例应用程序，看看是否值得您自己实现 libxml 解析器。

这很有趣。

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, ', " and &

The code below fails resulting in an NSXMLParserUndeclaredEntityError.


// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent


NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave "à">
  <!ENTITY egrave "è">
]>

In another experiment, I created a completely valid xml document with an internal DTD

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
    <!ELEMENT author (#PCDATA)>
    <!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>

I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.

2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model: 
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before: 
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: < 
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: < 
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document

I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.

NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandler getEntity callback. This is clearly not happening. I will file a bug report.

In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.

This has been fun.

回复收藏 0 原文

落墨 2024-10-18 12:38:49

一种可能较少的hacky解决方案是将DTD替换为本地修改的DTD，并将所有外部实体声明替换为本地DTD。

我是这样做的：

首先，找到文档 DTD 声明并将其替换为本地文件。例如，将以下内容替换

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

为：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://localhost/Users/siuying/Library/Application%20Support/iPhone%20Simulator/6.1/Applications/17065C0F-6754-4AD0-A1EA-9373F6476F8F/App.app/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

```

从 W3C URL 下载 DTD并将其添加到您的应用程序包中。您可以使用以下代码找到该文件的路径：

NSBundle* bundle = [NSBundle bundleForClass:[self class]];
NSString* path = [[bundle URLForResource:@"xhtml1-transitional" withExtension:@"dtd"] absoluteString];

打开 DTD 文件，找到任何外部实体引用：

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;

将其替换为内容实体文件的 ( http://www.w3. org/TR/xhtml1/DTD/xhtml-lat1.ent 在上述情况下）

替换所有外部引用后，NSXMLParser应该正确处理实体，而不需要每次解析每个远程DTD/外部实体XML 文件。

A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.

This is how I do it:

First, find and replace the document DTD declaration with a local file. For example, replace this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

with this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://localhost/Users/siuying/Library/Application%20Support/iPhone%20Simulator/6.1/Applications/17065C0F-6754-4AD0-A1EA-9373F6476F8F/App.app/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>

```

Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:

NSBundle* bundle = [NSBundle bundleForClass:[self class]];
NSString* path = [[bundle URLForResource:@"xhtml1-transitional" withExtension:@"dtd"] absoluteString];

Open the DTD file, find any external entity reference:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;

replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)

After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.

回复收藏 0 原文

躲猫猫 2024-10-18 12:38:49

在使用 NSXMLParser 解析数据之前，您可以在数据中进行字符串替换。据我所知，NSXMLParser 是 UTF-8。

回复收藏 0 原文

围归者 2024-10-18 12:38:49

我认为您将在这个示例中遇到另一个问题，因为它不是 NSXMLParser 正在寻找的有效 XML。

上面的确切问题是标签 META、LI、HTML 和 BODY 没有关闭，因此解析器会一直查找文档的其余部分来寻找其结束标签。

据我所知，如果您无权更改 HTML，解决此问题的唯一方法是通过插入结束标记来镜像它。

回复收藏 0 原文

风铃鹿 2024-10-18 12:38:49

我会尝试使用不同的解析器，比如 libxml2 - 理论上我认为它应该能够处理较差的 HTML。

回复收藏 0 原文

花期渐远 2024-10-18 12:38:49

自从我刚刚开始进行 iOS 开发以来，我一直在寻找同样的东西并找到了相关的邮件列表条目： http://www.mail-archive.com/[电子邮件受保护]/ msg17706.html

- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName: (NSString *)entityName systemID:(NSString *)systemID {       
    NSAttributedString *entityString = [[[NSAttributedString alloc] initWithHTML:[[NSString stringWithFormat:@"&%@;", entityName] dataUsingEncoding:NSUTF8StringEncoding] documentAttributes:NULL] autorelease];

    NSLog(@"resolved entity name: %@", [entityString string]);

    return [[entityString string] dataUsingEncoding:NSUTF8StringEncoding];
}

这与您的原始解决方案非常相似，也会导致解析器错误 NSXMLParserErrorDomain error 26;但之后它确实会继续解析。当然，问题是很难区分真正的错误;-)

Since I've just started doing iOS development I've been searching for the same thing and found a related mailing list entry: http://www.mail-archive.com/[email protected]/msg17706.html

- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName: (NSString *)entityName systemID:(NSString *)systemID {       
    NSAttributedString *entityString = [[[NSAttributedString alloc] initWithHTML:[[NSString stringWithFormat:@"&%@;", entityName] dataUsingEncoding:NSUTF8StringEncoding] documentAttributes:NULL] autorelease];

    NSLog(@"resolved entity name: %@", [entityString string]);

    return [[entityString string] dataUsingEncoding:NSUTF8StringEncoding];
}

This is fairly similar to your original solution and also causes a parser error NSXMLParserErrorDomain error 26; but it does continue parsing after that. The problem is, of course, that it's harder to tell real errors apart ;-)

回复收藏 0 原文

~没有更多了~