当前位置：文江博客话题详情

Firefox 扩展中 XMLHTTPRequest 的 HTML DOM 解析和字符编码

发布于 2024-11-06 05:24:54 字数 1871 浏览 8 评论 0原文

我现在正在编写 Firefox 4 引导扩展。

以下是我的故事：

当我使用@mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest时，目标URL的内容可以通过req.responseText成功加载。

我通过createElement方法将responseText解析为DOM，并将innerHTML属性解析为BODY元素。

一切似乎都很成功。

然而，字符编码（charset）存在问题。

由于我需要扩展检测目标文档的字符集，因此用 text/html 覆盖 Mine 类型的请求； charset=blahblah ..似乎不能满足我的需要。

我尝试过 @mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService，但 XMLHTTPRequest 似乎没有 ScriptableInputStream，甚至没有任何 InputStream 或可读流。

我不知道如何以合适的自动检测字符集读取目标文档内容，无论 GUI 中的自动检测字符编码功能或在内容文档的头元标记处读取的字符集。

编辑：如果我将整个文档（包括 HTML、HEAD、BODY 标记）解析为 DOM 对象，但不加载大量文档（如 js、css、ico 文件），这是否实用？

编辑： MDC 标题为“HTML to DOM”的文章的方法，该方法使用 @mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML 是不合适，因为它解析时有很多错误，并且错误与baseURI不能设置< /strong> 类型text/html。当 A 元素中包含相对路径时，所有属性 HREF 都会丢失。

编辑#2：如果有任何方法可以将传入的responseText转换为可读的UTF-8字符集字符串，那就太好了。 :-)

任何解决编码问题的想法或工作都值得赞赏。 :-)

PS。目标文档是通用，因此没有没有特定的字符集（或者... < strong>preknown ），当然不仅仅是 UTF8，因为它已经默认定义了。

SUPP：

到目前为止，我对解决这个问题有两个简短的主要想法。

有人可以帮我解决 XPCOM 模块和方法的名称吗？

将内容解析为 DOM 时指定字符集。

我们首先需要知道文档的字符集（通过提取头元标记，或标题）。然后，

找到一个在解析body内容时可以指定字符集的方法。
找出一种可以同时解析 head 和 body 的方法。

将传入的响应文本转换或制作为UTF-8，以便使用默认字符集UTF-8解析为DOM元素仍然有效。

X 似乎不实用也不明智：用字符集覆盖 Mine 类型是这个想法的实现，但我们无法在发起请求之前预先知道字符集。

原文

I am now writing firefox 4 bootstrapped extension.

The following is my story:

When I'm using @mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.

I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.

Everything seems to be successful.

However, there is a problem on character encoding ( charset ).

As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.

I've tried the @mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.

I have no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.

EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?

EDIT: Method on the article at MDC titled as "HTML to DOM" which is using @mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.

EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)

Any ideas or works to solve encoding problem are appreciated. :-)

PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.

SUPP:

Til now, I have two brief main ideas of solving this problem.

Can anybody could help me to work out of the XPCOM modules and methods' names?

To Specify the charset while parsing Content into DOM.

We need to first know the charset of the document ( by extracting head meta Tag, or header).
Then,

find out a method that can specify the charset when parsing body content.
find out a method that can parse both head and body.

To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.

X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无声无音无过去 2024-11-13 05:24:54

似乎没有更多的答案了。

经过一天的测试，我发现有一种方法（尽管很笨拙）可以解决我的问题。

xhr.overrideMimeType('text/plain; charset=x-user-defined'); ，其中 xhr 代表 XMLHttpRequest Handler。

强制 Firefox 将其视为普通文本
文本，使用用户定义的字符
放。这告诉 Firefox 不要解析
它，并让字节通过
未处理。
参考MDC文档：Using_XMLHttpRequest#Receiving_binary_data

然后使用Scriptable Unicode Converter : @mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter

变量 charset 可以从头元标记中提取，无论 regexp 是否来自 req.responseText （具有未知的字符集）或其他方法。

var unicodeConverter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);

最终生成了一个 unicode 字符串以及一系列 UTF-8。 :-)

然后简单地解析到 body 元素并满足我的需要。

其他精彩的想法仍然受到欢迎。请以充分的理由反对我的回答。 :-)

It seems that there are no more other answer.

After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.

xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.

To force Firefox to treat it as plain
text, using a user-defined character
set. This tells Firefox not to parse
it, and to let the bytes pass through
unprocessed.
Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data

And then use Scriptable Unicode Converter : @mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter

Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.

var unicodeConverter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);

An unicode string, as well as a family of UTF-8, is finally produced. :-)

Then simply parse to body element and meet my need.

Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

回复收藏 0 原文

~没有更多了~