这是我一直遇到的一些困难。我有一个本地客户端脚本,需要允许用户获取远程网页并在结果页面中搜索表单。为了做到这一点(没有正则表达式),我需要将文档解析为完全可遍历的 DOM 对象。
我想强调一些限制:
- 我不想使用库(如 jQuery)。对于我需要在这里做的事情来说,有太多的臃肿。
- 在任何情况下都不应执行远程页面的脚本(出于安全原因)。
- DOM API(例如
getElementsByTagName
)需要可用。
- 它只需要在 Internet Explorer 中工作,但至少在 7 中。
- 假设我无权访问服务器。我知道,但我不能用它来做这个。
我尝试过的
假设我在变量 html
中有一个完整的 HTML 文档字符串(包括 DOCTYPE 声明),这是我到目前为止所尝试过的:
var frag = document.createDocumentFragment(),
div = frag.appendChild(document.createElement("div"));
div.outerHTML = html;
//-> results in an empty fragment
div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment
div.innerHTML = html;
//-> Error (expected, but I tried it anyway)
var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes
我还尝试提取 <从 HTML 中获取 ;head>
和
节点并将它们添加到片段内的
元素中,仍然没有运气。
有人有什么想法吗?
Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.
Some limitations I'd like to stress:
- I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here.
- Under no circumstances should scripts from the remote page be executed (for security reasons).
- DOM APIs, such as
getElementsByTagName
, need to be available.
- It only needs to work in Internet Explorer, but in 7 at the very least.
- Let's pretend I don't have access to a server. I do, but I can't use it for this.
What I've tried
Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html
, here's what I've tried so far:
var frag = document.createDocumentFragment(),
div = frag.appendChild(document.createElement("div"));
div.outerHTML = html;
//-> results in an empty fragment
div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment
div.innerHTML = html;
//-> Error (expected, but I tried it anyway)
var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes
I've also tried extracting the <head>
and <body>
nodes from the HTML and adding them to a <HTML>
element inside the fragment, still no luck.
Does anyone have any ideas?
发布评论
评论(7)
小提琴:http://jsfiddle.net/JFSKe/6/
DocumentFragment
不实现 DOM 方法。将document.createElement
与innerHTML
结合使用会删除和
标记(甚至当创建的元素是根元素时,
)。因此,应该从其他地方寻求解决方案。我创建了一个跨浏览器字符串到 DOM 函数,它利用了不可见的内联框架。
所有外部资源和脚本将被禁用。有关详细信息,请参阅代码说明。
代码 代码
说明
sanitiseHTML
函数基于我的replace_all_rel_by_abs
函数(请参阅 这个答案)。不过,sanitiseHTML
函数已完全重写,以实现最大的效率和可靠性。此外,还添加了一组新的正则表达式来删除所有脚本和事件处理程序(包括 CSS
expression()
、IE7-)。为了确保所有标签都按预期进行解析,调整后的标签以为前缀。此前缀对于正确解析嵌套的“事件处理程序”是必要的带有未终止的引号:
">
。这些正则表达式是使用内部函数
cr
/cri
动态创建的(Create Replace [I nline])。这些函数接受参数列表,并创建和执行高级 RE 替换。为了确保 HTML 实体不会破坏 RegExp(中的
refresh
可以通过多种方式编写),动态创建的 RegExp部分由函数ae
(Any Entity)构造。实际的替换是通过函数
by
(替换by)完成的。在此实现中,by
在所有匹配的属性之前添加data-
。都被条带化。此步骤是必要的,因为
CDATA
部分允许代码中包含字符串。执行此替换后,可以安全地转到下一个替换:
标记将被删除。
标记已删除
All 事件侦听器和外部指针/属性 (
href
、src
、url()
) 均以data-
为前缀,如前所述。创建了一个
IFrame
对象。 IFrame 不太可能泄漏内存(与 htmlfile ActiveXObject 相反)。 IFrame 变得不可见,并附加到文档中,以便可以访问 DOM。document.write()
用于将 HTML 写入 IFrame。document.open()
和document.close()
用于清空文档之前的内容,以便生成的文档是给定html
字符串。document
对象的引用。 第二个参数是一个函数,它在调用时会销毁生成的 DOM 树。当您不再需要树时,应该调用此函数。如果未指定回调函数,该函数将返回一个由两个属性(
doc
和destroy
),其行为与前面提到的参数相同。其他说明
designMode
属性设置为“On”将停止框架执行脚本(Chrome 不支持)。如果出于特定原因必须保留标记,则可以使用
iframe.designMode = "On"
而不是脚本剥离功能。htmlfile activeXObject
的可靠来源。根据 此来源,htmlfile
比IFrame,并且更容易受到内存泄漏的影响。href
、src
...)均以data-< 为前缀/代码>。
data-href
显示了获取/更改这些属性的示例:elem.getAttribute("data-href")
和elem.setAttribute( "data-href", "...")
elem.dataset.href
和elem.dataset.href = "..."
。
否外部样式
无脚本样式没有图像:元素的大小可能完全不同。
示例
< strong>
sanitiseHTML(html)
将此小书签粘贴到该位置的栏中。它将提供一个注入文本区域的选项,显示经过清理的 HTML 字符串。
代码示例 -
string2dom(html)
:值得注意的参考文献
sanitiseHTML(html)
基于我之前创建的replace_all_rel_by_abs(html)
函数。Fiddle: http://jsfiddle.net/JFSKe/6/
DocumentFragment
doesn't implement DOM methods. Usingdocument.createElement
in conjunction withinnerHTML
removes the<head>
and<body>
tags (even when the created element is a root element,<html>
). Therefore, the solution should be sought elsewhere. I have created a cross-browser string-to-DOM function, which makes use of an invisible inline-frame.All external resources and scripts will be disabled. See Explanation of the code for more information.
Code
Explanation of the code
The
sanitiseHTML
function is based on myreplace_all_rel_by_abs
function (see this answer). ThesanitiseHTML
function is completely rewritten though, in order to achieve maximum efficiency and reliability.Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS
expression()
, IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by<!--'"-->
. This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes:<a id="><input onclick="<div onmousemove=evil()>">
.These RegExps are dynamically created using an internal function
cr
/cri
(Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh
in<meta http-equiv=refresh>
could be written in various ways), the dynamically created RegExps are partially constructed by functionae
(Any Entity).The actual replacements are done by function
by
(replace by). In this implementation,by
addsdata-
before all matched attributes.<script>//<[CDATA[ .. //]]></script>
occurrences are striped. This step is necessary, becauseCDATA
sections allow</script>
strings inside the code. After this replacement has been executed, it's safe to go to the next replacement:<script>...</script>
tags are removed.<meta http-equiv=refresh .. >
tag is removedAll event listeners and external pointers/attributes (
href
,src
,url()
) are prefixed bydata-
, as described previously.An
IFrame
object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed.document.write()
are used to write HTML to the IFrame.document.open()
anddocument.close()
are used to empty the previous contents of the document, so that the generated document is an exact copy of the givenhtml
string.document
object. The second argument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.If the callback function isn't specified, the function returns an object consisting of two properties (
doc
anddestroy
), which behave the same as the previously mentioned arguments.Additional notes
designMode
property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the<script>
tags for a specific reason, you can useiframe.designMode = "On"
instead of the script stripping feature.htmlfile activeXObject
. According to this source,htmlfile
is slower than IFrames, and more susceptible to memory leaks.href
,src
, ...) are prefixed bydata-
. An example of getting/changing these attributes is shown fordata-href
:elem.getAttribute("data-href")
andelem.setAttribute("data-href", "...")
elem.dataset.href
andelem.dataset.href = "..."
.No external styles<link rel="stylesheet" href="main.css" />
No scripted styles<script>document.body.bgColor="red";</script>
<img src="128x128.png" />
No images: the size of the element may be completely different.Examples
sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.
Code examples -
string2dom(html)
:Notable references
sanitiseHTML(html)
is based on my previously createdreplace_all_rel_by_abs(html)
function.<applet>
)不知道为什么要搞乱 documentFragments,您只需将 HTML 文本设置为新 div 元素的
innerHTML
即可。然后,您可以将该 div 元素用于getElementsByTagName
等,而无需将 div 添加到 DOM:如果您真的很喜欢 documentFragment 的想法,您可以使用此代码,但您仍然需要将其包装在 div 中以获得您想要的 DOM 函数:
Not sure why you're messing with documentFragments, you can just set the HTML text as the
innerHTML
of a new div element. Then you can use that div element forgetElementsByTagName
etc without adding the div to DOM:If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:
我不确定 IE 是否支持
document.implementation.createHTMLDocument
,但如果支持,请使用此算法(改编自我的 DOMParser HTML 扩展)。请注意,DOCTYPE 将不会被保留。:I'm not sure if IE supports
document.implementation.createHTMLDocument
, but if it does, use this algorithm (adapted from my DOMParser HTML extension). Note that the DOCTYPE will not be preserved.:假设 HTML 也是有效的 XML,您可以使用 loadXML()
Assuming the HTML is valid XML too, you may use loadXML()
DocumentFragment
不支持getElementsByTagName
—— 只有Document
支持。您可能需要使用像 jsdom 这样的库,它提供了 DOM 的实现,您可以通过它进行搜索使用
getElementsByTagName
和其他 DOM API。并且可以将其设置为不执行脚本。是的,它很“重”,我不知道它是否可以在 IE 7 中运行。DocumentFragment
doesn't supportgetElementsByTagName
-- that's only supported byDocument
.You may need to use a library like jsdom, which provides an implementation of the DOM and through which you can search using
getElementsByTagName
and other DOM APIs. And you can set it to not execute scripts. Yes, it's 'heavy' and I don't know if it works in IE 7.只是在这个页面上闲逛,有点晚了:)但以下内容应该可以帮助将来遇到类似问题的任何人...但是 IE7/8 现在确实应该被忽略,并且有更好的方法支持更现代的浏览器。
以下几乎适用于我测试过的所有内容 - 唯一的两个缺点是:
我已将定制的
getElementById
和getElementsByName
函数添加到根 div 元素,因此这些不会按预期出现在树的下面(除非修改代码以满足此需求)。文档类型将被忽略 - 但是我认为这不会有太大区别,因为我的经验是文档类型不会影响 dom 的结构,只是它的渲染方式(这显然不会发生在这个方法)。
基本上,系统依赖于这样一个事实:用户代理对
和
进行不同的处理。正如已经发现的那样,某些特殊标签不能存在于 div 元素中,因此它们被删除。命名空间元素可以放置在任何地方(除非有 DTD 另有说明)。虽然这些命名空间标签实际上不会像真正的标签一样工作,但考虑到我们只是将它们用于文档中的结构位置,所以并不会真正造成问题。标记和代码如下:
Just wandered across this page, am a bit late to be of any use :) but the following should help anyone with a similar problem in future... however IE7/8 should really be ignored by now and there are much better methods supported by the more modern browsers.
The following works across nearly eveything I've tested - the only two down sides are:
I've added bespoke
getElementById
andgetElementsByName
functions to the root div element, so these wont appear as expected futher down the tree (unless the code is modified to cater for this).The doctype will be ignored - however I don't think this will make much difference as my experience is that the doctype wont effect how the dom is structured, just how it is rendered (which obviously wont happen with this method).
Basically the system relies on the fact that
<tag>
and<namespace:tag>
are treated differently by the useragents. As has been found certain special tags can not exist within a div element, and so therefore they are removed. Namespaced elements can be placed anywhere (unless there is a DTD stating otherwise). Whilst these namespace tags wont actually behave as the real tags in question, considering we are only really using them for their structural position in the document it doesn't really cause a problem.markup and code are as follows:
要使用完整的 HTML DOM 功能而不触发请求,无需处理不兼容性:
一切就绪! doc是一个html文档,但它不是在线的。
To use full HTML DOM abilities without triggering requests, without having to deal with incompatibilities:
All set ! doc is an html document, but it is not online.