如何从元素的所有后代获取文本，而不考虑脚本？

发布于 2024-08-27 00:59:57 字数 1398 浏览 7 评论 0原文

我当前的项目涉及根据提供的选择器从元素及其所有后代收集文本内容。

例如，当提供选择器 #content 并针对此 HTML 运行时：

<div id="content">
  <p>This is some text.</p>
  <script type="text/javascript">
    var test = true;
  </script>
  <p>This is some more text.</p>
</div>

我的脚本将返回（在进行一些空白清理之后）：

这是一些文字。 var 测试 = true;这是更多文字。

但是，我需要忽略

这是我当前代码的摘录（从技术上讲，它根据一个或多个提供的选择器进行匹配）：

// get text content of all matching elements
for (x = 0; x < selectors.length; x++) { // 'selectors' is an array of CSS selectors from which to gather text content
  matches = Sizzle(selectors[x], document);
  for (y = 0; y < matches.length; y++) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
  }
}

它有点简单，因为它只返回与提供的选择器匹配的元素（及其后代）内的所有文本节点。我正在寻找的解决方案将返回除

我假设我需要以某种方式循环遍历与选择器匹配的元素的所有子元素，并累积除

我无法使用 jQuery（出于性能/带宽原因），尽管您可能已经注意到我确实使用了它的 Sizzle 选择器引擎，因此 jQuery 的选择器逻辑可用。

原文

My current project involves gathering text content from an element and all of its descendants, based on a provided selector.

For example, when supplied the selector #content and run against this HTML:

<div id="content">
  <p>This is some text.</p>
  <script type="text/javascript">
    var test = true;
  </script>
  <p>This is some more text.</p>
</div>

my script would return (after a little whitespace cleanup):

This is some text. var test = true; This is some more text.

However, I need to disregard text nodes that occur within <script> elements.

This is an excerpt of my current code (technically, it matches based on one or more provided selectors):

// get text content of all matching elements
for (x = 0; x < selectors.length; x++) { // 'selectors' is an array of CSS selectors from which to gather text content
  matches = Sizzle(selectors[x], document);
  for (y = 0; y < matches.length; y++) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
  }
}

It's a bit simplistic in that it just returns all text nodes within the element (and its descendants) that matches the provided selector. The solution I'm looking for would return all text nodes except for those that fall within <script> elements. It doesn't need to be especially high-performance, but I do need it to ultimately be cross-browser compatible.

I'm assuming that I'll need to somehow loop through all children of the element that matches the selector and accumulate all text nodes other than ones within <script> elements; it doesn't look like there's any way to identify JavaScript once it's already rolled into the string accumulated from all of the text nodes.

I can't use jQuery (for performance/bandwidth reasons), although you may have noticed that I do use its Sizzle selector engine, so jQuery's selector logic is available.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖不如嘢 2024-09-03 00:59:57

function getTextContentExceptScript(element) {
    var text= [];
    for (var i= 0, n= element.childNodes.length; i<n; i++) {
        var child= element.childNodes[i];
        if (child.nodeType===1 && child.tagName.toLowerCase()!=='script')
            text.push(getTextContentExceptScript(child));
        else if (child.nodeType===3)
            text.push(child.data);
    }
    return text.join('');
}

或者，如果允许您更改 DOM 以删除

var scripts= element.getElementsByTagName('script');
while (scripts.length!==0)
    scripts[0].parentNode.removeChild(scripts[0]);
return 'textContent' in element? element.textContent : element.innerText;

function getTextContentExceptScript(element) {
    var text= [];
    for (var i= 0, n= element.childNodes.length; i<n; i++) {
        var child= element.childNodes[i];
        if (child.nodeType===1 && child.tagName.toLowerCase()!=='script')
            text.push(getTextContentExceptScript(child));
        else if (child.nodeType===3)
            text.push(child.data);
    }
    return text.join('');
}

Or, if you are allowed to change the DOM to remove the <script> elements (which wouldn't usually have noticeable side effects), quicker:

var scripts= element.getElementsByTagName('script');
while (scripts.length!==0)
    scripts[0].parentNode.removeChild(scripts[0]);
return 'textContent' in element? element.textContent : element.innerText;

回复收藏 0 原文

紫瑟鸿黎 2024-09-03 00:59:57

编辑：

首先让我说我不太熟悉 Sizzle 的孤独，只是在使用它的库中......也就是说......

如果我必须这样做，我会做类似的事情：

var selectors = new Array('#main-content', '#side-bar');
function findText(selectors) {
    var rText = '';
    sNodes = typeof selectors = 'array' ? $(selectors.join(',')) : $(selectors);
    for(var i = 0; i <  sNodes.length; i++) {
       var nodes = $(':not(script)', sNodes[i]);
       for(var j=0; j < nodes.length; j++) {
         if(nodes[j].nodeType != 1 && node[j].childNodes.length) {
             /* recursion - this would work in jQ not sure if 
              * Sizzle takes a node as a selector you may need 
              * to tweak.
              */
             rText += findText(node[j]); 
         }  
       }
    }

    return rText;
}

我没有测试任何一个，但它应该给你一个想法。希望其他人能提供更多指导:-)

难道你不能只抓住父节点并检查循环中的nodeName...就像：

if(match.parentNode.nodeName.toLowerCase() != 'script' && match.nodeName.toLowerCase() != 'script' ) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
}

当然jquery支持not() 语法，那么您可以只执行 $(':not(script)') 吗？

EDIT:

Well first let me say im not too familar with Sizzle on its lonesome, jsut within libraries that use it... That said..

if i had to do this i would do something like:

var selectors = new Array('#main-content', '#side-bar');
function findText(selectors) {
    var rText = '';
    sNodes = typeof selectors = 'array' ? $(selectors.join(',')) : $(selectors);
    for(var i = 0; i <  sNodes.length; i++) {
       var nodes = $(':not(script)', sNodes[i]);
       for(var j=0; j < nodes.length; j++) {
         if(nodes[j].nodeType != 1 && node[j].childNodes.length) {
             /* recursion - this would work in jQ not sure if 
              * Sizzle takes a node as a selector you may need 
              * to tweak.
              */
             rText += findText(node[j]); 
         }  
       }
    }

    return rText;
}

I didnt test any of that but it should give you an idea. Hopefully someone else will pipe up with more direction :-)

Cant you just grab the parent node and check the nodeName in your loop... like:

if(match.parentNode.nodeName.toLowerCase() != 'script' && match.nodeName.toLowerCase() != 'script' ) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
}

ofcourse jquery supports the not() syntax in selectors so could you just do $(':not(script)')?

回复收藏 0 原文

~没有更多了~