有没有办法用JS从渲染的页面中获取所有文本?

发布于 2024-09-04 03:15:39 字数 143 浏览 7 评论 0原文

有没有一种(对用户来说不显眼的)方法来使用 Javascript 获取页面中的所有文本?我可以获取 HTML、解析它、删除所有标签等,但我想知道是否有办法从已渲染的页面获取文本。

澄清一下,我不想从选择中抓取文本,我想要整个页面。

谢谢你!

Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text from the alread rendered page.

To clarify, I don't want to grab text from a selection, I want the entire page.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

橘味果▽酱 2024-09-11 03:15:39

全部归功于Greg W的回答,因为我这个答案是基于他的代码,但我发现对于一个没有内联样式或脚本标签的网站来说,它通常更容易使用:

var theText = $('body').text();

因为这可以获取所有标签中的所有文本,而无需手动设置可能包含文本的每个标签。

另外,如果您不小心,手动设置标签可能会在输出中创建重复的文本,因为每个函数通常必须检查其他标签中包含的标签,这会导致它两次抓取相同的文本。使用一个包含我们想要从中获取文本的所有标签的选择器可以避免此问题。

需要注意的是,如果 body 标签内有内联样式或脚本标签,它也会抓取这些标签。

更新:

阅读这篇关于innerText的文章后,我现在认为获取文本的绝对最佳方法是普通 ol vanilla js:

document.body.innerText

按原样,这不是可靠的跨浏览器,但在受控环境中它会返回最佳结果。阅读文章了解更多详情。

此方法以通常更易读的方式格式化文本,并且在输出中包含样式或脚本标记内容。

All credit to Greg W's answer, as I based this answer on his code, but I found that for a website without inline style or script tags it was generally simpler to use:

var theText = $('body').text();

as this grabs all text in all tags without one having to manually set every tag that might contain text.

Also, if you're not careful, setting the tags manually has the propensity to create duplicated text in the output as the each function will often have to check tags contained within other tags which results in it grabbing the same text twice. Using one selector which contains all the tags we want to grab text from circumvents this issue.

The caveat is that if there are inline style or script tags within the body tag it will grab those too.

Update:

After reading this article about innerText I now think the absolute best way to get the text is plain ol vanilla js:

document.body.innerText

As is, this is not reliable cross-browser, but in controlled environments it returns the best results. Read the article for more details.

This method formats the text in a usually more readable manner and does not include style or script tag contents in the output.

土豪我们做朋友吧 2024-09-11 03:15:39

我想如果你不介意加载 jQuery,你可以做这样的事情。

var theText;
$('p,h1,h2,h3,h4,h5').each(function(){
  theText += $(this).text();
});

全部完成后,“theText”应该包含页面上的大部分文本。添加我可能遗漏的任何相关选择器。

I suppose you could do something like this, if you don't mind loading jQuery.

var theText;
$('p,h1,h2,h3,h4,h5').each(function(){
  theText += $(this).text();
});

When its all done, "theText" should contain most of the text on the page. Add any relevant selectors I may have left out.

眉目亦如画i 2024-09-11 03:15:39

作为对 Greg W 答案的改进,您还可以删除“未定义”,并删除任何数字,因为它们不是单词。

function countWords() {

    var collectedText;

    $('p,h1,h2,h3,h4,h5').each(function(index, element){
        collectedText += element.innerText + " ";
    });   

    // Remove 'undefined if there'
    collectedText = collectedText.replace('undefined', '');

    // Remove numbers, they're not words
    collectedText = collectedText.replace(/[0-9]/g, '');

    // Get
    console.log("You have " + collectedText.split(' ').length + " in your document.");
    return collectedText;

}

可以将其拆分为单词数组、单词计数;不管怎样,真的。

As an improvement to Greg W's answer, you could also remove 'undefined', and remove any numbers, considering they're not the words.

function countWords() {

    var collectedText;

    $('p,h1,h2,h3,h4,h5').each(function(index, element){
        collectedText += element.innerText + " ";
    });   

    // Remove 'undefined if there'
    collectedText = collectedText.replace('undefined', '');

    // Remove numbers, they're not words
    collectedText = collectedText.replace(/[0-9]/g, '');

    // Get
    console.log("You have " + collectedText.split(' ').length + " in your document.");
    return collectedText;

}

This can be split into an array of words, a count of words; whatever, really.

对风讲故事 2024-09-11 03:15:39

选择页面上的所有文本:

window.getSelection().selectAllChildren(document.body)

现在您可以将此文本作为字符串获取:

const pageText = window.getSelection().toString()

Select all text on page:

window.getSelection().selectAllChildren(document.body)

Now you can get this text as a string:

const pageText = window.getSelection().toString()

撩动你心 2024-09-11 03:15:39

document.body.innerText 可以工作,但是你没有得到任何 LF,因此结果是一团乱,无法以任何准确度进行纠正。

该脚本将用其渲染的纯文本替换文档,包括 LF...

var D=Frame_ID.contentWindow.document, E=D.body.getElementsByTagName('*');

for(let i=0; i<E.length; i++){

 E[i].innerText=E[i].innerText;

}

注意:如果您不想替换当前文档,您可以克隆循环中的每个元素,将其推入数组,然后加入带有 LF 的数组。

document.body.innerText works, but you don't get any LFs so the result is a mess that cannot be corrected with any level of accuracy.

This script will replace a document with its rendered plain-text, including the LFs...

var D=Frame_ID.contentWindow.document, E=D.body.getElementsByTagName('*');

for(let i=0; i<E.length; i++){

 E[i].innerText=E[i].innerText;

}

NB: If you don't want to replace the current document you can clone each element in the loop, push it into an array, and then join the array with LFs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文