将 HTML 转换为纯文本同时保留换行符(使用 JavaScript)的最便捷方法是什么?

发布于 2024-09-25 09:29:30 字数 244 浏览 6 评论 0原文

基本上我只需要从浏览器窗口复制 HTML 并将其粘贴到文本区域元素中的效果。

例如我想要这个:

<p>Some</p>
<div>text<br />Some</div>
<div>text</div>

变成这样:

Some
text
Some
text

Basically I just need the effect of copying that HTML from browser window and pasting it in a textarea element.

For example I want this:

<p>Some</p>
<div>text<br />Some</div>
<div>text</div>

to become this:

Some
text
Some
text

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

暮年 2024-10-02 09:29:30

如果该 HTML 在您的网页中可见,您可以通过用户选择(或 IE 中的 TextRange)来实现。这确实保留了换行符,即使不一定是前导和尾随空格。

更新 2012 年 12 月 10 日

但是,Selection 对象的 toString() 方法是 尚未标准化,并且在浏览器之间工作不一致,因此这种方法基于不稳定的基础和 < strong>我现在不建议使用它。如果不接受我会删除这个答案。

演示: http://jsfiddle.net/wv49v/

代码:

function getInnerText(el) {
    var sel, range, innerText = "";
    if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
        range = document.body.createTextRange();
        range.moveToElementText(el);
        innerText = range.text;
    } else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
        sel = window.getSelection();
        sel.selectAllChildren(el);
        innerText = "" + sel;
        sel.removeAllRanges();
    }
    return innerText;
}

If that HTML is visible within your web page, you could do it with the user selection (or just a TextRange in IE). This does preserve line breaks, if not necessarily leading and trailing white space.

UPDATE 10 December 2012

However, the toString() method of Selection objects is not yet standardized and works inconsistently between browsers, so this approach is based on shaky ground and I don't recommend using it now. I would delete this answer if it weren't accepted.

Demo: http://jsfiddle.net/wv49v/

Code:

function getInnerText(el) {
    var sel, range, innerText = "";
    if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
        range = document.body.createTextRange();
        range.moveToElementText(el);
        innerText = range.text;
    } else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
        sel = window.getSelection();
        sel.selectAllChildren(el);
        innerText = "" + sel;
        sel.removeAllRanges();
    }
    return innerText;
}
烈酒灼喉 2024-10-02 09:29:30

我试图找到一些我不久前为此编写的代码并使用过。效果很好。让我概述一下它做了什么,希望您可以复制它的行为。

  • 用替代文本或标题文本替换图像。
  • 将链接替换为“文本[链接]”
  • 替换通常会产生垂直空白的内容。 h1-h6、div、p、br、hr 等(我知道,我知道。这些实际上可能是内联元素,但效果很好。)
  • 删除其余标签并替换为空字符串。

您甚至可以进一步扩展它以格式化有序列表和无序列表等内容。这实际上取决于您想要走多远。

编辑

找到代码!

public static string Convert(string template)
{
    template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", "$1"); /* Use image alt text. */
    template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", "$2 [$1]"); /* Convert links to something useful */
    template = Regex.Replace(template, "<(/p|/div|/h\\d|br)\\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
    template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */

    return template;
}

I tried to find some code I wrote for this a while back that I used. It worked nicely. Let me outline what it did, and hopefully you could duplicate its behavior.

  • Replace images with alt or title text.
  • Replace links with "text[link]"
  • Replace things that generally produce vertical white space. h1-h6, div, p, br, hr, etc. (I know, I know. These could actually be inline elements, but it works out well.)
  • Strip out the rest of the tags and replace with an empty string.

You could even expand this more to format things like ordered and unordered lists. It really just depends on how far you'll want to go.

EDIT

Found the code!

public static string Convert(string template)
{
    template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", "$1"); /* Use image alt text. */
    template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", "$2 [$1]"); /* Convert links to something useful */
    template = Regex.Replace(template, "<(/p|/div|/h\\d|br)\\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
    template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */

    return template;
}
淡写薰衣草的香 2024-10-02 09:29:30

我根据这个答案创建了一个函数: https://stackoverflow.com/a/42254787/3626940

function htmlToText(html){
    //remove code brakes and tabs
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    //keep html brakes and tabs
    html = html.replace(/<\/td>/g, "\t");
    html = html.replace(/<\/table>/g, "\n");
    html = html.replace(/<\/tr>/g, "\n");
    html = html.replace(/<\/p>/g, "\n");
    html = html.replace(/<\/div>/g, "\n");
    html = html.replace(/<\/h>/g, "\n");
    html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");

    //parse html into text
    var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
    return dom.body.textContent;
}

I made a function based on this answer: https://stackoverflow.com/a/42254787/3626940

function htmlToText(html){
    //remove code brakes and tabs
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    //keep html brakes and tabs
    html = html.replace(/<\/td>/g, "\t");
    html = html.replace(/<\/table>/g, "\n");
    html = html.replace(/<\/tr>/g, "\n");
    html = html.replace(/<\/p>/g, "\n");
    html = html.replace(/<\/div>/g, "\n");
    html = html.replace(/<\/h>/g, "\n");
    html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");

    //parse html into text
    var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
    return dom.body.textContent;
}
断桥再见 2024-10-02 09:29:30

根据 chrmcpn 答案,我必须将基本的 HTML 电子邮件模板转换为纯文本版本,作为 的一部分在 Node.js 中构建脚本。我必须使用 JSDOM 才能使其工作,但这是我的代码:

const htmlToText = (html) => {
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    html = html.replace(/<\/p>/g, "\n\n");
    html = html.replace(/<\/h1>/g, "\n\n");
    html = html.replace(/<br>/g, "\n");
    html = html.replace(/<br( )*\/>/g, "\n");

    const dom = new JSDOM(html);
    let text = dom.window.document.body.textContent;

    text = text.replace(/  /g, "");
    text = text.replace(/\n /g, "\n");
    text = text.trim();
    return text;
}

Based on chrmcpn answer, I had to convert a basic HTML email template into a plain text version as part of a build script in node.js. I had to use JSDOM to make it work, but here's my code:

const htmlToText = (html) => {
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    html = html.replace(/<\/p>/g, "\n\n");
    html = html.replace(/<\/h1>/g, "\n\n");
    html = html.replace(/<br>/g, "\n");
    html = html.replace(/<br( )*\/>/g, "\n");

    const dom = new JSDOM(html);
    let text = dom.window.document.body.textContent;

    text = text.replace(/  /g, "");
    text = text.replace(/\n /g, "\n");
    text = text.trim();
    return text;
}
攒一口袋星星 2024-10-02 09:29:30

三步。

First get the html as a string.
Second, replace all <BR /> and <BR> with \r\n.
Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".

Three steps.

First get the html as a string.
Second, replace all <BR /> and <BR> with \r\n.
Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文