Javascript 正则表达式解析 HTML 和自动换行？

发布于 2024-12-23 05:35:13 字数 586 浏览 3 评论 0原文

我需要创建一些 Javascript，它可以从文本框中搜索输入的 HTML，并忽略所有标签，以按设定数字（例如 70）自动换行，并添加标签。

我还需要找到所有 ascii，例如 © 和  并将其计为 1 个空格，而不是 5 个或 4 个空格。

所以代码将采用：

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.

输出将是：

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend <br>
this goes on for over 70 spaces.

这可能吗？我该如何开始呢？已经有这样的工具了吗？

顺便说一句，CSS 是不可能使用的。

原文

I need to create a bit of Javascript that can search inputted HTML from a text box and ignore all the tags to automatically word wrap at a set number like say 70 and add a <br> tag.

I also need to find all the ascii like © and – and count that as one space not 5 or 4 spaces.

So the code would take:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.

Output would be:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend <br>
this goes on for over 70 spaces.

Is this possible? How would I begin? Is there already a tool for this?

By the way CSS is out of the question to use.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

德意的啸 2024-12-30 05:35:13

虽然短语“正则表达式”和“解析 HTML”的组合通常会导致整个宇宙崩溃，你的用例看起来足够简单，它可以工作，但你想保留的事实换行后的 HTML 格式使得处理以空格分隔的序列变得更加容易。这是您想要执行的操作的一个非常粗略的近似：

input = "<b>Hello</b> Here is some code that I would like to wrap. Let's pretend this goes on for over 70 spaces. Better ¥€±, let's <em>make</em> it go on for more than 70, and pick üþ a whole <strong>buñ©h</strong> of crazy symbols along the way.";
words = input.split(' ');

lengths = [];
for (var i = 0; i < words.length; i++)
  lengths.push(words[i].replace(/<.+>/g, '').replace(/&.+;/g, ' ').length);

line = [], offset = 0, output = [];
for (var i = 0; i < words.length; i ++) {
  if (offset + (lengths[i] + line.length - 1) < 70) {
    line.push(words[i]);
    offset += lengths[i];
  }
  else {
    output.push(line.join(' '));
    offset = 0; line = [], i -= 1;;
  }
  if (i == words.length - 1)
    output.push(line.join(' '));
}

output = output.join('<br />');

这会导致

Hello Here is some code that I would like to wrap. Let's pretend this
goes on for over 70 spaces. Better ¥€±, let's make it go on for more
than 70, and pick üþ a whole buñ©h of crazy symbols along the way.

注意 HTML 标记 (b, em, strong ) 被保留，只是 Markdown 没有显示它们。

基本上，输入字符串在每个空格处被分割成单词，这是幼稚的并且可能会引起麻烦，但它是一个开始。然后，在删除任何类似于 HTML 标签或实体的内容后，计算每个单词的长度。然后，迭代每个单词，保持我们所在列的运行记录就很简单了；一旦达到 70，我们将聚合的单词弹出到输出字符串中并重置。同样，它非常粗糙，但对于大多数基本的 HTML 来说应该足够了。

While the combination of the phrases "regular expression" and "parse HTML" usually causes entire universes to crumble, your use case seems simplistic enough that it could work, but the fact that you want to preserve HTML formatting after wrapping makes it much easier to just work on a space-delimited sequence. Here is a very rough approximation of what you'd like to do:

input = "<b>Hello</b> Here is some code that I would like to wrap. Let's pretend this goes on for over 70 spaces. Better ¥€±, let's <em>make</em> it go on for more than 70, and pick üþ a whole <strong>buñ©h</strong> of crazy symbols along the way.";
words = input.split(' ');

lengths = [];
for (var i = 0; i < words.length; i++)
  lengths.push(words[i].replace(/<.+>/g, '').replace(/&.+;/g, ' ').length);

line = [], offset = 0, output = [];
for (var i = 0; i < words.length; i ++) {
  if (offset + (lengths[i] + line.length - 1) < 70) {
    line.push(words[i]);
    offset += lengths[i];
  }
  else {
    output.push(line.join(' '));
    offset = 0; line = [], i -= 1;;
  }
  if (i == words.length - 1)
    output.push(line.join(' '));
}

output = output.join('<br />');

which results in

Hello Here is some code that I would like to wrap. Let's pretend this
goes on for over 70 spaces. Better ¥€±, let's make it go on for more
than 70, and pick üþ a whole buñ©h of crazy symbols along the way.

Note that the HTML tags (b, em, strong) are preserved, it's just that Markdown doesn't show them.

Basically, the input string is split into words at each space, which is naïve and likely to cause trouble, but it's a start. Then, the length of each word is calculated after anything resembling an HTML tag or entity has been removed. Then it's a simple matter of iterating over each word, keeping a running tally of the column we're on; once we've struck 70, we pop the aggregated words into the output string and reset. Again, it's very rough, but it should suffice for most basic HTML.

回复收藏 0 原文

温柔戏命师 2024-12-30 05:35:13

该解决方案通过标记计数来“遍历”字符串标记，直至达到所需的行长度。正则表达式捕获四个不同标记之一：

$1：HTML 打开/关闭标记（宽度 = 0）
$2：HTML 实体。（宽度 = 1）
$3：行终止符。（计数器重置）
$4：任何其他字符。（宽度= 1）

请注意，我添加了一个行终止符标记，以防您的文本框已使用换行符格式化（带有可选的回车符）。下面是一个 JavaScript 函数，它使用 String.replace() 和一个匿名回调来遍历字符串，并在

字符串

// Break up textarea into lines having len chars.
function breakupHTML(text, len) {
    var re = /(<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)|(&(?:\w+|#x[\da-f]+|#\d+);)|(\r?\n)|(.)/ig;
    var count = 0;  // Initialize line char count.
    return text.replace(re,
        function(m0, m1, m2, m3, m4) {
            // Case 1: An HTML tag. Do not add to count.
            if (m1) return m1;
            // Case 2: An HTML entity. Add one to count.
            if (m2) {
                if (++count >= len) {
                    count = 0;
                    m2 += '<br>\n';
                }
                return m2;
            }
            // Case 3: A hard coded line terminator.
            if (m3) {
                count = 0;
                return '<br>\n';
            }
            // Case 4: Any other single character.
            if (m4) {
                if (++count >= len) {
                    count = 0;
                    m4 += '<br>\n';
                }
                return m4;
            } // Never get here.
        });
}

运行时对标记进行计数：以下是注释格式的正则表达式细分，以便您可以看到正在捕获的内容：

p = re.compile(r"""
    # Match one HTML open/close tag, HTML entity or other char.
      (<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)  # $1: HTML open/close tag
    | (&(?:\w+|\#x[\da-f]+|\#\d+);)      # $2: HTML entity.
    | (\r?\n)                            # $3: Line terminator.
    | (.)                                # $4: Any other character.
    """, re.IGNORECASE | re.VERBOSE)

This solution "walks" the string token by token counting up to the desired line length. The regex captures one of four different tokens:

$1: HTML open/close tag (width = 0)
$2: HTML entity. (width = 1)
$3: Line terminator. (counter is reset)
$4: Any other character. (width = 1)

Note that I've added a line terminator token in case your textbox is already formatted with linefeed (with optional carriage returns). Here is a JavaScript function that walks the string using String.replace() and an anonymous callback counting tokens as it goes:

function breakupHTML(text, len);

// Break up textarea into lines having len chars.
function breakupHTML(text, len) {
    var re = /(<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)|(&(?:\w+|#x[\da-f]+|#\d+);)|(\r?\n)|(.)/ig;
    var count = 0;  // Initialize line char count.
    return text.replace(re,
        function(m0, m1, m2, m3, m4) {
            // Case 1: An HTML tag. Do not add to count.
            if (m1) return m1;
            // Case 2: An HTML entity. Add one to count.
            if (m2) {
                if (++count >= len) {
                    count = 0;
                    m2 += '<br>\n';
                }
                return m2;
            }
            // Case 3: A hard coded line terminator.
            if (m3) {
                count = 0;
                return '<br>\n';
            }
            // Case 4: Any other single character.
            if (m4) {
                if (++count >= len) {
                    count = 0;
                    m4 += '<br>\n';
                }
                return m4;
            } // Never get here.
        });
}

Here's a breakdown of the regex in commented format so you can see what is being captured:

p = re.compile(r"""
    # Match one HTML open/close tag, HTML entity or other char.
      (<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)  # $1: HTML open/close tag
    | (&(?:\w+|\#x[\da-f]+|\#\d+);)      # $2: HTML entity.
    | (\r?\n)                            # $3: Line terminator.
    | (.)                                # $4: Any other character.
    """, re.IGNORECASE | re.VERBOSE)

回复收藏 0 原文

花开浅夏 2024-12-30 05:35:13

不想释放 Cthulhu，我决定（与我的其他答案不同）为您的问题提供一个不尝试的答案使用正则表达式解析 HTML。相反，我转向了 jQuery 这一令人敬畏的力量，并使用它在客户端解析 HTML。

一个工作小提琴： http://jsfiddle.net/CKQ9f/6/

html：

<div id="wordwrapOriginal">Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g<b class="foo bar">Helloend this goes on for over 70 spaces.etend</b>Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g</div>
<hr>
<div id="wordwrapResult"></div>

jQuery ：

// lifted from here: https://stackoverflow.com/a/5259788/808921
$.fn.outerHTML = function() {
    $t = $(this);
    if( "outerHTML" in $t[0] )
    { return $t[0].outerHTML; }
    else
    {
        var content = $t.wrap('<div></div>').parent().html();
        $t.unwrap();
        return content;
    }
}

// takes plain strings (no markup) and adds <br> to 
// them when each "line" has exceeded the maxLineLen
function breakLines(text, maxLineLen, startOffset)
{
   var returnVals = {'text' : text, finalOffset : startOffset + text.length};
   if (text.length + startOffset > maxLineLen)
   {
      var wrappedWords = "";
      var wordsArr = text.split(' ');
      var lineLen = startOffset;
      for (var i = 0; i < wordsArr.length; i++)
      {
        if (wordsArr[i].length + lineLen > maxLineLen)
        {
          wrappedWords += '<br>';
          lineLen = 0;
        } 
        wrappedWords += (wordsArr[i] + ' ');
        lineLen += (wordsArr[i].length + 1);
      } 
      returnVals['text'] = wrappedWords.replace(/\s$/, '');
      returnVals['finalOffset'] = lineLen;
   }
   return returnVals;
}

// recursive function which will traverse the "tree" of HTML 
// elements under the baseElem, until it finds plain text; at which 
// point, it will use the above function to add newlines to that text
function wrapHTML(baseElem, maxLineLen, startOffset)
{
    var returnString = "";
    var currentOffset = startOffset;

    $(baseElem).contents().each(function () {
        if (! $(this).contents().length) // plain text
        {
            var tmp = breakLines($(this).text(), maxLineLen, currentOffset);
            returnString += tmp['text'];
            currentOffset = tmp['finalOffset'];

        }
        else // markup
        {
            var markup = $(this).clone();
            var tmp = wrapHTML(this, maxLineLen, currentOffset);
            markup.html(tmp['html']);
            returnString += $(markup).outerHTML();
            currentOffset = tmp['finalOffset'];
        }
    });

    return {'html': returnString, 'finalOffset': currentOffset};
}


$(function () {

   wrappedHTML = wrapHTML("#wordwrapOriginal", 70, 0);

   $("#wordwrapResult").html(wrappedHTML['html']);

});

注意递归 - 不能用正则表达式做到这一点！

Not wanting to unleash Cthulhu, I decided (unlike my fellow answers) to instead provide an answer to your problem that does not attempt to parse HTML with regular expressions. Instead, I turned to the awe-inspiring force for good that is jQuery, and used that to parse your HTML on the client side.

A working fiddle: http://jsfiddle.net/CKQ9f/6/

The html:

<div id="wordwrapOriginal">Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g<b class="foo bar">Helloend this goes on for over 70 spaces.etend</b>Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g</div>
<hr>
<div id="wordwrapResult"></div>

The jQuery:

// lifted from here: https://stackoverflow.com/a/5259788/808921
$.fn.outerHTML = function() {
    $t = $(this);
    if( "outerHTML" in $t[0] )
    { return $t[0].outerHTML; }
    else
    {
        var content = $t.wrap('<div></div>').parent().html();
        $t.unwrap();
        return content;
    }
}

// takes plain strings (no markup) and adds <br> to 
// them when each "line" has exceeded the maxLineLen
function breakLines(text, maxLineLen, startOffset)
{
   var returnVals = {'text' : text, finalOffset : startOffset + text.length};
   if (text.length + startOffset > maxLineLen)
   {
      var wrappedWords = "";
      var wordsArr = text.split(' ');
      var lineLen = startOffset;
      for (var i = 0; i < wordsArr.length; i++)
      {
        if (wordsArr[i].length + lineLen > maxLineLen)
        {
          wrappedWords += '<br>';
          lineLen = 0;
        } 
        wrappedWords += (wordsArr[i] + ' ');
        lineLen += (wordsArr[i].length + 1);
      } 
      returnVals['text'] = wrappedWords.replace(/\s$/, '');
      returnVals['finalOffset'] = lineLen;
   }
   return returnVals;
}

// recursive function which will traverse the "tree" of HTML 
// elements under the baseElem, until it finds plain text; at which 
// point, it will use the above function to add newlines to that text
function wrapHTML(baseElem, maxLineLen, startOffset)
{
    var returnString = "";
    var currentOffset = startOffset;

    $(baseElem).contents().each(function () {
        if (! $(this).contents().length) // plain text
        {
            var tmp = breakLines($(this).text(), maxLineLen, currentOffset);
            returnString += tmp['text'];
            currentOffset = tmp['finalOffset'];

        }
        else // markup
        {
            var markup = $(this).clone();
            var tmp = wrapHTML(this, maxLineLen, currentOffset);
            markup.html(tmp['html']);
            returnString += $(markup).outerHTML();
            currentOffset = tmp['finalOffset'];
        }
    });

    return {'html': returnString, 'finalOffset': currentOffset};
}


$(function () {

   wrappedHTML = wrapHTML("#wordwrapOriginal", 70, 0);

   $("#wordwrapResult").html(wrappedHTML['html']);

});

Note the recursion - can't do that with a regex!

回复收藏 0 原文

~没有更多了~