从 HTML 正文中提取文本片段（在 .NET 中）

发布于 2024-07-30 02:10:51 字数 1216 浏览 12 评论 0原文

我有一个由用户通过富文本编辑器输入的 HTML 内容，因此它几乎可以是任何内容（减去那些不应该在 body 标记之外的内容，不用担心“head”或 doctype 等）。此内容的示例：

<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />

技巧是，我只需要提取文本的前 100 个字符（删除 HTML 标签）。我还需要保留换行符并且不破坏任何单词。

所以上面的输出将类似于：

<前><代码>标头 1 这里有一些文字这里还有一些文字这里有一个链接标题 2 这里有一些文字一些

它有 98 个字符，并且保留换行符。到目前为止我能实现的是使用正则表达式去除所有 HTML 标签：

Regex.Replace(htmlStr, "<[^>]*>", "")

然后使用正则表达式修剪长度以及：

Regex.Match(textStr, @"^.{1,100}\b").Value

我的问题是，如何保留换行符？我得到如下输出：

<前><代码>标头 1 这里有一些文字这里有更多文字这里有一个链接标题 2 这里有一些文字更多文字

注意到连接句子了吗？也许有人可以告诉我解决这个问题的其他方法。谢谢！

其他信息：我的目的是从一堆 HTML 内容生成纯文本概要。我想这将有助于澄清这个问题。

原文

I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:

<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right><a href="x">A link here</a></div><hr />

The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.

So the output for the above will be something like:

Header 1
Some text here

Some more text here

A link here

Header 2
Some text here

Some

It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:

Regex.Replace(htmlStr, "<[^>]*>", "")

Then trim the length using Regex as well with:

Regex.Match(textStr, @"^.{1,100}\b").Value

My problem is, how to retaining the line break?. I get an output like:

Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text

Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!

Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

追星践月 2024-08-06 02:10:51

我想解决这个问题的方法就是把它当作一个简单的浏览器来看待。创建一个 Tag 基类，并使用 InnerHTML 属性和虚拟方法 PrintElement 使其抽象。

接下来，为您关心的每个 HTML 标记创建类并从基类继承。从你的例子来看，你最关心的标签是h1、p、a和hr。实现 PrintElement 方法，使其返回一个字符串，该字符串根据 InnerHTML 正确打印出元素（例如 p 类的 PrintElement 将返回“\n[InnerHTML]\n”）。

接下来，构建一个解析器，它将解析您的 HTML 并确定要创建哪个对象，然后将这些对象添加到队列中（树会更好，但看起来对于您的目的来说没有必要）。

最后，为每个元素调用 PrintElement 方法来遍历队列。

工作量可能比您计划的要多，但它是一个比简单使用正则表达式更强大的解决方案，如果您决定将来改变主意并想要显示简单的样式，只需返回并修改 PrintElement 方法即可。

回复收藏 0 原文

花开浅夏 2024-08-06 02:10:51

仅供参考，用正则表达式剥离 html...充满了微妙的问题。 HTML Agility Pack 可能更强大，但仍然受到单词混在一起的困扰:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;

For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;

回复收藏 0 原文

假情假意假温柔 2024-08-06 02:10:51

一种方法是分三步剥离 html：

Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>

One way could be to strip html in three steps:

Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>

回复收藏 0 原文