重新包装硬包装文本的算法？

发布于 2024-07-11 18:09:07 字数 1861 浏览 8 评论 0原文

假设我为我工作的公司编写了一个自定义电子邮件管理应用程序。它从公司的支持帐户中读取电子邮件，并将清理后的纯文本版本存储在数据库中，并执行其他巧妙的操作，例如在此过程中将其与客户帐户和订单相关联。当员工回复消息时，我的程序会生成一封电子邮件，该电子邮件将包含讨论主题的格式化版本发送给客户。如果客户做出回应，应用程序会在主题行中查找唯一的编号来读取传入的消息，删除之前的讨论，并将其添加为线程中的新项目。例如：

This is a message from Contoso customer service.

Recently, you requested customer support. Below is a summary of your 
request and our reply.

--------------------------------------------------------------------
Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m.
--------------------------------------------------------------------
John:

I've modified your address. You can confirm my work by logging into
"Your Account" on our Web site. Your order should ship out today.

Thanks for shopping at Contoso.

--------------------------------------------------------------------
You on Tuesday, December 30, 2008 at 8:03 a.m.
--------------------------------------------------------------------
Oops, I entered my address incorrectly. Can you change it to

Fred Smith
123 Main St
Anytown, VA 12345

Thanks!

--
Fred Smith
Contoso Product Lover

一般来说，这一切都很好，但有一个领域我现在已经推迟了一段时间的清理工作，它涉及文本换行。为了生成像上面这样漂亮的电子邮件格式，我需要重新包装客户最初发送的文本。

我已经编写了一个算法来执行此操作（尽管查看代码，我不完全确定它是如何工作的 - 它可以使用一些重构）。 但它无法区分硬换行、“段落结尾”换行和“语义”换行。例如，硬换行是电子邮件中的换行。客户端插入一个段落以换行一长行文本，例如 79 列。段落结尾换行符是用户在段落最后一句之后添加的换行符。语义换行符类似于 br 标记，例如 Fred 在上面输入的地址。

相反，我的算法只看到连续的两个换行符表示一个新段落，因此它会使客户的电子邮件的格式如下所示：

Oops, I entered my address incorrectly. Can you change it to

Fred Smith 123 Main St Anytown, VA 12345

Thanks!

-- Fred Smith Contoso Product Lover

每当我尝试编写一个可以按预期重新包装此文本的版本时，我基本上碰壁了，因为我需要知道文本的语义，“硬换行”换行符和“我的意思是像 br”类型的换行符之间的区别，例如在客户的地址中。（我连续使用两个换行符来确定何时开始一个新段落，这与大多数人实际键入电子邮件的方式一致。）

有人有一种可以按预期重新包装文本的算法吗？或者在权衡任何给定解决方案的复杂性时，这种实现是否“足够好”？

谢谢。

原文

Let's say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company's support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:

This is a message from Contoso customer service.

Recently, you requested customer support. Below is a summary of your 
request and our reply.

--------------------------------------------------------------------
Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m.
--------------------------------------------------------------------
John:

I've modified your address. You can confirm my work by logging into
"Your Account" on our Web site. Your order should ship out today.

Thanks for shopping at Contoso.

--------------------------------------------------------------------
You on Tuesday, December 30, 2008 at 8:03 a.m.
--------------------------------------------------------------------
Oops, I entered my address incorrectly. Can you change it to

Fred Smith
123 Main St
Anytown, VA 12345

Thanks!

--
Fred Smith
Contoso Product Lover

Generally, this all works great, but there's one area that I've kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.

I've written an algorithm that does this (though looking at the code, I'm not entirely sure how it works anymore--it could use some refactoring). But it can't distinguish between a hard-wrap newline, an "end of paragraph" newline, and a "semantic" newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br tag, such as the address that the Fred typed above.

My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer's e-mail be formatted something like the following:

Oops, I entered my address incorrectly. Can you change it to

Fred Smith 123 Main St Anytown, VA 12345

Thanks!

-- Fred Smith Contoso Product Lover

Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a "hard-wrap" newline and a "I really meant it like a br"-type newline, such as in the customer's address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)

Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation "good enough" when weighing the complexity of any given solution?

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

攒眉千度 2024-07-18 18:09:07

您可以尝试检查是否插入了换行符以将行长度保持在最大值以下（也称为硬换行）：只需检查文本中最长的行。然后，对于任何给定行，将下一行的第一个单词附加到该行。如果生成的行超过最大长度，则断行可能是硬换行。

更简单的是，您可能只是将 (maxlength - 15) <= length <= maxlength 中的所有中断视为硬包装（15 只是一个有根据的猜测）。这肯定会过滤掉地址和其他内容中的故意中断，并且此范围内的任何错过的中断都不会严重影响结果。

回复收藏 0 原文

荒岛晴空 2024-07-18 18:09:07

我有两个建议，如下。

注意标点符号：这将帮助您区分“硬换行”换行符和“段落结尾”换行符（因为，如果该行以句号结束，则用户更有可能想要它是段落结尾。
注意一行是否比最大行长度短得多：在上面的示例中，您可能有正在使用的文本。 “硬包装”为 79 个字符，加上地址行只有 30 个字符长；因为 30 远小于 79 个字符，所以您知道地址行是由用户而不是用户的文本换行算法破坏的。

另外，请注意缩进：从左侧缩进空格的行可能被认为是新段落，与前几行断开，就像在本论坛上一样。

回复收藏 0 原文

稀香 2024-07-18 18:09:07

按照 Ole 的上述建议，我重新设计了我的实现以查看阈值。它似乎可以很好地处理我遇到的大多数场景，而无需我发疯并编写真正理解英语的代码。

基本上，我首先扫描输入字符串并将最长的行长度记录在变量 inputMaxLineLength 中。然后，当我重新包装时，如果我遇到索引介于 inputMaxLineLength 和 inputMaxLineLength 的 85% 之间的换行符，那么我会用空格替换该换行符，因为我认为它是硬换行符——除非它后面紧跟着另一个换行符，因为这样我就认为它只是一个单行段落，恰好在该范围内。例如，如果有人输入一个简短的项目符号列表，就会发生这种情况。

当然不完美，但对于我的场景来说“足够好”，考虑到文本通常被以前的电子邮件客户端弄坏了一半。

这里有一些代码，我的几个小时前的实现，可能仍然在一些边缘情况下（使用 C#）。它比我以前的解决方案复杂得多，这很好。

源代码

下面是一些执行该代码的单元测试（使用 MSTest ）：

测试代码

如果有人有更好的实现（毫无疑问存在更好的实现），我很乐意阅读您的想法！谢谢。