重新包装硬包装文本的算法?
假设我为我工作的公司编写了一个自定义电子邮件管理应用程序。 它从公司的支持帐户中读取电子邮件,并将清理后的纯文本版本存储在数据库中,并执行其他巧妙的操作,例如在此过程中将其与客户帐户和订单相关联。 当员工回复消息时,我的程序会生成一封电子邮件,该电子邮件将包含讨论主题的格式化版本发送给客户。 如果客户做出回应,应用程序会在主题行中查找唯一的编号来读取传入的消息,删除之前的讨论,并将其添加为线程中的新项目。 例如:
This is a message from Contoso customer service. Recently, you requested customer support. Below is a summary of your request and our reply. -------------------------------------------------------------------- Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m. -------------------------------------------------------------------- John: I've modified your address. You can confirm my work by logging into "Your Account" on our Web site. Your order should ship out today. Thanks for shopping at Contoso. -------------------------------------------------------------------- You on Tuesday, December 30, 2008 at 8:03 a.m. -------------------------------------------------------------------- Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
一般来说,这一切都很好,但有一个领域我现在已经推迟了一段时间的清理工作,它涉及文本换行。 为了生成像上面这样漂亮的电子邮件格式,我需要重新包装客户最初发送的文本。
我已经编写了一个算法来执行此操作(尽管查看代码,我不完全确定它是如何工作的 - 它可以使用一些重构)。 但它无法区分硬换行、“段落结尾”换行和“语义”换行。例如,硬换行是电子邮件中的换行。客户端插入一个段落以换行一长行文本,例如 79 列。 段落结尾换行符是用户在段落最后一句之后添加的换行符。 语义换行符类似于 br
标记,例如 Fred 在上面输入的地址。
相反,我的算法只看到连续的两个换行符表示一个新段落,因此它会使客户的电子邮件的格式如下所示:
Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
每当我尝试编写一个可以按预期重新包装此文本的版本时,我基本上碰壁了,因为我需要知道文本的语义,“硬换行”换行符和“我的意思是像 br
”类型的换行符之间的区别,例如在客户的地址中。 (我连续使用两个换行符来确定何时开始一个新段落,这与大多数人实际键入电子邮件的方式一致。)
有人有一种可以按预期重新包装文本的算法吗? 或者在权衡任何给定解决方案的复杂性时,这种实现是否“足够好”?
谢谢。
Let's say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company's support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:
This is a message from Contoso customer service. Recently, you requested customer support. Below is a summary of your request and our reply. -------------------------------------------------------------------- Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m. -------------------------------------------------------------------- John: I've modified your address. You can confirm my work by logging into "Your Account" on our Web site. Your order should ship out today. Thanks for shopping at Contoso. -------------------------------------------------------------------- You on Tuesday, December 30, 2008 at 8:03 a.m. -------------------------------------------------------------------- Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
Generally, this all works great, but there's one area that I've kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.
I've written an algorithm that does this (though looking at the code, I'm not entirely sure how it works anymore--it could use some refactoring). But it can't distinguish between a hard-wrap newline, an "end of paragraph" newline, and a "semantic" newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br
tag, such as the address that the Fred typed above.
My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer's e-mail be formatted something like the following:
Oops, I entered my address incorrectly. Can you change it to Fred Smith 123 Main St Anytown, VA 12345 Thanks! -- Fred Smith Contoso Product Lover
Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a "hard-wrap" newline and a "I really meant it like a br
"-type newline, such as in the customer's address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)
Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation "good enough" when weighing the complexity of any given solution?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以尝试检查是否插入了换行符以将行长度保持在最大值以下(也称为硬换行):只需检查文本中最长的行。 然后,对于任何给定行,将下一行的第一个单词附加到该行。 如果生成的行超过最大长度,则断行可能是硬换行。
更简单的是,您可能只是将
(maxlength - 15) <= length <= maxlength
中的所有中断视为硬包装(15 只是一个有根据的猜测)。 这肯定会过滤掉地址和其他内容中的故意中断,并且此范围内的任何错过的中断都不会严重影响结果。You could try to check if a newline has been inserted to keep the line length below a maximum (aka hard wrap): Just check for the longest line in the text. Then, for any given line, you append the first word of the following line to it. If the resulting line exceeds the maximum length, the line break probably was a hard wrap.
Even simpler you might just consider all breaks in
(maxlength - 15) <= length <= maxlength
as being hardwraps (with 15 just being an educated guess). This would certainly filter out intentional breaks as in addresses and stuff, and any missed break in this range wouldn't influence the result too badly.我有两个建议,如下。
注意标点符号:这将帮助您区分“硬换行”换行符和“段落结尾”换行符(因为,如果该行以句号结束,则用户更有可能想要它是段落结尾。
注意一行是否比最大行长度短得多:在上面的示例中,您可能有正在使用的文本。 “硬包装”为 79 个字符,加上地址行只有 30 个字符长;因为 30 远小于 79 个字符,所以您知道地址行是由用户而不是用户的文本换行算法破坏的。
另外,请注意缩进:从左侧缩进空格的行可能被认为是新段落,与前几行断开,就像在本论坛上一样。
I have two suggestions, as follows.
Pay attention to punctuation: this will help you to distinguish between a "hard-wrap" newline and an "end of paragraph" newline (because, if the line ends with a full stop, then it's more likely that the user intended it to be an end-of-paragraph.
Pay attention to whether a line is much shorter than the maximum line length: in the example above, you might have text that's being "hard-wrapped" at 79 characters, plus you have address lines which are only 30 characters long; because 30 is much less than 79, you know that the address lines were broken by the user and not by the user's text-wrap algorithm.
Also, pay attention to indents: lines which are indented with whitespace from the left may be supposed to be new paragraphs, broken from the previous lines, as they are on this forum.
按照 Ole 的上述建议,我重新设计了我的实现以查看阈值。 它似乎可以很好地处理我遇到的大多数场景,而无需我发疯并编写真正理解英语的代码。
基本上,我首先扫描输入字符串并将最长的行长度记录在变量
inputMaxLineLength
中。 然后,当我重新包装时,如果我遇到索引介于inputMaxLineLength
和inputMaxLineLength
的 85% 之间的换行符,那么我会用空格替换该换行符,因为我认为它是硬换行符——除非它后面紧跟着另一个换行符,因为这样我就认为它只是一个单行段落,恰好在该范围内。 例如,如果有人输入一个简短的项目符号列表,就会发生这种情况。当然不完美,但对于我的场景来说“足够好”,考虑到文本通常被以前的电子邮件客户端弄坏了一半。
这里有一些代码,我的几个小时前的实现,可能仍然在一些边缘情况下(使用 C#)。 它比我以前的解决方案复杂得多,这很好。
源代码
下面是一些执行该代码的单元测试(使用 MSTest ):
测试代码
如果有人有更好的实现(毫无疑问存在更好的实现),我很乐意阅读您的想法! 谢谢。
Following Ole's advice above, I re-worked my implementation to look at a threshold. It seems to handle most scenarios I throw at it well enough without me having to go nuts and write code that actually understand the English language.
Basically, I first scan through the input string and record the longest line length in the variable
inputMaxLineLength
. Then as I'm rewrapping, if I encounter a newline that has an index betweeninputMaxLineLength
and 85% ofinputMaxLineLength
, then I replace that newline with a space because I think it's a hard wrap newline--unless it's immediately followed by another newline, because then I assume that it's just a one-line paragraph that just happens to within that range. This can happen if someone types out a short bulleted list, for example.Certainly not perfect, but "good enough" for my scenario, considering the text is usually half-mangled by a previous e-mail client to begin with.
Here's some code, my a-few-hours-old implementation that probably still underwraps in a few edge cases (using C#). It's a lot less complicated than my previous solution, which is nice.
Source Code
And here's some unit tests that exercise that code (using MSTest):
Test Code
If anyone has a better implementation (and no doubt a better implementation exists), I'll be happy to read your thoughts! Thanks.