根据标点符号设置文本格式
如何在考虑标点符号的情况下以自然语言设置文本格式? Vim 内置的 gq
命令,或命令行工具,例如 fmt 或 par 断行,不考虑标点符号。让我举个例子,
fmt -w 40
给出的不是我想要的:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40
会给出:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
当然,也有找不到标点符号的情况在给定的文本宽度内,它可以回退到标准文本格式行为。
我想要这个的原因是获得有意义的文本差异,我可以在其中发现哪个句子或子句子发生了变化。
How can I format text in a natural language taking punctuation into account? The built-in gq
command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,
fmt -w 40
gives not what I want:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40
would give:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.
The reason why I want this is to get a meaningful diff
of text where I can spot which sentence or subsentence changed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是我最终想出的一个不是很优雅但有效的方法。假设标点符号处的换行符相当于 6 个字符。这意味着,如果“不规则”的长度小于 6 个字符,我会接受更加不规则但包含更多以标点符号结尾的行的结果。例如,这是可以的(“raggedness”是 3 个字符)。
这是不行的(“raggedness”超过6个字符),
方法是在每个标点符号后面添加6个虚拟字符,格式化文本,然后删除虚拟字符。
这是
我使用
_
(空格+下划线)作为一对虚拟字符的代码,假设它们不包含在文本中。结果看起来还不错,Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).
This is not OK ("raggedness" is more than 6 characters)
The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.
Here is the code for this
I used
_
(space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,