根据标点符号设置文本格式

发布于 2024-11-07 14:17:17 字数 730 浏览 4 评论 0原文

如何在考虑标点符号的情况下以自然语言设置文本格式? Vim 内置的 gq 命令,或命令行工具,例如 fmtpar 断行,不考虑标点符号。让我举个例子,

fmt -w 40 给出的不是我想要的:

we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way

smart_formatter -w 40 会给出:

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

当然,也有找不到标点符号的情况在给定的文本宽度内,它可以回退到标准文本格式行为。

我想要这个的原因是获得有意义的文本差异,我可以在其中发现哪个句子或子句子发生了变化。

How can I format text in a natural language taking punctuation into account? The built-in gq command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,

fmt -w 40 gives not what I want:

we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way

smart_formatter -w 40 would give:

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.

The reason why I want this is to get a meaningful diff of text where I can spot which sentence or subsentence changed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

念三年u 2024-11-14 14:17:17

这是我最终想出的一个不是很优雅但有效的方法。假设标点符号处的换行符相当于 6 个字符。这意味着,如果“不规则”的长度小于 6 个字符,我会接受更加不规则但包含更多以标点符号结尾的行的结果。例如,这是可以的(“raggedness”是 3 个字符)。

Wait!
He said.

这是不行的(“raggedness”超过6个字符),

Wait!
He said to them.

方法是在每个标点符号后面添加6个虚拟字符,格式化文本,然后删除虚拟字符。

这是

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

我使用 _ (空格+下划线)作为一对虚拟字符的代码,假设它们不包含在文本中。结果看起来还不错,

we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way

Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).

Wait!
He said.

This is not OK ("raggedness" is more than 6 characters)

Wait!
He said to them.

The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.

Here is the code for this

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

I used _ (space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,

we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文