日语自动换行算法

发布于 2024-08-18 11:33:10 字数 349 浏览 7 评论 0原文

在我最近构建的一个 Web 应用程序中,当我们的一位用户决定使用它来完全用日语创建一些东西时,我感到非常惊讶。然而,文字的包裹方式很奇怪而且笨拙。显然,浏览器不能很好地处理日语文本的换行,可能是因为它包含很少的空格,因为每个字符形成一个完整的单词。然而,这并不是一个真正安全的假设,因为某些单词是由多个字符构成的,并且将某些字符组分成不同的行是不安全的。

谷歌搜索并没有真正帮助我更好地理解这个问题。在我看来,人们需要一本牢不可破的模式字典,并假设其他地方都是安全的。但我担心我对日语的了解还不够,无法真正了解所有单词(我从一些搜索中了解到),这些单词非常复杂。

您将如何解决这个问题?您知道是否有任何库或算法已经存在,可以以令人满意的方式处理此问题?

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.

Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.

How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

止于盛夏 2024-08-25 11:33:10

日语自动换行规则称为 kinsoku shori 并且非常简单。实际上,他们主要关心标点符号,根本不尝试保持单词完整。

我刚刚查了一本日本小说,确实,音节假名脚本中的单词和由多个汉字组成的单词都被包裹在单词中间而不受惩罚。

Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.

I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.

煮茶煮酒煮时光 2024-08-25 11:33:10

下面列出的项目对于解决日语自动换行(或从另一个角度来看分词)很有用。

  • budou (Python): https://github.com/google/budou
  • mikan (JS): < a href="https://github.com/trkbt10/mikan.js" rel="nofollow noreferrer">https://github.com/trkbt10/mikan.js
  • mikan.sharp (C#): < a href="https://github.com/YoungjaeKim/mikan.sharp" rel="nofollow noreferrer">https://github.com/YoungjaeKim/mikan.sharp

mikan 采用基于正则表达式的方法,而 budou使用自然语言处理。

Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).

mikan has regex-based approach while budou uses natural language processing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文