日语自动换行算法
在我最近构建的一个 Web 应用程序中,当我们的一位用户决定使用它来完全用日语创建一些东西时,我感到非常惊讶。然而,文字的包裹方式很奇怪而且笨拙。显然,浏览器不能很好地处理日语文本的换行,可能是因为它包含很少的空格,因为每个字符形成一个完整的单词。然而,这并不是一个真正安全的假设,因为某些单词是由多个字符构成的,并且将某些字符组分成不同的行是不安全的。
谷歌搜索并没有真正帮助我更好地理解这个问题。在我看来,人们需要一本牢不可破的模式字典,并假设其他地方都是安全的。但我担心我对日语的了解还不够,无法真正了解所有单词(我从一些搜索中了解到),这些单词非常复杂。
您将如何解决这个问题?您知道是否有任何库或算法已经存在,可以以令人满意的方式处理此问题?
In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.
Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.
How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
日语自动换行规则称为 kinsoku shori 并且非常简单。实际上,他们主要关心标点符号,根本不尝试保持单词完整。
我刚刚查了一本日本小说,确实,音节假名脚本中的单词和由多个汉字组成的单词都被包裹在单词中间而不受惩罚。
Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.
I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.
下面列出的项目对于解决日语自动换行(或从另一个角度来看分词)很有用。
mikan 采用基于正则表达式的方法,而 budou使用自然语言处理。
Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).
mikan has regex-based approach while budou uses natural language processing.