在 Clojure/Java 中检测 Unicode 文本连字
连字是由多个代码点表示的 Unicode 字符。例如,在梵文中 त्र
是一个由码点 त + ् + र
组成的连字。
在记事本等简单文本文件编辑器中查看时,त्र
显示为त्र
,并存储为三个 Unicode 字符。但是,当在 Firefox 中打开同一文件时,它会显示为正确的连字。
所以我的问题是,如何在从我的代码读取文件时以编程方式检测此类连字。既然 Firefox 做到了这一点,那么就必须存在一种以编程方式完成它的方法。是否有任何 Unicode 属性包含此信息,或者我是否需要拥有所有此类连字的映射?
SVG CSS 属性 text-rendering
设置为 < code>optimizeLegibility 做同样的事情(将代码点组合成正确的连字)。
PS:我使用的是Java。
编辑
我的代码的目的是计算 Unicode 文本中的字符数,假设连字是单个字符。所以我需要一种将多个代码点折叠成单个连字的方法。
Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र
is a ligature which consists of code points त + ् + र
.
When seen in simple text file editors like Notepad, त्र
is shown as त् + र
and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.
So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?
SVG CSS property text-rendering
when set to optimizeLegibility
does the same thing (combine code points into proper ligature).
PS: I am using Java.
EDIT
The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
计算机排版维基百科页面显示 -
这表明是编辑器进行了替换。而且,
据我所知(我对这个主题有一些兴趣,刚刚读了几篇文章),连字替代的说明嵌入在字体中。现在,我深入研究并找到了这些您;GSUB - 字形替换表 和 来自 OpenType 文件格式规范的连字替换子表。
接下来,您需要找到一些可以允许的库您可以深入了解 OpenType 字体文件,即用于快速访问的文件解析器,阅读以下两个讨论可能会为您提供有关如何进行这些替换的一些指导:
The Computer Typesetting wikipedia page says -
This indicates that it's the editor that does substitution. Moreover,
As far as I see (I got some interest in this topic and just now reading few articles), the instructions for ligature substitute is embeded inside font. Now, I dug into more and found these for you; GSUB - The Glyph Substitution Table and Ligature Substitution Subtable from the OpenType file format specification.
Next, you need to find some library which can allow you to peak inside OpenType font files, i.e. file parser for quick access. Reading the following two discussions may give you some directions in how to do these substitutions:
您也许可以从 GlyphVector 类获取此信息。
对于给定的 String,Font 实例可以创建一个 GlyphVector,它可以提供有关文本呈现的信息。
layoutGlyphVector() 方法可以提供此功能。
的 FLAG_COMPLEX_GLYPHS 属性GlyphVector 可以告诉您文本是否与输入字符没有 1 对 1 的映射。
以下代码显示了这样的示例:
numberOfGlyphs 应表示用于显示输入文本的字符数。
不幸的是,您需要创建一个 java GUI 组件来获取 FontRenderContext。
You may be able to get this information from the GlyphVector class.
For a given String a Font instance can create a GlyphVector that can provide information about the rendering of the text.
The layoutGlyphVector() method on the Font can provide this.
The FLAG_COMPLEX_GLYPHS attribute of the GlyphVector can tell you if the text does not have a 1 to 1 mapping with the input characters.
The following code shows an example of this:
numberOfGlyphs should represent the number of characters used to display the input text.
Unfortunately you need to create a java GUI component to get the FontRenderContext.
你所说的不是连字(至少不是 Unicode 的说法)而是字素簇。有一个标准附件涉及发现文本边界,包括字形簇边界:
http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries
另请参阅正则表达式中定制字素簇的描述:
http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters
以及排序规则字素的定义:
http://www.unicode.org/reports/tr10/#Collation_Graphemes
我认为这些是起点。更困难的部分可能是找到适用于梵文语言环境的 Unicode 排序算法的 Java 实现。如果找到,您就可以分析字符串,而无需求助于 OpenType 功能。这会更清晰一些,因为 OpenType 关心纯粹的表示细节,而不是字符或字素簇语义,但排序算法和定制的字素簇边界查找算法看起来好像它们可以独立于字体来实现。
What you are talking about are not ligatures (at least not in Unicode parlance) but grapheme clusters. There is a standard annex that is concerned with discovering text boundaries, including grapheme cluster boundaries:
http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries
Also see the description of tailored grapheme clusters in regular expressions:
http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters
And the definition of collation graphemes:
http://www.unicode.org/reports/tr10/#Collation_Graphemes
I think that these are starting points. The harder part will probably be to find a Java implementation of the Unicode collation algorithm that works for Devanagari locales. If you find one, you can analyze strings without resorting to OpenType features. This would be a bit cleaner since OpenType is concerned with purely presentational details and not with character or grapheme cluster semantics, but the collation algorithm and the tailored grapheme cluster boundary finding algorithm look as if they can be implemented independently of fonts.
虽然 Aaron 的答案并不完全正确,但它把我推了进去正确的方向。在阅读了
java.awt.font.GlyphVector
的 Java API 文档并在 Clojure REPL 上进行了大量操作之后,我能够编写一个可以完成我想要的功能的函数。这个想法是找到 glyphVector 中的字形宽度,并将零宽度的字形与最后找到的非零宽度字形组合起来。解决方案是在 Clojure 中,但如果需要,它应该可以转换为 Java。
还发布了 Gist。
While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of
java.awt.font.GlyphVector
and playing a lot on the Clojure REPL, I was able to write a function which does what I want.The idea is to find the width of glyphs in the
glyphVector
and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.Also posted on Gist.
我认为您真正需要的是
Unicode 规范化
。对于 Java,您应该检查 http://download.oracle .com/javase/6/docs/api/java/text/Normalizer.html
通过选择正确的规范化形式,您可以获得您正在寻找的内容。
I think that what you are really looking for is
Unicode Normalization
.For Java you should check http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html
By choosing the proper normalization form you can obtain what you are looking for.