在 Clojure/Java 中检测 Unicode 文本连字

发布于 2024-09-14 16:41:25 字数 634 浏览 16 评论 0原文

连字是由多个代码点表示的 Unicode 字符。例如,在梵文中 त्र 是一个由码点 त + ् + र 组成的连字。

在记事本等简单文本文件编辑器中查看时,त्र 显示为त्र,并存储为三个 Unicode 字符。但是,当在 Firefox 中打开同一文件时,它会显示为正确的连字。

所以我的问题是,如何在从我的代码读取文件时以编程方式检测此类连字。既然 Firefox 做到了这一点,那么就必须存在一种以编程方式完成它的方法。是否有任何 Unicode 属性包含此信息,或者我是否需要拥有所有此类连字的映射?

SVG CSS 属性 text-rendering 设置为 < code>optimizeLegibility 做同样的事情(将代码点组合成正确的连字)。

PS:我使用的是Java。

编辑

我的代码的目的是计算 Unicode 文本中的字符数,假设连字是单个字符。所以我需要一种将多个代码点折叠成单个连字的方法。

Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र.

When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.

So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?

SVG CSS property text-rendering when set to optimizeLegibility does the same thing (combine code points into proper ligature).

PS: I am using Java.

EDIT

The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小情绪 2024-09-21 16:41:25

计算机排版维基百科页面显示 -

计算机现代罗马字体
TeX 提供的包括五个
常见连字 ff、fi、fl、ffi 和
ffl。当 TeX 找到这些组合时
在文本中它取代了
适当的结扎,除非
被排字机覆盖。

这表明是编辑器进行了替换。而且,

Unicode 认为连字是
演示问题而不是
字符定义问题,并且,
例如,“如果现代字体是
要求显示“h”后跟“r”,
并且该字体有一个“hr”连字
它,它可以显示连字。”

据我所知(我对这个主题有一些兴趣,刚刚读了几篇文章),连字替代的说明嵌入在字体中。现在,我深入研究并找到了这些您;GSUB - 字形替换表来自 OpenType 文件格式规范的连字替换子表

接下来,您需要找到一些可以允许的库您可以深入了解 OpenType 字体文件,即用于快速访问的文件解析器,阅读以下两个讨论可能会为您提供有关如何进行这些替换的一些指导:

  1. Chromium bug http://code.google.com/p/chromium/issues/detail?id=22240
  2. Firefox 错误 https://bugs.launchpad.net/firefox/+bug/37828

The Computer Typesetting wikipedia page says -

The Computer Modern Roman typeface
provided with TeX includes the five
common ligatures ff, fi, fl, ffi, and
ffl. When TeX finds these combinations
in a text it substitutes the
appropriate ligature, unless
overridden by the typesetter.

This indicates that it's the editor that does substitution. Moreover,

Unicode maintains that ligaturing is
a presentation issue rather than a
character definition issue, and that,
for example, "if a modern font is
asked to display 'h' followed by 'r',
and the font has an 'hr' ligature in
it, it can display the ligature."

As far as I see (I got some interest in this topic and just now reading few articles), the instructions for ligature substitute is embeded inside font. Now, I dug into more and found these for you; GSUB - The Glyph Substitution Table and Ligature Substitution Subtable from the OpenType file format specification.

Next, you need to find some library which can allow you to peak inside OpenType font files, i.e. file parser for quick access. Reading the following two discussions may give you some directions in how to do these substitutions:

  1. Chromium bug http://code.google.com/p/chromium/issues/detail?id=22240
  2. Firefox bug https://bugs.launchpad.net/firefox/+bug/37828
百思不得你姐 2024-09-21 16:41:25

您也许可以从 GlyphVector 类获取此信息。

对于给定的 String,Font 实例可以创建一个 GlyphVector,它可以提供有关文本呈现的信息。

layoutGlyphVector() 方法可以提供此功能。

FLAG_COMPLEX_GLYPHS 属性GlyphVector 可以告诉您文本是否与输入字符没有 1 对 1 的映射。

以下代码显示了这样的示例:

JTextField textField = new JTextField();
String textToTest = "abcdefg";
FontRenderContext fontRenderContext = textField.getFontMetrics(font).getFontRenderContext();

GlyphVector glyphVector = font.layoutGlyphVector(fontRenderContext, textToTest.toCharArray(), 0, 4, Font.LAYOUT_LEFT_TO_RIGHT);
int layoutFlags = glyphVector.getLayoutFlags();
boolean hasComplexGlyphs = (layoutFlags & GlyphVector.FLAG_COMPLEX_GLYPHS) != 0;
int numberOfGlyphs = glyphVector.getNumGlyphs();

numberOfGlyphs 应表示用于显示输入文本的字符数。

不幸的是,您需要创建一个 java GUI 组件来获取 FontRenderContext。

You may be able to get this information from the GlyphVector class.

For a given String a Font instance can create a GlyphVector that can provide information about the rendering of the text.

The layoutGlyphVector() method on the Font can provide this.

The FLAG_COMPLEX_GLYPHS attribute of the GlyphVector can tell you if the text does not have a 1 to 1 mapping with the input characters.

The following code shows an example of this:

JTextField textField = new JTextField();
String textToTest = "abcdefg";
FontRenderContext fontRenderContext = textField.getFontMetrics(font).getFontRenderContext();

GlyphVector glyphVector = font.layoutGlyphVector(fontRenderContext, textToTest.toCharArray(), 0, 4, Font.LAYOUT_LEFT_TO_RIGHT);
int layoutFlags = glyphVector.getLayoutFlags();
boolean hasComplexGlyphs = (layoutFlags & GlyphVector.FLAG_COMPLEX_GLYPHS) != 0;
int numberOfGlyphs = glyphVector.getNumGlyphs();

numberOfGlyphs should represent the number of characters used to display the input text.

Unfortunately you need to create a java GUI component to get the FontRenderContext.

假装爱人 2024-09-21 16:41:25

你所说的不是连字(至少不是 Unicode 的说法)而是字素簇。有一个标准附件涉及发现文本边界,包括字形簇边界:

http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries

另请参阅正则表达式中定制字素簇的描述:

http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters

以及排序规则字素的定义:

http://www.unicode.org/reports/tr10/#Collat​​ion_Graphemes

我认为这些是起点。更困难的部分可能是找到适用于梵文语言环境的 Unicode 排序算法的 Java 实现。如果找到,您就可以分析字符串,而无需求助于 OpenType 功能。这会更清晰一些,因为 OpenType 关心纯粹的表示细节,而不是字符或字素簇语义,但排序算法和定制的字素簇边界查找算法看起来好像它们可以独立于字体来实现。

What you are talking about are not ligatures (at least not in Unicode parlance) but grapheme clusters. There is a standard annex that is concerned with discovering text boundaries, including grapheme cluster boundaries:

http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries

Also see the description of tailored grapheme clusters in regular expressions:

http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters

And the definition of collation graphemes:

http://www.unicode.org/reports/tr10/#Collation_Graphemes

I think that these are starting points. The harder part will probably be to find a Java implementation of the Unicode collation algorithm that works for Devanagari locales. If you find one, you can analyze strings without resorting to OpenType features. This would be a bit cleaner since OpenType is concerned with purely presentational details and not with character or grapheme cluster semantics, but the collation algorithm and the tailored grapheme cluster boundary finding algorithm look as if they can be implemented independently of fonts.

莫多说 2024-09-21 16:41:25

虽然 Aaron 的答案并不完全正确,但它把我推了进去正确的方向。在阅读了 java.awt.font.Gly​​phVector 的 Java API 文档并在 Clojure REPL 上进行了大量操作之后,我能够编写一个可以完成我想要的功能的函数。

这个想法是找到 glyphVector 中的字形宽度,并将零宽度的字形与最后找到的非零宽度字形组合起来。解决方案是在 Clojure 中,但如果需要,它应该可以转换为 Java。

(ns net.abhinavsarkar.unicode
  (:import [java.awt.font TextAttribute GlyphVector]
           [java.awt Font]
           [javax.swing JTextArea]))

(let [^java.util.Map text-attrs {
        TextAttribute/FAMILY "Arial Unicode MS"
        TextAttribute/SIZE 25
        TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
      font (Font/getFont text-attrs)
      ta (doto (JTextArea.) (.setFont font))
      frc (.getFontRenderContext (.getFontMetrics ta font))]
  (defn unicode-partition
    "takes an unicode string and returns a vector of strings by partitioning
    the input string in such a way that multiple code points of a single
    ligature are in same partition in the output vector"
    [^String text]
    (let [glyph-vector 
            (.layoutGlyphVector
              font, frc, (.toCharArray text),
              0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
          glyph-num (.getNumGlyphs glyph-vector)
          glyph-positions
            (map first (partition 2
                          (.getGlyphPositions glyph-vector 0 glyph-num nil)))
          glyph-widths
            (map -
              (concat (next glyph-positions)
                      [(.. glyph-vector getLogicalBounds width)])
              glyph-positions)
          glyph-indices 
            (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
          glyph-index-width-map (zipmap glyph-indices glyph-widths)
          corrected-glyph-widths
            (vec (reduce
                    (fn [acc [k v]] (do (aset acc k v) acc))
                    (make-array Float (count glyph-index-width-map))
                    glyph-index-width-map))]
      (loop [idx 0 pidx 0 char-seq text acc []]
        (if (nil? char-seq)
          acc
          (if-not (zero? (nth corrected-glyph-widths idx))
            (recur (inc idx) (inc pidx) (next char-seq)
              (conj acc (str (first char-seq))))
            (recur (inc idx) pidx (next char-seq)
              (assoc acc (dec pidx)
                (str (nth acc (dec pidx)) (first char-seq))))))))))

还发布了 Gist

While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of java.awt.font.GlyphVector and playing a lot on the Clojure REPL, I was able to write a function which does what I want.

The idea is to find the width of glyphs in the glyphVector and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.

(ns net.abhinavsarkar.unicode
  (:import [java.awt.font TextAttribute GlyphVector]
           [java.awt Font]
           [javax.swing JTextArea]))

(let [^java.util.Map text-attrs {
        TextAttribute/FAMILY "Arial Unicode MS"
        TextAttribute/SIZE 25
        TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
      font (Font/getFont text-attrs)
      ta (doto (JTextArea.) (.setFont font))
      frc (.getFontRenderContext (.getFontMetrics ta font))]
  (defn unicode-partition
    "takes an unicode string and returns a vector of strings by partitioning
    the input string in such a way that multiple code points of a single
    ligature are in same partition in the output vector"
    [^String text]
    (let [glyph-vector 
            (.layoutGlyphVector
              font, frc, (.toCharArray text),
              0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
          glyph-num (.getNumGlyphs glyph-vector)
          glyph-positions
            (map first (partition 2
                          (.getGlyphPositions glyph-vector 0 glyph-num nil)))
          glyph-widths
            (map -
              (concat (next glyph-positions)
                      [(.. glyph-vector getLogicalBounds width)])
              glyph-positions)
          glyph-indices 
            (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
          glyph-index-width-map (zipmap glyph-indices glyph-widths)
          corrected-glyph-widths
            (vec (reduce
                    (fn [acc [k v]] (do (aset acc k v) acc))
                    (make-array Float (count glyph-index-width-map))
                    glyph-index-width-map))]
      (loop [idx 0 pidx 0 char-seq text acc []]
        (if (nil? char-seq)
          acc
          (if-not (zero? (nth corrected-glyph-widths idx))
            (recur (inc idx) (inc pidx) (next char-seq)
              (conj acc (str (first char-seq))))
            (recur (inc idx) pidx (next char-seq)
              (assoc acc (dec pidx)
                (str (nth acc (dec pidx)) (first char-seq))))))))))

Also posted on Gist.

夏九 2024-09-21 16:41:25

我认为您真正需要的是 Unicode 规范化

对于 Java,您应该检查 http://download.oracle .com/javase/6/docs/api/java/text/Normalizer.html

通过选择正确的规范化形式,您可以获得您正在寻找的内容。

I think that what you are really looking for is Unicode Normalization.

For Java you should check http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html

By choosing the proper normalization form you can obtain what you are looking for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文