如何确定 Unicode 字符的宽度
我和一个朋友正在用java编写我们自己的控制台,但是由于无法准确确定unicode字符的宽度,我们在正确调整行时遇到了问题。这会导致不仅 unicode 的行发生移位,而且后续行也发生移位的问题。
有没有办法确定 unicode 的宽度?
问题的屏幕截图可以在下面找到。
它应该是这样的: https://abload.de/img/richtigslkmg.jpeg
这是终端中的示例: https://abload.de/img/terminal7dj5o.jpeg
这是 PowerShell 中的示例:https://abload.de/img/powershelln7je0.jpeg
这是 Visual Studio Code 中的示例:https://abload.de/img/visualstudiocode4xkuo.jpeg
这是 Putty 中的示例:https://abload.de/img/putty0ujsk.png
编辑:
很抱歉问题不清楚。
它与显示宽度有关,在示例中,我尝试确定显示长度以使每行具有相同的长度。 函数real_length用于计算/确定并返回显示宽度。
这里是示例代码:
public static void main(String[] args) {
String[] tests = {
"Peter",
"SHGAMI",
"Marcel №1",
"
me and a friend are programming our own console in java, but we have Problems to adjust the lines correctly, because of the width of the unicode characters which can not be determined exactly. This leads to the problem that not only the line of the unicode, but also following lines are shifted.
Is there a way to determine the width of the unicodes?
Screenshots of the problem can be found bellow.
This is how it should look: https://abload.de/img/richtigslkmg.jpeg
This is an example in Terminal: https://abload.de/img/terminal7dj5o.jpeg
This is an example in PowerShell: https://abload.de/img/powershelln7je0.jpeg
This is an example in Visual Studio Code: https://abload.de/img/visualstudiocode4xkuo.jpeg
This is an example in Putty: https://abload.de/img/putty0ujsk.png
EDIT:
I am sorry that the question was unclear.
It is about the display width, in the example I try to determine the display length to have each line the same length.
The function real_length is to calculate/determine and return the display width.
here the example code:
public static void main(String[] args) {
String[] tests = {
"Peter",
"SHGAMI",
"Marcel №1",
"????",
"????❤️????",
"????❤️????????",
"????????????"
};
for(String test : tests) test(test);
}
public static void test(String text) {
int max = 20;
for(int i = 0; i < max;i++) System.out.print("#");
System.out.println();
System.out.print(text);
int length = real_length(text);
for(int i = 0; i < max - length;i++) System.out.print("#");
System.out.println();
}
public static int real_length(String text) {
return text.length();
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不幸的是,由于以下几个原因,没有简单的解决方案可以解决您看似简单的问题:
根据所使用的字体,在控制台上呈现的字符宽度可能(并且可能会)有所不同。因此,代码需要确定或假设目标字体才能计算宽度。
System.out
只是一个PrintStream
,它不知道也不关心字体和字符宽度,因此任何解决方案都必须独立于它们。即使您可以确定控制台上使用的字体,并且有办法确定尝试以该特定字体呈现的每个字符的宽度,这对您有何帮助?了解宽度的变化可能可以让您巧妙地调整正在渲染的线条,使它们对齐,但这很可能是不切实际的。
一个潜在的解决方案是保持代码不变,并在
println()
正在写入的控制台上使用等宽字体,但这种方法仍然存在一些主要问题。首先,您需要确定一种等宽字体,但也支持您想要呈现的所有字符。当包含表情符号时,这可能会出现问题。其次,即使您识别出这样的字体,您也可能会发现该字体的所有字形都不是等宽的!这样的字体将确保(例如)小写的i
和大写的W
具有相同的宽度,但您不能对表情符号也做出这样的假设,并且您可以甚至不要假设“等宽”表情符号都具有相同的非标准宽度!第三,您标识的字体(如果存在)必须在您的目标环境(您的 PowerShell、您朋友的 PuTTY shell 等)中可用。这不是一个主要障碍,但又是一件需要担心的事情。您可能会发现呈现的文本因操作系统而异。您的输出在 Linux 终端窗口中看起来可能是对齐的,但使用相同字体的相同输出在 PowerShell 窗口中可能会不对齐。
考虑到所有这些,更好的方法可能是使用 Swing 或 JavaFX,您可以更好地控制渲染的输出。即使您不熟悉这些技术,只需调整通过搜索获得的一些示例代码,也不需要太长时间就能让某些东西正常工作。即使考虑到学习曲线,它仍然比提出一个强大的解决方案来对齐写入任意控制台的任意字符所需的时间要少,因为这是一个很难解决的问题。
注意:
real_length()
方法仅返回所提供的 JavaString
中的代码点数量。这与其内部表示有关,与渲染字符的宽度没有直接关系,而渲染字符的宽度由所使用的字体决定。Unfortunately there is no easy solution to your deceptively simple question, for several reasons:
The width of the characters being rendered on the console might (and probably will) vary, based on the font being used. So the code would need to determine, or assume, the target font in order to calculate widths.
System.out
is just aPrintStream
that does not know or care about fonts and character width, so any solution has to be independent of that.Even if you could determine the font being used on the console, and you had a way to determine the width of each character you were trying to render in that specific font, how would that help you? Knowing the variation in widths might conceivably allow you to cleverly tweak the lines being rendered so that they were aligned, but it's just as likely that it wouldn't be practicable.
A potential solution is to leave your code as it stands, and use a monospaced font on the console that
println()
is writing to, but there are still some major problems with that approach. First, you need to identify a font that is monospaced, but will also support all of the characters you want to render. This can be problematic when including emojis. Second, even if you identify such a font, you may find that all the glyphs for that font are not monospaced! Such a font will ensure that (say) a lowercasei
and an uppercaseW
have the same width, but you can't also make that assumption for emojis, and you can't even assume that the "monospaced" emojis will all have the same non-standard width! Third, the font you identify (if it exists at all) would have to be available in your target environments (your PowerShell, your friend's PuTTY shell, etc.). That is not a major obstacle, but it is one more thing to worry about.You may find that the rendered text varies by operating system. Your output may look aligned in a Linux terminal window, but that same output, using the same font, might be misaligned in a PowerShell window.
Given all that, a better approach might be to use Swing or JavaFX, where you have finer control over the output being rendered. Even if you are unfamiliar with those technologies, it wouldn't take too long to get something working, just by tweaking some sample code obtained through a search. And even allowing for the learning curve, it would still take less time than coming up with a robust solution for aligning arbitrary characters written to an arbitrary console, because that is a hard problem to solve.
Notes:
real_length()
method is merely returning the number of code points in the supplied JavaString
. That relates to its internal representation, and has no direct correlation with the width of the rendered characters, which is determined by the font being used.听起来您正在寻找 POSIX
的 Java 实现wcwidth
和wcswidth
函数,实现 中定义的规则Unicode 技术报告#11(专门关注呈现到固定宽度设备(终端等)时 Unicode 代码点的显示宽度)。我知道的唯一这样的 Java 实现是在 JLine3 库,仅针对这一类就需要引入大量代码,但这可能是您最好的选择。但请注意,该代码似乎不完整。例如,JLine3 代码将 Unicode 代码点 0x26AA (⚪️) 报告为宽度为 1,但在我测试过的每个平台上(包括此处的 StackOverflow 编辑器,它是固定宽度的“设备”)代码点显示在两列上。
祝你好运 - 这个东西比看起来复杂得多。 JVM 不幸的 UCS-2 历史(不是 Sun 的错 - 这是 Unicode 标准的错误时机)只会让事情变得更糟,正如其他人在这里所说的,避免
char
和字符
数据类型就像瘟疫 - 它们不会按照您期望的方式工作,并且使用这些类型的即时代码遇到包括来自 Unicode 补充平面的代码点的数据,几乎肯定会无法正确运行(除非作者特别小心——你觉得幸运吗?Sounds like you're looking for a Java implementation of the POSIX
wcwidth
andwcswidth
functions, which implement the rules defined in Unicode Technical Report #11 (which exclusively focuses on display widths for Unicode codepoints when rendered to fixed width devices - terminals and the like). The only such Java implementation that I'm aware of is in the JLine3 library, which is a lot of code to bring in for just this one class, but that may be your best bet.Note however that that code appears to be incomplete. Unicode codepoint 0x26AA (⚪️), for example, is reported as having a width of 1 by the JLine3 code, but on every platform I've tested on (including here in the StackOverflow editor, which is a fixed width "device") that codepoint is displayed over two columns.
Good luck - this stuff is a lot more complex than it looks. The JVM's unfortunate UCS-2 history (not Sun's fault - it was bad timing wrt the Unicode standard) only makes matters worse, and as others have said here, avoid the
char
andCharacter
data types like the plague - they do not work the way you expect, and the instant code that uses those types encounters data including codepoints from the Unicode supplemental planes, it is almost certain to function incorrectly (unless the author has been especially careful - do you feel lucky? ????).注意:这个答案与我之前的答案(我仍然支持)截然不同,并且有质的不同。
Java 应用程序(即不使用图形用户界面的应用程序)有一种简单的方法来获取以给定字体和给定字体大小呈现的字符串的宽度。它需要使用一些即使在非 AWT 环境中也受支持的 awt 类。这是使用问题中提供的数据的演示:
Note: This answer is distinct and qualitatively different from my earlier one (which I still stand by).
There is a simple way for a Java application (i.e. one not using a graphical user interface) to obtain the width of a String being rendered in a given font with a given font size. It requires the use of some awt classes which are supported even in a non-AWT environment. Here's a demo using the data provided in the question:
The code above is based on this old answer by user Lonzak to the question Java - FontMetrics without Graphics. Those AWT classes allow you to create a
Font
with defined characteristics (i.e. name, size, style), and then use aFontMetrics
instance to obtain the width of an arbitrary String when using that font.Here is the output from running the code shown above:
Notes:
The first set of results shows the widths of the sample data in the question when using plain Monospaced 10 point font. The second set of results shows the widths of those same strings when using bold Monospaced 24 point font.
The widths don't look correct for some of the emojis, but that is because when the source code and output results are pasted into SO some emoji representations are changed, presumably because of the different font being used in the browser. (I was using Monospaced for both the source and the output.) Here's a screen shot of the original output, showing that the widths at least look plausible:
Even though the widths are being calculated and rendered for a fixed width font (Monospaced), it's clear that the width of the emojis cannot be predicted from the widths of normal keyboard characters.
tl;dr
使用代码点而不是
char
。避免调用String#length
。详细信息
您的问题忽略显示任何代码。所以我只能猜测你在做什么以及可能出现什么问题。
避免
char
我猜你的目标是附加一定数量的 数字符号根据需要添加 字符以形成固定长度的文本行。
我猜问题是您正在使用旧的
char
类型或其包装类Character
。自 Java 2 以来,char
类型已基本上被破坏。作为 16 位值,char
在物理上无法表示大多数字符。使用代码点数字
在处理单个字符时,请使用代码点整数。代码点是永久分配给 Unicode 中定义的 140,000 多个字符中每个字符的编号。
Java 5+ 中的各种类中添加了各种与代码点相关的方法:
字符串
,StringBuilder
,
字符
等。这里我们使用
String#codePoints
获取IntStream
代码点,源中每个字符一个元素。我们使用StringBuilder#appendCodePoint
收集最终结果字符串的代码点。tl;dr
Use code points rather than
char
. Avoid callingString#length
.Details
Your Question neglected to show any code. So I can only guess what you are doing and what might be the problem.
Avoid
char
I am guessing that your goal is to append a certain number of NUMBER SIGN characters as needed to make a fixed-length row of text.
I am guessing the problem is that you are using the legacy
char
type, or its wrapper classCharacter
. Thechar
type has been essentially broken since Java 2. As a 16-bit value,char
is physically incapable of representing most characters.Use code point numbers
Instead, use code point integer numbers when working with individual characters. A code point is the number permanently assigned to each of the over 140,000 characters defined in Unicode.
A variety of code point related methods have been added to various classes in Java 5+:
String
,StringBuilder
,Character
, etc.Here we use
String#codePoints
to get anIntStream
of code points, one element for each character in the source. And we useStringBuilder#appendCodePoint
to collect the code points for our final result string.Or, shorten that
for
loop with the use of a ternary operator.Report result.
There is likely a clever way to write that code more briefly with streams and lambdas, but I cannot think of one at the moment.
And, one could cleverly use the
String#repeat
method in Java 11+.