如何在 Clojure 正则表达式中使用 unicode (UTF-8) 字符？

发布于 2024-09-06 13:56:56 字数 537 浏览 6 评论 0原文

对于那些非常友善的 Stacked Overflow Wizards 来说，这是一个双重问题。

如何设置 emacs/slime/swank 在与 Clojure 交互时使用 UTF-8，或在命令行 REPL 中使用 UTF-8？目前我无法将任何非罗马字符发送到 swank-clojure，并且使用命令行 REPL 会出现乱码。
在拉丁文本上执行正则表达式非常容易：
(re-seq #"[\w]+" "日语句子真的不需要空格吗？")

但是如果我有一些日语怎么办？我认为这会起作用，但我无法测试它：

(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当？")

如果我们必须使用字典来查找断词，或者自己找到仅片假名的单词，那就会变得更困难：

(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当？")

谢谢！

原文

This is a double question for you amazingly kind Stacked Overflow Wizards out there.

How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.
It's really easy to do regular expressions on latin text:
(re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")

But what if I had some japanese? I thought that this would work, but I can't test it:

(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当？")

It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:

(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当？")

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌逼全场 2024-09-13 13:56:56

恐怕无法使用 swank 或 Emacs。我在 NetBeans 上使用 Enclojure，它运行良好。

关于匹配：正如 Alex 所说， \w 不适用于非英语字符，甚至不适用于西欧的扩展拉丁字符集：

(re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große")  => ("gro" "e")  ; German
(re-seq #"\w+" "plaît")  => ("pla" "t")  ; French

\w 会跳过扩展字符。使用 [(?u)\w]+ 没有什么区别，与日语相同。

但请参阅此正则表达式参考：\p{L} 匹配“字母”类别中的任何 Unicode 字符，因此它实际上适用于挪威语

(re-seq #"\p{L}+" "prøve")
=> ("prøve")

和日语（至少我想是这样，我无法阅读它，但它似乎在大概范围内）：

(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当？")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")

还有很多其他选项，就像组合变音标记之类的匹配，请查看参考资料。

编辑：有关 Java 中 Unicode 的更多信息

有关使用 Unicode 时可能感兴趣的其他点的快速参考。

幸运的是，Java 通常可以很好地以适合位置和平台的正确编码读取和写入文本，但有时您需要覆盖它。

这都是 Java，其中大部分内容都没有 Clojure 包装器（至少现在还没有）。

java.nio.charset.Charset - 表示字符集，如 US-ASCII、ISO-8859-1、UTF-8
java.io.InputStreamReader - 允许您指定读取时从字节转换为字符串的字符集。有一个对应的OutputStreamWriter。
java.lang.String - 让您指定从字节数组创建字符串时的字符集。
java.lang.Character - 具有以下方法：获取字符的 Unicode 类别并在 Java 字符和 Unicode 代码点之间进行转换。
java.util.regex.Pattern - 正则表达式模式规范，包括 Unicode 块和类别。

Java 字符/字符串内部为 UTF-16。 char类型（及其包装Character）是16位，不足以表示全部Unicode，因此许多非拉丁文字需要两个字符来表示一个符号。

处理非拉丁 Unicode 时，通常最好使用代码点而不是字符。代码点是表示为 int 的一个 Unicode 字符/符号。 String 和Character 类具有在Java 字符和Unicode 代码点之间进行转换的方法。

unicode.org - Unicode 标准和代码图表。

我把它放在这里是因为我偶尔需要这些东西，但并不经常需要真正记住每次的细节。这是给未来的自己的一个注释，对于其他开始使用国际语言和编码的人来说也可能有用。

Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.

On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:

(re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große")  => ("gro" "e")  ; German
(re-seq #"\w+" "plaît")  => ("pla" "t")  ; French

The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.

But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian

(re-seq #"\p{L}+" "prøve")
=> ("prøve")

as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):

(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当？")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")

There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.

Edit: More on Unicode in Java

A quick reference to other points of potential interest when working with Unicode.

Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.

This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).

java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.

Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.

When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.

unicode.org - the Unicode standard and code charts.

I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.

回复收藏 0 原文

潜移默化 2024-09-13 13:56:56

我在这里回答一半的问题：

如何设置 emacs/slime/swank 在与 Clojure 交互时使用 UTF-8，或在命令行 REPL 中使用 UTF-8？

更具交互性的方式：

Mx custom-group
"slime-lisp"
找到 slime 编码系统选项，并选择 utf-8-unix。保存它，以便 Emacs 在您的下一个会话中选择它。

或者将其放入您的 .emacs 中：

(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))

这就是交互式菜单将执行的操作。

适用于 Emacs 23 和我的机器

I'll answer half a question here:

How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?

A more interactive way:

M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.

Or place this in your .emacs:

(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))

That's what the interactive menu will do anyway.

Works on Emacs 23 and works on my machine

回复收藏 0 原文

小帐篷 2024-09-13 13:56:56

对于片假名，Wikipedia 显示 Unicode 排序。因此，如果您想使用捕获所有片假名的正则表达式字符类，我想您可以这样做：

user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当？")
("スペース")

平假名，因为它的价值：

user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当？")
("の" "には" "が" "ないって")

如果任何正则表达式可以检测日语断词，我会感到非常惊讶。

For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:

user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当？")
("スペース")

Hiragana, for what it's worth:

user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当？")
("の" "には" "が" "ないって")

I'd be pretty amazed if any regex could detect Japanese word breaks.

回复收藏 0 原文

樱娆 2024-09-13 13:56:56

对于国际字符，您需要使用 Java 字符类，例如 [\p{javaLowerCase}\p{javaUpperCase}]+ 来匹配任何单词字符... \w 用于 ASCII - 请参阅 java.util.Regex 文档

回复收藏 0 原文

不如归去 2024-09-13 13:56:56

在正则表达式中添加 (?U) 前缀，如下所示： (re-matches #"(?U)\w+" "ñé2_hi") =>; “ñé2_hi”。

这会将 UNICODE_CHARACTER_CLASS 标志设置为 true，以便典型字符类对非 ASCII Unicode 执行您想要的操作。

有关详细信息，请参阅此处： http: //docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS

回复收藏 0 原文

~没有更多了~

关于作者

我很坚强

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

如何在 Clojure 正则表达式中使用 unicode (UTF-8) 字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

我的痛♀有谁懂

我是自愿种绣球花的

屌丝范

lcx_

予囚

朦胧时间

友情链接

如何在 Clojure 正则表达式中使用 unicode (UTF-8) 字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

我的痛♀有谁懂

我是自愿种绣球花的

屌丝范

lcx_

予囚

朦胧时间

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。