如何在 Clojure 正则表达式中使用 unicode (UTF-8) 字符?

发布于 2024-09-06 13:56:56 字数 537 浏览 6 评论 0原文

对于那些非常友善的 Stacked Overflow Wizards 来说,这是一个双重问题。

  1. 如何设置 emacs/slime/swank 在与 Clojure 交互时使用 UTF-8,或在命令行 REPL 中使用 UTF-8?目前我无法将任何非罗马字符发送到 swank-clojure,并且使用命令行 REPL 会出现乱码。

  2. 在拉丁文本上执行正则表达式非常容易:

    (re-seq #"[\w]+" "日语句子真的不需要空格吗?")

但是如果我有一些日语怎么办?我认为这会起作用,但我无法测试它:

(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")

如果我们必须使用字典来查找断词,或者自己找到仅片假名的单词,那就会变得更困难:

(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")

谢谢!

This is a double question for you amazingly kind Stacked Overflow Wizards out there.

  1. How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.

  2. It's really easy to do regular expressions on latin text:

    (re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")

But what if I had some japanese? I thought that this would work, but I can't test it:

(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")

It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:

(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

萌逼全场 2024-09-13 13:56:56

恐怕无法使用 swank 或 Emacs。我在 NetBeans 上使用 Enclojure,它运行良好。

关于匹配:正如 Alex 所说, \w 不适用于非英语字符,甚至不适用于西欧的扩展拉丁字符集:

(re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große")  => ("gro" "e")  ; German
(re-seq #"\w+" "plaît")  => ("pla" "t")  ; French

\w 会跳过扩展字符。使用 [(?u)\w]+ 没有什么区别,与日语相同。

但请参阅此正则表达式参考\p{L} 匹配“字母”类别中的任何 Unicode 字符,因此它实际上适用于挪威语

(re-seq #"\p{L}+" "prøve")
=> ("prøve")

和日语(至少我想是这样,我无法阅读它,但它似乎在大概范围内):

(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")

还有很多其他选项,就像组合变音标记之类的匹配,请查看参考资料。

编辑:有关 Java 中 Unicode 的更多信息

有关使用 Unicode 时可能感兴趣的其他点的快速参考。

幸运的是,Java 通常可以很好地以适合位置和平台的正确编码读取和写入文本,但有时您需要覆盖它。

这都是 Java,其中大部分内容都没有 Clojure 包装器(至少现在还没有)。

Java 字符/字符串内部为 UTF-16。 char类型(及其包装Character)是16位,不足以表示全部Unicode,因此许多非拉丁文字需要两个字符来表示一个符号。

处理非拉丁 Unicode 时,通常最好使用代码点而不是字符。代码点是表示为 int 的一个 Unicode 字符/符号。 String 和Character 类具有在Java 字符和Unicode 代码点之间进行转换的方法。

我把它放在这里是因为我偶尔需要这些东西,但并不经常需要真正记住每次的细节。这是给未来的自己的一个注释,对于其他开始使用国际语言和编码的人来说也可能有用。

Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.

On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:

(re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große")  => ("gro" "e")  ; German
(re-seq #"\w+" "plaît")  => ("pla" "t")  ; French

The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.

But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian

(re-seq #"\p{L}+" "prøve")
=> ("prøve")

as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):

(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")

There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.

Edit: More on Unicode in Java

A quick reference to other points of potential interest when working with Unicode.

Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.

This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).

  • java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
  • java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
  • java.lang.String - lets you specify a charset when creating a String from an array of bytes.
  • java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
  • java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.

Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.

When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.

I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.

潜移默化 2024-09-13 13:56:56

我在这里回答一半的问题:

如何设置 emacs/slime/swank 在与 Clojure 交互时使用 UTF-8,或在命令行 REPL 中使用 UTF-8?

更具交互性的方式:

  1. Mx custom-group
  2. "slime-lisp"
  3. 找到 slime 编码系统选项,并选择 utf-8-unix。保存它,以便 Emacs 在您的下一个会话中选择它。

或者将其放入您的 .emacs 中:

(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))

这就是交互式菜单将执行的操作。

适用于 Emacs 23 和 我的机器

I'll answer half a question here:

How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?

A more interactive way:

  1. M-x customize-group
  2. "slime-lisp"
  3. Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.

Or place this in your .emacs:

(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))

That's what the interactive menu will do anyway.

Works on Emacs 23 and works on my machine

小帐篷 2024-09-13 13:56:56

对于片假名,Wikipedia 显示 Unicode 排序。因此,如果您想使用捕获所有片假名的正则表达式字符类,我想您可以这样做:

user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")

平假名,因为它的价值:

user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")

如果任何正则表达式可以检测日语断词,我会感到非常惊讶。

For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:

user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")

Hiragana, for what it's worth:

user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")

I'd be pretty amazed if any regex could detect Japanese word breaks.

樱娆 2024-09-13 13:56:56

对于国际字符,您需要使用 Java 字符类,例如 [\p{javaLowerCase}\p{javaUpperCase}]+ 来匹配任何单词字符... \w 用于 ASCII - 请参阅 java.util.Regex 文档

for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation

不如归去 2024-09-13 13:56:56

在正则表达式中添加 (?U) 前缀,如下所示: (re-matches #"(?U)\w+" "ñé2_hi") =>; “ñé2_hi”

这会将 UNICODE_CHARACTER_CLASS 标志设置为 true,以便典型字符类对非 ASCII Unicode 执行您想要的操作。

有关详细信息,请参阅此处: http: //docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS

Prefix your regex with (?U) like so: (re-matches #"(?U)\w+" "ñé2_hi") => "ñé2_hi".

This sets the UNICODE_CHARACTER_CLASS flag to true so that the typical character classes do what you want with non-ASCII Unicode.

See here for more info: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文