如何删除特定的字符集，而不是其他字符？

发布于 2025-01-20 21:36:28 字数 463 浏览 2 评论 0原文

由于角色解码的结果，我有一组看起来像这样的标题，特殊字符随后是其他不在原件中的字符：

P├ñiv├ñn J├ñlkeen
Tuuli k├ñ├ñnty├ñ voi
P├ñiv├ñn J├ñlkeen
Tuuli k├ñ├ñnty├ñ voi
∩╜óEurotrash∩╜ú
Le D├⌐sert N'Est Plus En Afrique

我知道我可以使用一些东西s的行：g /... /... /< /code>，但我不确定如何仅匹配特殊字符（< /code>，╜<< /code>），而不是捕获撇号，空间等。

当使用\ w尝试捕获“不是'word'中使用的字符”时，我会遇到与撇号，空格等匹配的问题。

因此，我的问题是，当试图从单词中删除这些字符时，我缺少什么？

原文

As a consequence of botched character decoding, I have a set of titles that look like this, with special characters followed by other characters like ñ that were not in the original:

P├ñiv├ñn J├ñlkeen
Tuuli k├ñ├ñnty├ñ voi
P├ñiv├ñn J├ñlkeen
Tuuli k├ñ├ñnty├ñ voi
∩╜óEurotrash∩╜ú
Le D├⌐sert N'Est Plus En Afrique

I know that I can use something along the lines of s:g / ... / ... /, but I am unsure of how to match just the special characters (├, ╜) as opposed to capturing apostrophes, spaces, and so on.

When using \W to try and capture "not a character used in a 'word'", I run into the issue of it matching apostrophes, spaces, and so on.

Thus, my question is, what am I missing that would be useful when trying to essentially delete these characters from the words?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心安伴我暖 2025-01-27 21:36:28

TL;DR 两个解决方案以及一些进一步的讨论。

适用于您的示例的最简单的解决方案

<:So>

例如：

say
  S:g /<:So>//
given
  'P├ñiv├ñn J├ñlkeen' 

# Pñivñn Jñlkeen

上述解决方案的解释

首先，一个看起来简短但能让您快速深入的解释：

了解Unicode 字符属性。这是官方 Unicode 字符数据库文档。
学习 Raku 材料（例如有关 Unicode 属性的最新 Raku 文档）和讨论（例如搜索SO：“unicode属性”，“unicode 属性”）了解 Raku 中 Unicode 属性的使用。
了解参考 Raku 编译器 (Rakudo) 实际如何使用它们。（它变得复杂；参见SO Raku 字符将匹配哪些 Unicode 属性。）

现在稍微长一点的解释可能比上面概述的方法更容易理解这些内容：

在 Raku 正则表达式中，<...> 是 ... 为 true 的一般断言。 ... 可以有许多变体。¹
其中一个变体是当前字符与字符类或类的某种组合相匹配，由<...>指定。这个 Raku 字符类表达式 ... 可以有许多变体。²
字符类表达式的一个变体是显式 Unicode 字符属性值；它将与该 Unicode 字符属性值匹配字符。 Unicode 字符属性及其值可以有许多变体； <:So> 是一个示例，适用于您的场景。³

以下是我为您的场景选择字符类的过程：

我使用了 util.unicode。 org 托管实用程序可查看 ├ 字符的属性。⁴
我关注的是General_Category属性列在表左栏顶部附近。这是在 Raku 正则表达式中指定的常见属性。您通常不会通过编写 <:General_Category...> 来指定它，其中 ... 是类别，而是只编写 <:category> 其中 category 是类别。
我注意到了 General_Category 值（下一列）：Other_Symbol。如果您阅读与 General_Category 属性相关的 Unicode 或 Raku 文档，您会发现 So 是 Other_Symbol 的简短别名。 p>
要在 Raku 中指定 Unicode 属性，请使用 Raku 的冒号对文字语法编写。键可以是许多 Unicode 属性中的任何一个。⁵ 因此，要匹配具有 So 属性的字符，请在正则表达式中写入 <:So>。
要从字符串中删除具有该类的字符，一种选择是使用 s/// 构造：
<前><代码>说
S:g /<:So>//
给定
'P├ñiv├ñn J├ñlkeen'
# Pñivñn Jñlkeen

获得相同结果的另一种方法

您可能希望对字符类 / 进行“算术”字符，添加一些字符，并减去其他字符：

say
  S:g / <+ [\W] - [\s] - ['] > //
given
  'P├ñiv├ñn J├ñlkeen'

# Pñivñn Jñlkeen

将 <+ [\W] - [\s] - ['] > 读取为“不是单词字符，也不是空格字符，也不是' 或者”。

请参阅 Raku 文档的字符类部分中的枚举字符类和范围，了解更多详细信息使用枚举类（使用 < [...] >）和添加/减去类（使用 + 和-）和/或更早的所以我回答我写的答案与你的问题非常相似。

对于某些用例，此方法可能会简化精确获取您想要的结果和/或表达它的过程，以便其他人或未来的您更容易维护。但这一切都取决于我将在下一节中讨论的事情。

进一步讨论您的问题：“等等”

我不确定如何仅捕获特殊字符而不是捕获撇号、空格等等。

假设您使用我所展示的一种或另一种或两种解决方案。你怎么知道它们真的是你想要的？答案是你需要探索 Unicode。 \w/\W、\s/\S 等字符类只是 Unicode 属性的便捷快捷方式，因此如果您想确定发生了什么，您最终仍然需要探索 Unicode。鉴于这一切最终都归结为 Unicode 属性，让我们讨论 <:So> 解决方案。

正如我们在上面看到的，├ 的 General_Category 属性（如上面链接的实用程序所示）是一个链接/值 Other_Symbol。

如果单击后一个链接，您将看到与 Unicode Other_Symbol 字符类对应的页面。它是一大堆混乱的黑白符号和彩色表情符号，然后是其他符号的有序列表。角色数量超过6000个！

该字符类是否包含可归类为“空格”的字符？我几乎可以肯定事实并非如此，但这要由你来解决，而不是我。那么“撇号”呢？我对此比“空格”稍微不太确定，尽管我再次认为它不会包含可以称为或归类为“撇号”的字符。 Other_Symbol 字符类是否包含“等等”？！？也许扔一枚硬币？

TL;DR Two solutions plus some further discussion.

Simplest solution that works for your examples

<:So>

For example:

say
  S:g /<:So>//
given
  'P├ñiv├ñn J├ñlkeen' 

# Pñivñn Jñlkeen

An explanation of the above solution

First, an explanation that looks short but will take you deep fast:

Learn about Unicode character properties. Here's the official Unicode Character Database doc.
Study Raku materials (eg the latest Raku documentation on Unicode properties) and discussions (eg searches of SO: "unicode property", "unicode properties") about use of Unicode properties in Raku.
Learn how the reference Raku compiler (Rakudo) actually uses them. (It gets complicated; see SO What are all the Unicode properties a Raku character will match.)

Now a slightly longer explanation that may make understanding this stuff easier than the approach outlined above:

In Raku regexes, <...> is a general assertion that ... is true. There are many variations on what ... can be.¹
One variation is that the current character is matched against a character class, or some combination of classes, specified by the <...>. There are many variations on what this Raku character class expression ... can be.²
One variation of a character class expression is an explicit Unicode character property value; it will match characters with that Unicode character property value. There are many variations on what a Unicode character property and its values can be; <:So> is an example and works for your scenario.³

Here's the process I went through to select that character class for your scenario:

I used a util.unicode.org hosted utility to see properties of the ├ character.⁴
I focused on the the General_Category property listed near the top of the left hand column of the table. This is a common property to specify in a Raku regex. You generally won't specify it by writing <:General_Category...>, with the ... being the category, but instead by just writing <:category> where category is the category.
I noted the General_Category value (the next column): Other_Symbol. If you read the Unicode or Raku doc related to the General_Category property, you'll see that So is a short alias of Other_Symbol.
To specify a Unicode property in Raku, write it using Raku's colon pair literal syntax. The key can be any of many Unicode properties.⁵ So to match a character that has the So property, write <:So> in a regex.
To remove a character that has that class from a string, one option is to use the s/// construct:
```
say
  S:g /<:So>//
given
  'P├ñiv├ñn J├ñlkeen' 

# Pñivñn Jñlkeen
```

Another way to get the same result

You might want to do "arithmetic" with character classes / characters, adding some, and subtracting others:

say
  S:g / <+ [\W] - [\s] - ['] > //
given
  'P├ñiv├ñn J├ñlkeen'

# Pñivñn Jñlkeen

Read <+ [\W] - [\s] - ['] > as "Not a word character, but also not a space character and not a ' either".

See Enumerated character classes and ranges in the Character Class section of Raku's doc for further details of using enumerated classes (using < [...] >) and adding/subtracting classes (using + and -) and/or an earlier SO I answer I wrote in response to a question pretty similar to yours.

For some use cases this approach may simplify getting precisely the result you want and/or expressing it so it'll be easier for others, or a future you, to maintain. But it all depends on things I'll discuss in the next section.

Further discussion of your question: "and so on"

I am unsure of how to capture just the special characters as opposed to capturing apostrophes, spaces, and so on.

Let's say you use something like one or other or the two solutions I've shown. How do you know they really are what you want? The answer is you need to explore Unicode. Character classes like \w/\W, \s/\S etc are just convenient shortcuts for Unicode properties, so you still end up needing to explore Unicode if you want to be sure what's going on. Given that it all ultimately boils down to Unicode properties, let's discuss the <:So> solution.

As we saw above, the General_Category property of ├ (as shown via the utility linked above) is a link/value Other_Symbol.

If you click that latter link, you'll see a page corresponding to the Unicode Other_Symbol character class. It's a big jumbled mass of black-and-white symbols and colorful emojis and then an orderly list of other symbols. There are over 6,000 characters!

Does this character class contain characters that could be categorized as "spaces"? I'm near certain it doesn't, but it'll be up to you to figure that out, not me. What about "apostrophes"? I'm slightly less sure about that than "spaces", though I again think it won't include characters that could be called or categorized as "apostrophes". Does the Other_Symbol character class contain "and so on"?!? Maybe toss a coin? ???? Maybe do an in page browser search for particular characters on the Other_Symbol page? ????

When I'm not using the tools hosted on util.unicode.org or similar, one approach I've occasionally used to explore Unicode character classes is variations on this Raku code:

say (^0x10FFFF)».chr.grep(/ <:So> /)

The (^0x10FFFF) is a way to specify the integer Range that corresponds to all of the 1,114,112 legal Unicode Code Points. The ».chr iterates the range, producing a list of integers from 0 onward, applying .chr to each, which produces the Unicode character corresponding to a given integer. The .grep(/ <:So> /) then keeps only the characters whose General_Category is Other_Symbol.

That said, that'll be really slow. You'll want to find other ways to explore Unicode.

Other options include the util.unicode.org tools and the Raku community's Unicodable which you can run by visiting the #raku IRC channel and entering u: ....

Discussion of "`$_ ~~ s:g / \├ / /;`"

when I attempt to directly escape the character in my regex via $_ ~~ s:g/ \├ / /;, the result stays the same

I've not been able to reproduce that.

I think you've just gotten confused.

If you are still convinced you are right, please produce an MRE.

Footnotes

¹ A big chunk of Raku's power is in its powered up regexes. And a big chunk of that is expressed via the general form <...>.

² Character class syntaxes include older style ones carried forward from older regex formats, eg \s to match whitespace. But they're all just shortcuts for Unicode properties or characters rather than ASCII ones, and there are now a LOT more variations.

³ If you only skim the doc you might think that Raku regexes can only match against Unicode's General_Category property. But if you look at the code examples you'll see there's <:Script<Latin>> and <:Block('Basic Latin')> too. (But what are they?) And then when you see the vast array of properties displayed by the util.unicode.org property browser you realize there's vastly more that could be matched. Rakudo matches many of these but not all. For gory details, see What are all the Unicode properties a character will match?.

⁴ Perhaps adding links to some of these utilities from the Raku doc would be a good thing. And/or creating/hosting variants of them using Raku.

⁵ I suspect Raku defines some additional things that are of the form <:foo> beyond Unicode properties. For example, I know :space works (it matches an ASCII space) but suspect it's not a Unicode property. Otoh that sounds downright wrong to me and against what I would expect of Raku design. If I find out for sure one way or the other I'll update this footnote.

回复收藏 0 原文

~没有更多了~