如何删除特定的字符集,而不是其他字符?
由于角色解码的结果,我有一组看起来像这样的标题,特殊字符随后是其他不在原件中的字符:
Päivän Jälkeen
Tuuli kääntyä voi
Päivän Jälkeen
Tuuli kääntyä voi
「Eurotrash」
Le Désert N'Est Plus En Afrique
我知道我可以使用一些东西s的行:g /... /... /< /code>,但我不确定如何仅匹配特殊字符(
< /code>,
╜<< /code>),而不是捕获撇号,空间等。
当使用\ w
尝试捕获“不是'word'中使用的字符”时,我会遇到与撇号,空格等匹配的问题。
因此,我的问题是,当试图从单词中删除这些字符时,我缺少什么?
As a consequence of botched character decoding, I have a set of titles that look like this, with special characters followed by other characters like ñ
that were not in the original:
Päivän Jälkeen
Tuuli kääntyä voi
Päivän Jälkeen
Tuuli kääntyä voi
「Eurotrash」
Le Désert N'Est Plus En Afrique
I know that I can use something along the lines of s:g / ... / ... /
, but I am unsure of how to match just the special characters (├
, ╜
) as opposed to capturing apostrophes, spaces, and so on.
When using \W
to try and capture "not a character used in a 'word'", I run into the issue of it matching apostrophes, spaces, and so on.
Thus, my question is, what am I missing that would be useful when trying to essentially delete these characters from the words?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
TL;DR 两个解决方案以及一些进一步的讨论。
适用于您的示例的最简单的解决方案
例如:
上述解决方案的解释
首先,一个看起来简短但能让您快速深入的解释:
了解Unicode 字符属性。这是官方 Unicode 字符数据库文档。
学习 Raku 材料(例如有关 Unicode 属性的最新 Raku 文档)和讨论(例如搜索SO:“unicode属性”,“unicode 属性”)了解 Raku 中 Unicode 属性的使用。
了解参考 Raku 编译器 (Rakudo) 实际如何使用它们。 (它变得复杂;参见SO Raku 字符将匹配哪些 Unicode 属性。)
现在稍微长一点的解释可能比上面概述的方法更容易理解这些内容:
在 Raku 正则表达式中,
<...>
是...
为 true 的一般断言。...
可以有许多变体。¹其中一个变体是当前字符与字符类或类的某种组合相匹配,由
<...>
指定。这个 Raku 字符类表达式...
可以有许多变体。²字符类表达式的一个变体是显式 Unicode 字符属性值;它将与该 Unicode 字符属性值匹配字符。 Unicode 字符属性及其值可以有许多变体;
<:So>
是一个示例,适用于您的场景。³以下是我为您的场景选择字符类的过程:
我使用了
util.unicode。 org
托管实用程序可查看├
字符的属性。⁴我关注的是
General_Category
属性列在表左栏顶部附近。这是在 Raku 正则表达式中指定的常见属性。您通常不会通过编写<:General_Category...>
来指定它,其中...
是类别,而是只编写<:category>
其中category
是类别。我注意到了
General_Category
值(下一列):Other_Symbol
。如果您阅读与General_Category
属性相关的 Unicode 或 Raku 文档,您会发现So
是Other_Symbol
的简短别名。 p>要在 Raku 中指定 Unicode 属性,请使用 Raku 的冒号对文字语法编写。键可以是许多 Unicode 属性中的任何一个。⁵ 因此,要匹配具有
So
属性的字符,请在正则表达式中写入<:So>
。要从字符串中删除具有该类的字符,一种选择是使用
s///
构造:<前><代码>说
S:g /<:So>//
给定
'Päivän Jälkeen'
# Pñivñn Jñlkeen
获得相同结果的另一种方法
您可能希望对字符类 / 进行“算术”字符,添加一些字符,并减去其他字符:
将
<+ [\W] - [\s] - ['] >
读取为“不是单词字符,也不是空格字符,也不是'
或者”。请参阅 Raku 文档的字符类部分中的枚举字符类和范围,了解更多详细信息使用枚举类(使用
< [...] >
)和添加/减去类(使用+
和-
)和/或更早的所以我回答我写的答案与你的问题非常相似。对于某些用例,此方法可能会简化精确获取您想要的结果和/或表达它的过程,以便其他人或未来的您更容易维护。但这一切都取决于我将在下一节中讨论的事情。
进一步讨论您的问题:“等等”
假设您使用我所展示的一种或另一种或两种解决方案。你怎么知道它们真的是你想要的?答案是你需要探索 Unicode。
\w
/\W
、\s
/\S
等字符类只是 Unicode 属性的便捷快捷方式,因此如果您想确定发生了什么,您最终仍然需要探索 Unicode。鉴于这一切最终都归结为 Unicode 属性,让我们讨论<:So>
解决方案。正如我们在上面看到的,
├
的General_Category
属性(如上面链接的实用程序所示)是一个链接/值Other_Symbol
。如果单击后一个链接,您将看到与 Unicode
Other_Symbol
字符类对应的页面。它是一大堆混乱的黑白符号和彩色表情符号,然后是其他符号的有序列表。角色数量超过6000个!该字符类是否包含可归类为“空格”的字符?我几乎可以肯定事实并非如此,但这要由你来解决,而不是我。那么“撇号”呢?我对此比“空格”稍微不太确定,尽管我再次认为它不会包含可以称为或归类为“撇号”的字符。
Other_Symbol
字符类是否包含“等等”?!?也许扔一枚硬币?TL;DR Two solutions plus some further discussion.
Simplest solution that works for your examples
For example:
An explanation of the above solution
First, an explanation that looks short but will take you deep fast:
Learn about Unicode character properties. Here's the official Unicode Character Database doc.
Study Raku materials (eg the latest Raku documentation on Unicode properties) and discussions (eg searches of SO: "unicode property", "unicode properties") about use of Unicode properties in Raku.
Learn how the reference Raku compiler (Rakudo) actually uses them. (It gets complicated; see SO What are all the Unicode properties a Raku character will match.)
Now a slightly longer explanation that may make understanding this stuff easier than the approach outlined above:
In Raku regexes,
<...>
is a general assertion that...
is true. There are many variations on what...
can be.¹One variation is that the current character is matched against a character class, or some combination of classes, specified by the
<...>
. There are many variations on what this Raku character class expression...
can be.²One variation of a character class expression is an explicit Unicode character property value; it will match characters with that Unicode character property value. There are many variations on what a Unicode character property and its values can be;
<:So>
is an example and works for your scenario.³Here's the process I went through to select that character class for your scenario:
I used a
util.unicode.org
hosted utility to see properties of the├
character.⁴I focused on the the
General_Category
property listed near the top of the left hand column of the table. This is a common property to specify in a Raku regex. You generally won't specify it by writing<:General_Category...>
, with the...
being the category, but instead by just writing<:category>
wherecategory
is the category.I noted the
General_Category
value (the next column):Other_Symbol
. If you read the Unicode or Raku doc related to theGeneral_Category
property, you'll see thatSo
is a short alias ofOther_Symbol
.To specify a Unicode property in Raku, write it using Raku's colon pair literal syntax. The key can be any of many Unicode properties.⁵ So to match a character that has the
So
property, write<:So>
in a regex.To remove a character that has that class from a string, one option is to use the
s///
construct:Another way to get the same result
You might want to do "arithmetic" with character classes / characters, adding some, and subtracting others:
Read
<+ [\W] - [\s] - ['] >
as "Not a word character, but also not a space character and not a'
either".See Enumerated character classes and ranges in the Character Class section of Raku's doc for further details of using enumerated classes (using
< [...] >
) and adding/subtracting classes (using+
and-
) and/or an earlier SO I answer I wrote in response to a question pretty similar to yours.For some use cases this approach may simplify getting precisely the result you want and/or expressing it so it'll be easier for others, or a future you, to maintain. But it all depends on things I'll discuss in the next section.
Further discussion of your question: "and so on"
Let's say you use something like one or other or the two solutions I've shown. How do you know they really are what you want? The answer is you need to explore Unicode. Character classes like
\w
/\W
,\s
/\S
etc are just convenient shortcuts for Unicode properties, so you still end up needing to explore Unicode if you want to be sure what's going on. Given that it all ultimately boils down to Unicode properties, let's discuss the<:So>
solution.As we saw above, the
General_Category
property of├
(as shown via the utility linked above) is a link/valueOther_Symbol
.If you click that latter link, you'll see a page corresponding to the Unicode
Other_Symbol
character class. It's a big jumbled mass of black-and-white symbols and colorful emojis and then an orderly list of other symbols. There are over 6,000 characters!Does this character class contain characters that could be categorized as "spaces"? I'm near certain it doesn't, but it'll be up to you to figure that out, not me. What about "apostrophes"? I'm slightly less sure about that than "spaces", though I again think it won't include characters that could be called or categorized as "apostrophes". Does the
Other_Symbol
character class contain "and so on"?!? Maybe toss a coin? ???? Maybe do an in page browser search for particular characters on theOther_Symbol
page? ????When I'm not using the tools hosted on
util.unicode.org
or similar, one approach I've occasionally used to explore Unicode character classes is variations on this Raku code:The
(^0x10FFFF)
is a way to specify the integerRange
that corresponds to all of the 1,114,112 legal Unicode Code Points. The».chr
iterates the range, producing a list of integers from0
onward, applying.chr
to each, which produces the Unicode character corresponding to a given integer. The.grep(/ <:So> /)
then keeps only the characters whoseGeneral_Category
isOther_Symbol
.That said, that'll be really slow. You'll want to find other ways to explore Unicode.
Other options include the
util.unicode.org
tools and the Raku community'sUnicodable
which you can run by visiting the#raku
IRC channel and enteringu: ...
.Discussion of "
$_ ~~ s:g / \├ / /;
"I've not been able to reproduce that.
I think you've just gotten confused.
If you are still convinced you are right, please produce an MRE.
Footnotes
¹ A big chunk of Raku's power is in its powered up regexes. And a big chunk of that is expressed via the general form
<...>
.² Character class syntaxes include older style ones carried forward from older regex formats, eg
\s
to match whitespace. But they're all just shortcuts for Unicode properties or characters rather than ASCII ones, and there are now a LOT more variations.³ If you only skim the doc you might think that Raku regexes can only match against Unicode's
General_Category
property. But if you look at the code examples you'll see there's<:Script<Latin>>
and<:Block('Basic Latin')>
too. (But what are they?) And then when you see the vast array of properties displayed by theutil.unicode.org
property browser you realize there's vastly more that could be matched. Rakudo matches many of these but not all. For gory details, see What are all the Unicode properties a character will match?.⁴ Perhaps adding links to some of these utilities from the Raku doc would be a good thing. And/or creating/hosting variants of them using Raku.
⁵ I suspect Raku defines some additional things that are of the form
<:foo>
beyond Unicode properties. For example, I know:space
works (it matches an ASCII space) but suspect it's not a Unicode property. Otoh that sounds downright wrong to me and against what I would expect of Raku design. If I find out for sure one way or the other I'll update this footnote.