第一次出现时分裂

发布于 2024-11-27 21:11:09 字数 334 浏览 0 评论 0原文

在第一次出现分隔符时分割字符串的最佳方法是什么?

例如:

"123mango abcd mango kiwi peach"

在第一个 mango 上进行拆分以获得:

" abcd mango kiwi peach"

要在最后出现处进行拆分,请参阅Python 中的分区字符串并获取冒号后最后一段的值

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

" abcd mango kiwi peach"

To split on the last occurrence instead, see Partition string in Python and get value of last segment after colon.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

冷情 2024-12-04 21:11:09

来自文档

str.split([sep[, maxsplit]])

返回字符串中的单词列表,使用 sep 作为分隔符字符串。如果指定了 maxsplit,则最多完成 maxsplit 次拆分(因此,列表最多包含 maxsplit+1 个元素)。

s.split('mango', 1)[1]

From the docs:

str.split([sep[, maxsplit]])

Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

s.split('mango', 1)[1]
深海夜未眠 2024-12-04 21:11:09
>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'
>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'
紫瑟鸿黎 2024-12-04 21:11:09

对我来说,更好的方法是:

s.split('mango', 1)[-1]

...因为如果发生的情况不在字符串中,您将得到“IndexError:列表索引超出范围”

因此 -1 不会受到任何伤害,因为出现次数已设置为 1。

For me the better approach is that:

s.split('mango', 1)[-1]

...because if happens that occurrence is not in the string you'll get "IndexError: list index out of range".

Therefore -1 will not get any harm cause number of occurrences is already set to one.

眼眸 2024-12-04 21:11:09

您还可以使用 str.partition

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

使用 str.partition 的优点是它总是会返回以下形式的元组:

(<pre>, <separator>, <post>)

因此这使得解压输出非常灵活,因为总是将是 3结果中的元素元组。

You can also use str.partition:

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

The advantage of using str.partition is that it's always gonna return a tuple in the form:

(<pre>, <separator>, <post>)

So this makes unpacking the output really flexible as there's always going to be 3 elements in the resulting tuple.

请帮我爱他 2024-12-04 21:11:09

总结

最简单且性能最佳的方法是使用字符串的.partition方法

通常,人们可能想要获取找到的分隔符之前之后的部分,并且可能想要找到第一个或< em>最后一次出现 字符串中的分隔符。对于大多数技术来说,所有这些可能性都大致一样简单,并且从一种技术转换为另一种技术也很简单。

对于下面的示例,我们将假设:

>>> import re
>>> s = '123mango abcd mango kiwi peach'

使用 .split

>>> s.split('mango', 1)
['123', ' abcd mango kiwi peach']

.split 的第二个参数限制字符串拆分的次数。这给出了分隔符之前和之后的部分;然后我们就可以选择我们想要的。

如果分隔符没有出现,则不会进行分割:

>>> s.split('grape', 1)
['123mango abcd mango kiwi peach']
Thus, to check whether the delimiter was present, check the length of the result before working with it.

使用 .partition

>>> s.partition('mango')
('123', 'mango', ' abcd mango kiwi peach')

结果是一个元组,并且分隔符本身在找到时被保留。

当未找到分隔符时,结果将是一个相同长度的元组,结果中包含两个空字符串:

>>> s.partition('grape')
('123mango abcd mango kiwi peach', '', '')

因此,要检查分隔符是否存在,请检查第二个元素的值。

使用正则表达式

>>> # Using the top-level module functionality
>>> re.split(re.escape('mango'), s, 1)
['123', ' abcd mango kiwi peach']
>>> # Using an explicitly compiled pattern
>>> mango = re.compile(re.escape('mango'))
>>> mango.split(s, 1)
['123', ' abcd mango kiwi peach']

正则表达式的 .split 方法与内置字符串 .split 方法具有相同的参数,用于限制拆分次数。同样,当分隔符不出现时,不会进行任何拆分:

>>> grape = re.compile(re.escape('grape'))
>>> grape.split(s, 1)
['123mango abcd mango kiwi peach']

在这些示例中,re.escape 没有任何效果,但在一般情况下,为了将分隔符指定为文字文本,有必要这样做。另一方面,使用 re 模块可以发挥正则表达式的全部功能:(

>>> vowels = re.compile('[aeiou]')
>>> # Split on any vowel, without a limit on the number of splits:
>>> vowels.split(s)
['123m', 'ng', ' ', 'bcd m', 'ng', ' k', 'w', ' p', '', 'ch']

注意空字符串:在 ea 之间找到peach。)

使用索引和切片

使用字符串的 .index 方法找出分隔符在哪里,然后用它进行切片:

>>> s[:s.index('mango')] # for everything before the delimiter
'123'
>>> s[s.index('mango')+len('mango'):] # for everything after the delimiter
' abcd mango kiwi peach'

这直接给出前缀。但是,如果未找到分隔符,则会引发异常:

>>> s[:s.index('grape')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

最后一次出现后的所有内容,而不是

虽然没有询问,但我在此处提供了相关技术以供参考。

.split.partition 技术有直接的对应项,用于获取字符串的最后一部分(即,最后之后的所有内容) 分隔符的出现)。供参考:

>>> '123mango abcd mango kiwi peach'.rsplit('mango', 1)
['123mango abcd ', ' kiwi peach']
>>> '123mango abcd mango kiwi peach'.rpartition('mango')
('123mango abcd ', 'mango', ' kiwi peach')

同样,有一个 .rindex 来匹配 .index,但它仍然会给出最后一个匹配的开头的索引分区的。因此:

>>> s[:s.rindex('mango')] # everything before the last match
'123mango abcd '
>>> s[s.rindex('mango')+len('mango'):] # everything after the last match
' kiwi peach'

对于正则表达式方法,我们可以依靠反转输入的技术,查找反转定界符的第一次出现,反转各个结果,并反转结果列表:

>>> ognam = re.compile(re.escape('mango'[::-1]))
>>> [x[::-1] for x in ognam.split('123mango abcd mango kiwi peach'[::-1], 1)][::-1]
['123mango abcd ', ' kiwi peach']

当然,这几乎肯定需要更多努力比它的价值。

另一种方法是从分隔符到字符串末尾使用负前瞻:

>>> literal_mango = re.escape('mango')
>>> last_mango = re.compile(f'{literal_mango}(?!.*{literal_mango})')
>>> last_mango.split('123mango abcd mango kiwi peach', 1)
['123mango abcd ', ' kiwi peach']

由于前瞻,这是最坏情况的 O(n^2) 算法。

性能测试

$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.partition('mango')[-1]"
2000000 loops, best of 5: 128 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.split('mango', 1)[-1]"
2000000 loops, best of 5: 157 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s[s.index('mango')+len('mango'):]"
1000000 loops, best of 5: 250 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'; import re; mango=re.compile(re.escape('mango'))" "mango.split(s, 1)[-1]"
1000000 loops, best of 5: 258 nsec per loop

虽然正则表达式方法更灵活,但速度肯定较慢。限制分割数量可以提高字符串方法和正则表达式的性能(没有限制的时间不会显示,因为它们速度较慢并且也会给出不同的结果),但是 .partition 是仍然是明显的赢家。

对于此测试数据,.index 方法速度较慢,尽管它只需创建一个子字符串并且不必迭代超出匹配项的文本(例如创建其他子字符串的目的)。预先计算分隔符的长度会有所帮助,但这仍然比 .split.partition 方法慢。

Summary

The simplest and best-performing approach is to use the .partition method of the string.

Commonly, people may want to get the part either before or after the delimiter that was found, and may want to find either the first or last occurrence of the delimiter in the string. For most techniques, all of these possibilities are roughly as simple, and it is straightforward to convert from one to another.

For the below examples, we will assume:

>>> import re
>>> s = '123mango abcd mango kiwi peach'

Using .split

>>> s.split('mango', 1)
['123', ' abcd mango kiwi peach']

The second parameter to .split limits the number of times the string will be split. This gives the parts both before and after the delimiter; then we can select what we want.

If the delimiter does not appear, no splitting is done:

>>> s.split('grape', 1)
['123mango abcd mango kiwi peach']
Thus, to check whether the delimiter was present, check the length of the result before working with it.

Using .partition

>>> s.partition('mango')
('123', 'mango', ' abcd mango kiwi peach')

The result is a tuple instead, and the delimiter itself is preserved when found.

When the delimiter is not found, the result will be a tuple of the same length, with two empty strings in the result:

>>> s.partition('grape')
('123mango abcd mango kiwi peach', '', '')

Thus, to check whether the delimiter was present, check the value of the second element.

Using regular expressions

>>> # Using the top-level module functionality
>>> re.split(re.escape('mango'), s, 1)
['123', ' abcd mango kiwi peach']
>>> # Using an explicitly compiled pattern
>>> mango = re.compile(re.escape('mango'))
>>> mango.split(s, 1)
['123', ' abcd mango kiwi peach']

The .split method of regular expressions has the same argument as the built-in string .split method, to limit the number of splits. Again, no splitting is done when the delimiter does not appear:

>>> grape = re.compile(re.escape('grape'))
>>> grape.split(s, 1)
['123mango abcd mango kiwi peach']

In these examples, re.escape has no effect, but in the general case it's necessary in order to specify a delimiter as literal text. On the other hand, using the re module opens up the full power of regular expressions:

>>> vowels = re.compile('[aeiou]')
>>> # Split on any vowel, without a limit on the number of splits:
>>> vowels.split(s)
['123m', 'ng', ' ', 'bcd m', 'ng', ' k', 'w', ' p', '', 'ch']

(Note the empty string: that was found between the e and the a of peach.)

Using indexing and slicing

Use the .index method of the string to find out where the delimiter is, then slice with that:

>>> s[:s.index('mango')] # for everything before the delimiter
'123'
>>> s[s.index('mango')+len('mango'):] # for everything after the delimiter
' abcd mango kiwi peach'

This directly gives the prefix. However, if the delimiter is not found, an exception will be raised instead:

>>> s[:s.index('grape')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

Everything after the last occurrence, instead

Though it wasn't asked, I include related techniques here for reference.

The .split and .partition techniques have direct counterparts, to get the last part of the string (i.e., everything after the last occurrence of the delimiter). For reference:

>>> '123mango abcd mango kiwi peach'.rsplit('mango', 1)
['123mango abcd ', ' kiwi peach']
>>> '123mango abcd mango kiwi peach'.rpartition('mango')
('123mango abcd ', 'mango', ' kiwi peach')

Similarly, there is a .rindex to match .index, but it will still give the index of the beginning of the last match of the partition. Thus:

>>> s[:s.rindex('mango')] # everything before the last match
'123mango abcd '
>>> s[s.rindex('mango')+len('mango'):] # everything after the last match
' kiwi peach'

For the regular expression approach, we can fall back on the technique of reversing the input, looking for the first appearance of the reversed delimiter, reversing the individual results, and reversing the result list:

>>> ognam = re.compile(re.escape('mango'[::-1]))
>>> [x[::-1] for x in ognam.split('123mango abcd mango kiwi peach'[::-1], 1)][::-1]
['123mango abcd ', ' kiwi peach']

Of course, this is almost certainly more effort than it's worth.

Another way is to use negative lookahead from the delimiter to the end of the string:

>>> literal_mango = re.escape('mango')
>>> last_mango = re.compile(f'{literal_mango}(?!.*{literal_mango})')
>>> last_mango.split('123mango abcd mango kiwi peach', 1)
['123mango abcd ', ' kiwi peach']

Because of the lookahead, this is a worst-case O(n^2) algorithm.

Performance testing

$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.partition('mango')[-1]"
2000000 loops, best of 5: 128 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.split('mango', 1)[-1]"
2000000 loops, best of 5: 157 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s[s.index('mango')+len('mango'):]"
1000000 loops, best of 5: 250 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'; import re; mango=re.compile(re.escape('mango'))" "mango.split(s, 1)[-1]"
1000000 loops, best of 5: 258 nsec per loop

Though more flexible, the regular expression approach is definitely slower. Limiting the number of splits improves performance with both the string method and regular expressions (timings without the limit are not shown, because they are slower and also give a different result), but .partition is still a clear winner.

For this test data, the .index approach was slower even though it only has to create one substring and doesn't have to iterate over text beyond the match (for the purpose of creating the other substrings). Pre-computing the length of the delimiter helps, but this is still slower than the .split and .partition approaches.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文