通过分离器列表分开字符串
我有一个字符串 文本
和a list names
- 我想 split
文本
每当名称>
的元素发生时。
text ='Monika购物。然后她骑自行车。迈克喜欢披萨。 。
我
讨厌
Monika 购物。然后她骑
。
- 自行车 代码>并不总是以
名称
元素开头。感谢您指出的Victorlee。我不在乎那部分领先的部分,但其他人也许会这样做,所以感谢 name> name
在 same>中回答“两种情况”的人。代码>文本。- saparators <代码>名称是唯一的,但可以在
文本
中多次发生。因此,输出将具有更多的列表,而不是名称
具有 strings 。 文本
永远不会具有相同的唯一名称
元素,元素连续两次/&lt;&gt;。- 最终,我希望输出是列表 的,其中每个拆分
text
slice slice对应于其 shipator ,它已被拆分。清单的顺序很重要。
re.split()
不会让我将列表用作分隔符参数。我可以re.compile()
我的分隔符列表吗?
更新:托马斯代码最适合我的案件,但我注意到我之前没有意识到的一个警告:
name> name
的某些元素先于“夫人”。或“先生”虽然文本
中的某些相应匹配之前是“夫人”。或“先生”
到目前为止:
names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <module>
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110 return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)
I have a string text
and a list names
- I want to split
text
every time an element ofnames
occurs.
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
desired output:
output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
FAQ
text
does not always start with anames
element. Thanks for VictorLee pointing that out. I dont care about that leading part but others maybe do, so thanks for the people answering "both cases"- The order of the separators within
names
is independend of their occurance intext
. - separators within
names
are unique but can occur multiple times throughouttext
. Therefore the output will have more lists thannames
has strings. text
will never have the same uniquenames
element occuring twice consecutively/<>.- Ultimately I want the output to be a list of lists where each split
text
slice corresponds to its separator, that it was split by. Order of lists doesent matter.
re.split()
wont let me use a list as a separator argument. Can I re.compile()
my separator list?
update: Thomas code works best for my case, but I noticed one caveat i havent realized before:
some of the elements of names
are preceded by 'Mrs.' or 'Mr.' while only some of the corresponding matches in text
are preceded by 'Mrs.' or 'Mr.'
so far:
names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]
error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <module>
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110 return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您正在寻找一种使用正则表达式的方法,则:
prints:
说明
首先,我们从过去的 names 参数中动态创建Regex
Regex1
是:当您将输入分开时,因为任何传递的名称都可能出现在输入的开头或结束时,您最终可能会在结果中出现空字符串,因此我们会过滤掉这些并得到
:我们将每个列表划分为:
再次滤除任何可能的空字符串以获得最终结果。
所有这一切的关键是,当我们拆分的正则截止捕获组时,该捕获组的文本也作为结果列表的一部分返回。
更新
如果输入文本不使用其中一个,则没有指定应该发生什么。假设您可能想要忽略所有字符串,直到找到其中一个名称为止,然后查看以下版本。同样,如果文本不包含任何名称,则更新的代码只会返回一个空列表:
打印:
If you are looking for a way to use regular expressions, then:
Prints:
Explanation
First we dynamically create a regex
regex1
from the past names argument to be:When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:
Then we split each list on:
And again we filter out any possible empty strings to get our final result.
The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.
Update
You did not specify what should occur if the input text does not being with one of the names. On the assumption that you might want to ignore all of the string until you find one of the names, then check out the following version. Likewise, if the text does not contain any of the names, then the updated code will just return an empty list:
Prints:
与正则表达式相反,您还可以重建文本为合适的格式,该格式将通过
split
方法获得预期结果。并添加一些字符串格式过程。此代码将与 text 兼容,该 不是 name 中的元素。
关键点:重构文本对
的价值| Monika |去购物。然后她骑自行车。 | Mike |喜欢披萨。 | Monika |讨厌我。
使用自定义分离器,例如|
,它不应在原始文本中发生。Against with regular expressions, you also could reconstruct text to a suitable format which will get the expect result by
split
method. And add some string format process.This code will compatible with text which not start with the element in names.
Key point: reformat text value to
|Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me.
with self-define separator such as|
which should not occur in original text.我采用了您给定的解决方案之一,并稍微重构了。
编辑
另一种解决此处提到的边缘情况的解决方案。
有点丑陋,希望改进。
I took one of your given solutions and slightly refactored it.
EDITED
Another solution to account for some edge case mentioned here.
Kinda ugly though, looking to improve.
这与这里的一些答案相似,但更简单。
有三个步骤:
我们可以组合(1)和(2)列表列表更复杂。
This is in a similar vein to some answers here, but simpler.
There are three steps:
We can combine (1) and (2) but it makes creating the list of lists more complicated.
您可以使用
re.split
以及>输出:
注意:
这是问题的答案修订版9 。
您不应考虑第一次出现名称之前的“文本”。
您也没有指定文本以名称结尾会发生什么。
zip
有效,因为fragments
中总是有偶数元素。如果第一个元素与名称不匹配(文本或空字符串),则我们删除了第一个元素,如果文本以名称结尾,则最后一个元素始终是一个空字符串。根据
re.split
:
,这是同一示例,但在第一次出现之前不忽略“文本”:
注释:
提出问题的更新修订11
尝试包括ID345678附加要求:
utput:
注意:
最终的正则句子是
(Henry | Mike |(Mrs \。)?Monika)
create_regex_string(“Mrs。Monika”)
创建(MRS \。)?MONIKA
因为我们在正则表达式中引入了一个附加分组,所以
fragments
具有更多值zip
更改线路,因此它是动态的,并且如果您不don'想要在
结果
中的称呼,您可以使用name.split()[ - 1]
创建结果>结果
:请注意:我在休息时间更新脚本时尚未测试所有用例。让我知道是否有问题,然后我下班时会调查。
You could use
re.split
along withzip
:Output:
Notes:
This is an answer for question revision 9.
You don't specify if "text" before the first occurrence of a name should be considered or not.
You also don't specify what happens if the text ends with a name.
zip
works because there is always an even number of elements infragments
. We remove the first element if it does not match a name (either text or empty string), and the last element is always an empty string if the text ends with a name.According to
re.split
:Here is the same example but not ignoring "text" before the first occurrence:
Notes:
Update for Question Revision 11
Following an attempt to include id345678 additional requirement:
Output:
Notes:
final regex string is then
(Henry|Mike|(Mrs\. )?Monika)
create_regex_string("Mrs. Monika")
creates(Mrs\. )?Monika
because we introduced an additional grouping in the regex,
fragments
has more valueszip
so it is dynamicallyand if you don't want the salutation in the
result
, you can usename.split()[-1]
when creatingresult
:Please note: I have not tested all use cases as I updated the script on my break time. Let me know if there are issues and then I will look into it when I am off work.
您的示例与您所需的输出不完全匹配。同样,尚不清楚示例输入将始终始终具有每个句子结束时的时期。
话虽如此,您可能想尝试这种肮脏的方法:
这是输出:
Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.
Having said that, you might want to try this dirty approach:
This outputs:
需要使用它
。
除非您明确
This is without the re, unless you explicitly need to use it.. Works for the test case given..
Gives the output:
Should work for other cases too..