通过分离器列表分开字符串

发布于 2025-01-21 09:17:35 字数 3387 浏览 2 评论 0原文

我有一个字符串 文本和a list names

  • 我想 split 文本 每当 名称> 的元素发生时。

text ='Monika购物。然后她骑自行车。迈克喜欢披萨。 。

讨厌

Monika 购物。然后她骑

  • 自行车 代码>并不总是以名称元素开头。感谢您指出的Victorlee。我不在乎那部分领先的部分,但其他人也许会这样做,所以感谢
  • name> name same>中回答“两种情况”的人。代码>文本。
  • saparators <代码>名称是唯一的,但可以在文本中多次发生。因此,输出将具有更多的列表,而不是名称具有 strings
  • 文本永远不会具有相同的唯一名称元素,元素连续两次/&lt;&gt;。
  • 最终,我希望输出是列表 的,其中每个拆分text slice slice对应于其 shipator ,它已被拆分。清单的顺序很重要。

re.split()不会让我将列表用作分隔符参数。我可以re.compile()我的分隔符列表吗?


更新:托马斯代码最适合我的案件,但我注意到我之前没有意识到的一个警告:

name> name的某些元素先于“夫人”。或“先生”虽然文本中的某些相应匹配之前是“夫人”。或“先生”


到目前为止:

names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]

def create_regex_string(name: List[str]) -> str:
    name_components = name.split()
    if len(name_components) == 1:
        return re.escape(name)
    salutation, *name = name_components
    return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
    
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]
        result = [[name, clist.rstrip()] for name, clist in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        ) if clist is not None
    ]

print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]

错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [86], in <module>
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in <genexpr>(.0)
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in create_regex_string(name)
    109 if len(name_components) == 1:
    110     return re.escape(name)
--> 111 salutation, *name = name_components
    112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"

ValueError: not enough values to unpack (expected at least 1, got 0)

I have a string text and a list names

  • I want to split text every time an element of names occurs.

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

desired output:

output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

FAQ

  • text does not always start with a names element. Thanks for VictorLee pointing that out. I dont care about that leading part but others maybe do, so thanks for the people answering "both cases"
  • The order of the separators within names is independend of their occurance in text.
  • separators within names are unique but can occur multiple times throughout text. Therefore the output will have more lists than names has strings.
  • text will never have the same unique names element occuring twice consecutively/<>.
  • Ultimately I want the output to be a list of lists where each split text slice corresponds to its separator, that it was split by. Order of lists doesent matter.

re.split() wont let me use a list as a separator argument. Can I re.compile() my separator list?


update: Thomas code works best for my case, but I noticed one caveat i havent realized before:

some of the elements of names are preceded by 'Mrs.' or 'Mr.' while only some of the corresponding matches in text are preceded by 'Mrs.' or 'Mr.'


so far:

names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]

def create_regex_string(name: List[str]) -> str:
    name_components = name.split()
    if len(name_components) == 1:
        return re.escape(name)
    salutation, *name = name_components
    return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
    
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]
        result = [[name, clist.rstrip()] for name, clist in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        ) if clist is not None
    ]

print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]

error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [86], in <module>
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in <genexpr>(.0)
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in create_regex_string(name)
    109 if len(name_components) == 1:
    110     return re.escape(name)
--> 111 salutation, *name = name_components
    112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"

ValueError: not enough values to unpack (expected at least 1, got 0)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

桃扇骨 2025-01-28 09:17:35

如果您正在寻找一种使用正则表达式的方法,则:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

prints:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

说明

首先,我们从过去的 names 参数中动态创建Regex Regex1是:

(?=Mike|Monika)

当您将输入分开时,因为任何传递的名称都可能出现在输入的开头或结束时,您最终可能会在结果中出现空字符串,因此我们会过滤掉这些并得到

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

:我们将每个列表划分为:

(Mike|Monika)

再次滤除任何可能的空字符串以获得最终结果。

所有这一切的关键是,当我们拆分的正则截止捕获组时,该捕获组的文本也作为结果列表的一部分返回。

更新

如果输入文本不使用其中一个,则没有指定应该发生什么。假设您可能想要忽略所有字符串,直到找到其中一个名称为止,然后查看以下版本。同样,如果文本不包含任何名称,则更新的代码只会返回一个空列表:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex0 = re.compile('(' + joined_names + ')[\s\S]*')
    m = regex0.search(text)
    if not m:
        return []
    text = m.group(0)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

打印:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

If you are looking for a way to use regular expressions, then:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

Prints:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

Explanation

First we dynamically create a regex regex1 from the past names argument to be:

(?=Mike|Monika)

When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

Then we split each list on:

(Mike|Monika)

And again we filter out any possible empty strings to get our final result.

The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.

Update

You did not specify what should occur if the input text does not being with one of the names. On the assumption that you might want to ignore all of the string until you find one of the names, then check out the following version. Likewise, if the text does not contain any of the names, then the updated code will just return an empty list:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex0 = re.compile('(' + joined_names + ')[\s\S]*')
    m = regex0.search(text)
    if not m:
        return []
    text = m.group(0)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

Prints:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
弥繁 2025-01-28 09:17:35

与正则表达式相反,您还可以重建文本为合适的格式,该格式将通过split方法获得预期结果。并添加一些字符串格式过程。

# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
    my_sprt = '|'
    tmp_text_arr = text.split()
    for i in range(len(tmp_text_arr)):
        for sprt in names:
            if sprt == tmp_text_arr[i]:
                tmp_text_arr[i] = my_sprt + sprt + my_sprt

    tmp_text = ' '.join(tmp_text_arr)
    if tmp_text.startswith(my_sprt):
        tmp_text = tmp_text[1:]

    tmp_text_arr = tmp_text.split(my_sprt)
    if tmp_text_arr[0] not in names:
        tmp_text_arr.pop(0)

    out_arr = []
    for i in range(0, len(tmp_text_arr) - 1, 2):
        out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
    return out_arr

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

此代码将 text 兼容,该 不是 name 中的元素。

关键点:重构文本的价值| Monika |去购物。然后她骑自行车。 | Mike |喜欢披萨。 | Monika |讨厌我。使用自定义分离器,例如|,它不应在原始文本中发生。

Against with regular expressions, you also could reconstruct text to a suitable format which will get the expect result by split method. And add some string format process.

# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
    my_sprt = '|'
    tmp_text_arr = text.split()
    for i in range(len(tmp_text_arr)):
        for sprt in names:
            if sprt == tmp_text_arr[i]:
                tmp_text_arr[i] = my_sprt + sprt + my_sprt

    tmp_text = ' '.join(tmp_text_arr)
    if tmp_text.startswith(my_sprt):
        tmp_text = tmp_text[1:]

    tmp_text_arr = tmp_text.split(my_sprt)
    if tmp_text_arr[0] not in names:
        tmp_text_arr.pop(0)

    out_arr = []
    for i in range(0, len(tmp_text_arr) - 1, 2):
        out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
    return out_arr

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

This code will compatible with text which not start with the element in names.

Key point: reformat text value to |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me. with self-define separator such as | which should not occur in original text.

街角迷惘 2025-01-28 09:17:35

我采用了您给定的解决方案之一,并稍微重构了。

def split(txt, seps, actual_sep='\1'):
    order = [item for item in txt.split() if item in seps ]
    for sep in seps:
        txt = txt.replace(sep, actual_sep)
    return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']

print( split(text, names) )

编辑

另一种解决此处提到的边缘情况的解决方案。

def split(txt, seps, sep_pack='\1'):
    for sep in seps:
        txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
    
    lst = txt.split(sep_pack)
    temp = []
    idx = 0
    for _ in range(len(lst)):
        if idx < len(lst):
            if lst[idx] in seps:
                temp.append( [lst[idx], lst[idx+1]] )
                idx+=2
            else:
                temp.append( ['', lst[idx]] )
                idx+=1

    return temp

有点丑陋,希望改进。

I took one of your given solutions and slightly refactored it.

def split(txt, seps, actual_sep='\1'):
    order = [item for item in txt.split() if item in seps ]
    for sep in seps:
        txt = txt.replace(sep, actual_sep)
    return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']

print( split(text, names) )

EDITED

Another solution to account for some edge case mentioned here.

def split(txt, seps, sep_pack='\1'):
    for sep in seps:
        txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
    
    lst = txt.split(sep_pack)
    temp = []
    idx = 0
    for _ in range(len(lst)):
        if idx < len(lst):
            if lst[idx] in seps:
                temp.append( [lst[idx], lst[idx+1]] )
                idx+=2
            else:
                temp.append( ['', lst[idx]] )
                idx+=1

    return temp

Kinda ugly though, looking to improve.

吐个泡泡 2025-01-28 09:17:35

这与这里的一些答案相似,但更简单。

有三个步骤:

  1. 查找分隔符的所有出现,
  2. 将其余文本分开
  3. ,将(1)和(2)的结果组合到列表中,根据需要,

我们可以组合(1)和(2)列表列表更复杂。

import re

def split_on_names(names: list[str], text: str) -> list[list[str]]:
    pattern = re.compile("|".join(map(re.escape, names)))
    # step 1: find the separators (in order)
    separator = pattern.findall(text)
    # step 2: split out the text between separators
    remainder = list(filter(None, pattern.split(text)))

    # at this point, if `remainder` is longer, it's because `text` 
    # didn't start with a separator. So, we add a blank separator
    # to account for the prefix.
    if len(remainder) > len(separator):
        separator = ["", *separator]

    # step 3: reshape the results into a list of lists
    return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."

split_on_names(names, text)

# output:
#
# [
#    ['', 'Hi '],
#    ['Monika', ' goes shopping. Then she rides bike. '],
#    ['Mike', ' likes Pizza. '],
#    ['Monika', ' hates me.']
# ]

This is in a similar vein to some answers here, but simpler.

There are three steps:

  1. Find all occurrences of a separator
  2. Split apart the remaining text
  3. Combine the results from (1) and (2) into a list of lists, as desired

We can combine (1) and (2) but it makes creating the list of lists more complicated.

import re

def split_on_names(names: list[str], text: str) -> list[list[str]]:
    pattern = re.compile("|".join(map(re.escape, names)))
    # step 1: find the separators (in order)
    separator = pattern.findall(text)
    # step 2: split out the text between separators
    remainder = list(filter(None, pattern.split(text)))

    # at this point, if `remainder` is longer, it's because `text` 
    # didn't start with a separator. So, we add a blank separator
    # to account for the prefix.
    if len(remainder) > len(separator):
        separator = ["", *separator]

    # step 3: reshape the results into a list of lists
    return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."

split_on_names(names, text)

# output:
#
# [
#    ['', 'Hi '],
#    ['Monika', ' goes shopping. Then she rides bike. '],
#    ['Mike', ' likes Pizza. '],
#    ['Monika', ' hates me.']
# ]
好多鱼好多余 2025-01-28 09:17:35

您可以使用 re.split 以及

import re
from pprint import pprint

text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    pprint(result)

>输出:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Mike', ' likes Pizza.'],
 ['Monika', ' hates me.']]

注意:

  • 这是问题的答案修订版9

  • 您不应考虑第一次出现名称之前的“文本”。

    • 上面的脚本在第一次出现之前忽略了“文本”。
  • 您也没有指定文本以名称结尾会发生什么。

    • 上面的脚本将通过添加一个空字符串来包括出现。但是,如果“文本”是一个空字符串,则可以通过删除最后一个元素可以轻松解决。
  • zip有效,因为fragments中总是有偶数元素。如果第一个元素与名称不匹配(文本或空字符串),则我们删除了第一个元素,如果文本以名称结尾,则最后一个元素始终是一个空字符串。

根据 re.split

如果分离器中有捕获组并且在字符串开始时匹配,则结果将从一个空字符串开始。字符串的末端[...]

的末端也相同


,这是同一示例,但在第一次出现之前不忽略“文本”:

import re

text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # not ignoring text before first occurrence; use empty string as name
    if fragments[0].strip() == "":
        fragments = fragments[1:]
    elif not fragments[0] in names:
        fragments = [""] + fragments

    result = [
        [name, text.rstrip()]
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    # # remove empty text
    # if result and not result[-1][1]:
    #     result = result[:-1]

    print(result)  # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]

注释:

  • 这是问题的答案

提出问题的更新修订11

尝试包括ID345678附加要求:

import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:

    name_components = name.split()

    if len(name_components) == 1:
        return re.escape(name)

    salutation, name_part = name_components

    return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
       "Mrs. Monika hates me. Henry needs a break."

names = ["Henry", "Dr. Mike", "Mrs. Monika"]

regex_string = "|".join(create_regex_string(name) for name in names)

group_count = regex_string.count("(") + 1

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        )
    ]

    pprint(result)

utput:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Dr. Mike', ' likes Pizza.'],
 ['Mrs. Monika', ' hates me.'],
 ['Henry', ' needs a break.']]

注意:

  • 最终的正则句子是(Henry | Mike |(Mrs \。)?Monika)

    • 例如。 create_regex_string(“Mrs。Monika”)创建(MRS \。)?MONIKA
    • 它也将用于其他致敬(只要有一个空间将致敬与名称分开)
  • 因为我们在正则表达式中引入了一个附加分组,所以fragments具有更多值

    • 因此,我们需要使用zip更改线路,因此它是动态的
  • ,并且如果您不don'想要在结果中的称呼,您可以使用name.split()[ - 1]创建结果>结果

result = [
    [name.split()[-1], text.rstrip()] 
    for name, text in zip(
        fragments[::group_count+1],
        fragments[group_count::group_count+1]
    )
]

# [['Monika', ' goes shopping. Then she rides bike.'],
#  ['Mike', ' likes Pizza.'],
#  ['Monika', ' hates me.'],
#  ['Henry', ' needs a break.']]

请注意:我在休息时间更新脚本时尚未测试所有用例。让我知道是否有问题,然后我下班时会调查。

You could use re.split along with zip:

import re
from pprint import pprint

text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    pprint(result)

Output:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Mike', ' likes Pizza.'],
 ['Monika', ' hates me.']]

Notes:

  • This is an answer for question revision 9.

    • There is an update at the very end of this answer considering the changes in question revision 11.
  • You don't specify if "text" before the first occurrence of a name should be considered or not.

    • Script above ignores "text" before the first occurrence.
  • You also don't specify what happens if the text ends with a name.

    • Script above will include the occurrence by adding an empty string. However, can be easily be solved by removing the last element if the "text" is an empty string.
  • zip works because there is always an even number of elements in fragments. We remove the first element if it does not match a name (either text or empty string), and the last element is always an empty string if the text ends with a name.

According to re.split:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]


Here is the same example but not ignoring "text" before the first occurrence:

import re

text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # not ignoring text before first occurrence; use empty string as name
    if fragments[0].strip() == "":
        fragments = fragments[1:]
    elif not fragments[0] in names:
        fragments = [""] + fragments

    result = [
        [name, text.rstrip()]
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    # # remove empty text
    # if result and not result[-1][1]:
    #     result = result[:-1]

    print(result)  # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]

Notes:

  • This is an answer for question revision 9.
    • There is an update at the very end of this answer considering the changes in question revision 11.

Update for Question Revision 11

Following an attempt to include id345678 additional requirement:

import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:

    name_components = name.split()

    if len(name_components) == 1:
        return re.escape(name)

    salutation, name_part = name_components

    return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
       "Mrs. Monika hates me. Henry needs a break."

names = ["Henry", "Dr. Mike", "Mrs. Monika"]

regex_string = "|".join(create_regex_string(name) for name in names)

group_count = regex_string.count("(") + 1

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        )
    ]

    pprint(result)

Output:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Dr. Mike', ' likes Pizza.'],
 ['Mrs. Monika', ' hates me.'],
 ['Henry', ' needs a break.']]

Notes:

  • final regex string is then (Henry|Mike|(Mrs\. )?Monika)

    • eg. create_regex_string("Mrs. Monika") creates (Mrs\. )?Monika
    • it will also work for other salutations (as long as there is one space separating the salutation from the name)
  • because we introduced an additional grouping in the regex, fragments has more values

    • therefore, we needed to change the line with zip so it is dynamically
  • and if you don't want the salutation in the result, you can use name.split()[-1] when creating result:

result = [
    [name.split()[-1], text.rstrip()] 
    for name, text in zip(
        fragments[::group_count+1],
        fragments[group_count::group_count+1]
    )
]

# [['Monika', ' goes shopping. Then she rides bike.'],
#  ['Mike', ' likes Pizza.'],
#  ['Monika', ' hates me.'],
#  ['Henry', ' needs a break.']]

Please note: I have not tested all use cases as I updated the script on my break time. Let me know if there are issues and then I will look into it when I am off work.

菊凝晚露 2025-01-28 09:17:35

您的示例与您所需的输出不完全匹配。同样,尚不清楚示例输入将始终始终具有每个句子结束时的时期。

话虽如此,您可能想尝试这种肮脏的方法:

import re

text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'

names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split

output = []
sentences = text.split(".")
for name in names:
    for sentence in sentences:
        if name in sentence:
            output.append([name, f"{rsplit(sentence)[-1]}."])

print(output)

这是输出:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]

Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.

Having said that, you might want to try this dirty approach:

import re

text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'

names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split

output = []
sentences = text.split(".")
for name in names:
    for sentence in sentences:
        if name in sentence:
            output.append([name, f"{rsplit(sentence)[-1]}."])

print(output)

This outputs:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]
始终不够爱げ你 2025-01-28 09:17:35

需要使用它

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

def sep(text, names):
    foo = []
    new_text = text.split(' ')
    for i in new_text:
        if i in names:
            foo.append(new_text[:new_text.index(i)])
            new_text = new_text[new_text.index(i):]
    foo.append(new_text)
    foo = foo[1:]

    new_foo = []
    for i in foo:
        first, rest = i[0], i[1:]
        rest = " ".join(rest)
        i = [first, rest]
        new_foo.append(i)
    print(new_foo)

sep(text, names)

[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]

除非您明确

This is without the re, unless you explicitly need to use it.. Works for the test case given..

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

def sep(text, names):
    foo = []
    new_text = text.split(' ')
    for i in new_text:
        if i in names:
            foo.append(new_text[:new_text.index(i)])
            new_text = new_text[new_text.index(i):]
    foo.append(new_text)
    foo = foo[1:]

    new_foo = []
    for i in foo:
        first, rest = i[0], i[1:]
        rest = " ".join(rest)
        i = [first, rest]
        new_foo.append(i)
    print(new_foo)

sep(text, names)

Gives the output:

[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]

Should work for other cases too..

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文