包含多个子字符串的拆分字符串

发布于 2025-01-21 08:47:58 字数 1674 浏览 2 评论 0原文

我有字符串名称

names = ['熟人Muller','副总统约翰逊会员彼得森熟人Rose']

我想拆分包含<<的字符串强>以下子字符串中有多个:

substrings = ['副总统','会员','secclaintance']

更重要的是,我想在上一个之后拆分遵循子字符串的单词的字符

desired_output = ['熟人穆勒','副总统约翰逊',“会员彼得森”,'熟人罗斯博士']

我不知道如何如何在我的代码中实现“多个”条件:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

异常:当最后一个字符是一个点(例如Prof。)时,在 second supstring之后分开。


更新: 名称比我想象的要复杂,并遵循

  1. 已经正确回答的标题般的模式('副总统约翰逊(Johnson)会员彼得森(Johnson Affiliate Peterson Peterson)熟人Rose'< /code>)
  2. 直到遵循第二种字符串模式('Mister Kelly,AWS'),
  3. 直到第三个字符串遵循直到结束为止('Birker博士,秘书Dews博士,Dews博士,关系秘书小姐雅各布博士,秘书'

names = ['熟人穆勒' Prt Robertson小姐,FDU',“凯利先生,AWS”,'伯克(Birker),秘书Dews博士,伯格小姐(Miss Berg),关系秘书秘书,秘书']

有时秘书随后是不同的规格。我不在乎这些字符有时会遵循秘书,直到下一个名字发生。他们可以掉落。当然,'秘书'应像updated_output中一样存储。

我创建了一个 - 希望详尽的列表规格 秘书的内容。这是列表的表示: 规格= ['','','','用于关系','for Interior',“对于环境”]

更新的问题:我如何解释使用规格列表的第三个模式?

updated_output = ['熟人穆勒','副总统约翰逊',“会员彼得森”,'熟人罗斯博士' ,“凯利先生,aws”,博士Birker,国务卿”,博士Dews,成员',关系秘书伯格小姐,' Jakob,秘书']

I have a list of strings names

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

I want to split the strings that contain more than one of the following substrings:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

More precicely, i want to split after the last character of the word that follows the substring

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

I dont know how to implement 'more than one' condition into my code:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

Exception: when that last character is a point (e.g. Prof.), split after the second word following the substring.


update: names is more complex than i thought and follows

  1. the title-like-pattern already answered correctly ('Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose')
  2. until a second pattern of strings follows ('Mister Kelly, AWS')
  3. until a third pattern of strings follows until the end ('Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary')

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

Sometimes Secretary is followed by varying specifications. I dont care about these characters that sometimes follow Secretary until the next name occurs. They can be dropped. Of course 'Secretary' should be stored like in updated_output.

I created a - hopefully exhaustive - list specifications of the stuff that follows Secretary. Here is a representation of list:
specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

updated question: how can i account for the third pattern using the specification list?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梨涡 2025-01-28 08:47:58

您想在这三个标题之一之前以 word boundare 进行拆分,因此您可以查找一个word boundare \ b,然后是正lookahead (? = ...)对于其中一个标题:

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

然后,您可以修剪并丢弃空的结果:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

使用输入字符串列表,只需将此处理应用于所有处理:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

它给出:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']

You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b followed by a positive lookahead (?=...) for one of those titles:

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Then, you can trim and discard the empty results:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

With a list of input strings, simply apply this treatment to all of them:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

Which gives:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']
韬韬不绝 2025-01-28 08:47:58

尝试:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

打印:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Try:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

Prints:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文