使用正则表达式进行句子分割

发布于 2024-11-24 19:36:23 字数 539 浏览 4 评论 0原文

我有一些文本（SMS）消息，我想使用句点（'.'）作为分隔符对它们进行分段。我无法处理以下类型的消息。如何使用 Python 中的正则表达式对这些消息进行分段。

分割前：

'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
'no of beds 8.please inform person in-charge.tq'

分割后：

'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u'
'no of beds 8' 'please inform person in-charge' 'tq'

每行都是一条单独的消息

更新：

我正在做自然语言处理，我觉得可以处理'16.8mmmol/l'和'床数8.2杯茶。' 同样。 80% 的准确率对我来说已经足够了，但我想尽可能减少误报。

原文

I have few text(SMS) messages and I want to segment them using period('.') as a delimiter. I am unable to handle following types of messages. How can I segment these messages using Regex in Python.

Before segmentation:

'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
'no of beds 8.please inform person in-charge.tq'

After segmentation:

'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u'
'no of beds 8' 'please inform person in-charge' 'tq'

Each line is a separate message

Updated:

I am doing natural language processing and I feel its okay to treat '16.8mmmol/l' and 'no of beds 8.2 cups of tea.' as same. 80% accuracy is enough for me but I want to reduce False Positive as much as possible.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羅雙樹 2024-12-01 19:36:23

几周前，我搜索了一个正则表达式，它可以捕获字符串中代表数字的每个字符串，无论数字的书写形式如何，甚至是科学计数法中的数字，甚至是带有逗号的印度数字：请参阅此线程

我在下面的代码中使用这个正则表达式来解决您的问题。

与其他答案相反，在我的解决方案中， '8.' 中的点不被视为必须进行分割的点，因为它可以被读取为后面没有数字的浮点数点。

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

结果

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

编辑1

我添加了一个 simler_regex 检测数字，来自我在这个帖子

我没有检测到印度数字和科学计数法中的数字，但它实际上给出了相同的结果

Some weeks ago, I searched for a regex that would catch every string representing a number in a string, whatever the form in which the number is written, even the ones in scientific notation, even the indian numbers having commas: see this thread

I use this regex in the following code to give a solution to your problem.

Contrary to the other answers, in my solution a dot in '8.' isn't considered as a dot on which a split must be done, because it can be read as a float having no digit after the dot.

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

result

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

EDIT 1

I added a simpler_regex detecting numbers, from a post of mine in this thread

I doesn't detect indian numbers and numbers in scientific notation but it gives in fact the same results

回复收藏 0 原文

递刀给你 2024-12-01 19:36:23

您可以使用否定先行断言来匹配“。”后面不跟数字，然后使用 re.split ：

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

you can use a negative lookahead assertion to match a "." not followed by a digit, and use re.split on this:

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

回复收藏 0 原文

尐籹人 2024-12-01 19:36:23

环顾

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

四周确保一侧或另一侧都不是数字。因此这也涵盖了 16.8 情况。如果两边都有数字，这个表达式不会分裂。

What about

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

The lookarounds ensure that either on one or the other side is not a digit. So this covers also the 16.8 case. This expression will not split if there are on both sides digits.

回复收藏 0 原文