使用正则表达式进行句子分割
我有一些文本(SMS)消息,我想使用句点('.')作为分隔符对它们进行分段。我无法处理以下类型的消息。如何使用 Python 中的正则表达式对这些消息进行分段。
分割前:
'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u' 'no of beds 8.please inform person in-charge.tq'
分割后:
'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u' 'no of beds 8' 'please inform person in-charge' 'tq'
每行都是一条单独的消息
更新:
我正在做自然语言处理,我觉得可以处理'16.8mmmol/l'
和'床数8.2杯茶。'
同样。 80% 的准确率对我来说已经足够了,但我想尽可能减少误报
。
I have few text(SMS) messages and I want to segment them using period('.') as a delimiter. I am unable to handle following types of messages. How can I segment these messages using Regex in Python.
Before segmentation:
'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u' 'no of beds 8.please inform person in-charge.tq'
After segmentation:
'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u' 'no of beds 8' 'please inform person in-charge' 'tq'
Each line is a separate message
Updated:
I am doing natural language processing and I feel its okay to treat '16.8mmmol/l'
and 'no of beds 8.2 cups of tea.'
as same. 80% accuracy is enough for me but I want to reduce False Positive
as much as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
几周前,我搜索了一个正则表达式,它可以捕获字符串中代表数字的每个字符串,无论数字的书写形式如何,甚至是科学计数法中的数字,甚至是带有逗号的印度数字:请参阅 此线程
我在下面的代码中使用这个正则表达式来解决您的问题。
与其他答案相反,在我的解决方案中, '8.' 中的点不被视为必须进行分割的点,因为它可以被读取为后面没有数字的浮点数点。
结果
编辑1
我添加了一个 simler_regex 检测数字,来自我在 这个帖子
我没有检测到印度数字和科学计数法中的数字,但它实际上给出了相同的结果
Some weeks ago, I searched for a regex that would catch every string representing a number in a string, whatever the form in which the number is written, even the ones in scientific notation, even the indian numbers having commas: see this thread
I use this regex in the following code to give a solution to your problem.
Contrary to the other answers, in my solution a dot in '8.' isn't considered as a dot on which a split must be done, because it can be read as a float having no digit after the dot.
result
EDIT 1
I added a simpler_regex detecting numbers, from a post of mine in this thread
I doesn't detect indian numbers and numbers in scientific notation but it gives in fact the same results
您可以使用否定先行断言来匹配“。”后面不跟数字,然后使用
re.split
:you can use a negative lookahead assertion to match a "." not followed by a digit, and use
re.split
on this:环顾
四周确保一侧或另一侧都不是数字。因此这也涵盖了
16.8
情况。如果两边都有数字,这个表达式不会分裂。What about
The lookarounds ensure that either on one or the other side is not a digit. So this covers also the
16.8
case. This expression will not split if there are on both sides digits.这取决于你的具体句子,但你可以尝试:
看看是否有效。这将保留在引号中,但如果需要,您可以将其删除。
It depends on your exact sentence, but you could try:
See if that works. This will keep in the quotes, but you can then remove them if needed.
split
是一个 Python 内置函数,用于在特定字符处分隔字符串。split
is a Python builtin that separates a string at a specific character.