组合词的自然语言处理修复
我有一些由另一个系统生成的文本。它将一些单词组合在一起,我认为这是某种自动换行的副产品。所以像“thedog”这样简单的东西被组合成“thedog”。
我检查了 ascii 和 unicode 字符串,看看其中是否有一些看不见的字符,但确实没有。一个令人困惑的问题是,这是医学文本,并且没有可供检查的语料库。因此,真实的例子是“...排除 SARS 与肺炎的测试”最终变成“...与肺炎”。
有人对寻找和分离这些有什么建议吗?
I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'.
I checked the ascii and unicode string to see is there wasn't some unseen character in there, but there wasn't. A confounding problem is that this is medical text and a corpus to check against aren't that available. So, real example is '...test to rule out SARS versus pneumonia' ends up as '... versuspneumonia.'
Anyone have a suggestion for finding and separating these?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能对此感兴趣 http://www.perlmonks.org/?node_id=336331
您可以通过使用两本词典来利用文本的医学性质,一本仅包含医学术语,另一本包含通用英语。
如果您可以分离出医学单词,然后根据普通词典运行字符串的其余部分,您应该会得到一些不错的结果。
This may be of interest to you http://www.perlmonks.org/?node_id=336331
You can probably use the medical nature of the text to your advantage by using two dictionaries, one containing only medical terminology and one of general English.
If you can isolate out medical words then run the rest of the string against the general dictionary you should get some decent results.
这是一个相当棘手的问题。
我可能会说组合方法是你最好的选择。
2.1.如果找到匹配项,请与人工确认。
这几乎是拼写检查的高级形式。你可以让它更加自动化,但我不会在这么重要的事情上冒险。
或者,您可以寻找中断发生时的模式。因此,例如,如果每第 n 个应该是空格的字符不是空格,则可以修复该问题。
This is a rather tricky problem.
I would probably say a combination method is your best bet.
2.1. If you get a match, confirm with the human.
It'd pretty much be an advanced form of spellcheck. You could automate it more, but I'd not risk it on something that important.
Alternatively, you can look for patterns with when the breaks happen. Thus if, for example, every nth character that should be a space isn't, you can fix that.
这就是我所做的。我结合了几个想法,并使用通用的引导方法提出了一个非常好的解决方案。我使用 Python 来完成这一切。
Here is what I did. I combined a couple of ideas and using a general bootstrapping methodology came up with a pretty good solution. I used Python for all of this.