组合词的自然语言处理修复

发布于 2024-10-22 07:56:28 字数 236 浏览 5 评论 0原文

我有一些由另一个系统生成的文本。它将一些单词组合在一起,我认为这是某种自动换行的副产品。所以像“thedog”这样简单的东西被组合成“thedog”。

我检查了 ascii 和 unicode 字符串,看看其中是否有一些看不见的字符,但确实没有。一个令人困惑的问题是,这是医学文本,并且没有可供检查的语料库。因此,真实的例子是“...排除 SARS 与肺炎的测试”最终变成“...与肺炎”。

有人对寻找和分离这些有什么建议吗?

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'.

I checked the ascii and unicode string to see is there wasn't some unseen character in there, but there wasn't. A confounding problem is that this is medical text and a corpus to check against aren't that available. So, real example is '...test to rule out SARS versus pneumonia' ends up as '... versuspneumonia.'

Anyone have a suggestion for finding and separating these?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

失而复得 2024-10-29 07:56:28

您可能对此感兴趣 http://www.perlmonks.org/?node_id=336331

您可以通过使用两本词典来利用文本的医学性质,一本仅包含医学术语,另一本包含通用英语。

如果您可以分离出医学单词,然后根据普通词典运行字符串的其余部分,您应该会得到一些不错的结果。

This may be of interest to you http://www.perlmonks.org/?node_id=336331

You can probably use the medical nature of the text to your advantage by using two dictionaries, one containing only medical terminology and one of general English.

If you can isolate out medical words then run the rest of the string against the general dictionary you should get some decent results.

半﹌身腐败 2024-10-29 07:56:28

这是一个相当棘手的问题。

我可能会说组合方法是你最好的选择。

  1. 搜索“拼写错误的单词”
  2. 对于其中的每一个,检查是否有某种字典单词组合可以组成它。由于步骤 4,您可以假设一个单词仅由两个单词组成
    2.1.如果找到匹配项,请与人工确认。
  3. 如果没有匹配,请人类说“这是一个你没有的真实单词”,或者“这是更正”

这几乎是拼写检查的高级形式。你可以让它更加自动化,但我不会在这么重要的事情上冒险。

或者,您可以寻找中断发生时的模式。因此,例如,如果每第 n 个应该是空格的字符不是空格,则可以修复该问题。

This is a rather tricky problem.

I would probably say a combination method is your best bet.

  1. Search for "misspelled words"
  2. For each one of these, check to see if there is some combination of dictionary words which can make it. You can assume that a word is only made up of two words, because of step 4
    2.1. If you get a match, confirm with the human.
  3. If there is no match, ask the human to say "this is a real word you don't have", or "this is the correction"

It'd pretty much be an advanced form of spellcheck. You could automate it more, but I'd not risk it on something that important.

Alternatively, you can look for patterns with when the breaks happen. Thus if, for example, every nth character that should be a space isn't, you can fix that.

趁年轻赶紧闹 2024-10-29 07:56:28

这就是我所做的。我结合了几个想法,并使用通用的引导方法提出了一个非常好的解决方案。我使用 Python 来完成这一切。

  1. 获取报告样本,对所有单词进行标记并创建频率表。
  2. 对于频率为 3 或以下的单词(频率为 4 或以上被认为足够常见且正确),我使用 PyEnchant 包(附魔库)对它们进行拼写检查,
  3. 在步骤 2 中根据“拼写错误”的单词构建了医学词典,那是临床的。
  4. 对于所有报告,
  5. 为频率低于 4 的单词创建了一个频率表,我使用 PyEnchant 和我的医学词典对每个单词进行了拼写检查,
  6. 取出每个拼写错误的单词,并以所有可能的方式拆分它们。测试这些拆分是否能创建 2 个拼写正确的单词。保留任何成功的分割
  7. 对于每个潜在的解决方案,使用最高权重的解决方案。

Here is what I did. I combined a couple of ideas and using a general bootstrapping methodology came up with a pretty good solution. I used Python for all of this.

  1. took a sample of reports, tokenized all the words and created a frequency table.
  2. For words with a frequency of 3 or under (frequency of 4 or more was deemed common enough to be correct), I spell checked them using PyEnchant package (enchant library)
  3. built a medical dictionary from the 'misspelled' words, in step 2, that were clinical.
  4. for all the reports, created a frequency table
  5. for words with a frequency under 4, I spell checked each using PyEnchant and my medical dictionary
  6. Took each misspelled word and split them in all possible ways. The splits were tested for the creation of 2 correctly spelled words. kept any successful split
  7. For each potential solutions the highest weighted solution was used.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文