如何确定句子中单词的正确大小写?

发布于 2024-12-08 19:54:42 字数 108 浏览 1 评论 0原文

我有一个数据库,其中包含仅包含大写字母的句子。该数据库是技术性的,包含医学术语,我想对其进行规范化,以便大写字母(接近)用户的期望。实现这一目标的最佳方法是什么?是否有免费的数据集可以帮助我完成此过程?

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available data-set I can use to help with the process?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

愿得七秒忆 2024-12-15 19:54:42

一种方法可能是从词性标记推断大写,例如使用 Python 自然语言工具包 (NLTK):

import nltk
import re


def truecase(text, only_proper_nouns=False):
    truecased_sents = []  # List of true-cased sentences.
    # Apply POS-tagging.
    tagged_sent = nltk.pos_tag([
        word.lower() for word in nltk.word_tokenize(text)
    ])
    # Infer capitalization from POS-tags.
    capitalize_tags = (
        {"NNP", "NNPS"} if only_proper_nouns else
        {"NN", "NNS"}
    )
    normalized_sent = [
        word.capitalize() if tag in capitalize_tags else word
        for (word, tag) in tagged_sent
    ]
    # Capitalize first word in sentence.
    normalized_sent[0] = normalized_sent[0].capitalize()
    # Use regular expression to get punctuation right.
    pretty_string = re.sub(
        " (?=[\\.,'!?:;])",
        "",
        " ".join(normalized_sent)
    )
    return pretty_string

这并不完美,特别是因为我不知道您的数据到底是什么样子,但也许您可以获得主意:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk
import re


def truecase(text, only_proper_nouns=False):
    truecased_sents = []  # List of true-cased sentences.
    # Apply POS-tagging.
    tagged_sent = nltk.pos_tag([
        word.lower() for word in nltk.word_tokenize(text)
    ])
    # Infer capitalization from POS-tags.
    capitalize_tags = (
        {"NNP", "NNPS"} if only_proper_nouns else
        {"NN", "NNS"}
    )
    normalized_sent = [
        word.capitalize() if tag in capitalize_tags else word
        for (word, tag) in tagged_sent
    ]
    # Capitalize first word in sentence.
    normalized_sent[0] = normalized_sent[0].capitalize()
    # Use regular expression to get punctuation right.
    pretty_string = re.sub(
        " (?=[\\.,'!?:;])",
        "",
        " ".join(normalized_sent)
    )
    return pretty_string

This will not be perfect, especially because I don't know what your data exactly looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."
沉溺在你眼里的海 2024-12-15 19:54:42

搜索有关 truecasing 的工作:http://en.wikipedia.org/wiki/Truecasing

如果您可以访问具有正常大小写的类似医疗数据,那么生成您自己的数据集真的很容易。将所有内容大写并使用到原始文本的映射来训练/测试您的算法。

Search for work on truecasing: http://en.wikipedia.org/wiki/Truecasing

It would be really easy to generate your own data set if you have access to similar medical data with normal capitalization. Capitalize everything and use the mapping to the original text to train/test your algorithm.

不交电费瞎发啥光 2024-12-15 19:54:42

最简单的方法是使用基于 ngram 的拼写纠正算法。

例如,您可以使用 LingPipe SpellChecker。您可以找到用于预测单词中空格的源代码,类似于预测大小写的方法。

Easiest way to do this is to use a spell correction algorithm based on ngrams.

You can use, for example LingPipe SpellChecker. You can find source code for predicting spaces in word, similar to what can be done for predicting case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文