当前位置：文江博客话题详情

如何确定句子中单词的正确大小写？

发布于 2024-12-08 19:54:42 字数 108 浏览 1 评论 0原文

我有一个数据库，其中包含仅包含大写字母的句子。该数据库是技术性的，包含医学术语，我想对其进行规范化，以便大写字母（接近）用户的期望。实现这一目标的最佳方法是什么？是否有免费的数据集可以帮助我完成此过程？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

愿得七秒忆 2024-12-15 19:54:42

一种方法可能是从词性标记推断大写，例如使用 Python 自然语言工具包 (NLTK)：

import nltk
import re


def truecase(text, only_proper_nouns=False):
    truecased_sents = []  # List of true-cased sentences.
    # Apply POS-tagging.
    tagged_sent = nltk.pos_tag([
        word.lower() for word in nltk.word_tokenize(text)
    ])
    # Infer capitalization from POS-tags.
    capitalize_tags = (
        {"NNP", "NNPS"} if only_proper_nouns else
        {"NN", "NNS"}
    )
    normalized_sent = [
        word.capitalize() if tag in capitalize_tags else word
        for (word, tag) in tagged_sent
    ]
    # Capitalize first word in sentence.
    normalized_sent[0] = normalized_sent[0].capitalize()
    # Use regular expression to get punctuation right.
    pretty_string = re.sub(
        " (?=[\\.,'!?:;])",
        "",
        " ".join(normalized_sent)
    )
    return pretty_string

这并不完美，特别是因为我不知道您的数据到底是什么样子，但也许您可以获得主意：

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk
import re


def truecase(text, only_proper_nouns=False):
    truecased_sents = []  # List of true-cased sentences.
    # Apply POS-tagging.
    tagged_sent = nltk.pos_tag([
        word.lower() for word in nltk.word_tokenize(text)
    ])
    # Infer capitalization from POS-tags.
    capitalize_tags = (
        {"NNP", "NNPS"} if only_proper_nouns else
        {"NN", "NNS"}
    )
    normalized_sent = [
        word.capitalize() if tag in capitalize_tags else word
        for (word, tag) in tagged_sent
    ]
    # Capitalize first word in sentence.
    normalized_sent[0] = normalized_sent[0].capitalize()
    # Use regular expression to get punctuation right.
    pretty_string = re.sub(
        " (?=[\\.,'!?:;])",
        "",
        " ".join(normalized_sent)
    )
    return pretty_string

This will not be perfect, especially because I don't know what your data exactly looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

回复收藏 0 原文