如何确定句子中单词的正确大小写?
我有一个数据库,其中包含仅包含大写字母的句子。该数据库是技术性的,包含医学术语,我想对其进行规范化,以便大写字母(接近)用户的期望。实现这一目标的最佳方法是什么?是否有免费的数据集可以帮助我完成此过程?
I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available data-set I can use to help with the process?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一种方法可能是从词性标记推断大写,例如使用 Python 自然语言工具包 (NLTK):
这并不完美,特别是因为我不知道您的数据到底是什么样子,但也许您可以获得主意:
One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):
This will not be perfect, especially because I don't know what your data exactly looks like, but maybe you can get the idea:
搜索有关 truecasing 的工作:http://en.wikipedia.org/wiki/Truecasing
如果您可以访问具有正常大小写的类似医疗数据,那么生成您自己的数据集真的很容易。将所有内容大写并使用到原始文本的映射来训练/测试您的算法。
Search for work on truecasing: http://en.wikipedia.org/wiki/Truecasing
It would be really easy to generate your own data set if you have access to similar medical data with normal capitalization. Capitalize everything and use the mapping to the original text to train/test your algorithm.
最简单的方法是使用基于 ngram 的拼写纠正算法。
例如,您可以使用 LingPipe SpellChecker。您可以找到用于预测单词中空格的源代码,类似于预测大小写的方法。
Easiest way to do this is to use a spell correction algorithm based on ngrams.
You can use, for example LingPipe SpellChecker. You can find source code for predicting spaces in word, similar to what can be done for predicting case.