非结构化文本到结构化数据
我正在寻找有关以类似于谷歌日历快速添加按钮的方式构建非结构化文本的参考资料(教程、书籍、学术文献)。
我知道这可能属于 NLP 类别,但我只对从“Levi jeans size 32 A0b293”之类的过程感兴趣
:品牌:Levi,尺码:32,类别:牛仔裤,代码:A0b293
我想它将是词法分析和机器学习技术的某种组合。
我对语言相当不可知,但如果推动的话,我会更喜欢 python、Matlab 或 C++ 参考文献,
谢谢
I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.
I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"
to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293
I imagine it would be some combination of lexical parsing and machine learning techniques.
I am rather language agnostic but if pushed would prefer python, Matlab or C++ references
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您需要提供有关文本来源(网络?用户输入?)、领域(只是衣服吗?)、潜在格式和词汇的更多信息……
假设最坏的情况,您需要开始学习 NLP。 NLTK 的文档是一本非常好的免费书籍: http://www.nltk.org/book 。这也是对 Python 的很好的介绍,并且软件是免费的(可用于各种用途)。请注意:NLP 很难。它并不总是有效。有时并不有趣。最先进的技术与您想象的相差甚远。
假设一个更好的场景(您的文本是半结构化的) - 一个很好的免费工具是 pyparsing。有本书,有大量示例,并且生成的代码非常有吸引力。
我希望这有帮助...
You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...
Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.
Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.
I hope this helps...
可以看看托比·塞加兰(Toby Segaran)的《集体智慧》。我似乎记得在一章中讨论了这一点的基础知识。
Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.
经过一番研究,我发现这个问题通常被称为信息提取,并积累了一些论文并将它们存储在 Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
也称为 Tai Weiss 指出,Python 的 NLTK 是一个很好的起点,此本书的一章专门讨论信息提取
After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction
如果您只处理像您引用的示例这样的情况,那么您最好使用一些基于手动规则的方法,该规则是 100% 可预测的,并且涵盖了生产中可能遇到的 90% 的情况。
您可以枚举所有可能品牌的列表,并且类别并检测输入字符串中的哪个是哪个,因为这两个列表中通常很少有交集。
使用正则表达式可以轻松检测和提取其他两个列表。 (1-3 位数字始终是大小等)
您的问题域似乎不够大,无法保证采用更繁重的方法,例如统计学习。
If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..
You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..
The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)
Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.