我如何开始信息提取?

发布于 2024-09-25 00:44:39 字数 250 浏览 2 评论 0原文

在信息提取方面我是一个新手。这几天看了很多学术论文,还订了一本NLP方面的书。我想弄清楚如何构建一个类似 FlipDog.com 的系统(希望不是从头开始)。他们从 60,000 多个公司网站中提取职位空缺。我该如何开始?

我愿意学习任何编程语言。有人用过 Mallet/GATE/MinorThird 或 RoadRunner 吗?理想情况下,我希望能够使用特定于我的领域的数据集来训练系统,并让它基于该数据集提取信息。为此,您会推荐哪个平台?

谢谢!

I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?

I am open to learning any programming language. Has anybody used Mallet/GATE/MinorThird or RoadRunner? Ideally, I want to be able to train a system with the data set particular to my domain and have it extract information based on that. Which platform would you recommend for this purpose?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

少女的英雄梦 2024-10-02 00:44:39

提取工作机会的更快方法是使用 dapper.net (一个网络 从网站抓取服务)。您可以非常轻松地教 dapper 使用可视化编辑器提取数据。当您的目标网站上有表格时,它的效果非常好。

要学习信息提取,我建议从lingpipe开始。它是一个用于信息提取的java框架,因此您不需要学习该框架的架构特定功能,例如Gate或Apache UIMA。在lingpipe网站上您会发现很多教程,可以帮助您学习各种信息提取方法。之后我建议学习Gate和UIMA。

如果你想实现这样的网站,你还需要学习如何使用网络爬虫框架(例如nutch )、网络搜索引擎 ( yahoo、google、bing)和信息检索引擎(例如 apache lucene )在提取的数据之上提供搜索服务。

更新:

对于Python,最好从:http://www.nltk.org/开始

The faster way to extract job offerings is to use dapper.net (a web scraping service from websites). You can very easily to teach dapper to extract data using visual editor. It works very well when on your target websites you have tables.

To learn Information Extraction, I suggest to start from lingpipe. It is a java framework for Information Extraction, so you do not need to learn architectural specific features of the framework, such as Gate or Apache UIMA. On lingpipe website you will find a lot of tutorials which will help you to learn various Information Extraction approaches. After that I suggest to learn Gate and UIMA.

If you want to realize such a website, you also need to learn how to use web crawler frameworks (e.g., nutch), web search engines (yahoo, google, bing), and Information Retrieval engines (such as, apache lucene) to provide a search service on the top of extracted data.

Update:

For python, it is the best to start with: http://www.nltk.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文