如何开始信息提取?

发布于 2024-07-13 19:40:37 字数 168 浏览 8 评论 0原文

您能否推荐一个培训路径来开始并变得非常擅长信息提取。 我开始阅读它是为了做我的一个爱好项目,很快意识到我必须擅长数学(代数、统计、概率)。 我读过一些关于不同数学主题的入门书籍(而且非常有趣)。 寻找一些指导。 请帮忙。

更新:只是为了回答其中一条评论。 我对文本信息提取更感兴趣。

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.

Update: Just to answer one of the comment. I am more interested in Text Information Extraction.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

爱,才寂寞 2024-07-20 19:40:37

只是为了回答其中一条评论。 我对文本信息提取更感兴趣。

根据您项目的性质,自然语言处理计算语言学都可以派上用场 - 它们提供工具来测量和从文本信息中提取特征,并应用训练、评分,或分类。

好的入门书籍包括 OReilly 的《集体智慧编程》(有关“搜索和排名”的章节) ”,文档过滤,也许还有决策树)。

利用这些知识的建议项目:POS(词性)标记和命名实体识别(从纯文本中识别名称、地点和日期的能力)。 您可以使用维基百科作为训练语料库,因为大多数目标信息已经在信息框中提取 - 这可能会为您提供一些有限的测量反馈。

IE的另一大锤子是搜索,这是一个不容小觑的领域。 再次,OReilly的书提供了一些基本排名的介绍; 一旦您拥有大量索引文本语料库,您就可以用它执行一些真正的 IE 任务。 请查看 Peter Norvig:从数据进行理论化作为起点,这是一个非常良好的激励因素——也许你可以重新实现他们的一些结果作为学习练习。

作为预警,我想我有义务告诉你,信息提取是困难。 任何给定任务的前 80% 通常都是微不足道的; 然而,随着开发和研究时间的推移,IE 任务的难度每增加一个百分比,通常都会呈指数级增长。 它的记录也相当不足 - 大多数高质量信息目前都在晦涩的白皮书中(Google Scholar 是你的朋友)-一旦你的手被烧伤了几次,一定要检查一下。 但最重要的是,不要让这些障碍让你失望——在这个领域肯定有取得进展的巨大机会。

Just to answer one of the comment. I am more interested in Text Information Extraction.

Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.

Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).

Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.

The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.

As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.

冷清清 2024-07-20 19:40:37

我会推荐 Christopher D. Manning、Prabhakar Raghavan 和辛里希·许茨。 它涵盖了广泛的问题领域,构成了信息提取的重要且最新的(2008)基础,并且可以在线全文获取(在给定的链接下)。

I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).

仲春光 2024-07-20 19:40:37

我建议您查看 自然语言工具包 (nltk)NLTK 书籍。 两者都是免费的,都是很棒的学习工具。

I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.

心房敞 2024-07-20 19:40:37

您不需要精通数学即可进行 IE,只需了解算法的工作原理,对需要最佳结果性能的情况进行实验,以及实现目标准确度水平所需的规模并使用它即可。 你基本上是在研究算法和编程以及 CS/AI/机器学习理论的各个方面,而不是写一篇关于构建新的机器学习算法的博士论文,你必须通过数学原理说服某人为什么该算法有效,所以我完全不同意带着这样的想法。 实践和理论之间存在差异 - 众所周知,数学家更多地关注理论,而不是算法的实用性以产生可行的业务解决方案。 然而,您需要阅读 NLP 书籍和期刊论文做一些背景知识,以了解人们从结果中发现了什么。 IE 是一个非常特定于上下文的域,因此您需要首先定义您尝试提取信息的上下文 - 您将如何定义此信息? 您的结构化模型是什么? 假设您正在从半结构化和非结构化数据集中提取数据。 然后,您还需要权衡是否要通过标准人类方法(涉及正则表达式和模式匹配等)来处理 IE,还是要使用马尔可夫链等统计机器学习方法来实现。 您甚至可以考虑混合方法。

您可以遵循进行提取的标准流程模型是采用数据/文本挖掘方法:

预处理 - 定义和标准化您的数据,以便从各种或特定来源提取数据,清理您的数据

分段/分类/聚类/关联 - 您的黑匣子,大部分提取工作将在其中完成

后处理 - 将数据清理回您想要存储的位置或将其表示为信息

此外,您需要了解什么是数据和什么是信息之间的区别。 因为您可以重复使用您发现的信息作为数据源来构建更多信息地图/树/图表。 这一切都非常具体化。

标准步骤:输入 -> 处理 -> 输出

如果您使用 Java/C++,则有大量可用的框架和库可供您使用。

如果您想要进行大量标准文本提取,Perl 将是一种出色的 NLP 提取工作语言。

您可能希望将数据表示为 XML 甚至 RDF 图(语义 Web),并且对于您定义的上下文模型,您可以构建关系和关联图,这些关系和关联图很可能会随着您发出越来越多的提取请求而发生变化。 将其部署为静态服务,因为您希望将其视为文档资源。 您甚至可以将其链接到分类数据集和分面搜索(例如使用 Solr)。

值得阅读的资源有:

  • 计算语言学和自然语言处理手册
  • 统计自然语言处理基础
  • 信息提取应用前景
  • Perl 和 Prolog 语言处理
  • 简介 语音和语言处理 (Jurafsky)
  • 文本挖掘应用程序编程
  • 文本挖掘手册
  • Taming 的文本
  • 智能网络
  • 构建搜索应用程序
  • 算法IEEE 期刊

确保在将此类应用程序/算法部署到生产环境之前进行彻底的评估,因为它们会递归地增加您的数据存储要求。 您可以使用 AWS/Hadoop 进行集群,使用 Mahout 进行大规模分类等。 将数据集存储在 MongoDB 中或将非结构化转储存储到 jackrabbit 等中。首先尝试尝试原型。 您可以使用各种档案来基于路透社语料库、tipster、TREC 等进行培训。您甚至可以查看 alchemy API、GATE、UIMA、OpenNLP 等。

从标准文本构建提取比网络文档更容易因此,预处理步骤中的表示对于定义您试图从标准化文档表示中提取的内容变得更加重要。

标准测量包括精确度、召回率、f1 测量等。

You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.

A standard process model you can follow to do your extraction is to adapt a data/text mining approach:

pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data

segmentation/classification/clustering/association - your black box where most of your extraction work will be done

post-processing - cleansing your data back to where you want to store it or represent it as information

Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.

standard steps for: input->process->output

If you are using Java/C++ there are loads of frameworks and libraries available you can work with.

Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.

You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.

Good sources to read are:

  • Handbook of Computational Linguistics and Natural Language Processing
  • Foundations of Statistical Natural Language Processing
  • Information Extraction Applications in Prospect
  • An Introduction to Language Processing with Perl and Prolog
  • Speech and Language Processing (Jurafsky)
  • Text Mining Application Programming
  • The Text Mining Handbook
  • Taming Text
  • Algorithms of Intelligent Web
  • Building Search Applications
  • IEEE Journal

Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.

Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.

Standard measures include precision, recall, f1 measure amongst others.

云朵有点甜 2024-07-20 19:40:37

我不同意那些推荐阅读《集体智慧编程》的人。 如果你想做任何事情,哪怕是中等复杂度,你都需要擅长应用数学,而 PCI 会给你一种错误的自信感。 例如,当谈到 SVM 时,它只是说 libSVM 是实现它们的好方法。

现在,libSVM 绝对是一个很好的软件包,但谁在乎软件包呢。 您需要了解的是为什么 SVM 能够给出如此出色的结果,以及它与贝叶斯思维方式有何根本不同(以及 Vapnik 为何成为传奇)。

恕我直言,没有一种解决方案。 您应该很好地掌握线性代数、概率和贝叶斯理论。 我应该补充一点,贝叶斯对此的重要性就像氧气对人类一样重要(有点夸张,但你明白我的意思,对吧?)。 另外,充分掌握机器学习。 仅仅使用其他人的工作是完全可以的,但是当您想知道为什么要这样做时,您必须了解一些有关 ML 的知识。

检查这两个:

http://pindancing.blogspot。 com/2010/01/learning-about-machine-learniing.html

http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html

http://measuringmeasures.com/blog/2010/3/12/learning-about-machine- Learning-2nd-ed.html

好的,现在就是三个了:) / 酷

I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.

Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).

IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.

Check these two for that :

http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html

http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html

http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html

Okay, now that's three of them :) / Cool

稀香 2024-07-20 19:40:37

维基百科信息提取文章是一个快速介绍。

在更学术的层面上,您可能需要浏览一下积分概率提取之类的论文用于发现文本中的关系和模式的模型和数据挖掘

The Wikipedia Information Extraction article is a quick introduction.

At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.

短叹 2024-07-20 19:40:37

如果您需要企业级 NER 服务,请查看此处。 开发 NER 系统(和训练集)是一项非常耗时且需要高技能的任务。

Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.

纸伞微斜 2024-07-20 19:40:37

这有点偏离主题,但您可能想阅读 O'Reilly 的《集体智能编程》。 它间接处理文本信息提取,并且不需要太多的数学背景。

This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文