当前位置：文江博客话题详情

如何开始信息提取？

发布于 2024-07-13 19:40:37 字数 168 浏览 8 评论 0原文

您能否推荐一个培训路径来开始并变得非常擅长信息提取。我开始阅读它是为了做我的一个爱好项目，很快意识到我必须擅长数学（代数、统计、概率）。我读过一些关于不同数学主题的入门书籍（而且非常有趣）。寻找一些指导。请帮忙。

更新：只是为了回答其中一条评论。我对文本信息提取更感兴趣。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱，才寂寞 2024-07-20 19:40:37

只是为了回答其中一条评论。我对文本信息提取更感兴趣。

根据您项目的性质，自然语言处理和计算语言学都可以派上用场 - 它们提供工具来测量和从文本信息中提取特征，并应用训练、评分，或分类。

好的入门书籍包括 OReilly 的《集体智慧编程》（有关“搜索和排名”的章节） ”，文档过滤，也许还有决策树）。

利用这些知识的建议项目：POS（词性）标记和命名实体识别（从纯文本中识别名称、地点和日期的能力）。您可以使用维基百科作为训练语料库，因为大多数目标信息已经在信息框中提取 - 这可能会为您提供一些有限的测量反馈。

IE的另一大锤子是搜索，这是一个不容小觑的领域。再次，OReilly的书提供了一些基本排名的介绍；一旦您拥有大量索引文本语料库，您就可以用它执行一些真正的 IE 任务。请查看 Peter Norvig：从数据进行理论化作为起点，这是一个非常良好的激励因素——也许你可以重新实现他们的一些结果作为学习练习。

作为预警，我想我有义务告诉你，信息提取是困难。任何给定任务的前 80% 通常都是微不足道的；然而，随着开发和研究时间的推移，IE 任务的难度每增加一个百分比，通常都会呈指数级增长。它的记录也相当不足 - 大多数高质量信息目前都在晦涩的白皮书中（Google Scholar 是你的朋友）-一旦你的手被烧伤了几次，一定要检查一下。但最重要的是，不要让这些障碍让你失望——在这个领域肯定有取得进展的巨大机会。

回复收藏 0 原文

冷清清 2024-07-20 19:40:37

我会推荐 Christopher D. Manning、Prabhakar Raghavan 和辛里希·许茨。它涵盖了广泛的问题领域，构成了信息提取的重要且最新的（2008）基础，并且可以在线全文获取（在给定的链接下）。

回复收藏 0 原文

仲春光 2024-07-20 19:40:37

我建议您查看自然语言工具包 (nltk) 和 NLTK 书籍。两者都是免费的，都是很棒的学习工具。

回复收藏 0 原文

心房敞 2024-07-20 19:40:37

您不需要精通数学即可进行 IE，只需了解算法的工作原理，对需要最佳结果性能的情况进行实验，以及实现目标准确度水平所需的规模并使用它即可。你基本上是在研究算法和编程以及 CS/AI/机器学习理论的各个方面，而不是写一篇关于构建新的机器学习算法的博士论文，你必须通过数学原理说服某人为什么该算法有效，所以我完全不同意带着这样的想法。实践和理论之间存在差异 - 众所周知，数学家更多地关注理论，而不是算法的实用性以产生可行的业务解决方案。然而，您需要阅读 NLP 书籍和期刊论文做一些背景知识，以了解人们从结果中发现了什么。 IE 是一个非常特定于上下文的域，因此您需要首先定义您尝试提取信息的上下文 - 您将如何定义此信息？您的结构化模型是什么？假设您正在从半结构化和非结构化数据集中提取数据。然后，您还需要权衡是否要通过标准人类方法（涉及正则表达式和模式匹配等）来处理 IE，还是要使用马尔可夫链等统计机器学习方法来实现。您甚至可以考虑混合方法。

您可以遵循进行提取的标准流程模型是采用数据/文本挖掘方法：

预处理 - 定义和标准化您的数据，以便从各种或特定来源提取数据，清理您的数据
分段/分类/聚类/关联 - 您的黑匣子，大部分提取工作将在其中完成
后处理 - 将数据清理回您想要存储的位置或将其表示为信息

此外，您需要了解什么是数据和什么是信息之间的区别。因为您可以重复使用您发现的信息作为数据源来构建更多信息地图/树/图表。这一切都非常具体化。

标准步骤：输入 -> 处理 -> 输出

如果您使用 Java/C++，则有大量可用的框架和库可供您使用。

如果您想要进行大量标准文本提取，Perl 将是一种出色的 NLP 提取工作语言。

您可能希望将数据表示为 XML 甚至 RDF 图（语义 Web），并且对于您定义的上下文模型，您可以构建关系和关联图，这些关系和关联图很可能会随着您发出越来越多的提取请求而发生变化。将其部署为静态服务，因为您希望将其视为文档资源。您甚至可以将其链接到分类数据集和分面搜索（例如使用 Solr）。

值得阅读的资源有：

计算语言学和自然语言处理手册
统计自然语言处理基础
信息提取应用前景
Perl 和 Prolog 语言处理
简介语音和语言处理 (Jurafsky)
文本挖掘应用程序编程
文本挖掘手册
Taming 的文本
智能网络
构建搜索应用程序
算法IEEE 期刊

确保在将此类应用程序/算法部署到生产环境之前进行彻底的评估，因为它们会递归地增加您的数据存储要求。您可以使用 AWS/Hadoop 进行集群，使用 Mahout 进行大规模分类等。将数据集存储在 MongoDB 中或将非结构化转储存储到 jackrabbit 等中。首先尝试尝试原型。您可以使用各种档案来基于路透社语料库、tipster、TREC 等进行培训。您甚至可以查看 alchemy API、GATE、UIMA、OpenNLP 等。

从标准文本构建提取比网络文档更容易因此，预处理步骤中的表示对于定义您试图从标准化文档表示中提取的内容变得更加重要。

标准测量包括精确度、召回率、f1 测量等。

You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.

A standard process model you can follow to do your extraction is to adapt a data/text mining approach:

pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information

Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.

standard steps for: input->process->output

If you are using Java/C++ there are loads of frameworks and libraries available you can work with.

Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.

You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.

Good sources to read are:

Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal

Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.

Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.

Standard measures include precision, recall, f1 measure amongst others.

回复收藏 0 原文