机器学习和自然语言处理

发布于 2024-08-21 06:32:39 字数 1536 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

£冰雨忧蓝° 2024-08-28 06:32:39

这个相关的 stackoverflow 问题有一些很好的答案: 对于对自然语言处理感兴趣的人来说,什么是好的起点?

这是一个非常大的领域。先决条件主要包括概率/统计、线性代数和基础计算机科学,尽管自然语言处理需要更深入的计算机科学背景(通常涵盖一些基本的人工智能)。关于特定语言:Lisp 是“作为人工智能研究的事后想法”而创建的,而 Prolog (其根源在于形式逻辑)特别针对自然语言处理,许多课程将使用 Prolog、Scheme、Matlab、R 或其他函数式语言(例如 OCaml 用于康奈尔大学的本课程),因为它们非常适合这种分析。

以下是一些更具体的指导:

对于机器学习,斯坦福 CS 229:机器学习 很棒:它包含了所有内容,包括完整的讲座视频(iTunes 上也有)、课程笔记、习题集等,并且 吴恩达

请注意先决条件:

学生应具备以下背景:
基本的计算机科学原理和技能,达到足以写作的水平
一个相当重要的计算机程序。熟悉基本概率论。
熟悉基本线性代数。

本课程使用 Matlab 和/或 Octave。它还推荐以下阅读材料(尽管课程笔记本身非常完整):

对于自然语言处理,斯坦福大学的 NLP 小组提供了许多很好的资源。入门课程斯坦福 CS 224:自然语言处理包括< a href="http://www.stanford.edu/class/cs224n/syllabus.html" rel="nofollow noreferrer">所有在线讲座并具有以下先决条件:

有足够的编程经验
和正式的结构。编程
项目将用 Java 1.5 编写,
所以了解 Java(或愿意
自学)是必需的。
标准概念的知识
人工智能和/或
计算语言学。基本的
熟悉逻辑、向量空间、
和概率。<​​/p>

一些推荐的文本是:

先决条件计算语言学课程需要基本的计算机编程和数据结构知识,并使用相同的文本图书。所需的人工智能课程也可在线获取以及所有讲义并使用:

这是标准的人工智能文本,也值得一读。

我自己使用 R 进行机器学习,并且非常推荐它。为此,我建议查看统计学习的要素 ,其全文可在线免费获取。您可能需要参考机器学习自然语言处理对 CRAN 特定功能的看法。

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of
basic computer science principles and skills, at a level sufficient to write
a reasonably non-trivial computer program. Familiarity with the basic probability theory.
Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:

Adequate experience with programming
and formal structures. Programming
projects will be written in Java 1.5,
so knowledge of Java (or a willingness
to learn on your own) is required.
Knowledge of standard concepts in
artificial intelligence and/or
computational linguistics. Basic
familiarity with logic, vector spaces,
and probability.

Some recommended texts are:

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

This is the standard Artificial Intelligence text and is also worth reading.

I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

小嗷兮 2024-08-28 06:32:39

我的推荐是其中之一或全部(取决于他的数量和兴趣领域):

牛津计算语言学手册

牛津计算语言学手册
(来源:oup.com

统计自然语言处理基础

统计自然语言处理基础

信息检索简介

《信息检索简介》</a

My recommendation would be either or all (depending on his amount and area of interest) of these:

The Oxford Handbook of Computational Linguistics:

The Oxford Handbook of Computational Linguistics
(source: oup.com)

Foundations of Statistical Natural Language Processing:

Foundations of Statistical Natural Language Processing

Introduction to Information Retrieval:

Introduction to Information Retrieval

攀登最高峰 2024-08-28 06:32:39

字符串算法,包括后缀树。微积分和线性代数。各种统计数据。人工智能优化算法。数据集群技术......以及一百万个其他东西。目前这是一个非常活跃的领域,具体取决于您打算做什么。

选择使用什么语言进行操作并不重要。例如,Python 有 NLTK,这是​​一个非常好的免费软件包,用于修补计算语言学。

String algorithms, including suffix trees. Calculus and linear algebra. Varying varieties of statistics. Artificial intelligence optimization algorithms. Data clustering techniques... and a million other things. This is a very active field right now, depending on what you intend to do.

It doesn't really matter what language you choose to operate in. Python, for instance has the NLTK, which is a pretty nice free package for tinkering with computational linguistics.

锦爱 2024-08-28 06:32:39

我会说可能&统计是最重要的先决条件。特别是高斯混合模型 (GMM) 和隐马尔可夫模型 (HMM)在机器学习和自然语言处理中都非常重要(当然,如果是介绍性的,这些科目可能是课程的一部分)。

然后,我想说基础的计算机科学知识也很有帮助,例如算法形式语言和基本的复杂性理论。

I would say probabily & statistics is the most important prerequisite. Especially Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are very important both in machine learning and natural language processing (of course these subjects may be part of the course if it is introductory).

Then, I would say basic CS knowledge is also helpful, for example Algorithms, Formal Languages and basic Complexity theory.

Oo萌小芽oO 2024-08-28 06:32:39

提到的斯坦福 CS 224:自然语言处理课程还包括在线视频(以及其他课程材料)。这些视频没有链接到课程网站上,因此很多人可能没有注意到它们。

Stanford CS 224: Natural Language Processing course that was mentioned already includes also videos online (in addition to other course materials). The videos aren't linked to on the course website, so many people may not notice them.

南城旧梦 2024-08-28 06:32:39

Jurafsky 和 ​​Martin 的语音和语言处理 http://www.amazon.com/Speech-Language -Processing-Daniel-Jurafsky/dp/0131873210/ 非常好。不幸的是,第二版章节草稿现在已经发布,不再在线免费:(

另外,如果你是一个体面的程序员,那么尝试 NLP 程序永远不会太早。我想到了 NLTK (Python)。它有一本书你可以免费在线阅读(我认为是 OReilly 出版的)。

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

大海や 2024-08-28 06:32:39

Markdown 和解析表达式语法简介 (PEG) 怎么样? )cletus 在其网站 cforcoding

ANTLR 似乎是自然语言处理的一个很好的起点。但我不是专家。

How about Markdown and an Introduction to Parsing Expression Grammars (PEG) posted by cletus on his site cforcoding?

ANTLR seems like a good place to start for natural language processing. I'm no expert though.

↙温凉少女 2024-08-28 06:32:39

广泛的问题,但我当然认为有限状态自动机和隐马尔可夫模型的知识会很有用。这需要统计学习、贝叶斯参数估计和熵的知识。

潜在语义索引是许多机器学习问题中最近使用的一种常用工具。有些方法相当容易理解。有很多潜在的基础项目。

  1. 在文本语料库中查找共现,以进行文档/段落/句子聚类。
  2. 对文本语料库的情绪进行分类。
  3. 自动注释或总结文档。
  4. 查找单独文档之间的关系,以自动生成文档之间的“图表”。

编辑:非负矩阵分解(NMF)是一种由于其简单性和有效性而越来越受欢迎的工具。这很容易理解。我目前正在研究使用NMF进行音乐信息检索; NMF 已被证明对于文本语料库的潜在语义索引也很有用。这是一篇论文。 PDF

Broad question, but I certainly think that a knowledge of finite state automata and hidden Markov models would be useful. That requires knowledge of statistical learning, Bayesian parameter estimation, and entropy.

Latent semantic indexing is a commonly yet recently used tool in many machine learning problems. Some of the methods are rather easy to understand. There are a bunch of potential basic projects.

  1. Find co-occurrences in text corpora for document/paragraph/sentence clustering.
  2. Classify the mood of a text corpus.
  3. Automatically annotate or summarize a document.
  4. Find relationships among separate documents to automatically generate a "graph" among the documents.

EDIT: Nonnegative matrix factorization (NMF) is a tool that has grown considerably in popularity due to its simplicity and effectiveness. It's easy to understand. I currently research the use of NMF for music information retrieval; NMF has shown to be useful for latent semantic indexing of text corpora, as well. Here is one paper. PDF

南巷近海 2024-08-28 06:32:39

Prolog 只能在学术上帮助他们,对于逻辑约束和基于语义 NLP 的工作也很有限。 Prolog 还不是一种行业友好的语言,因此在现实世界中尚不实用。而且,matlab 也是一种基于学术的工具,除非他们正在做大量基于科学或定量的工作,否则他们实际上并不需要它。首先,他们可能想要拿起“Norvig”书并进入人工智能的世界,在所有领域打下基础。了解一些基本的概率、统计、数据库、操作系统、数据结构,并且很可能对编程语言有了解和经验。他们需要能够向自己证明人工智能技术为何有效以及在何处无效。然后进一步详细研究机器学习和 NLP 等特定领域。事实上,norvig 书在每一章后都提供了参考资料,因此他们已经有很多可供进一步阅读的内容。互联网、书籍、期刊论文为他们提供了大量的参考资料以供指导。不要只是阅读这本书,尝试用编程语言构建工具,然后推断“有意义的”结果。学习算法是否真的按照预期学习,如果没有,为什么会出现这种情况,如何解决。

Prolog will only help them academically it is also limited for logic constraints and semantic NLP based work. Prolog is not yet an industry friendly language so not yet practical in real-world. And, matlab also is an academic based tool unless they are doing a lot of scientific or quants based work they wouldn't really have much need for it. To start of they might want to pick up the 'Norvig' book and enter the world of AI get a grounding in all the areas. Understand some basic probability, statistics, databases, os, datastructures, and most likely an understanding and experience with a programming language. They need to be able to prove to themselves why AI techniques work and where they don't. Then look to specific areas like machine learning and NLP in further detail. In fact, the norvig book sources references after every chapter so they already have a lot of further reading available. There are a lot of reference material available for them over internet, books, journal papers for guidance. Don't just read the book try to build tools in a programming language then extrapolate 'meaningful' results. Did the learning algorithm actually learn as expected, if it didn't why was this the case, how could it be fixed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文