当前位置：文江博客话题详情

机器学习、人工智能和计算语言学

发布于 2024-11-03 07:42:24 字数 318 浏览 8 评论 0原文

我很乐意与在机器学习、计算语言学或人工智能方面有经验的人交谈，但通过以下示例：

• 您将应用哪些现有软件来进行可管理的尝试，通过统计语言构建类似谷歌翻译的东西，机器学习？（别误会我的意思，我不想只是这样做，而只是想为这个领域最复杂的事情绘制一个概念框架，如果你有机会领导一个团队，你会怎么想将实现这样的...）

• 哪个现有数据库？当数据量为 TB 时，应使用哪种数据库技术来存储结果

• 除了 C++ 之外还有哪些编程语言？

• Apache mahunt？

• 而且，这些软件组件如何协同工作以推动整体工作？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自找没趣 2024-11-10 07:42:25

您会应用哪些现有软件来进行可管理的尝试，通过统计语言、机器学习构建谷歌翻译之类的东西

如果您的唯一目标是构建翻译软件，那么我只会使用谷歌语言 API：它是免费的，所以为什么要重新发明轮子呢？如果您的目标是为了熟悉机器学习而构建一个类似于 Google 的翻译器，那么您就走错了路......尝试一个更简单的问题。

哪个数据库？

更新：
取决于你的语料库的大小：如果它很大，那么我会使用hadoop（因为你提到了mahout）......否则使用标准数据库（SQL Server，MySQL等）。

原文：
我不确定你可以使用什么数据库，但如果一切都失败了，你可以使用谷歌翻译来建立你自己的数据库......但是，后者会引入对谷歌翻译器的偏见，谷歌所做的任何错误都会导致你的软件（至少）有相同的错误。

除了 C++ 之外还有哪些编程语言？

无论您最喜欢什么...当然 C++ 是一个选择，但使用 Java 或 C# 可能会更轻松。使用 Java 和 C# 进行开发要快得多，因为这些语言从一开始就内置了很多功能。

阿帕奇马亨特？

如果你有一个巨大的数据集......你可以。

更新：
一般来说，如果你的语料库非常大，那么我肯定会使用像 mahout/hadoop 这样的强大组合。它们都是专门为此目的而构建的，除非您确实有一个庞大的团队支持，否则您将很难“复制”他们的所有工作。

而且，这些软件组件如何协同工作以推动整个工作？

看来您实际上是在尝试熟悉机器学习......我会尝试更简单的事情：构建一个语言检测器而不是翻译器。我最近构建了一个，我发现你能做的最有用的事情就是构建字符 n 元语法（二元语法和三元语法组合起来效果最好）。然后，您可以使用 n-gram 作为标准机器学习算法（如 C45、GP、GA、贝叶斯模型等）的输入，并执行 10 倍交叉验证以最大限度地减少过度拟合。

更新：

“...您使用哪些软件组件来运行您的示例？”

我的例子非常简单：我有一个 SQL Server 数据库，其中的文档已经用某种语言标记，我加载内存中的所有数据（数百个文档），然后为每个文档提供算法（C45）。该算法使用自定义函数来提取文档特征（二元字母和三元字母），然后运行其标准学习过程并生成模型。然后，我根据测试数据集测试模型以验证准确性。

就您而言，对于 TB 级的数据，您似乎应该将 mahout 与 hadoop 结合使用。此外，您将要使用的组件在 mahout/hadoop 架构中得到了很好的定义，因此从那时起它应该是非常不言自明的。

Which existing software would you apply for a manageable attempt building something like google translate by statisic linguistic, machine learning

If your only goal is to build software that translates, then I would just use the Google Language API: it's free so why reinvent the wheel? If your goal is to build a translator similar to Google's for the sake of getting familiar with machine learning, then you're on the wrong path... try a simpler problem.

Which database(s)?

Update:
Depends on the size of your corpus: if it's ginormous, then I would go with hadoop (since you mentioned mahout)... otherwise go with a standard database (SQL Server, MySQL, etc.).

Original:
I'm not sure what databases you can use for this, but if all else fails you can use Google translate to build your own database... however, the latter will introduce bias towards Google's translator and any errors that Google does will cause your software to (at the very least) have the same errors.

Which programming languages besides C++?

Whatever you're most comfortable with... certainly C++ is an option, but you might have an easier time with Java or C#. Developing in Java and C# is much faster since there is A LOT of functionality built into those languages right from the start.

Apache mahunt?

If you have an enormous data set... you could.

Update:
In general if the size of your corpus is really big, then I would definitely use a robust combination like mahout/hadoop. Both of them are built exactly for that purpose and you would have a really hard time "duplicating" all of their work unless you do have a huge team behind you.

And, how would those software components work together to power the effort as a whole?

It seems that you are in fact trying to familiarize yourself with machine learning... I would try something MUCH simpler: build a language detector instead of a translator. I recently built one and I found that the most useful thing you can do is build character n-grams (bigrams and trigrams combined worked the best). You would then use the n-grams as input to a standard Machine Learning algorithm (like C45, GP, GA, Bayesian Model, etc.) and perform 10-fold cross-validation to minimize overfitting.

Update:

"...what software components do you use to make your example running?"

My example was pretty simple: I have an SQL Server database with documents which are already labeled with a language, I load all the data in the memory (several hundred documents) and I give the algorithm (C45) each document. The algorithm uses a custom function to extract the document features (bigram and trigram letters), then it runs its standard learning process and spits out a model. I then test the model against a testing data set to verify the accuracy.

In your case, with terabytes of data, it seems that you should use mahout with hadoop. Additionally, the components you're going to be using are well defined in the mahout/hadoop architecture, so it should be pretty self explanatory from there on.

回复收藏 0 原文

挽袖吟 2024-11-10 07:42:25

关于语言选择，至少对于原型设计，我建议使用Python。它在自然语言处理方面取得了巨大的成功，拥有大量科学计算、文本分析和机器学习的工具库。最后但并非最不重要的一点是，如果您想从现有工具中受益，那么调用编译代码（C、C++）非常容易。

具体来说，请查看以下模块：

NLTK，自然语言工具包
scikits.learn，Python 中的机器学习

Olivier Grisel 的有关使用这些工具进行文本挖掘的演示文稿可以派上用场。

免责声明：我是 scikits.learn 的核心开发人员之一。

回复收藏 0 原文

隔纱相望 2024-11-10 07:42:25

哪些现有数据库？当数据量达到 TB 时，应使用哪种数据库技术来存储结果
HBase、ElasticSearch、MongoDB

• 除了C++ 之外还有哪些编程语言？
对于ML 其他流行语言Scala、Java、Python

• Apache mahunt？
有时很有用，对纯 Hadoop 进行更多编码

• 而且，这些软件组件如何协同工作以推动整个工作？
有许多统计机器学习算法可以与mapreduce并行化，允许在NoSQl中存储

回复收藏 0 原文

佞臣 2024-11-10 07:42:25

可用于自动翻译的最佳技术基于统计方法。在计算机科学中，这称为“机器翻译”或 MT。这个想法是将信号（要翻译的文本）视为噪声信号，并使用纠错来“修复”信号。例如，假设您要将英语翻译成法语。假设英语语句最初是法语，但结果是英语。您必须修复它才能恢复它。可以为目标语言（法语）和错误构建统计语言模型。错误可能包括丢弃的单词、移动的单词、拼写错误的单词和添加的单词。

更多信息请访问：http://www.statmt.org/

关于数据库，MT 解决方案确实不需要典型的数据库。一切都应该在记忆中完成。

用于此特定任务的最佳语言是速度最快的语言。 C 语言非常适合解决这个问题，因为它快速且易于控制内存访问。但可以使用任何高级语言，例如 Perl、C#、Java、Python 等。

回复收藏 0 原文

猥琐帝 2024-11-10 07:42:25

Google 的 Tensorflow 是一个用于基本翻译的有用工具。然而，任何真正精通双语的人都知道，翻译不是一个统计过程。这是一个复杂得多的过程，刚刚被简化，所以 90% 看起来都是正确的。
巨大的并行性将带来巨大的变化，因此量子计算的出现，也许其中的一些想法，将使下一个 8% 成为可能。
最后的 2% 将与普通专业笔译员和口译员相匹配。

回复收藏 0 原文

~没有更多了~