机器学习、人工智能和计算语言学
我很乐意与在机器学习、计算语言学或人工智能方面有经验的人交谈,但通过以下示例:
• 您将应用哪些现有软件来进行可管理的尝试,通过统计语言构建类似谷歌翻译的东西,机器学习? (别误会我的意思,我不想只是这样做,而只是想为这个领域最复杂的事情绘制一个概念框架,如果你有机会领导一个团队,你会怎么想将实现这样的...)
• 哪个现有数据库?当数据量为 TB 时,应使用哪种数据库技术来存储结果
• 除了 C++ 之外还有哪些编程语言?
• Apache mahunt?
• 而且,这些软件组件如何协同工作以推动整体工作?
I would love to talk to people who have experience in machine-learning, computational-linguistics or artificial-intelligence in general but by the following example:
• Which existing software would you apply for a manageable attempt building something like google translate by statistic linguistic, machine learning?
(Don’t get me wrong I don’t want to just do this, but solely trying to draw a conceptional framework for something most complex in this field, what would you think of if you had the chance to lead a team going to realize such...)
• Which existent database(s)? Which database technology to store results when those are terabytes of data
• Which programming languages besides C++?
• Apache mahunt?
• And, how would those software components work together to power the effort as a whole?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您的唯一目标是构建翻译软件,那么我只会使用谷歌语言 API:它是免费的,所以为什么要重新发明轮子呢?如果您的目标是为了熟悉机器学习而构建一个类似于 Google 的翻译器,那么您就走错了路......尝试一个更简单的问题。
更新:
取决于你的语料库的大小:如果它很大,那么我会使用hadoop(因为你提到了mahout)......否则使用标准数据库(SQL Server,MySQL等)。
原文:
我不确定你可以使用什么数据库,但如果一切都失败了,你可以使用谷歌翻译来建立你自己的数据库......但是,后者会引入对谷歌翻译器的偏见,谷歌所做的任何错误都会导致你的软件(至少)有相同的错误。
无论您最喜欢什么...当然 C++ 是一个选择,但使用 Java 或 C# 可能会更轻松。使用 Java 和 C# 进行开发要快得多,因为这些语言从一开始就内置了很多功能。
如果你有一个巨大的数据集......你可以。
更新:
一般来说,如果你的语料库非常大,那么我肯定会使用像 mahout/hadoop 这样的强大组合。它们都是专门为此目的而构建的,除非您确实有一个庞大的团队支持,否则您将很难“复制”他们的所有工作。
看来您实际上是在尝试熟悉机器学习......我会尝试更简单的事情:构建一个语言检测器而不是翻译器。我最近构建了一个,我发现你能做的最有用的事情就是构建字符 n 元语法(二元语法和三元语法组合起来效果最好)。然后,您可以使用 n-gram 作为标准机器学习算法(如 C45、GP、GA、贝叶斯模型等)的输入,并执行 10 倍交叉验证以最大限度地减少过度拟合。
更新:
我的例子非常简单:我有一个 SQL Server 数据库,其中的文档已经用某种语言标记,我加载内存中的所有数据(数百个文档),然后为每个文档提供算法(C45)。该算法使用自定义函数来提取文档特征(二元字母和三元字母),然后运行其标准学习过程并生成模型。然后,我根据测试数据集测试模型以验证准确性。
就您而言,对于 TB 级的数据,您似乎应该将 mahout 与 hadoop 结合使用。此外,您将要使用的组件在 mahout/hadoop 架构中得到了很好的定义,因此从那时起它应该是非常不言自明的。
If your only goal is to build software that translates, then I would just use the Google Language API: it's free so why reinvent the wheel? If your goal is to build a translator similar to Google's for the sake of getting familiar with machine learning, then you're on the wrong path... try a simpler problem.
Update:
Depends on the size of your corpus: if it's ginormous, then I would go with hadoop (since you mentioned mahout)... otherwise go with a standard database (SQL Server, MySQL, etc.).
Original:
I'm not sure what databases you can use for this, but if all else fails you can use Google translate to build your own database... however, the latter will introduce bias towards Google's translator and any errors that Google does will cause your software to (at the very least) have the same errors.
Whatever you're most comfortable with... certainly C++ is an option, but you might have an easier time with Java or C#. Developing in Java and C# is much faster since there is A LOT of functionality built into those languages right from the start.
If you have an enormous data set... you could.
Update:
In general if the size of your corpus is really big, then I would definitely use a robust combination like mahout/hadoop. Both of them are built exactly for that purpose and you would have a really hard time "duplicating" all of their work unless you do have a huge team behind you.
It seems that you are in fact trying to familiarize yourself with machine learning... I would try something MUCH simpler: build a language detector instead of a translator. I recently built one and I found that the most useful thing you can do is build character n-grams (bigrams and trigrams combined worked the best). You would then use the n-grams as input to a standard Machine Learning algorithm (like C45, GP, GA, Bayesian Model, etc.) and perform 10-fold cross-validation to minimize overfitting.
Update:
My example was pretty simple: I have an SQL Server database with documents which are already labeled with a language, I load all the data in the memory (several hundred documents) and I give the algorithm (C45) each document. The algorithm uses a custom function to extract the document features (bigram and trigram letters), then it runs its standard learning process and spits out a model. I then test the model against a testing data set to verify the accuracy.
In your case, with terabytes of data, it seems that you should use mahout with hadoop. Additionally, the components you're going to be using are well defined in the mahout/hadoop architecture, so it should be pretty self explanatory from there on.
关于语言选择,至少对于原型设计,我建议使用Python。它在自然语言处理方面取得了巨大的成功,拥有大量科学计算、文本分析和机器学习的工具库。最后但并非最不重要的一点是,如果您想从现有工具中受益,那么调用编译代码(C、C++)非常容易。
具体来说,请查看以下模块:
NLTK,自然语言工具包
scikits.learn,Python 中的机器学习
Olivier Grisel 的 有关使用这些工具进行文本挖掘的演示文稿可以派上用场。
免责声明:我是 scikits.learn 的核心开发人员之一。
With regards to language choice, at least for prototyping, I would suggest Python. It is enjoying a lot of success in the natural language processing as comes with a large library of tools with scientific computing, text analysis, and machine learning. Last but not least, it is really easy to call compiled code (C, C++), if you want to benefit from existing tools.
Specifically, have a look at the following modules:
NLTK, natural language toolkit
scikits.learn, machine learning in Python
Olivier Grisel's presentation on text mining with these tools can come in handy.
Disclaimer: I am one of the core developers of scikits.learn.
哪些现有数据库?当数据量达到 TB 时,应使用哪种数据库技术来存储结果
HBase、ElasticSearch、MongoDB
• 除了C++ 之外还有哪些编程语言?
对于ML 其他流行语言Scala、Java、Python
• Apache mahunt?
有时很有用,对纯 Hadoop 进行更多编码
• 而且,这些软件组件如何协同工作以推动整个工作?
有许多统计机器学习算法可以与mapreduce并行化,允许在NoSQl中存储
Which existent database(s)? Which database technology to store results when those are terabytes of data
HBase, ElasticSearch, MongoDB
• Which programming languages besides C++?
For ML other popular languages Scala, Java, Python
• Apache mahunt?
Useful sometimes, more codding to pure Hadoop
• And, how would those software components work together to power the effort as a whole?
There are many statistical machine learning algorithms which can be paralelized with mapreduce, allow sotrage in NoSQl
可用于自动翻译的最佳技术基于统计方法。在计算机科学中,这称为“机器翻译”或 MT。这个想法是将信号(要翻译的文本)视为噪声信号,并使用纠错来“修复”信号。例如,假设您要将英语翻译成法语。假设英语语句最初是法语,但结果是英语。您必须修复它才能恢复它。可以为目标语言(法语)和错误构建统计语言模型。错误可能包括丢弃的单词、移动的单词、拼写错误的单词和添加的单词。
更多信息请访问:http://www.statmt.org/
关于数据库,MT 解决方案确实不需要典型的数据库。一切都应该在记忆中完成。
用于此特定任务的最佳语言是速度最快的语言。 C 语言非常适合解决这个问题,因为它快速且易于控制内存访问。但可以使用任何高级语言,例如 Perl、C#、Java、Python 等。
The best techniques available for automated translation are based on statistical methods. In computer science this is known as "Machine Translation" or MT. The idea is to treat the signal (the text to be translated) as a noisy signal and to use error correction to "fix" the signal. For example, suppose you are translating english to french. Assume the english statement was originally french but came out as english. You have to fix it up to restore it. A statistical language model can be built for the target language (french) and for the errors. Errors could include dropped words, moved words, misspelled words, and added words.
More can be found at : http://www.statmt.org/
Regarding the db, an MT solution does not need a typical db. Everything should be done in memory.
The best language to use for this specific task is the fastest one. C would be ideal for this problem because it is fast and easy to control memory access. But any high level language could be used such as Perl, C#, Java, Python, etc.
Google 的 Tensorflow 是一个用于基本翻译的有用工具。然而,任何真正精通双语的人都知道,翻译不是一个统计过程。这是一个复杂得多的过程,刚刚被简化,所以 90% 看起来都是正确的。
巨大的并行性将带来巨大的变化,因此量子计算的出现,也许其中的一些想法,将使下一个 8% 成为可能。
最后的 2% 将与普通专业笔译员和口译员相匹配。
Google's Tensorflow is a useful tool for basic translation. Anyone who is truly bilingual knows, however, that translating is not a statistical process. It is a much more complicated process that has just been simplified so that 90% of it seems correct.
Immense parallelism will make a great difference, so the advent of Quantum Computing, and maybe some of the ideas form it, will make possible the next 8%.
The final 2% will match normal professional translators and interpreters.