Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).
NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.
Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...
I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.
If you have custom scripts, you might want to check out how well they perform with PyPy.
就写吧,人编程最大的缺陷就是过早优化。开展一个项目,将其写出来并使其发挥作用。然后返回并修复错误并确保其优化。将会有很多人喋喋不休地谈论 x 与 y 的速度,y 比 x 更好,但归根结底它只是一种语言。问题不在于语言是什么,而在于它如何使用它。
Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized. There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.
发布评论
评论(4)
两者都很好。 Java 在文本处理方面投入了大量精力。 斯坦福的文本处理系统,OpenNLP、UIMA 和 GATE 似乎是大玩家(我知道我错过了一些)。经过几分钟的使用后,您实际上可以在大型语料库上运行斯坦福自然语言处理模块。但是,它对内存有很大的要求(当我使用它时,大约需要 3 GB 左右)。
NLTK,Gensim、Pattern 以及许多其他Python 模块非常擅长文本 加工。它们的内存使用和性能都非常合理。
Python 可以扩展,因为文本处理是一个非常容易扩展的问题。在解析/标记/分块/提取文档时,您可以非常轻松地使用多重处理。一旦你将文本转换为任何类型的特征向量,那么你就可以使用 numpy 数组,我们都知道 numpy 有多棒......
我是通过 NLTK 学习的,Python 极大地帮助了我减少了开发时间,所以我认为你先尝试一下。他们还有一个非常有用的邮件列表,我建议您加入。
如果您有自定义脚本,您可能需要检查它们在 PyPy 中的执行情况。
Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).
NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.
Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...
I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.
If you have custom scripts, you might want to check out how well they perform with PyPy.
不尝试就很难回答这样的问题。那么,为什么你不
我过去曾这样做过,这确实是查看某些东西是否表现得足够好的方法。
It's very difficult to answer questions like this without trying. So why don't you
I've done this in the past and it's really the way to see if something performs well enough for something.
就写吧,人编程最大的缺陷就是过早优化。开展一个项目,将其写出来并使其发挥作用。然后返回并修复错误并确保其优化。将会有很多人喋喋不休地谈论 x 与 y 的速度,y 比 x 更好,但归根结底它只是一种语言。问题不在于语言是什么,而在于它如何使用它。
Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized. There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.
它不是您必须评估的语言,而是可用于该语言的用于集群、数据存储/检索等的框架和应用程序服务器。
您可以使用 jython 并使用所有 Java 企业技术来实现高负载系统,并使用 python 进行文本解析。
it's not language you have to evaluate, but frameworks and app servers for clustering, data storage/retrieval etc available for the language.
you can use jython and use all the java enterprise technologies for high load system and do text parsing with python.