Web 数据挖掘任务的编程语言比较
我需要一些帮助来比较不同的编程语言,例如:C++、Java、Python、Ruby 和 PHP,以完成与 Web 数据挖掘相关的任务(开发 Web 爬虫、字符串操作等)。我对 PHP 有一点经验,我认为它对于这个特定任务的优点是语法简单、深入的字符串解析能力、网络功能和可移植性,但对其他语言及其优缺点不太了解与此特定任务相关。
I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and etc.). I have a bit experience with PHP, and I think advantages that it has for this particular task are simple syntax, in-depth string parsing capabilities, networking functions, and portability, but don't know much about other languages and their advantages and disadvantages related for this particular task.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
具体的语言并不重要,重要的是你的熟悉程度。如今,所有高级语言都带有基础知识。除非您需要超快的速度(您可能会受到下载速度的限制,而不是解析 HTML 的速度)或有其他未列出的限制,否则该语言不会那么重要。
只要确保您使用这些库即可。特别是 HTML 解析库,它适合处理无效标记(不是 XML 解析器)和适当的正则表达式。
The specific language will not matter nearly as much as your familiarity. These days, all high-level languages will come with the basics. Unless you need it to be super-fast (you're probably going to be limited by download speed, not the speed that you parse the HTML) or have other constraints not listed, the language won't matter that much.
Just make sure that you use the libraries. In particular an HTML parsing library that is good with invalid markup (not an XML parser) and regular expressions where appropriate.
正如之前的文章所暗示的那样——熟悉会带来很大的不同。我还想说看看该语言最初设计的目的 - 它很好地了解了它最擅长的事情。
PHP - 专为服务器端脚本编写而设计,不太适合这种用途。
Perl - 旨在将文本分开(好的开始)和优秀的库 - 查看 LWP 和 HTML 下的模块,例如 HTML::Treebuilder - 一个不错的选择。无与伦比的插件模块选择。
Python - 一个不错的选择,看看 beautifulsoup 和 urllib
Ruby - 也是一个不错的选择,看看 hpricot 在可用模块方面比 Perl 或 Python 成熟得多。
我写过很多网络蜘蛛/数据挖掘软件,并且一直使用 Perl。如果我今天从头开始,我可能会选择 python。
As a previous post implies - being familiar makes a big difference. I would also say look at what the language was originally designed to do - it gives a good idea of what its best at.
PHP - designed for server side scripting, not really ideal for this use.
Perl - Designed to pull text apart (good start) and excellent libraries - look at LWP and the modules under HTML such as HTML::Treebuilder - a good choice. Unrivalled selection of modules to plugin.
Python - A good choice, look at beautifulsoup and urllib
Ruby - also a good choice, look at hpricot a lot less mature than Perl or Python in terms of modules available.
I have written quite a bit of web spider/data mining software and have always used Perl. If I was starting from scratch today I might choose python.
Google 的第一个爬虫是用 Python 1.5 编写的,
我不是其他语言的专家,但我会使用 python 和 html5lib 或 Beautifulsoup。
Google's first crawler was written in Python 1.5
I'm no expert on other languages, but I would go with python and html5lib or Beautifulsoup.