We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
如果您真的在谈论大规模,那么您可能需要一些可以水平扩展的东西,例如,像Hadoop。您可以用多种语言编写 Hadoop 作业,因此您不必依赖 Java。例如,这里有一篇关于用 Python 编写 Hadoop 作业的文章。顺便说一句,这可能是我会使用的语言,这要归功于像
httplib2
这样的库 用于发出请求,lxml
用于解析结果。如果 Map-Reduce 框架太过分了,您可以将其保留在 Python 中并使用
multiprocessing
。
更新:
如果您不需要 MapReduce 框架,并且更喜欢其他语言,请查看 Java 中的
ThreadPoolExecutor
。不过,我肯定会使用 Apache Commons HTTP 客户端。 JDK 本身的内容对程序员来说不太友好。If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like
httplib2
for making the requests andlxml
for parsing the results.If a Map-Reduce framework is overkill, you could keep it in Python and use
multiprocessing
.UPDATE:
If you don't want a MapReduce framework, and you prefer a different language, check out the
ThreadPoolExecutor
in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.您可能应该使用用于测试 Web 应用程序的工具(WatiN 或 Selenium)。
然后,您可以使用我编写的工具来编写与数据分离的工作流程。
https://github.com/leblancmeneses/RobustHaven.IntegrationTests
您不必执行任何操作使用 WatiN 或 Selenium 时手动解析。您将编写一个 css querySelector。
使用 TopShelf 和 NServiceBus,您可以水平扩展工作人员数量。
仅供参考:通过 mono,我提到的这些工具可以在 Linux 上运行。 (尽管里程可能会有所不同)
如果不需要评估 JavaScript 来动态加载数据:
任何需要将文档加载到内存中的操作都会浪费时间。如果您知道标签在哪里,那么您所需要的只是一个 sax 解析器。
You should probably use tools used for testing web applications (WatiN or Selenium).
You can then compose your workflow separated from the data using a tool I've written.
https://github.com/leblancmeneses/RobustHaven.IntegrationTests
You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.
Using TopShelf and NServiceBus you can scale the # of workers horizontally.
FYI: With mono these tools i mention can run on Linux. (although miles may vary)
If JavaScript doesn't need to be evaluated to load data dynamically:
Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.
我使用 Java 和 HttpClient commons 库做了类似的事情。尽管我避免使用 DOM 解析器,因为我正在寻找可以从正则表达式轻松找到的特定标签。
操作中最慢的部分是发出 http 请求。
I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.
The slowest part of the operation is making the http requests.
那么c++呢?有许多大型图书馆可以为您提供帮助。
boost asio可以帮你搞定网络。
TinyXML 可以解析XML文件。
我对数据库一无所知,但几乎所有数据库都有C++接口,这不是问题。
what about c++? there are many large scale libraries can help you.
boost asio can help you do the network.
TinyXML can parse XML files.
I have no idea about database, but almost all database have interfaces for c++, it is not a problem.