用于大规模分析的 Python 策略(即时或延迟)

发布于 2024-12-07 11:54:50 字数 427 浏览 1 评论 0原文

要分析大量网站或金融数据并提取参数数据,最佳策略是什么?

我将以下策略分类为“即时”或“延迟”。哪个最好?

  1. 即时:即时处理数据并将参数数据存储到数据库中
  2. 延迟:将所有源数据以 ASCII 形式存储到文件系统中并进行后期处理稍后,或者使用处理数据守护进程
  3. 延迟:将所有页面作为 BLOB 存储在数据库中以便稍后进行后处理,或者使用处理数据守护进程

数字 1 是最简单的,特别是如果您只有一台服务器。 #2 或 #3 使用单个服务器是否会更高效,或者您是否只看到多个服务器的威力?

是否有任何 python 项目已经适合这种分析?

编辑:最好,我的意思是最快的执行,以防止用户等待,以易于编程为次要

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?

I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?

  1. On-the-fly: Process data on-the-fly and store parametric data into a database
  2. Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
  3. Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon

Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?

Are there any python projects that are already geared toward this kind of analysis?

Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

老旧海报 2024-12-14 11:54:50

我会在一台或多台机器上使用 celery ,并采用“即时”策略。您可以有一个获取数据的聚合任务,以及一个分析数据并将其存储在数据库中的处理任务。这是一种高度可扩展的方法,您可以根据您的计算能力对其进行调整。

从某种意义上说,“即时”策略更有效,因为您可以一次性处理数据。另外两个涉及额外的步骤,从保存数据的位置重新检索数据并在之后进行处理。

当然,一切都取决于数据的性质以及处理数据的方式。如果处理阶段慢于聚合,则“即时”策略将挂起并等待处理完成。但同样,您可以将 celery 配置为异步,并在有数据尚未处理时继续聚合。

I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.

The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.

Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.

紫南 2024-12-14 11:54:50

第一:“最快执行以防止用户等待”意味着某种延迟处理。一旦您决定推迟处理(这样用户就看不到它),平面文件和数据库之间的选择基本上与最终用户等待时间无关。

第二:数据库速度慢。平面文件速度很快。然而,由于您将使用 celery 并避免最终用户等待时间,因此平面文件和数据库之间的区别变得无关紧要。

将所有源数据以 ASCII 形式存储到文件系统中,并稍后进行后期处理,或使用处理数据守护进程

这是最快的。 Celery 加载平面文件。

First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.

Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.

Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon

This is fastest. Celery to load flat files.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文