当前位置：文江博客话题详情

我会遇到此应用程序堆栈的加载问题吗？

发布于 2024-10-18 01:56:09 字数 2588 浏览 7 评论 0原文

我正在设计一个文件下载网络。

最终目标是拥有一个 API，可以让您直接将文件上传到存储服务器（无需网关或其他东西）。然后该文件被存储在数据库中并被引用。

当请求文件时，将从数据库中选择当前保存该文件的服务器，并完成 http 重定向（或者 API 给出当前有效的直接 URL）。

后台作业负责处理所需的文件复制，以实现持久性/缩放目的。

后台作业还会移动文件，以确保服务器上有关磁盘和带宽使用的工作负载均匀。

任何时候都没有突袭或其他事情。每个驱动器都作为 JBOD 挂在服务器上。所有复制都在应用程序级别进行。如果一台服务器发生故障，它只会在数据库中被标记为损坏，后台作业会负责从健康源进行复制，直到再次达到所需的冗余。

该系统还需要准确的统计数据来监控/平衡以及稍后的计费。

所以我考虑了以下设置。

该环境是经典的 Ubuntu、Apache2、PHP、MySql LAMP 堆栈。
访问当前存储服务器的 url 是由 API 生成的（这没问题。只是一个经典的 PHP 网站和 MySQL 数据库）

现在变得有趣了...

存储服务器运行 Apache2 并且 PHP 脚本捕获请求。 URL 参数（安全令牌哈希）经过验证。 IP、时间戳和文件名经过验证，因此请求获得授权。（不需要数据库连接，只需一个知道秘密令牌的 PHP 脚本）。
PHP 脚本将文件 hader 设置为使用 apache2 mod_xsendfile
Apache 传递 mod_xsendfile 传递的文件并配置为具有访问权限日志通过管道传输到另一个 PHP 脚本
Apache 运行 mod_logio 并且访问日志位于组合 I/O 日志格式，但另外还使用 %D 变量进行扩展（服务请求所花费的时间，以微秒为单位。）计算网络和其他东西中的传输速度点瓶颈。
然后，管道访问日志转到解析 URL 的 PHP 脚本（第一个文件夹是“bucked”，就像分配了一个客户端的 google storage 或 amazon s3 一样。因此客户端是已知的）计算输入/输出流量并增加数据库字段。出于性能原因，我考虑使用每日字段，并像流量 = 流量+X 一样更新它们，如果没有更新行，则创建它。

我不得不提的是，该服务器将是具有大量存储空间的低预算服务器。

可以仔细查看此服务器故障上的线程。

关键数据是系统将具有千兆位吞吐量（最大为 24/7），并且文件请求将相当大（因此没有图像或小文件负载会通过大量日志行和请求产生高负载）。也许平均 500MB 左右！

当前计划的设置运行在廉价的消费类主板（华硕）、2 GB DDR3 RAM 和 AMD Athlon II X2 220、2 个 2.80GHz 托盘 CPU 上。

当然，下载管理器和范围请求将是一个问题，但我认为访问的平均大小将至少约为 50 兆左右。

所以我的问题是：

此流程中是否存在严重瓶颈？你能发现任何问题吗？
我是否正确地假设 mysql_affected_rows() 可以直接从最后一个请求读取并且不会向 mysql 服务器发出另一个请求？
您认为具有上述规格的系统可以处理这个问题吗？如果没有，我该如何改进？我认为第一个瓶颈是 CPU，不是吗？
你对此有何看法？您有什么改进建议吗？也许是完全不同的东西？我考虑过使用 Lighttpd 和 mod_secdownload 模块。不幸的是它无法检查 IP 地址，而且我不太灵活。它的优点是下载验证不需要 php 进程来触发。但由于它只运行很短并且不读取和输出数据本身，我认为这是可以的。你？我曾经在旧的一次性电脑上使用lighttpd下载过，性能非常棒。我也考虑过使用nginx，但我没有这方面的经验。但是
您对通过管道记录直接更新数据库的脚本有何看法？我是否应该将请求写入作业队列并在可以处理延迟的第二个进程中在数据库中更新它们？或者根本不做，而是在晚上解析日志文件？我认为我希望尽可能实时，并且除了中央数据库之外，没有在其他地方积累数据。我也不想跟踪所有服务器上运行的作业。维护起来可能会很混乱。应该有一个简单的单元测试来生成安全链接，下载它并检查一切是否正常以及日志记录是否已发生。
还有什么建议吗？我很高兴收到您的任何意见！
我也计划将所有这些都开源。我只是认为需要有一个开源替代方案来替代像亚马逊 s3 这样面向文件下载的昂贵存储服务。
我确实搜索了很多，但没有找到类似的东西。当然，我会重新使用现有的解决方案。最好是开源的。你知道类似的东西吗？

原文

I am designing a file download network.

The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.

When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).

Background jobs take care of desired replication of the file for durability/scaling purposes.

Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.

There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.

The system also needs accurate stats for monitoring / balancing and maby later billing.

So I thought about the following setup.

The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)

Now it gets interesting...

The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.

I have to mention that the server will be low budget servers with massive strage.

The can have a close look at the intended setup in this thread on serverfault.

The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!

The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.

Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.

So my questions are:

Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?