当前位置：文江博客话题详情

集中/分布式共享

发布于 2024-11-19 14:47:13 字数 1469 浏览 2 评论 0原文

我想制作一个用户可以上传和下载文件的系统。该系统将具有集中式拓扑，但将严重依赖对等点通过中央节点将相关数据传输到其他对等点。我希望他们保存整个数据集的压缩加密部分，而不是让对等方保存整个文件。

某些客户端匿名上传文件到服务器
我希望客户端能够使用某种 NAT（随机 IP）上传，并意识到服务器无法将确认数据包发送回客户端。通过标头中继总内容长度并在不匹配时忽略整个上传来确保数据完整性是否可行？
服务器对数据进行索引、压缩和拆分为块，向每个块添加标识字节，对其进行加密，并通过网络拆分数据，同时映射每个块的位置。
服务器还将根据请求更新对等方的文件索引。随着更多数据添加到系统中，我想压缩会变得更加有效。我希望能够将这些新的字典条目推送给同行，以便他们可以更新其块和客户端软件中的解压缩系统，而不会造成明显的网络压力。如果加密，这些块可能会很大，而任何客户端都不会知道其中包含 x 文件的一部分。
某些客户端请求文件
中央节点执行查找以确定块在网络中的位置，并向对等节点请求这些块。一旦这些块被组装起来，它们就会被发送（仍然是加密和压缩的）到客户端，然后客户端将内容转换成解压缩的文件。如果加密请求可以通过对等点发出并中继到服务器，并且洋葱通过端到端加密的多个路径路由，那就太好了。

在后台，服务器将监视块的稳定性和冗余性，如果有必要，将处理濒临灭绝的块，并将它们保存在自己的银行中，或者如果有愿意的客户端，则通过网络重新分发它们。这样，中心节点就可以适当收缩和增长。

我们的目标是建立一个网络，任何客户端都可以在其中上传或下载数据，而没有任何其他对等方知道谁做了任何事情，但对所有人都可以免费、开放地访问。

系统必须能够处理大量的并发连接，同时管理对等点和数据库，而不会失去理智。

您的最佳实施方案是什么？

编辑：赏金开放。

周末，我实现了一个系统，该系统基本上执行上述操作，减去第 1 部分。对于上传，我只是实现了 SSL，而不是伪造 IP 地址。该系统在几个方面都很薄弱。文件被分割成 1MB 的块并加密，然后随机发送给注册的对等点。每个块的接收者都存储在数据库中。我担心这会很快变得太大而无法管理，但我也想避免用块请求淹没网络。当请求文件时，中央节点通知拥有块的对等体，它们需要将块发送到 x 客户端（在 p2p 模式下）或服务器（在直接模式下），然后服务器将文件向下传输。该系统只是一个大黑客，并且是用 ruby 编写的，我认为这并不能真正胜任这项任务。对于重写，我正在考虑使用 C++ 和 Boost.Asio。

我正在寻找有关架构和系统设计的一般建议。我完全不执着于我当前的实现。

当前的地形

服务器处理客户端上传、索引和内容传播
服务器处理客户端请求上传文件和请求文件的客户端客户端服务器接受块和请求

我希望客户端不必运行持久服务器，但我想不出解决它的好方法。

我会发布一些代码，但它很尴尬。谢谢。如有任何问题，请提出问题，基本思想是建立一个体面的匿名文件共享模型，结合分布式和集中式内容分发模型的优点。如果您有完全不同的想法，请随时发布。

原文

I would like to make a system whereby users can upload and download files. The system will have a centralized topography but will rely heavily on peers to transfer relevant data through the central node to other peers. Instead of peers holding entire files I would like for them to hold a compressed an encrypted portion of the whole data set.

Some client uploads file to server anonymously
I would like for the client to be able to upload using some sort of NAT (random ip), realizing that the server would not be able to send confirmation packets back to the client. Is ensuring data integrity feasible with a header relaying the total content length, and disregarding the entire upload if there is a mismatch?
Server indexes, compresses and splits the data into chunks adding identifying bytes to each chunk, encrypts it, and splits the data over the network while mapping the locations of each chunk.
The server will also update the file index for peers upon request. As more data is added to the system, I imagine that the compression can become more efficient. I would like to be able to push these new dictionary entries to peers so they can update both their chunks and the decompression system in the client software, without causing overt network strain. If encrypted, the chunks can be large without any client being aware of having part of x file.
Some client requests a file
The central node performs a lookup to determine the location of the chunks within the network and requests these chunks from peers. Once the chunks have been assembled, they are sent (still encrypted and compressed) to the client, who then translates the content into the decompressed file. It would be nice if an encrypted request could be made through a peer and relayed to a server, and onion routed through multiple paths with end-to-end encryption.

In the background, the server will be monitoring the stability and redundancy of the chunks, and if necessary will take on chunks that near extinction, and either hold them in it's own bank or redistribute them over the network if there are willing clients. In this way, the central node can shrink and grow as appropriate.

The goal is to have a network within which any client can upload or download data with no single other peer knowing who has done either, but with free and open access to all.

The system must be able to handle a massive amount of simultaneous connections while managing the peers and data library without loosing it's head.

What would be your optimal implementation?

Edit : Bounty opened.

Over the weekend, I implemented a system that does basically the above, minus part 1. For the upload, I just implemented SSL instead of forging the IP address. The system is weak in several areas. Files are split into 1MB chunks and encrypted, and sent to registered peers at random. The recipient(s) for each chunk are stored in the database. I fear that this will quickly grow too large to be manageable, but I also want to avoid having to flood the network with chunk requests. When a file is requested, the central node informs peers possessing the chunks that they need to send the chunks to x client (in p2p mode) or to the server (in direct mode), which then transfers the file down. The system is just one big hack, and written in ruby, which I imagine is not really up to the task. For the rewrite, I am considering using C++ with Boost.Asio.

I am looking for general suggestions regarding architecture and system design. I am not at all attached to my current implementation.

Current Topography

Server Handling client uploads,indexing, and content propagation
Server Handling client requests
Client for upload files and requesting files
Client Server accepting chunks and requests

I would like for the client not to have to have a persistent server running, but I can't think of a good way around it.

I would post some of the code but its embarassing. Thanks. Please ask any questions, the basic idea is to have a decent anonymous file sharing model combining the strengths of both the distributed and centralized model of content distribution. If you have a totally different idea, please feel free to post it if you want.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

_失温 2024-11-26 14:47:13

我希望客户能够
使用某种 NAT 上传
（随机ip），实现服务器
无法发送确认信息
数据包返回给客户端。是
确保数据完整性可行
转发全部内容的标头
长度，并且忽略整个
是否上传不匹配？

不，那是不可行的。如果您的数据包为 1500 字节，并且数据包丢失率为 0.1%，则上传 1 MB 文件且不丢失任何数据包的几率为 0.999 ^ (1048576 / 1500) = 0.497，即低于 50%。此外，如果服务器无法将确认发送回客户端，则尚不清楚客户端如何知道上传是否成功。

解决确认问题的一种方法是使用无速率代码，它允许客户端计算和发送有效无限数量的唯一块，使得任何足够大的子集都足以重建原始文件。然而，这给客户端和服务器都增加了大量的复杂性，并且仍然需要某种方式来通知客户端服务器已收到完整的文件。

在我看来，你在这里混淆了几个问题。如果您的系统有一个供客户端上传的集中式组件，那么为什么还需要进行 NAT 穿越呢？

对于问题的第二部分和第三部分，您可能需要研究分布式哈希表和基于内容的哈希表解决（但主要注意事项在此处进行了解释）。防止节点知道它们存储的文件的内容可以通过以下方式来完成：使用其内容的第一个哈希值对文件进行加密，并使用第二个哈希值来存储它们 - 这意味着知道该文件的哈希值的任何人文件可以检索它，但客户端无法解密它们托管的文件。

一般来说，我建议首先为您正在设计的系统写下一份可靠的目标列表，然后寻找适合这些目标的架构。相反，听起来您有一些隐含的目标，并且已经基于此选择了一个基本的系统架构 - 这可能不适合您的全部目标。

I would like for the client to be able
to upload using some sort of NAT
(random ip), realizing that the server
would not be able to send confirmation
packets back to the client. Is
ensuring data integrity feasible with
a header relaying the total content
length, and disregarding the entire
upload if there is a mismatch?

No, that's not feasible. If your packets are 1500 bytes, and you have 0.1% packetloss, the chance of a one megabyte file being uploaded without any lost packets is .999 ^ (1048576 / 1500) = 0.497, or under 50%. Further, it's not clear how the client would even know if the upload succeeded if the server has no way to send acknowledgements back to the client.

One way around the acknowledgement issue would be to use a rateless code, which allows the client to compute and send an effectively infinite number of unique blocks, such that any sufficiently large subset is enough to reconstruct the original file. This adds a large amount of complexity to both the client and server, however, and still requires some way to notify the client that the server has received the complete file.

It seems to me you're confusing several issues here. If your system has a centralized component to which your clients upload, why do you need to do NAT traversal at all?

For parts two and three of your question, you probably want to research Distributed Hash Tables and content-based addressing (but with major caveats explained here). Preventing the nodes from knowing the content of the files they store could be accomplished by, for example, encrypting the files with the first hash of their content, and storing them keyed by the second hash - this means that anyone who knows the hash of the file can retrieve it, but clients cannot decrypt the files they host.

In general, I would suggest starting by writing down a solid list of goals for the system you're designing, then looking for an architecture that suits those goals. In contrast, it sounds like you have some implicit goals, and have already picked a basic system architecture - which may not suit your full goals - based on that.

回复收藏 0 原文

帝王念 2024-11-26 14:47:13

很抱歉在慷慨的 500 声誉聚会上迟到了，但即使我来得太晚了，我也想在你们的讨论中添加一些我的研究成果。

是的，这样的系统会很好，就像Bittorrent，但具有加密文件和未加密数据的哈希值。在 BT 中，您当然可以添加加密文件，但散列将是加密数据，因此如果没有集中式 queryKey->hashCollection 存储（即完成识别包的所有工作的服务器），则无法识别检索源-每个客户的资源。 Freenet (http://freenetproject.org/) 尝试了类似的系统，尽管比你尝试的更有限。

对于NAT考虑，我们首先看一下：aClient -> yourServer（以及稍后的aClient->aClient）

对于客户端和服务器之间的通信，NAT（以及屏蔽客户端的防火墙）不是问题！由于客户端发起与您的服务器（具有固定 IP 地址或 dns 条目（或 dyndns））的连接，您甚至不必考虑 NAT，因此服务器可以毫无问题地响应，因为即使有多个客户端在单个公司防火墙后面，防火墙（其 NAT）将查找服务器想要与哪个客户端通信并相应地转发（无需您告诉它）。

现在是“困难”部分：客户端 ->通过防火墙/NAT 进行客户端通信：您可以使用的核心技术是打洞 (http://en.wikipedia.org/wiki/UDP_hole_punching）。它工作得非常好，Skype 就使用它（从公司防火墙后面的一个客户端到另一个客户端；（如果不成功，则使用镜像服务器））。为此，您需要两个客户端都知道对方的地址，然后向对方发送一些数据包，那么它们如何获取彼此的地址？：您的服务器将地址提供给客户端（这不仅需要请求者，还需要每个分发者打开定期连接到您的服务器）。

在我谈论您对数据完整性的担忧之前，您可以（而且我想应该）做出包和数据包之间的一般区别：

您只需考虑您可以将（应用程序域）包（大）和用于互联网传输的包（小，受 MTU 等限制）分开：如果两者大小相同（最大 tcp/ 的大小），速度会很慢ip 数据包为 576（减去开销；看看这里：http://www.comsci.us/datacom/ippacket.html< /a> );你可以做一些实验来了解你的包的合适大小，我最好的猜测是 50k-1M 都可以（但是分析会优化它，因为如果你想要分发的大多数文件是大还是小，我们不会这样做）。

关于数据完整性：对于您的包，您肯定需要一个哈希值，我建议直接采用加密哈希值，因为这样可以防止篡改（除了损坏）；您不需要记录包的大小，因为如果散列错误，您无论如何都必须重新传输包。请记住，如果您使用 TCP/IP 进行数据包传输，这种包损坏并不常见（是的，即使在您的场景中也可以使用 TCP/IP），它会自动纠正（重新请求）传输错误。巨大的优势在于，中间的所有计算机和路由器都知道 TCP/IP，并在源计算机和目标计算机之间的每一步上自动检查损坏情况，因此它们可以自己重新请求数据包，从而使其非常快。他们不会知道您自己实现的数据包完整性协议，因此使用该自定义协议，数据包必须在重新请求开始之前到达目的地。

对于下一个想法，让我们将发布文件的客户端称为“发布者”，我知道这有点明显，但是将其与“上传者”区分开来很重要，因为客户端不需要将文件上传到您的服务器（仅有关它的一些信息，请参见下文）。

实现中央索引服务器应该没有问题，问题是您计划让它加密所有文件本身，而不是让发布者做繁重的工作（良好的加密是非常繁重的工作）。让发布者（而不是服务器）加密数据的唯一问题是，您必须信任发布者为您提供合理的搜索关键字：理论上，它可以为您提供每个客户都想要的非常有吸引力的搜索关键字以及参考到伪造数据（加密数据很难与随机数据区分开来）。但这个问题的解决方案是众包：让您的服务器存储用户评级，以便下载者可以对文件进行投票。您需要的表的实现可以是一个常规的旧哈希表，其中包含拥有该软件包的客户端 ID（见下文）的各个搜索关键字。发布者一开始是唯一保存数据的客户端，但是每个至少下载了一个包的客户端都应该被添加到哈希表的条目中，因此，如果发布者离线并且每个包都已被至少下载一个客户一切都继续工作。现在，关键的是，客户端 ID->IP 地址映射并非易事，因为它经常更改（例如，对于许多客户端来说每 24 小时更改一次）；为了补偿，您必须在服务器上有另一个表来进行此映射，并使客户端定期（例如每小时）联系服务器以告知其 IP 地址。我建议对客户端 ID 使用加密哈希，这样某个客户端就不可能通过告诉您假 ID 来破坏该表。

如有任何问题和批评，请评论。

Sorry for arriving late at the generous 500 reputation party, but even if i am too late i would like to add a little of my research to your discussion.

Yes such a system would be nice, like Bittorrent but with encrypted files and hashes of the un-encrypted data. In BT you can add encrypted files of course, but then the hashes would be of the encrypted data and thus not possible to identify retrieval-sources without a centralized queryKey->hashCollection storage, i.e. a server that does all the work of identifying package-sources for every client. A similar system was attempted by Freenet (http://freenetproject.org/), although more limited than what you attempt.

For the NAT consideration let's first look at: aClient -> yourServer (and aClient->aClient later)

For the communication between a client and your server the NATs (and firewalls that shield the clients) are not an issue! Since the clients initiate the connection to your server (which has either fixed ip-address or a dns-entry (or dyndns)) you dont even have to think about NATs, the server can respond without an issue since, even if multiple clients are behind a single corporate firewall the firewall (its NAT) will look up with which client the server wants to communicate and forwards accordingly (without you having to tell it to).

Now the "hard" part: client -> client communication through firewalls/NAT: The central technique you can use is hole-punching (http://en.wikipedia.org/wiki/UDP_hole_punching). It works so well it is what Skype uses (from one client behind a corporate firewall to another; (if it does not succeed it uses a mirroring-server)). For this you need both clients to know the address of the other and then shoot some packets at eachother so how do they get eachother's addresses?: Your server gives the addresses to the clients (this requires that not only a requester but also every distributer open a connection to your server periodically).

Before i talk about your concern about data-integrity, the general distinction between packages and packets you could (and i guess should) make:

You just have to consider that you can separate between your (application-domain) packages (large) and the packets used for internet-transmission (small, limited by MTU among other things): It would be slow to have both be the same size, the size of a maximum tcp/ip packet is 576 (minus overhead; take a look here: http://www.comsci.us/datacom/ippacket.html ); you could do some experiments about what a good size for your packages is, my best guess is that 50k-1M would all be fine (but profiling would optimize that since we dont if most of the files you want to distribute are large or small).

About data-integrity: For your packages you definitely need a hash, i would recommend to directly take a cryptographic hash since this prevents tampering (in addition to corruption); you dont need to record the size of the packages since if the hash is faulty you have to re-transmit the package anyways. Bear in mind, that this kind of package-corruption is not very frequent if you use TCP/IP for packet transmission (yes, you can use TCP/IP even in your scenario), it automatically corrects (re-requests) transmission-errors. The huge advantage is that all computers and routers in between know TCP/IP and check for corruption automatically on every step in between the source and destination computer, so they can re-request the packet themselves which makes it very fast. They would not know about a packet-integrity-protocol you implement yourself so with that custom protocol the packet has to arrive at the destination before the re-request can even start.

For the next thought let's call the client which publishes a file the "publisher", i know this is kind of obvious, however it is important to distinguish this from "uploader", since the client does not need to upload the file to your server (just some info about it, see below).

Implementing the central indexing-server should be no problem, the problem would be that you plan to have it encrypt all the files itself instead of making the publisher do that heavy work (good encryption is very heavy lifting). The only problem with having the publisher (not the server) encrypt the data is, that you have to trust the publisher to give you reasonable search-keywords: theoretically it could give you a very attractive search-keyword every client desires together with a reference to bogus data (encrypted data is hard to distinguish from random data). But the solution to this problem is crowd-sourcing: make your server store a user-rating so downloaders can vote on files. The implementation of the table you need could be a regular-old hash-table of individual search-keywords to client-ID's (see below) who have that package. The publisher is at first the only client that holds the data, but every client that downloaded at least one of the packages should then be added to the hash-table's entry, so if the publisher goes offline and every package has been downloaded by at least one client everything continues working. Now critically the mapping client-ID->IP-Addresses is non-trivial because it changes often (e.g. every 24 hours for many clients); to compensate, you have to have another table on your server that makes this mapping and make the clients contact the server periodically (e.g. every hour) to tell it its IP-address. I would recommend using a crypto-hash for client-ID's so that it is impossible for one client to trash this table by telling you fake ID's.

For any questions and criticism, please comment.

回复收藏 0 原文