每秒 3K 传入请求的重复检测，推荐的数据结构/算法？

发布于 2024-12-15 23:04:35 字数 883 浏览 2 评论 0原文

设计一个系统，其中服务端点（可能是一个简单的 servlet）必须每秒处理 3K 请求（数据将通过 http 发布）。

然后这些请求将被存储到mysql中。

我需要指导的关键问题是，发布到此端点的重复数据的比例很高。

我只需要将唯一数据存储到 mysql，那么你建议我用什么来处理重复？

发布的数据将如下所示：

<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>

我将编写一个方法，对 prop1、prop2、pro3 进行哈希处理以创建唯一的哈希码（主体可以不同，但仍被视为唯一）。

我正在考虑创建某种可以跨请求共享的并发字典。

它们更有可能在 24 小时内重复发布数据。所以我可以每 x 小时后从字典中清除数据。

关于存储重复项的数据结构有什么建议吗？考虑到每秒 3K 请求，我应该存储多少条记录，即它会很快变得很大。

注意：它们将发布 10K 个不同的来源，并且重复的可能性仅发生在给定的来源中。这意味着我可以有不止一本字典，也许可以有一组资源来传播事物。这意味着如果source1发布数据，然后source2发布数据，重复的变化非常非常低。但如果source1一天发帖100次，重复的可能性就非常高。

注意：请暂时忽略将发布的数据保存到 mysql 的任务，因为这本身就是另一个问题，重复检测是我需要帮助的第一个障碍。

原文

Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).

These requests will then be stored into mysql.

They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.

I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?

The posted data will look like:

<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>

I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).

I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.

Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.

Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.

Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.

Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

各自安好 2024-12-22 23:04:35

有趣的问题。

我可能会在这里查看 HashMap 结构的某种 HashMap，其中 HashMap 的第一级将使用源作为键，第二级将包含实际数据（用于检测重复项的最小数据）并使用哈希码函数进行哈希。对于实际实现，Java 的 ConcurrentHashMap 可能是选择。

这样，如果您需要将负载分配到多台计算机上，您还可以设置结构来根据源对传入负载进行分区。

关于清除，我认为您必须使用生产数据来衡量确切的行为。您需要了解当您成功消除重复项时数据增长的速度以及数据如何分布在 HashMap 中。凭借良好的分布和不太快的增长，我可以想象偶尔进行清理就足够了。否则，也许 LRU 策略会很好。

回复收藏 0 原文

疏忽 2024-12-22 23:04:35

听起来您需要一个散列结构，可以在恒定时间内添加和检查密钥是否存在。在这种情况下，请尝试实现 Bloom 过滤器。请注意，这是一个概率结构，即它可能会告诉您某个密钥存在，但实际上它不存在，但如果您仔细调整参数，则可以使失败的概率极低。

编辑：好的，所以布隆过滤器是不可接受的。为了仍然保持持续查找（尽管不是持续插入），请尝试研究 Cuckoo 哈希。

回复收藏 0 原文

初熏 2024-12-22 23:04:35

1) 像这样设置数据库

ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);

INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (@prop1, @prop2, @prop3, @body) 
ON DUPLICATE KEY UPDATE Body=@body

2) 您不需要任何算法或花哨的哈希 ADT

shell> mysqlimport [options] db_name textfile1 [textfile2 ...]

http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
使用 --replace 或 --ignore 标志以及 --compress。

3) Java 要做的就是...

a) 生成 CSV 文件，使用 StringBuffer 类，然后每隔 X 秒左右，与新的 StringBuffer 交换并将旧 StringBuffer 的 .toString 传递给一个线程将其刷新到文件 /temp/SOURCE/TIME_STAMP.csv

b) 偶尔启动 mysqlimport 命令的 Runtime.getRuntime().exec

c) 如果空间存在问题，请删除旧的 CSV 文件，或者将它们存档到网络存储/备份设备

1) Setup your database like this

ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);

INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (@prop1, @prop2, @prop3, @body) 
ON DUPLICATE KEY UPDATE Body=@body

2) You don't need any algorithms or fancy hashing ADTs

shell> mysqlimport [options] db_name textfile1 [textfile2 ...]

http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.

3) All your Java will do is...

a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv

b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command

c) delete the old CSV files if space is an issue, or archive them to network storage/backup device

回复收藏 0 原文

可爱咩 2024-12-22 23:04:35

好吧，你基本上是在寻找某种非常大的 Hashmap 和类似

if (map.put(key, val) != null) // send data

有很多不同的 Hashmap 实现可用，但你可以看看NBHM 。非阻塞的 put 以及在设计时考虑到大型、可扩展的问题可以很好地工作。 Map 还具有迭代器，在使用它们遍历映射时不会抛出 ConcurrentModificationException，这基本上是删除旧数据的要求，正如我所见。另外， putIfAbsent 就是您实际需要的全部 - 但不知道这是否比简单的 put 更有效，您必须询问 Cliff 或检查源代码。

诀窍是尝试通过使其足够大来避免调整 Map 的大小 - 否则调整大小时吞吐量将受到影响（这可能是一个问题）。并考虑如何实现旧数据的删除 - 使用一些空闲线程遍历迭代器并可能删除旧数据。

回复收藏 0 原文

拧巴小姐 2024-12-22 23:04:35

使用 java.util.ConcurrentHashMap 构建哈希映射，但请确保在创建时为该映射分配了正确的initialCapacity 和 concurrencyLevel。

ConcurrentHashMap 的 api 文档包含所有相关信息：

initialCapacity - 初始容量。实施执行
内部尺寸可以容纳这么多元素。
concurrencyLevel - 并发更新线程的估计数量。这
实现执行内部调整以尝试适应这种情况
许多线程。

只要您以正确的方式初始化了 ConcurrentHashMap，您就应该能够使用 putIfAbsent 来处理 3K 请求 - 确保将其作为负载测试的一部分进行调整。

但在某些时候，尝试在一台服务器中处理所有请求可能会被证明是太多了，并且您将不得不在服务器之间进行负载平衡。此时，您可以考虑使用 memcached 来存储哈希索引，而不是 CHP。

不过，您仍然需要解决的有趣问题是：

在启动时将所有哈希值加载到内存中
，确定何时从内存映射中删除哈希值

回复收藏 0 原文

抚笙 2024-12-22 23:04:35

如果您使用强哈希公式，例如 MD5 或 SHA-1，您根本不需要存储任何数据。重复的概率几乎为零，因此如果两次找到相同的哈希结果，则第二个是重复的。
鉴于 MD5 为 16 字节，SHA-1 为 20 字节，它应该会减少内存需求，从而在 CPU 缓存中保留更多元素，从而显着提高速度。

存储这些密钥只需要一个小的哈希表和后面的树来处理冲突。

回复收藏 0 原文

~没有更多了~