每秒 3K 传入请求的重复检测,推荐的数据结构/算法?
设计一个系统,其中服务端点(可能是一个简单的 servlet)必须每秒处理 3K 请求(数据将通过 http 发布)。
然后这些请求将被存储到mysql中。
我需要指导的关键问题是,发布到此端点的重复数据的比例很高。
我只需要将唯一数据存储到 mysql,那么你建议我用什么来处理重复?
发布的数据将如下所示:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
我将编写一个方法,对 prop1、prop2、pro3 进行哈希处理以创建唯一的哈希码(主体可以不同,但仍被视为唯一)。
我正在考虑创建某种可以跨请求共享的并发字典。
它们更有可能在 24 小时内重复发布数据。所以我可以每 x 小时后从字典中清除数据。
关于存储重复项的数据结构有什么建议吗?考虑到每秒 3K 请求,我应该存储多少条记录,即它会很快变得很大。
注意:它们将发布 10K 个不同的来源,并且重复的可能性仅发生在给定的来源中。这意味着我可以有不止一本字典,也许可以有一组资源来传播事物。这意味着如果source1发布数据,然后source2发布数据,重复的变化非常非常低。但如果source1一天发帖100次,重复的可能性就非常高。
注意:请暂时忽略将发布的数据保存到 mysql 的任务,因为这本身就是另一个问题,重复检测是我需要帮助的第一个障碍。
Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
有趣的问题。
我可能会在这里查看 HashMap 结构的某种 HashMap,其中 HashMap 的第一级将使用源作为键,第二级将包含实际数据(用于检测重复项的最小数据)并使用哈希码函数进行哈希。对于实际实现,Java 的 ConcurrentHashMap 可能是选择。
这样,如果您需要将负载分配到多台计算机上,您还可以设置结构来根据源对传入负载进行分区。
关于清除,我认为您必须使用生产数据来衡量确切的行为。您需要了解当您成功消除重复项时数据增长的速度以及数据如何分布在 HashMap 中。凭借良好的分布和不太快的增长,我可以想象偶尔进行清理就足够了。否则,也许 LRU 策略会很好。
Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.
听起来您需要一个散列结构,可以在恒定时间内添加和检查密钥是否存在。在这种情况下,请尝试实现 Bloom 过滤器。请注意,这是一个概率结构,即它可能会告诉您某个密钥存在,但实际上它不存在,但如果您仔细调整参数,则可以使失败的概率极低。
编辑:好的,所以布隆过滤器是不可接受的。为了仍然保持持续查找(尽管不是持续插入),请尝试研究 Cuckoo 哈希。
Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.
1) 像这样设置数据库
2) 您不需要任何算法或花哨的哈希 ADT
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
使用 --replace 或 --ignore 标志以及 --compress。
3) Java 要做的就是...
a) 生成 CSV 文件,使用 StringBuffer 类,然后每隔 X 秒左右,与新的 StringBuffer 交换并将旧 StringBuffer 的 .toString 传递给一个线程将其刷新到文件 /temp/SOURCE/TIME_STAMP.csv
b) 偶尔启动 mysqlimport 命令的 Runtime.getRuntime().exec
c) 如果空间存在问题,请删除旧的 CSV 文件,或者将它们存档到网络存储/备份设备
1) Setup your database like this
2) You don't need any algorithms or fancy hashing ADTs
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device
好吧,你基本上是在寻找某种非常大的 Hashmap 和类似
if (map.put(key, val) != null) // send data
有很多不同的 Hashmap 实现可用,但你可以看看NBHM 。非阻塞的 put 以及在设计时考虑到大型、可扩展的问题可以很好地工作。 Map 还具有迭代器,在使用它们遍历映射时不会抛出 ConcurrentModificationException,这基本上是删除旧数据的要求,正如我所见。另外, putIfAbsent 就是您实际需要的全部 - 但不知道这是否比简单的 put 更有效,您必须询问 Cliff 或检查源代码。
诀窍是尝试通过使其足够大来避免调整 Map 的大小 - 否则调整大小时吞吐量将受到影响(这可能是一个问题)。并考虑如何实现旧数据的删除 - 使用一些空闲线程遍历迭代器并可能删除旧数据。
Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also
putIfAbsent
is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.
使用 java.util.ConcurrentHashMap 构建哈希映射,但请确保在创建时为该映射分配了正确的initialCapacity 和 concurrencyLevel。
ConcurrentHashMap 的 api 文档 包含所有相关信息:
只要您以正确的方式初始化了 ConcurrentHashMap,您就应该能够使用 putIfAbsent 来处理 3K 请求 - 确保将其作为负载测试的一部分进行调整。
但在某些时候,尝试在一台服务器中处理所有请求可能会被证明是太多了,并且您将不得不在服务器之间进行负载平衡。此时,您可以考虑使用 memcached 来存储哈希索引,而不是 CHP。
不过,您仍然需要解决的有趣问题是:
Use a
java.util.ConcurrentHashMap
for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.The api docs for ConcurrentHashMap have all the relevant information:
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
如果您使用强哈希公式,例如 MD5 或 SHA-1,您根本不需要存储任何数据。重复的概率几乎为零,因此如果两次找到相同的哈希结果,则第二个是重复的。
鉴于 MD5 为 16 字节,SHA-1 为 20 字节,它应该会减少内存需求,从而在 CPU 缓存中保留更多元素,从而显着提高速度。
存储这些密钥只需要一个小的哈希表和后面的树来处理冲突。
If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.