标签的高效数据结构?
想象一下,您想要序列化和反序列化 stackoverflow 帖子,包括其标签,以尽可能有效地节省空间(以二进制形式),同时也为了在进行标签查找时提高性能。有没有适合这种场景的良好数据结构?
Stackoverflow 有大约 28532 个不同的标签,您可以创建一个包含所有标签的表并为它们分配一个整数,此外您可以按频率对它们进行排序,以便最常见的标签具有最少的数字。从搜索和存储的角度来看,仍然像“1 32 45”格式的字符串一样简单地存储它们似乎有点低效
。另一个想法是将标签保存为变量位数组,从查找和序列化的角度来看,这很有吸引力。由于最常见的标签是第一个,因此您可以将标签放入少量内存中。
问题当然是不常见的标签会产生巨大的位数组。对于大范围的 0 是否有“压缩”位数组的标准?或者应该完全使用其他结构?
编辑
我不是在寻找数据库解决方案或需要将整个表保留在内存中的解决方案,而是寻找用于过滤单个项目的结构
Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?
Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective
Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.
The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?
EDIT
I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
并不是要破坏你的问题,但 28k 条记录确实不算多。您是否可能过早地进行优化?
我首先会坚持在数据库表上使用“常规”索引。他们使用的严厉启发式方法通常非常有效,并且不容易被击败(或者如果可以的话,它真的值得及时付出努力吗?收益是否足够大?)。
另外,根据您实际执行标签查询的位置,用户是否真的注意到您优化的 200 毫秒时间增益?
首先测量然后优化:-)
编辑
如果没有数据库,我可能会有一个主表,将所有标签与 ID 一起保存(如果可能的话将其保存在内存中)。将定期排序的 ID 列表与每个帖子一起保存。
不确定基于通用性的存储量会有多少帮助。可以在其中进行常规二分搜索的排序列表可能足够快;措施:-)
不过,在这里您需要迭代每个标签查询的所有帖子。
如果这最终变得很慢,您可以为每个标签存储一些帖子标识符。不过,该数据结构可能会变得有些大,并且可能需要一个文件来查找和读取。
对于较小的表,您可以根据哈希值(具有重复项)构建一个表。通过这种方式,您可以使用它快速找到需要进一步检查以查看它们是否匹配的较小的候选帖子列表。
Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely?
I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).
Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?
First measure then optimize :-)
EDIT
Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.
Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)
Here you would need to iterate all posts for every tag query though.
If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.
For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.
您需要第二个包含 2 个字段的表: tag_id Question_id
就是这样。然后您在 tag_id、question_id 和 Question_id、tag_id 上创建索引 - 这将覆盖索引,因此您的所有查询都会非常快。
You need second table with 2 fields: tag_id question_id
That's it. Then you create indexes on tag_id, question_id and question_id, tag_id - that would be covering index so all your queries would be very fast.
我有一种感觉,你的问题太抽象了;您没有详细说明您想要如何访问数据结构,这非常重要。
话虽这么说,我建议计算每个标签出现的次数,然后使用 Huffman 编码 来提出可用于标签的最短编码。这并不完全完美,但我会坚持下去,直到你证明它是不合适的。然后,您可以将代码与每个问题关联起来。
I have a feeling you abstracted your question too much; you didn't say very much about how you want to access the datastructure, which is very important.
That being said, I suggest to count the number of occurances for each tag and then use Huffman coding to come up with the shortest encoding which can be used for the tags. This is not entirely perfect, but I'd stick with it until you've demonstrate that it's inappropriate. You can then associate the codes with each question.
如果您想有效地查找特定标签内的问题,您将需要某种索引。也许,所有 Tag 对象都可以有一个引用数组(引用、指针、数字 ID 等),指向用此特定标签标记的所有问题。这样,您只需要找到标签对象,并且您就有一个指向该标签的所有问题的数组。
If you want to efficiently lookup questions within a specific tag, you will need some kind of index. Maybe, all Tag objects could have an array of references (references, pointers, nummeric-id, etc) to all the questions that are tagged with this particular tag. This way you simply need to find the tag object and you have an array pointing to all the questions of that tag.