SQL - 如何标记数据?
我需要在 SQL 表中存储短标签(A01、B34 等),并确保它们已建立索引。为字母表中的每个字母创建一个 INT 列是不可能的,因为条目可以具有多个“A”标签。
首先,我将它们存储为长字符串,并用空格分隔(例如“A01 B34”)。但这需要一个 LIKE% 查询,它会进行全表扫描并忽略任何索引。所以我正在寻找替代方案。
我现在使用 SQLite FTS(文本搜索)来搜索这些标签,但这需要一个特殊的表来存储标签,并使用 JOIN 查询获取结果,以及我想避免的所有其他内容。
我的要求非常简单:我需要存储数百万个短字符串,每个字符串都有自己的标签,并对这些标签进行简单搜索。
我当前的方法(在标签上进行 FTS)是最快的吗?或者对于这种数据使用NoSQL数据库更好?
I need to store short tags (A01, B34, etc) in a SQL table, and make sure their indexed. Creating an INT column for each letter in the alphabet is not possible, because entries can have multiple 'A' tags for example.
First I stored them as a long string, seperated with spaces (for example "A01 B34"). But this requires a LIKE% query, which does a fulltable scan and ignores any indexes. So i'm looking for alternatives.
I now use SQLite FTS (text search) to search for these tags, but this requires a special table to store the tags in, and fetching results with JOIN queries, and all kinds of other stuff I'd rather like to avoid.
My requirements are pretty simple: I need to store millions of short strings, each with their own tags, and do simple searches for these tags.
Is my current approach (doing FTS on the tags) the fastest? Or is it better to use a NoSQL database for this kind of data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我将在之前启动的 Pageflakes 社区网站中分享我的经验。在 Pageflakes,用户创建的内容被标记。您可以从这里查看示例:
http://www.pageflakes.com /Community/Content/Flakes.aspx?moduleKey=4796
每个小部件、页面广播都有一组标签。当有人搜索时,我们会给予标签最高优先级,然后是标题,然后是项目描述。
假设您有一个如下所示的内容表:
首先,您需要为所有唯一标签创建一个表。
然后您需要将标签映射到内容行。
您现在会问,对于每个新内容,我必须插入三个表中。并非总是如此。仅当您有新标签时才插入标签表。大多数时候,人们选择现有的标签。经过几个月的标记后,用户应该已经用完了唯一的标记。从那时起,99% 的时间用户都会选择一些现有的标签。这样,就为您删除了一个插入件。因此,您只有一个额外的插入件。
而且,insert总是明显低于select。最有可能的是,99% 是读取,1% 是写入。
除非您引入这两个表,否则您永远无法拥有用户可以单击标签并查看标有该特定标签的所有内容的 UI。如果您不需要这样的功能,那么您当然可以在内容表本身上添加一个“标签”列,并以逗号分隔的格式存储标签。
现在最重要的一点是——如何产生最好的搜索结果。在内容表上,我们有一个名为“SearchData”的 varchar 字段。该字段首先填充标签名称,然后填充标题,然后填充描述。因此,
然后您使用 SQL Server 的全文索引仅对 SearchData 列进行索引,而不对 Content 表中的任何其他字段进行索引。
这对你有用吗?
I will share my experience how I have done it in my previous startup Pageflakes Community site. At Pageflakes, user created content is tagged. You can see an example from here:
http://www.pageflakes.com/Community/Content/Flakes.aspx?moduleKey=4796
Each widget, pagecast has a collection of tags. When someone searches, we give the tags highest priority, then the title, then the description of the item.
Assuming you have a Content table like this:
First of all, you need to create a table for all unique tags.
Then you need to map the tag to content rows.
You will now ask, for each new content, I have to insert in three tables. Not always. You insert in Tag table only when you have a new tag. Most of the time, people choose existing tags. After couple of months of tagging, users should have exhausted unique tags. From then, 99% of the time users pick some existing tag. So, that removes one insert for you. So, you only have one additional insert.
Moreover, insert is always significantly lower than select. Most likely you will have 99% read, 1% write.
Unless you introduce these two tables, you can never have a UI where users can click on a tag and see all the content tagged with that particular tag. If you have no need for such feature, then of course you can just add a "tags" column on the Content table itself and store the tags in comma delimited format.
Now the most important point - how to produce the best search result. On the content table, we have a varchar field called "SearchData". This field is first populated with the tag names, then the title, then the description. So,
Then you use SQL Server's Full text indexing to index the SearchData column only, not any other field in the Content table.
Does this work for you?
你没有给我们提供很多细节,但你的设计似乎都是错误的。它不符合第三范式。
You do not give us a lot of details to go on, but your design seems to be all wrong. It is not in third normal form.
@Joshua,请使用“正常化”一词。目前您的数据已非规范化。非规范化是可能的事情。但在标准化和某种性能黑客之后。目前您的设计似乎是错误的。
例如,您应该插入 1 个表 3 个表:
这是 DBMS 中的经典设计模式。这里不需要NoSQL。
@Joshua, pls goo on term "normalization". Currently your data is denormalized. Denormalization is possible thing. but after normalization and as some kind of perfomance hack. Currently your design seems to be wrong.
As an example you should have insetad of 1 table 3 tables:
It's a classic design pattern in DBMS. And NoSQL here not needed.
正如其他用户指出的那样,数据没有很好地标准化。我假设这是故意的,并且有一些非常大的(数百 GB 或 TB 大小要求或您没有提到的巨大吞吐量要求)。但在开始任何路径之前,您应该准确了解您的要求:写入与读取的频率、写入和读取的延迟要求是什么,以及必须在计算中包括索引维护。
如果您有显着的性能要求,您可以尝试在当前拥有的基础上构建近线索引系统。我过去曾在大吞吐量要求的系统中使用过这种技术。这个想法基本上是,对于写入,您可以使它们尽可能小和快,并创建一个批处理过程来返回并将数据添加到辅助搜索表中,从而将其转换为能够搜索的形式。好处是您的写入可以快速完成,并且如果您选择良好的聚集索引,则可以非常有效地完成批处理的读取。此外,您可以根据需要将数据分段到不同的服务器中,以支持更高的搜索吞吐量。主要缺点是更新不会立即反映在搜索结果中。
如果你写入这样的表:
表数据(id 二进制(6),...,时间戳日期时间,标签 varchar(256))
并有一个辅助表:
表搜索(标签 char(3)、dataId 二进制(6))
您可以创建一个批处理过程来获取最后 BATCH_SIZE(可能是 1000 条)记录,并将标签列拆分为空格,然后将标签插入/删除搜索表。您将一个变量/行保留在某个位置,其中包含您从中收集的最后一个时间戳值,并在下一个批处理间隔从那里开始。最后,如果删除很重要,则每个批次间隔都需要查找数据表中不再存在的记录集。或者,如果您的数据表太大,您可以选择逻辑删除表,或者如果它们发生的频率不够高,您可以同时对数据执行删除和搜索。
批处理需要注意的事情是使批处理大小太大,并在更新搜索表时使用表锁。另外,您必须注意重复的时间戳。当然,在写入/更新数据表时,有必要始终更新时间戳。
As other users have pointed out, the data is not well normalized. I'll assume that this is intentional and there is some very large (100s of gb or tb size requirement or huge throughput requirement that you haven't mentioned). But before you start down any path, you should understand exactly what your requirements are: how often do you write versus read, what are the latency requirements for writes and reads, and you have to include index maintenance in your calculations.
If you have a significant perf requirement, you might try building a near-line index system on top of what you currently have. I've used this technique in the past for large throughput requirement systems. The idea is basically that for writes, you make them as small and quick as possible, and create a batch process to come back and add the data into a secondary search table that will get it into a form that is capable of being searched. The benefit is your writes can be done quickly, and if you choose your clustered index well the reads for the batch processing can be done very efficiently. In addition, you can segment the data into different servers as necessary to support higher search throughput. The major drawback is that updates are not instantaneously reflected in search results.
If you write into a table like:
table data (id binary(6), ..., timestamp datetime, tags varchar(256))
and have a secondary table:
table search (tag char(3), dataId binary(6))
You can create a batch process to come around take the last BATCH_SIZE (1000 maybe) records and splitting the tags column on a space and inserting/deleting the tags into/from the search table. You keep a variable/row somewhere with the last timestamp value you've collected from and start from there at the next batch interval. Finally, if deletes are important, each batch interval will need to find the set of records no longer in the data table. Alternately you could choose a tombstone table if your data table is too large or you can concurrently perform deletes against data and search if they happen infrequently enough.
Things to watch out for with batch processing is making the batch size too big and taking table locks when updating the search table. Also, you have to watch out for duplicate timestamps. And of course, when writing/updating the data table it is necessary to always update the timestamp.