使用文档数据库 (noSQL) 进行基于集合的基本操作

发布于 2024-11-24 09:09:19 字数 1236 浏览 7 评论 0原文

与大多数人一样，我来自 RDMS 世界，试图了解 noSQL 数据库，特别是文档存储（因为我发现它们最有趣）。

我试图了解如何使用文档数据库执行一些基于集合的操作（我正在使用 RavenDB）。

因此，根据我的理解：

Union（如 SQL UNION 中）是非常直接的追加。此外不同集合之间的并集（SQL JOIN）可以实现map/reduce。这 RavenDB 神话书中给出的示例，评论计数为博客条目是一个好的开始。
可以使用以下多种技术来执行交叉反规范化一直到创建“映射”或“链接” 此处所述的文档（以及聚合器示例以下）。在 RDMS 中，这将使用简单的“INNER JOIN”或“WHERE x IN”
减去（相对补码）来执行，这是我遇到困难的地方。在 RDMS 中，此操作只是“WHERE x NOT IN”或“LEFT JOIN”，其中连接集为 NULL。

使用一个现实世界的例子，假设我们有一个 RSS 聚合器（例如 Google Reader），它有数百万甚至数十亿个 RSS 条目，其中有数千个用户，每个条目都带有收藏夹等。

在这个例子中，我们重点关注条目、用户和标签；其中标签充当用户和条目之间的链接。

user {string id, string name /*etc.*/}
entry {string id, string title, string url /*etc.*/}
tag {string userId, string entryId, string[] tags} /* (favourite, read, etc.)*/

通过上述方法，可以很容易地执行条目和用户使用标签之间的交集。但我无法理解如何执行减法。例如“返回所有没有任何标签的项目”，甚至更令人畏惧的“返回最新的 1000 个没有任何标签的项目”。

所以我的问题是：

你能给我一些关于这个问题的阅读材料吗？
您能否分享一些关于如何完成任务的想法高效？

注意：我知道您会失去文档数据库的查询灵活性，但肯定有办法做到这一点吗？

原文

As with most, I come from and RDMS world trying to get my head around noSQL databases and specifically document stores (as I find them the most interesting).

I am try to understand how to perform some set-based operations using a document database (I'm playing with RavenDB).

So as per my understanding:

Union (as in SQL UNION) is very straight forward append. Additionally
unions between different sets (SQL JOIN) can be achieved map/reduce. The
example given in the RavenDB mythology book with Comment counts on
Blogs entries is a good start.
Intersection can be performed using a number of techniques from
de-normalization right through to creating a “mapping” or “link”
document as described here (and the aggregator example below). In an RDMS this would be performed using a simple "INNER JOIN" or "WHERE x IN"
Subtract (Relative Complement) is where I am getting stuck. In an RDMS this operation is simply a "WHERE x NOT IN" or a "LEFT JOIN" where the joined set is NULL.

Using a real world example let’s say we have an RSS aggregator (such as Google Reader) which has millions if not billions of RSS entries with thousands of users, each tagging favourite, etc.

In this example we focus on entry, user and tag; where tag acts as a link between user and entry.

user {string id, string name /*etc.*/}
entry {string id, string title, string url /*etc.*/}
tag {string userId, string entryId, string[] tags} /* (favourite, read, etc.)*/

With the above approach it is easy to perform the intersection between entry and user using tag. But I cannot get my head around how one would perform a subtract. For instance “Return all items that do not have any tags” or even more daunting “return the latest 1000 items without any tag”.

So my question:

Can you point me to some reading material on the matter?
Can you share some ideas on how one can accomplish the task
efficiently?

Note: I know that you lose query flexibility with document databases, but surely there must be a way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌辣 2024-12-01 09:09:19

阿莫克,
您想要的东西在非关系数据库中确实无法轻松完成。
主要是因为他们不进行集合思考，并且与分布式计算有很强的联系。
例如，如果无法访问所有数据，您就无法真正进行有效的集合，这几乎意味着任何基于集合的操作都必须需要访问所有这些数据。
由于NoSQL数据库通常用于分布式场景，因此它们无法真正支持这一点。
具体来说，RavenDB 允许对指定集合进行某些操作，但它是建立在独立文档的假设之上的，这些文档与其他文档或需要以相同方式一起操作的文档没有很强的关系。