当前位置：文江博客话题详情

Bigtable实际例子

发布于 2024-11-25 07:27:59 字数 118 浏览 3 评论 0原文

有人可以提供一个真实世界的例子来说明如何在 Bigtable 中构建数据吗？请从搜索引擎、社交网络或任何其他熟悉的角度进行讨论，清楚、务实地说明该行 -> 是如何进行的。列族->列组合优于传统的规范化关系方法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅听莫相离 2024-12-02 07:27:59

阅读原始 Google 白皮书很有帮助：

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf

Google 数据架构信息源的综合列表：

http://highscalability.com/google-architecture

更新： 2014 年 11 月 4 日

Google 白皮书 PDF 的新版本可在此处找到：

http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf

回复收藏 0 原文

与风相奔跑 2024-12-02 07:27:59

我认为差异更多在于查询数据的方式而不是存储数据的方式。

关系数据库和 NoSQL 之间的主要区别在于，后者没有 SQL。

这意味着您（而不是查询优化器）自己编写查询计划。

如果您知道如何执行此操作，这可能会提高查询性能。

考虑一个典型的搜索引擎查询：查找包含所有（或部分）单词的前 10 页，例如“湿 T 恤竞赛”，按相关性排序（为了简单起见，我们将单词邻近性放在一边清酒）。

为此，您需要将所有单词拆分并保存在按 （单词、相关性、来源） 排序的可搜索和可迭代列表中。然后，您将此列表划分为 (3 *ranks) 组（每个组从搜索查询中给定排名的单词顶部开始），其中 ranks 是可能的数字或排名，例如 1 到 10；并加入 source 上的集合，.

在关系数据库中，它看起来像这样：

SELECT  w1.source
FROM    ranks r1
JOIN    words w1
ON      w1.word = 'wet'
        AND w1.rank = r1.value
CROSS JOIN
        ranks r2
JOIN    words w2
ON      w2.word = 'shirt'
        AND w2.rank = r2.value
        AND w2.source = w1.source
CROSS JOIN
        ranks r3
JOIN    words w3
ON      w3.word = 'contest'
        AND w3.rank = r2.value
        AND w3.source = w1.source
ORDER BY
        relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10

最好在按等级划分的三个集合上使用 MERGE JOIN 来执行。

然而，据我所知，没有一个优化器会构建这个计划（不考虑 relevance_formula 可能不会分布在各个排名上的事实）。

要解决此问题，您应该实施自己的查询计划：从每个单词/排名对的顶部开始，同时降低所有三个集合，跳过缺失值并使用 search 而不是 如果您觉得其中一组中有太多内容需要跳过，请继续下一步。

因此，关系方法为您提供了一种更方便的数据查询方法，但代价可能是性能损失。

如果您正在开发校园 Web 服务器，那么编写这些 SELECT * 就可以了，即使它们的执行时间比可能的时间长一微秒。但如果您正在开发 Google，则值得花一些时间来优化查询（纯关系系统只允许使用 SQL 访问其数据，这是不允许的）。

NoSQL 和关系数据库有时会相互扩散。例如，Berkeley DB 是一种著名的 NoSQL 存储引擎，MySQL 使用它作为其存储后端，以允许 SQL< /code> 查询。反之亦然，HandlerSocket 允许对关系型 InnoDB 存储进行纯键值查询，并在其上构建 MySQL 数据库。

I believe the difference is more about the way the data are queried rather the way they are stored.

The main difference between relational databases and NoSQL is that there is, um, no SQL in the latter.

This means you (not the query optimizer) write the query plans yourself.

This may increase the query performance if you know how to do that.

Consider a typical search engine query: find top 10 pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).

To do this, you need all words split and kept in a searchable and iterable list ordered by (word, relevance, source). Then you partition this list into (3 * ranks) sets (each starting at the top of the words in your search query at a given rank), where ranks is the possible number or ranks, say, 1 to 10; and join the sets on source, .

In a relational database it would look like this:

SELECT  w1.source
FROM    ranks r1
JOIN    words w1
ON      w1.word = 'wet'
        AND w1.rank = r1.value
CROSS JOIN
        ranks r2
JOIN    words w2
ON      w2.word = 'shirt'
        AND w2.rank = r2.value
        AND w2.source = w1.source
CROSS JOIN
        ranks r3
JOIN    words w3
ON      w3.word = 'contest'
        AND w3.rank = r2.value
        AND w3.source = w1.source
ORDER BY
        relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10

This is best executed using a MERGE JOIN over the three sets partitioned by rank.

However, no optimizer I'm aware of will build this plan (leaving aside the fact that relevance_formula may not distribute over the individual ranks).

To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using search rather then next if you feel that there will be too much to skip in one of the sets.

Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.

If you are developing a campus web server, then writing those SELECT * is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data using SQL just would not let to do).

The such called NoSQL and relational databases sometimes diffuse into each other. For instance, Berkeley DB is a well-known NoSQL storage engine which was used by MySQL as its storage backend to allow SQL queries. And vice versa, HandlerSocket allows pure key-value queries to a relational InnoDB store with a MySQL database built over it.