如何在 Cassandra 中为 TinyURL 用例设计表？

发布于 2025-01-11 23:29:48 字数 1004 浏览 2 评论 0原文

最近我遇到了一个著名的设计问题。 <一href="https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR?aid=5082902844932096&utm_source=google&utm_medium=paid&a mp;utm_campaign=dynamic_core&utm_term=&utm_campaign=%5BDynamic%5D%20Programming%20Verticals&utm_source=adwords&utm_medium=ppc&hsa_acc =5451446008&hsa_cam=16452540641&hsa_grp=136967452314&hsa_ad=5852 53631476&hsa_src=g&hsa_tgt=aud-470210443636:dsa-765065488750&hsa _kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&gclid=CjwKCAiAjoeRBhA JEiwaYY3nDMDzCWWj3R0yc196g_fb-u-kciO0pXlFippC23HE6dx-AZ4Ag7d6_RoCTwwQAvD_BwE" rel="nofollow noreferrer">'Tiny URL'

我发现人们为 DynamoDB 或 Cassandra 等 NoSQL DBS 提供担保。我已经阅读了几天有关 Cassandra 的文章，我想围绕这个数据库针对这个特定问题设计我的解决方案。

表的定义是什么？如果我选择以下表定义：

Create table UrlMap(tiny_url text PRIMARY KEY, url text);

这不会导致很多分区吗？因为我的分区键可以采用大约 68B 值（使用 6 个字符的 base64 字符串），

这会以某种方式影响整体读/写性能吗？如果是这样，定义表的更好模型是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纸伞微斜 2025-01-18 23:29:48

Cassandra 中数据建模的主要原则是为每个应用程序查询设计一个表。

对于 URL 缩短服务，主要的应用程序查询是检索给定微小 URI 的等效完整 URL。在伪代码中，查询如下所示：

    GET long url FROM datastore WHERE uri = ?

请注意，出于服务的目的，我们不会存储 Web 域名以使应用程序可重复用于任何域。过滤器（WHERE 子句）是 URI，因此这就是您想要的分区键，因此我们将相应地设计表：

CREATE TABLE urls_by_uri (
    uri text,
    long_url text,
    PRIMARY KEY(uri)
)

如果我们想要检索 http://tinyu 的 URL .rl/abc123，CQL 查询是：

    SELECT long_url FROM urls_by_uri WHERE uri = 'abc123'

正如 Phact 和 Andrew 指出的那样，无需担心将在表中存储的分区（记录）数量，因为您最多可以存储 2 个分区（记录） ^128 个分区Cassandra 表的实用用途是无限的。

在 Cassandra 中，每个分区都使用 Murmur3 哈希算法（默认分区器）哈希为令牌值。此实现将每个分区随机分布在集群中的所有节点上。使用相同的哈希算法来确定哪个节点“拥有”分区，使得 Cassandra 中的检索（读取）速度非常快。

只要将 SELECT 查询限制为单个分区，检索数据的速度就会非常快。事实上，我与数百家公司合作，他们的 SLA 读取时间为 6-9 毫秒，读取率为 95%。当您正确建模数据并正确调整集群大小时，这在 Cassandra 中是可以实现的。干杯!

The primary principle of data modelling in Cassandra is to design one table for each application query.

For a URL shortening service, the main application query is to retrieve the equivalent full URL for a given tiny URI. In pseudo-code, the query looks like:

    GET long url FROM datastore WHERE uri = ?

Note that for the purpose of a service, we won't store the web domain name to make the app reusable for any domain. The filter (WHERE clause) is the URI so this is what you want as the partition key so we would design the table accordingly:

CREATE TABLE urls_by_uri (
    uri text,
    long_url text,
    PRIMARY KEY(uri)
)

If we want to retrieve the URL for http://tinyu.rl/abc123, the CQL query is:

    SELECT long_url FROM urls_by_uri WHERE uri = 'abc123'

As Phact and Andrew pointed, there is no need to worry about the number of partitions (records) you'll be storing in the table because you can store as many as 2^128 partitions in a Cassandra table which for practical purposes is limitless.

In Cassandra, each partition gets hashed into a token value using the Murmur3 hash algorithm (default partitioner). This implementation distributes each partition randomly across all nodes in the cluster. The same hash algorithm is used to determine which node "owns" the partition making retrieval (reads) very fast in Cassandra.

As long as you limit the SELECT queries to a single partition, retrieving the data is extremely fast. In fact, I work with hundreds of companies who have an SLA on reads of 95% between 6-9 milliseconds. This is achievable in Cassandra when you model your data correctly and size your cluster correctly. Cheers!

回复收藏 0 原文