我如何找到正确的数据设计和正确的工具/数据库/查询以下要求

发布于 2025-02-09 21:36:18 字数 587 浏览 1 评论 0原文

我有一种要求，但无法弄清楚如何解决。我的数据集的格式以下

id, atime, grade
123, time1, A
241, time2, B
123, time3, C

或列出列表格式：

[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]

现在我的用例是在多行上执行比较，聚合和查询，例如

最后2行之间的时间差，
其中ID = 123 = 123个时间差。 123＆amp; gradea
时间差异在第一，第三，第五和最新数据之间，
所有数据（或特定ID的最后10个记录）应易于访问。

还需要进一步计算。 我应该为数据集选择哪种格式我应该使用哪些数据库/工具？ 我在这里没有关系数据库有用。如果您有任何想法，我将无法用Solr/Elastic解决它，请简要介绍一下或其他任何工具Spark，Hadoop，Cassandra的头部？我正在尝试一些事情，但是任何帮助都值得赞赏。

原文

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format

id, atime, grade
123, time1, A
241, time2, B
123, time3, C

or if I put in list format:

[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]

Now my use-case is to perform comparison, aggregation and queries over multiple row like

time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.

Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゃ懵逼小萝莉 2025-02-16 21:36:18

选择合适的技术高度取决于与您的SLA相关的事物。诸如您的查询能有延迟多少？您的查询类型是什么？您的数据是否归类为大数据？数据可更新吗？我们期待晚期事件吗？我们将来是否需要历史数据，还是可以使用越野等技术？和这样的事情。为了澄清我的答案，可能是使用窗口功能可以解决问题。例如，您可以将数据存储在您提到的任何工具上，并使用Presto SQL引擎，您可以查询并获得所需的结果。但是并非所有人都是最佳的。此外，通常无法使用单个工具来解决这类问题。一组工具可以满足所有要求。

tl; dr。在下文中，我们找不到解决方案。它介绍了一种思考数据建模和选择工具的方法。

让我尝试对问题进行建模以选择单个工具。我认为您的数据无法更新，您需要一个较低的延迟响应时间，我们不会期望任何迟到事件，我们将面临必须将其保存为原始数据的大量数据流。

根据第一个和第二个要求，至关重要的是随机访问（似乎您想在特定ID上查询），因此诸如Parquet或orc文件之类的解决方案不是一个不错的选择。
基于最后的要求，必须基于ID对数据进行分区。第一个和第二个要求和最后的要求，都将ID视为标识符部分，看来没有其他字段（例如时间）像加入和全局订购一样。因此，我们可以选择ID作为分区器（物理或逻辑）和atime作为群集部分；对于每个ID，根据时间订购事件。
第三个要求有点模糊。您想产生所有数据吗？还是每个ID？
为了计算前三个条件，我们需要一个支持窗口功能的工具。

根据上述注释，似乎我们应该选择一个对随机访问查询有良好支持的工具。 Cassandra，Postgres，Druid，MongoDB和Elasticsearch等工具是我记得的事物。让我们检查一下：

Cassandra：在随机访问查询上的响应时间非常好，可以轻松处理大量数据，并且没有单个失败点。但是可悲的是，它不支持窗口功能。另外，您应该仔细设计数据模型，似乎这不是我们可以选择的好工具（由于未来需要原始数据）。我们可以通过与卡桑德拉（Cassandra）一起使用Spark绕过其中的一些限制，但是目前，我们更喜欢避免在堆栈中添加新工具。
Postgres：在随机访问查询和索引列上很棒。它支持窗口功能。我们可以将多个服务器的数据（水平分区）分解（通过选择ID作为碎片键，我们可以在计算上具有数据位置）。但是有一个问题：ID并非唯一；因此，我们不能选择ID作为主键，并且我们在随机访问中遇到了一些问题（我们可以选择ID和atime列（作为时间戳列）作为复合主键，但没有拯救我们）。
德鲁伊：这是一个很棒的OLAP工具。根据DRUID的存储方式（段文件），通过选择正确的数据模型，您可以在子秒中对大量数据进行分析查询。它不支持窗口功能，但是使用汇总和其他功能（例如最早的），我们可以回答我们的问题。但是，通过使用汇总，我们会丢失原始数据，我们需要它们。
MongoDB：它支持随机访问查询和碎片。另外，我们可以在其计算框架上具有某种类型的窗口函数，我们可以定义某种用于进行聚合的管道。它支持上限收集，如果ID列的基数不高，我们可以使用它来存储每个ID的最后10个事件。看来此工具可以涵盖我们的所有要求。
Elasticsearch：这对于随机访问非常好，也许是最伟大的。使用某种过滤器聚合，我们可以具有一种窗口函数。它可以通过碎片处理大量数据。但是它的查询语言很难。我可以想象我们可以用ES回答第一个和第二个问题，但是就目前而言，我无法在我的脑海中进行查询。与之找到正确的解决方案需要时间。

因此，MongoDB和Elasticsearch似乎可以回答我们的要求，但是有很多'如果在路上。我认为我们找不到使用单个工具的直接解决方案。也许我们应该选择多种工具，并使用复制数据之类的技术来找到最佳解决方案。

Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.

tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.

Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.

Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.

Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:

Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.

So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

回复收藏 0 原文

~没有更多了~