我如何找到正确的数据设计和正确的工具/数据库/查询以下要求
我有一种要求,但无法弄清楚如何解决。我的数据集的格式以下
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
或列出列表格式:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
现在我的用例是在多行上执行比较,聚合和查询,例如
- 最后2行之间的时间差,
- 其中ID = 123 = 123个时间差。 123& gradea
- 时间差异在第一,第三,第五和最新数据之间,
- 所有数据(或特定ID的最后10个记录)应易于访问。
还需要进一步计算。 我应该为数据集选择哪种格式 我应该使用哪些数据库/工具? 我在这里没有关系数据库有用。如果您有任何想法,我将无法用Solr/Elastic解决它,请简要介绍一下或其他任何工具Spark,Hadoop,Cassandra的头部? 我正在尝试一些事情,但是任何帮助都值得赞赏。
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
- time difference between last 2 rows where id=123
- time difference between last 2 rows where id=123&GradeA
- Time difference between first, 3rd, 5th and latest one
- all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
选择合适的技术高度取决于与您的SLA相关的事物。诸如您的查询能有延迟多少?您的查询类型是什么?您的数据是否归类为大数据?数据可更新吗?我们期待晚期事件吗?我们将来是否需要历史数据,还是可以使用越野等技术?和这样的事情。为了澄清我的答案,可能是使用窗口功能可以解决问题。例如,您可以将数据存储在您提到的任何工具上,并使用Presto SQL引擎,您可以查询并获得所需的结果。但是并非所有人都是最佳的。此外,通常无法使用单个工具来解决这类问题。一组工具可以满足所有要求。
tl; dr。在下文中,我们找不到解决方案。它介绍了一种思考数据建模和选择工具的方法。
让我尝试对问题进行建模以选择单个工具。我认为您的数据无法更新,您需要一个较低的延迟响应时间,我们不会期望任何迟到事件,我们将面临必须将其保存为原始数据的大量数据流。
atime
作为群集部分;对于每个ID,根据时间订购事件。根据上述注释,似乎我们应该选择一个对随机访问查询有良好支持的工具。 Cassandra,Postgres,Druid,MongoDB和Elasticsearch等工具是我记得的事物。让我们检查一下:
atime
列(作为时间戳列)作为复合主键,但没有拯救我们)。最早的
),我们可以回答我们的问题。但是,通过使用汇总,我们会丢失原始数据,我们需要它们。因此,MongoDB和Elasticsearch似乎可以回答我们的要求,但是有很多'如果在路上。我认为我们找不到使用单个工具的直接解决方案。也许我们应该选择多种工具,并使用复制数据之类的技术来找到最佳解决方案。
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
atime
as the cluster part; For each ID, events are ordered based on the time.Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
atime
columns (as a timestamp column) as a compound primary key, but it does not save us).EARLIEST
), we can answer our questions. But by using rollup, we lose raw data and we need them.So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.