存储和处理大量时空数据
作为我们研究小组的一部分,我们正在收集大量位置数据。我们的数据基本上看起来像(用户 ID、纬度/经度坐标、时间戳)。还涉及其他元数据,但这与这里无关。 我们每周收集大约 2-300 万条记录,并预计在适当的时候收集大约一年的数据。
我真的很想获得一些有关存储和处理这些数据的技术的建议。我们希望能够回答类似以下的查询:
(1) 对于给定位置,在指定时间段内谁在该位置附近(在指定距离内)?
(2) 哪些地点彼此靠近?
这就是总体思路。我们不需要实时响应,但是什么是好的数据库(或其他数据存储软件)?我遇到过人们谈论 kd 树,这在这种规模下有效吗?我需要什么样的硬件?我希望得到有关一般策略的指导。我们如何存储这些数据?将所有内容存储在数据库中是否有意义?哪些数据/软件/软件包非常适合距离/半径计算?
我们最熟悉Python/Linux,更愿意远离Java,更喜欢开源/免费软件。我们对这一切都很陌生,书籍和论文的指针也很有用。所有和任何建议都会非常有用。
As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here.
We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time.
I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to:
(1) For a given location, who was near that location (within a specified distance) over a specified period of time?
(2) Which locations are near each other?
That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations?
We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
PostGIS 可能就是您正在寻找的。
PostGIS is probably what you are looking for.