Python 中的算法可以存储和搜索每天发生的数千个编号事件?
我正在研究存储和查询大量项目的事件发生的历史记录的解决方案。
这是简化的场景:我每天都会收到 200 000 个路灯(标记为 sl1 到 sl200000)的日志,其中显示路灯当天是否运行。灯的使用时间长短并不重要,重要的是它是在给定的日历日。
每个灯还存储其他信息,Python 类的开头看起来像这样:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
我的 py-foo 不太好,我想避免在磁盘/内存存储上过于贪婪的解决方案。因此,具有(年、月、日)元组字典的解决方案可能是一种解决方案,但我希望获得更有效解决方案的指示。
记录可以存储为位流,每个位代表从 1 月 1 日开始的一年中的一天。因此,如果灯在 2010 年的前三天运行,则记录可能是:
sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')
跨年份边界的搜索将需要合并、闰年是一个特殊情况,另外我需要使用这个自制的解决方案进行相当多的编码/解码。好像不太安静吧。 加速bitstring-bit-operations,how-do-i-find-missing-dates- in-a-list 和 finding-data-gaps-with-bit-屏蔽我遇到的有趣的帖子。我还调查了 python-bitstring 并做了一些谷歌搜索,但似乎没有什么真正适合的。
此外,我希望搜索“间隙”是可能的,例如“三天或更多天没有行动”,并且必须将标记的日期转换为真实的日历日期。
我希望得到有关可能解决方案的想法或指示。为了添加更多细节,可能有趣的是使用的后端数据库是 ZODB,并且首选可以 pickle 的纯 Python 对象。
I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.
This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.
Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.
A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:
sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')
Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.
Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.
I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 Numpy 中创建 2D 数组:
它将非常节省内存,并且您可以轻松聚合日期和灯。
为了更好地控制日期,请查看 scikits.timeseries。它们将允许您使用日期时间对象访问日期。
Create a 2D-array in Numpy:
It will be very memory-efficient and you can aggregate easily the days and lamps.
In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.
我可能会对灯进行字典处理,并让每个灯都包含状态更改列表,其中第一个元素是更改时间,第二个元素是自该时间以来有效的值。
这样,当您到达下一个样本时,除非状态与上一个项目相比发生了变化,否则您什么都不做。
搜索快速高效,因为您可以及时使用二分搜索方法。
坚持它也很容易,您可以将数据附加到现有的正在运行的系统,也不会出现任何问题,以及对灯状态列表进行字典以进一步减少资源使用。
如果你想搜索差距,你只需检查所有项目并比较下一个和上一个时间 - 如果你决定对状态列表进行字典,那么你将能够为每个不同的列表而不是每个列表执行一次灯,然后只需一次迭代即可获得具有相同“离线”状态的所有灯,这有时可能会有所帮助
I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.
This way when you get to the next sample you do nothing unless the state changed compared to the last item.
Searching is quick and efficient as you can use binary search approaches on the times.
Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.
If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help