有没有一个好的算法来检查指定时间段内数据的变化?
我们有大约 7000 种金融产品,理论上,其收盘价在规定的时间段(例如一周或一个月)内应在一定百分比范围内上下波动。
我可以访问存储这些历史价格的内部系统(不是关系数据库!)。我想制作一份报告,列出在一段时间内价格根本没有变动或变动幅度低于 10% 的所有产品。
我不能只将第一个值(第 1 天)与最后的值(第 n 天)进行比较,因为价格可能会回到最后一天的价格,这会导致误报,而产品的价格当然,可能会在两者之间飙升。
是否有任何既定的算法可以在合理的计算时间内完成此操作?
We have around 7k financial products whose closing prices should theoretically move up and down within a certain percentage range throughout a defined period of time (say a one week or month period).
I have access to an internal system that stores these historical prices (not a relational database!). I would like to produce a report that lists any products whose price has not moved at all or less than say 10% over the time period.
I can't just compare the first value (day 1) to the value at the end (day n) as the price could potentially have moved back to what it was on the last day which would lead to a false positive while the product's price could have spiked somewhere in between of course.
Are there any established algorithms to do this in reasonable compute time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果需要经常检查(对于大量间隔,例如去年的每天,以及同一组产品),您可以存储每周/每月每个项目的高值和低值。通过将正确的每周和/或每月界限与间隔边缘的一些原始数据相结合,您可以获得该间隔内的最小值和最大值。
If this needs to be checked often (for a large number of interval, like daily for the last year, and for the same set of products), you can store the high and low values of each item per week/month. By combining the right weekly and/or monthly bounds with some raw data at the edges of the interval you can get the minimum and maximum value over the interval.
如果您可以将数据添加到 kdb(即您不限于读取访问权限),您可以考虑添加“自上次价格变化以来的天数”作为一组新数据(即每种金融工具一个数字)。然后,每日任务将获取今天和昨天的标记,并更新存储的数字。同样,您可以维持 kdb 最近(上个月、去年)的高点和低点。您必须在较大的数据集上运行作业来最初填充值,但随后您的每日更新将涉及更少的数据。
建议如果您采用类似的方法,您可以通过某种方式重新运行全部或部分数据集(例如添加新产品)。
最后 - 历史是否根据当前价格进行标准化? (即考虑股票分割或类似的重估)。如果没有,您需要检测这些不连续性并将它们分开。
编辑
我会调查使用kdb+/Q来实施信号处理,而不是将原始数据提取到 Java 应用程序。正如你所说,它的性能很高。
If you can add data to kdb (i.e. you're not limited to read access) you might consider adding the 'number of days since last price change' as a new set of data (i.e. one number per financial instrument). A daily task would then fetch today's mark and yesterday's, and update the numbers stored. Similarly you could maintain recent (last month, last year) highs and lows in kdb. You'd have to run a job over the larger dataset to prime the values initially, but then your daily updates will involve much less data.
Recommend that if you adopt something like this you have some way to rerun for all or part of the dataset (say for adding a new product).
Lastly - is the history normalised against current prices? (i.e. are revaluations for stock splits or similar taken into account). If not, you'd need to detect these discontinuities and divide them out.
EDIT
I'd investigate usng kdb+/Q to implement the signal processing, rather than extracting the raw data to a Java application. As you say, it's highly performant.
如果您可以跟踪时间间隔内价格的最小值和最大值,则可以执行此操作 - 这假设时间间隔不会不断变化。跟踪一组不断变化的项目的最小值和最大值的一种方法是“背对背”放置两个堆 - 您可以存储此堆以及一些在存储中的一个或两个数组中查找和删除旧项目所需的指针。将两个堆背靠背放置的想法出现在 Knuth 的《计算机编程艺术》第 3 卷练习 31 第 5.2.3 节中。 Knuth 将这种野兽称为“优先出队”,这似乎是可以搜索的。最小值和最大值可按恒定成本获得。当新价格到来时修改它的成本是 log n,其中 n 是存储的商品数量。
You can do this if you can keep track of the min and max value of the price during the time interval - this assumes that the time interval is not being constantly changed. One way of keeping track of the min and max values of a changing set of items is with two heaps placed 'back to back' - you could store this and some pointers necessary to find and remove old items in one or two arrays in your store. The idea of putting two heaps back to back is in Knuth's Art of Computer Programming Vol 3 as Exercise 31 section 5.2.3. Knuth calls this sort of beast a Priority Dequeue, and this seems to be searchable. Min and max are available at constant cost. Cost of modifying it when a new price arrives is log n, where n is the number of items stored.
如果不审视每一天,就没有办法做到这一点。
假设数据如下:
中间有一日峰值。除非您检查峰值发生的日期,否则您不会发现这一点 - 换句话说,您需要每天检查。
There isn't any way to do this without looking at every single day.
Suppose the data looks like such:
With that one-day spike in the middle. You're not going to catch that unless you check the day that the spike happens - in other words, you need to check every single day.