I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.
发布评论
评论(6)
我承认,我不是统计学家。但我以前也遇到过这类问题。实际上,我们在这里讨论的是,您有一些观察到的离散事件,并且您想要弄清楚它们在任何给定时间点发生的可能性。您遇到的问题是您想要获取离散数据并从中获取连续数据。
我想到的术语是密度估计。具体来说是内核密度估计。您可以通过简单的分箱获得核密度估计的一些效果(例如,对某个时间间隔(例如每刻钟或每小时)内的事件数量进行计数。)核密度估计具有一些比简单分箱更好的统计属性。 (生成的数据通常“更平滑”。)
不过,这只能解决您的问题之一。下一个问题仍然是更有趣的问题 - 如何获取数据时间线(在本例中,只有打印机数据)并从中产生预测?首先,您解决问题的方式可能不是您想要的。虽然拥有有限数据源并预测该源的下一步的神奇想法听起来很有吸引力,但集成更多数据源来创建实际预测要实用得多。 (例如,在出现大量电话活动后,打印机可能会受到严重打击 - 这在某些公司中很难预测)Netflix 挑战赛是这一点的一个相当有力的例子。
当然,更多数据源的问题是需要额外的工作来建立收集数据的系统。
老实说,我认为这是一个特定于领域的问题,并采取两种方法:查找与时间无关的模式,并查找与时间相关的模式。
一个与时间相关的模式示例是,每个工作日的 4:30 Suzy 打印出她的一天结束报告。这种情况发生在一周中每天的特定时间。这种事情很容易用固定的时间间隔来检测。 (每天、每个工作日、每个周末、每个星期二、每个月的第一天等...)这非常容易以预定的时间间隔进行检测 - 只需创建一条一周的估计概率密度函数曲线长并返回过去并对曲线进行平均(可能通过窗口函数进行加权平均以获得更好的预测)。
如果您想变得更复杂,请找到一种自动检测此类间隔的方法。 (数据可能不会太庞大,以至于您可以暴力破解。)
一个与时间无关的模式示例是,每次会计部门的迈克打印出发票清单表时,他都会去找 Johnathan,后者打印出一个相当大的发票清单表。几个小时后一批完整的发票报告。这种事情更难检测,因为它的形式更自由。我建议查看不同的时间间隔(例如 30 秒、40 秒、50 秒、1 分钟、1.2 分钟、1.5 分钟、1.7 分钟、2 分钟、3 分钟、...1 小时、2 小时、3 小时、 ....)并以一种很好的方式对它们进行二次采样(例如 Lanczos 重采样)来创建向量。然后使用矢量量化风格算法对“有趣”模式进行分类。不过,您需要仔细考虑如何处理类别的确定性 - 如果您生成的类别中的数据很少,则它可能不可靠。 (某些矢量量化算法在这方面比其他算法更好。)
然后,要对未来打印某些内容的可能性进行预测,请查找最近的活动间隔(30 秒、40 秒、50 秒、1 分钟、以及所有其他间隔)通过矢量量化并根据结果的确定性对结果进行加权,以创建预测的加权平均值。
您需要找到一种好方法来衡量时间相关和时间无关输出的确定性,以创建最终估计。
这种事情是典型的预测数据压缩方案。我建议您看一下PAQ,因为它有很多我已经讨论过的概念在这里,可以提供一些非常有趣的见解。甚至还可以提供源代码以及有关所用算法的优秀文档。
您可能想要采用与矢量量化完全不同的方法,离散化数据并使用更像 PPM 方案。它实施起来要简单得多,而且仍然有效。
我不知道这个项目的时间框架或范围是多少,但这种事情总是可以进行到N级。如果有最后期限,我想强调的是,你首先要担心的是让某件事发挥作用,然后才能让它发挥作用。不理想的东西总比没有好。
这种项目很很酷。如果你能正确完成这种项目,你就能找到一份工作。我建议您花点时间,正确地做,并将其作为功能、开源、有用的软件发布。我强烈推荐开源,因为您希望创建一个社区,可以在您有权访问、愿意支持或有时间支持的更多环境中贡献数据源提供程序。
祝你好运!
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
我真的不明白马尔可夫模型在这里有什么用处。当您预测的事件依赖于先前的事件时,通常会使用马尔可夫模型。当然,典型的例子是文本,一个好的马尔可夫模型可以非常好地猜测下一个字符或单词是什么。
但是用户打印下一个内容的时间是否存在某种模式呢?也就是说,您是否发现工作之间的时间间隔有规律?如果是这样,那么马尔可夫模型就可以工作。如果不是,那么马尔可夫模型将是随机猜测。
在如何建模时,请将工作之间的不同时间段视为字母表中的字母。事实上,您可以为每个时间段分配一个字母,例如:
然后,浏览数据并为打印作业之间的每个时间段分配一个字母。完成后,您将获得数据的文本表示形式,并且可以运行任何进行文本预测的马尔可夫示例。
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
如果您有一个您认为可能与问题领域相关的实际模型,您应该应用它。例如,可能存在与星期几、一天中的时间以及可能的日期相关的模式(节假日可能会显示较低的使用率)。
大多数基于检查(例如)相邻事件之间的时间的原始统计建模技术将难以捕获这些潜在影响。
我将为每个已知事件(一周中的某一天等)建立一个统计模型,并用它来预测未来发生的情况。
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
我认为预测神经网络将是完成这项任务的好方法。
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
这种方法也是用于预测外汇天气预报、股票标记、太阳黑子。
如果您想了解更多有关其工作原理的信息,请参阅此处的教程。
http://www.obitko.com/tutorials/neural-network-prediction/
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
将马尔可夫链想象成一个图,其中顶点通过权重或距离相互连接。在此图表中移动会消耗掉您行驶的重量或距离的总和。以下是文本生成的示例:http://phpir.com/text- Generation。
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
卡尔曼滤波器用于跟踪状态向量,通常具有连续(或至少离散连续)动态。这与零星的离散事件截然相反,因此除非您有一个包含这种状态向量(并且是线性或几乎线性的)的基础模型,否则您可能不需要卡尔曼滤波器。
听起来你没有一个基础模型,正在四处寻找一个:你有一个钉子,并且正在通过工具箱尝试锉刀、螺丝刀和卷尺 8^)
我最好的建议:首先,使用您对问题的了解来构建模型;然后根据模型找出解决问题的方法。
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.