数据挖掘情况
假设我有下面提到的数据。
11AM user1 刷牙
11:05AM user1 准备
早餐 11:10AM user1 吃早餐
11:15AM user1 洗澡
11:30AM user1 上班
12PM user2 刷牙
12:05PM user2 准备早餐
12:10PM user2 吃早餐
12:15PM user2 洗澡
12 :30PM 用户 2 去办公室
11AM 用户 3 接受洗澡
11:05AM user3 准备早餐
11:10AM user3 刷牙
11:15AM user3 吃早餐
11:30AM user3 上班
12PM user4 洗澡
12:05PM user4 准备早餐
12:10PM user4 刷牙
12:15PM user4 吃早餐
12:30PM user4上班
这些数据告诉我不同人的日常生活。从这些数据来看,用户 1 和用户 2 的行为似乎相似(尽管他们执行活动的时间存在差异,但他们遵循相同的顺序)。出于同样的原因,User3 和 User4 的行为类似。 现在我必须将这些用户分为不同的组。在此示例中,group1- user1 和 User2 ... 后面是 group2,包括 user3 和 user4
我应该如何处理这种情况。我正在尝试学习数据挖掘,这是我认为是数据挖掘问题的一个例子。我正在尝试寻找解决方案的方法,但我想不出一种方法。我相信这个数据有规律可循。但我想不出可以揭示它的方法。 另外,我必须将这种方法映射到我拥有的数据集上,该数据集非常大,但与此类似:) 该数据是关于记录一次事件发生情况的日志。我想找到代表相似事件序列的组。
任何指示将不胜感激。
Suppose I have the data as mentioned below.
11AM user1 Brush
11:05AM user1 Prep Brakfast
11:10AM user1 eat Breakfast
11:15AM user1 Take bath
11:30AM user1 Leave for office
12PM user2 Brush
12:05PM user2 Prep Brakfast
12:10PM user2 eat Breakfast
12:15PM user2 Take bath
12:30PM user2 Leave for office
11AM user3 Take bath
11:05AM user3 Prep Brakfast
11:10AM user3 Brush
11:15AM user3 eat Breakfast
11:30AM user3 Leave for office
12PM user4 Take bath
12:05PM user4 Prep Brakfast
12:10PM user4 Brush
12:15PM user4 eat Breakfast
12:30PM user4 Leave for office
This data tell me about the daily routine of different people. From this data it seems user1 and user2 behave similarly (though there is a difference in time they perform the activity but they are following the same sequence). With the same reason, User3 and User4 behave similarly.
Now I have to group such users into different groups. In this example, group1- user1 and USer2 ... followed by group2 including user3 and user4
How should I approach this kind of situation. I am trying to learn data mining and this is an example I thought of as a data mining problem. I am trying to find an approach for the solution, but I can not think of one. I believe this data has the pattern in it. but I am not able to think of the approach which can reveal it.
Also, I have to map this approach on the dataset I have, which is pretty huge but similar to this :) The data is about logs stating occurrence of events at a time. And I want to find the groups representing similar sequence of events.
Any pointers would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它看起来像是关联挖掘之上的聚类,更准确地说是Apriori 算法。像这样的事情:
示例:
让我们将所有操作标记为字母:
a - Brush
b - 准备早餐
c - 东早餐
d - 洗澡
...
您的属性将类似于
a1:a->b
a2: a->c
a3:a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...
在这种情况下,用户特征向量将是:
为了比较两个用户,需要一些距离测量。最简单的一个是余弦距离,它只是两个特征向量之间的余弦值。如果 2 个用户具有完全相同的操作序列,则他们的相似度将等于 1。如果他们没有共同点 - 他们的相似度将为 0。
使用距离度量使用聚类算法(例如,k-means)来创建用户组。
It looks like clustering on top of associating mining, more precisely Apriori algorithm. Something like this:
Example:
Let's mark all actions as letters:
a - Brush
b - Prep Breakfast
c - East Breakfast
d - Take Bath
...
Your attributes will look like
a1: a->b
a2: a->c
a3: a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...
User feature vectors in this case will be:
To compare 2 users some distance measure is needed. The simplest one is cosine distance, that is just value of cosine between 2 feature vectors. If 2 users have exactly the same sequence of actions, their similarity will equal 1. If they have nothing common - their similarity will be 0.
With distance measure use clustering algorithm (say, k-means) to make groups of users.
使用其他答案中提出的 Apriori 等项集挖掘算法并不是最佳解决方案,因为 Apriori 不考虑时间或顺序。因此,需要进行额外的预处理步骤来考虑订购。
更好的解决方案是直接使用 PrefixSpan、SPADE 或 CM-SPADE 等顺序模式挖掘算法。顺序模式挖掘算法将直接查找一组序列中经常出现的子序列。
然后您仍然可以对找到的顺序模式应用聚类!
Using an itemset mining algorithm like Apriori as proposed in the other answer is not the best solution because Apriori does not consider time or the sequential ordering. Thus, it requires to do an additional pre-processing step to consider ordering.
A better solution is to use a sequential pattern mining algorithm like PrefixSpan, SPADE, or CM-SPADE directly. A sequential pattern mining algorithm will directly find subsequences that appears often in a set of sequences.
Then you can still apply clustering on the sequential patterns found!