数据挖掘情况

发布于 2024-12-07 11:02:46 字数 907 浏览 7 评论 0原文

假设我有下面提到的数据。

11AM user1 刷牙

11:05AM user1 准备

早餐 11:10AM user1 吃早餐

11:15AM user1 洗澡

11:30AM user1 上班

12PM user2 刷牙

12:05PM user2 准备早餐

12:10PM user2 吃早餐

12:15PM user2 洗澡

12 :30PM 用户 2 去办公室

11AM 用户 3 接受洗澡

11:05AM user3 准备早餐

11:10AM user3 刷牙

11:15AM user3 吃早餐

11:30AM user3 上班

12PM user4 洗澡

12:05PM user4 准备早餐

12:10PM user4 刷牙

12:15PM user4 吃早餐

12:30PM user4上班

这些数据告诉我不同​​人的日常生活。从这些数据来看,用户 1 和用户 2 的行为似乎相似(尽管他们执行活动的时间存在差异,但他们遵循相同的顺序)。出于同样的原因,User3 和 User4 的行为类似。 现在我必须将这些用户分为不同的组。在此示例中,group1- user1 和 User2 ... 后面是 group2,包括 user3 和 user4

我应该如何处理这种情况。我正在尝试学习数据挖掘,这是我认为是数据挖掘问题的一个例子。我正在尝试寻找解决方案的方法,但我想不出一种方法。我相信这个数据有规律可循。但我想不出可以揭示它的方法。 另外,我必须将这种方法映射到我拥有的数据集上,该数据集非常大,但与此类似:) 该数据是关于记录一次事件发生情况的日志。我想找到代表相似事件序列的组。

任何指示将不胜感激。

Suppose I have the data as mentioned below.

11AM user1 Brush

11:05AM user1 Prep Brakfast

11:10AM user1 eat Breakfast

11:15AM user1 Take bath

11:30AM user1 Leave for office

12PM user2 Brush

12:05PM user2 Prep Brakfast

12:10PM user2 eat Breakfast

12:15PM user2 Take bath

12:30PM user2 Leave for office

11AM user3 Take bath

11:05AM user3 Prep Brakfast

11:10AM user3 Brush

11:15AM user3 eat Breakfast

11:30AM user3 Leave for office

12PM user4 Take bath

12:05PM user4 Prep Brakfast

12:10PM user4 Brush

12:15PM user4 eat Breakfast

12:30PM user4 Leave for office

This data tell me about the daily routine of different people. From this data it seems user1 and user2 behave similarly (though there is a difference in time they perform the activity but they are following the same sequence). With the same reason, User3 and User4 behave similarly.
Now I have to group such users into different groups. In this example, group1- user1 and USer2 ... followed by group2 including user3 and user4

How should I approach this kind of situation. I am trying to learn data mining and this is an example I thought of as a data mining problem. I am trying to find an approach for the solution, but I can not think of one. I believe this data has the pattern in it. but I am not able to think of the approach which can reveal it.
Also, I have to map this approach on the dataset I have, which is pretty huge but similar to this :) The data is about logs stating occurrence of events at a time. And I want to find the groups representing similar sequence of events.

Any pointers would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

对你而言 2024-12-14 11:02:46

它看起来像是关联挖掘之上的聚类,更准确地说是Apriori 算法。像这样的事情:

  1. 挖掘动作之间所有可能的关联,即序列 Bush -> 。准备早餐,准备早餐 ->吃早餐,...,布什 ->准备早餐->吃早餐等。您可以在数据中找到每对、三胞胎、四胞胎等。
  2. 从每个这样的序列中创建单独的属性。为了获得更好的性能,为对属性添加 2 的提升,为三元组添加 3 的提升,依此类推。
  3. 此时你必须有一个属性向量和相应的提升向量。您可以计算每个用户的特征向量:如果用户操作中存在此序列,则在向量中的每个位置设置 1 * boost,否则设置 0)。您将获得每个用户的矢量表示。
  4. 在此向量上使用更适合您需求的聚类算法。每个找到的类都是您使用的组。

示例:

让我们将所有操作标记为字母:

a - Brush
b - 准备早餐
c - 东早餐
d - 洗澡
...

您的属性将类似于

a1:a->b
a2: a->c
a3:a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...

在这种情况下,用户特征向量将是:

attributes   = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ...
user1        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user2        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user3        =  0,  0,  0,  0, ...,   0,   0,   0, ...,   0,   0, ...

为了比较两个用户,需要一些距离测量。最简单的一个是余弦距离,它只是两个特征向量之间的余弦值。如果 2 个用户具有完全相同的操作序列,则他们的相似度将等于 1。如果他们没有共同点 - 他们的相似度将为 0。

使用距离度量使用聚类算法(例如,k-means)来创建用户组。

It looks like clustering on top of associating mining, more precisely Apriori algorithm. Something like this:

  1. Mine all possible associations between actions, i.e. sequences Bush -> Prep Breakfast, Prep Breakfast -> Eat Breakfast, ..., Bush -> Prep Breakfast -> Eat Breakfast, etc. Every pair, triplet, quadruple, etc. you can find in your data.
  2. Make separate attribute from each such sequence. For better performance add boost of 2 for pair attributes, 3 for triplets and so on.
  3. At this moment you must have an attribute vector with corresponding boost vector. You can calculate feature vector for each user: set 1 * boost at each position in the vector if this sequence exists in user actions and 0 otherwise). You will get vector representation of each user.
  4. On this vectors use clustering algorithm that fits your needs better. Each found class is the group you use.

Example:

Let's mark all actions as letters:

a - Brush
b - Prep Breakfast
c - East Breakfast
d - Take Bath
...

Your attributes will look like

a1: a->b
a2: a->c
a3: a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...

User feature vectors in this case will be:

attributes   = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ...
user1        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user2        =  1,  0,  0,  0, ...,   0,   1,   0, ...,   4,   0, ...
user3        =  0,  0,  0,  0, ...,   0,   0,   0, ...,   0,   0, ...

To compare 2 users some distance measure is needed. The simplest one is cosine distance, that is just value of cosine between 2 feature vectors. If 2 users have exactly the same sequence of actions, their similarity will equal 1. If they have nothing common - their similarity will be 0.

With distance measure use clustering algorithm (say, k-means) to make groups of users.

蓬勃野心 2024-12-14 11:02:46

使用其他答案中提出的 Apriori 等项集挖掘算法并不是最佳解决方案,因为 Apriori 不考虑时间或顺序。因此,需要进行额外的预处理步骤来考虑订购。

更好的解决方案是直接使用 PrefixSpan、SPADE 或 CM-SPADE 等顺序模式挖掘算法。顺序模式挖掘算法将直接查找一组序列中经常出现的子序列。

然后您仍然可以对找到的顺序模式应用聚类!

Using an itemset mining algorithm like Apriori as proposed in the other answer is not the best solution because Apriori does not consider time or the sequential ordering. Thus, it requires to do an additional pre-processing step to consider ordering.

A better solution is to use a sequential pattern mining algorithm like PrefixSpan, SPADE, or CM-SPADE directly. A sequential pattern mining algorithm will directly find subsequences that appears often in a set of sequences.

Then you can still apply clustering on the sequential patterns found!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文