使用 CUDA 的 Fp 增长算法
我必须使用 CUDA 开发数据挖掘算法。我查了很多,发现除了FpGrowth之外,大部分算法都已经实现了。
你认为这是个好主意吗?你能给我一些关于如何实施它的想法吗?
i have to develop a data mining algorithm using CUDA. I have searched a lot and found that most algorithms have already been implemented except FpGrowth.
do you think its a good idea? can you give me any ideas on how to implement it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我将回答你的第一个问题:“这是个好主意吗?”。好吧,我认为如果您需要的话,这是一个好主意。但是,如果你只是因为尚未完成而想这样做,也许这不是一个好主意。
对于第二个问题,请确保您很好地理解了FPGrowth。您可以阅读描述 FPGrowth 的原始论文。您也可以查看《数据挖掘导论》一书。它对 FPGrowth 有易于理解的描述。当你理解好FPGrowth之后,你就可以看到如何用CUDA来实现它......这是我的建议。
I will answer your first question: "is it a good idea?". Well, I think that it is a good idea if you need it. But, if you just want to do it because it has not been done, maybe it is not a so good idea.
For the second question, make sure that you understand FPGrowth well. You can read the original paper describing FPGrowth. Also you can check the book "Introduction to data mining". It has an easy-to-understand description of FPGrowth. After you understand well FPGrowth, you can see how to implement it with CUDA... That is my suggestion.
我找到了一个网页,其中描述了如何绘制 FP 树以及如何从该树中识别频繁模式。您可以访问该网站并阅读信息。
如何使用 FP 树算法识别频繁模式
I found a web page which describe how to draw a FP tree and how to identify the frequent patterns from that tree. you can visit that site and read the information.
How to identify frequent patterns using FP tree algorithm
我不知道 FpGrowth 但我想你已经阅读了论文( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.162.1209&rep=rep1&type=pdf 等)。我猜你是 CUDA 新手,这使得实现如此复杂的事情变得相当困难。
CUDA 获得良好性能的关键是大规模统一并行性和少量同步。 CUDA 区域 http://www.nvidia.com/object/cuda_apps_flash_new.html 有有很多很好的例子,什么是有效的以及如何有效。学习 CUDA 的一个很好的起点是编程指南 http: //developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf 。
一个常见的问题是“我已经有了这个 C 代码,如何将其移植到 CUDA”。答案是不要!在 CUDA 中,没有指针、没有字符串、没有打印、没有文件,您所学到的关于高效代码的大部分内容都是错误的。
一种更有前途的方法是以更抽象的方式思考底层算法。确定可以并行完成什么,考虑一个好的数据结构(可能涉及大型数组),实现一个原型。依赖 Thrust 等 CUDA 库可能会更容易 http://code.google.com/p/thrust / 使第一个版本正常工作。
关于FpGrowth,有什么可以并行的吗?构建动态树和树遍历通常被认为不容易在 CUDA 中有效地实现。
I don't know FpGrowth but i guess you have read the papers ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.162.1209&rep=rep1&type=pdf , etc.). I guess you are new to CUDA which makes implementing something this complicated rather difficult.
The key to good performance with CUDA is massive uniform parallelism and litte synchronization. The CUDA Zone http://www.nvidia.com/object/cuda_apps_flash_new.html has a lot of good examples what works and how. A good starting point for learning CUDA is the programming guide http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf .
A frequent question is "I've got this C code, how do i port it to CUDA". The answer is don't! In CUDA there are no pointers, no strings, no printing, no files and most of what you have learned about efficient code is wrong.
A more promising approach is to think of the underlying algorithm in a more abstract way. Identify what can be done in parallel, think about a good datastructure (probably involving large arrays), implement a prototype. It might be easier to rely on CUDA libraries like Thrust http://code.google.com/p/thrust/ to get a first version working.
Regarding FpGrowth, is there anything that can be done in parallel? Building dynamic trees and tree traversal are generally not considered to be easily implemented in CUDA efficiently.