如何使用数据库来存储日常的大量文本源?
我正在做一项基于文本处理和挖掘的研究。原理很简单,我们收集特定日期内的所有帖子,例如“2011Jan01”。我们不关心哪个客户发布了该内容,我们只关注他发布该内容的时间。例如,在日期“2011年1月01日”,有五位客户在我们的论坛中发布了一些关于产品的想法,我们删除了有关客户的信息并将他们的帖子内容合并在一起。
然而,我们有一个很大的论坛,所以我们每天可能有成千上万的人活跃地发布长或短的帖子。如果我们把它们结合起来。一天要排队上万甚至几十万。
我们想使用像MySQL这样的数据库来构建一个表来保存并稍后进行数据挖掘。我们对表格的第一个想法很简单:
表格
Date combinedPostContents
2011Jan01 "blablalbla everything from clients, lot of contents"
这样简单合理吗?或者我们应该使用本地文本文件来保存内容并按我们收集它们的日期命名文本文件?哪一个更好?
提前非常感谢,大师们!!:)
I am doing a research based on text processing and mining. The principle is simple, we collect all post within specific date, for example, "2011Jan01". We do not care which client post that contents and we only focus on the time when he posted it. For example, at date "2011Jan01", here are five clients posted some thoughts about products in our forum the we delete information about client and combine the contents of their post together.
However, we have a large forum, so we may have thousands people active to post long or short threads daily. if we combine them. It would be ten thousands or even hundred thousands line for one day.
We would like to use some database like MySQL to build a table to save and later to data mining it. Our first idea about the table is quite simple:
Table
Date combinedPostContents
2011Jan01 "blablalbla everything from clients, lot of contents"
is this simple reasonable? or should we use local text file to save the contents and name the text file by the date we collect them? which one is better?
Thanks lot in advance, Gurus!!:)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过数据挖掘文本信息来获取客户对产品的想法将非常困难。您肯定会想要使用数据库,并且您确实应该为他们正在审查的产品建立某种评级系统。
Data mining text information to get customer thoughts on products will be VERY difficult. You will definitely want to use a database and you really should be doing some sort of rating system for the products they are reviewing.