mysql 数据库上的数据挖掘
我从文本挖掘开始。 我有两个包含数千条数据的数据库表。
一个“技能”表和一个“技能类别”表,
- 每个“技能”都属于一个技能类别。
- “技能”在物理上是数据库中的一个 varchar(200) 字段,其中有一些描述该技能的文本。
以下是从技能表中提取的一些技能:
“PHP(良好水平)、Java(中级)、C++” “PHP5” 《项目管理与质量管理》 “开始 JavaScript” 《水利工程》 “dfsdf 泽泽尔泽” “cibling customer”
我想做的是从这些领域中提取知识,我的意思是只提取真正的技能,而忽略其余无用的文本。 对于上面的例子,我只想得到一个数组:
“PHP” “爪哇” “C++” “PHP5” “项目管理” 《质量管理》 “Javascript” 《水利工程》 “cibling客户”
我该怎么做才能从大量数据中提取技能? 你知道执行此操作的具体算法吗?例如:k-means ...?
提前致谢。
I Begin with textmining.
I have two database tables with thousands of data..
a table for "skills" and a table for "skills categories"
- every "skill" belongs to a skills categorie.
- a "skill" is , physicaly, a varchar(200) field in the database, where there is some text describing the skill.
Here are some skills extracted from the skills table:
"PHP (good level), Java (intermediaite), C++"
"PHP5"
"project management and quality management"
"begining Javascript"
"water engineering"
"dfsdf zerze rzer"
"cibling customers"
what i want to do is to extract knowledge from those fields, i mean extract only the real skill and ignore the rest of useless text.
for the above example i want to get only an array with:
"PHP"
"Java"
"C++"
"PHP5"
"project management"
"quality management"
"Javascript"
"water engineering"
"cibling customers"
what should i do to extract the skills from tons of data please ?
do you know specific algorithms to do this ? ex : k-means ... ?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我将利用正则表达式来解析每一行数据,首先用逗号(,)分割,然后删除括号内的任何文本以及通向这些括号的空格。至于删除垃圾短语,也许可以与接受的单词列表进行比较?
我还注意到关键字“AND”表示两种不同的技能,根据您想要的输出而定。由于数据不一定采用相同的格式,因此使用这种处理方法的结果可能有点粗略。
I would make use of Regex to parse each row of data, first of all splitting by comma(,) and then removing any text held within brackets, and spaces leading to those brackets. As for removing junk pharases, perhaps comparing to an accepted word list?
I also notice that the keyword 'AND' denotes two separate skills, going by your desired output. Results using this method of processing may be a bit sketchy due to the data not all neccesarily being in the same format.
从头开始是非常困难的,
我会从某个地方解析一些技能集数据并将它们加载到表中并使用该表作为参考表,尝试匹配该表中的数据。否则你无法确定这些单词或短语是否有意义。
对于每个短语,我都会使用以下算法
假设你有一个由 5 个单词组成的短语,
我首先会检查这个短语是否存在于我的表中,如果存在,则保留它并转到下一个,如果不存在,请检查
它们是否存在也不匹配,检查
等等...
我知道这有点混乱而且很长,但这是我想到的第一件事。
希望有帮助
It would be very hard to start from scratch,
I'd parse some data for skill sets from somewhere and load them to a table and use that table as reference table, trying to match data from that table. Otherwise you have no way to determine whether the words or phrases are meaningful or not.
And for each phrase i'd use the following algorithm
Say you have a phrase of 5 words
first i'd check whether this one exists in my table, if so keep it and go to the next one, if not, check
and if they dont match either, check
etc...
I know it is a bit messy and long way, but it is the first thing came in to my mind.
Hope it helps
您可以使用类似的方法来构建白名单和黑名单表。从长远来看,你将能够更好地控制什么是积极的,什么是消极的。
You could use something like this to build a table of white and black lists. In the long run you'll have better control over what is a positive and what is not.