大型数据库上的文本挖掘(数据挖掘)
我有一个大型简历 (CV) 数据库,以及一个对所有用户技能进行分组的特定表技能。
该表内有一个字段skill_text,它以全文描述技能。
我正在寻找一种算法/软件/方法来从该表中提取重要的术语/短语,以便构建一个具有标准化技能的新表。
以下是从数据库中提取的一些示例技能:
- 部门和竞争分析
- 业务开发(包括. 在国际环境中)
- 具体结构和道路设计软件 - Microstation、Macau、AutoCAD(基础知识)
- 创意工作(Photoshop、In-Design、Illustrator)
- 检查和报告活动进度
- 组织和参加活动和展览
- 开发:Aptana Studio, PHP、HTML、CSS、JavaScript、SQL、AJAX
- 学科:一对一营销、电子营销(SEO 和 SEA、展示、电子邮件、联属计划)混合营销、病毒式营销、社交网络营销。
输出应类似于:
- 部门和竞争分析
- 业务开发
- 特定结构和道路设计软件 -
- 澳门
- AutoCAD
- Photoshop
- In-Design
- Illustrator
- 组织活动
- 开发
- Aptana Studio
- PHP
- HTML
- CSS
- JavaScript
- SQL
- AJAX
- 混合营销
- 病毒式营销
- 社交网络营销
- 电子邮件
- SEO
- 一对一营销
如您所见,只有技能,没有其他表示文本。
我知道使用文本挖掘技术可以做到这一点,但如何做到呢? 数据库真的很大......这是一件好事,因为我们可以计算文本频率并确定它是真正的技能还是只是无意义的文本...... 最大的问题是..如何确定“blablabla”是一项技能?
编辑 : 请不要告诉我使用标准的东西,如文本tokinzer或正则表达式..因为用户以非常任意的方式输入技能!
谢谢
I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there's a field skill_text that describes the skill in full text.
I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..
Here are some examples skills extracted from the DB :
- Sectoral and competitive analysis
- Business Development (incl. in international settings)
- Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
- Creative work (Photoshop, In-Design, Illustrator)
- checking and reporting back on campaign progress
- organising and attending events and exhibitions
- Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
- Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.
The output shoud be something like :
- Sectoral and competitive analysis
- Business Development
- Specific structure and road design software -
- Macao
- AutoCAD
- Photoshop
- In-Design
- Illustrator
- organising events
- Development
- Aptana Studio
- PHP
- HTML
- CSS
- JavaScript
- SQL
- AJAX
- Mix marketing
- Viral Marketing
- Social network marketing
- emailing
- SEO
- One to one marketing
As you see only skills remains no other representation text.
I know this is possible using text mining technics but how to do it ?
the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text...
The big problem is .. how to determin that "blablabla" is a skill ?
Edit :
please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果我以编程方式执行此操作,我会:
将所有标点符号分隔的数据(或者可能只是括号和逗号)提取到一个新表中(没有主键,只有技能),这样
创造性工作(Photoshop、In-Design、Illustrator)
变为然后,在处理完所有 CV 后,查询最常见的技能(这是 MySQL)
这可能看起来像这个人为的示例
然后您决定从前 X 个技能中,您想要捕获哪些技能必须映射到其他技能(例如,
Indesign
和In-design
应映射到相同的技能)以及要丢弃的技能,然后使用数据映射编写流程脚本。使用数据映射写入新的词频表(本次skill_id,skill,频率),第二次解析数据时也写入查找表(cv_id,skill_id)。然后,您的数据将处于以下状态:每个 CV 映射到多个技能,每个技能映射到多个 CV。您可以查询最热门的技能、符合特定条件的简历等。
If I was doing this programmatically I would:
Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so
Creative work (Photoshop, In-Design, Illustrator)
becomesThen, after you've proceed all CVs, query for the most common skills (this is MySQL)
Which may look like this contrived example
Then you decide, from the top X skills, which you want to capture, which must map to other skills (
Indesign
andIn-design
should map to the same skill, for example) and which to discard, then script the process using a data map.Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.
许多数据库将通过其全文搜索功能为您完成此操作。我知道 PostgreSQL 的全文搜索可以在自定义字典的帮助下轻松完成此操作。
或者,您可以使用 PHP 的 strtok 或等效项来索引您的文本。建立索引后,您可以与字典进行比较,或者简单地使用出现的情况为自己创建一个工作表。词云也是以类似的方式制作的。
Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.
Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.
做好这件事需要知识;否则,如何才能说“组织活动”是一种“技能”,而“创造性工作”则不是?但是一个愚蠢的程序可以通过分析搭配的统计数据来首先解决这个问题:请参阅 如何从一系列文本条目中提取常见/重要短语 和 从文本中检测短语和关键字的算法。
Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.