跟踪单词的邻近度
我正在开发一个小项目,其中涉及在文档集合中基于字典的文本搜索。我的字典有积极的信号词(又名好词),但在文档集中仅找到一个词并不能保证得到积极的结果,因为可能存在否定词(例如,不重要),它们可能位于这些积极词的附近。我想构造一个矩阵,使其包含文档编号、肯定词及其与否定词的接近度。
任何人都可以建议一种方法来做到这一点。我的项目处于非常非常早期的阶段,所以我给出了我的文本的基本示例。
No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.
这是我的示例文档,其中坎地沙坦西酯、格列本脲、硝苯地平、地高辛、华法林、氢氯噻嗪是我的正面词,而无意义是我的负面词。我想在我的积极词和消极词之间进行邻近(基于词的)映射。
任何人都可以提供一些有用的指示吗?
I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.
Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.
No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.
This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.
Can anyone give some helpful pointers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,我建议不要使用 R 来完成此任务。 R 可以做很多事情,但文本操作不是其中之一。 Python 可能是一个不错的选择。
也就是说,如果我要在 R 中实现这一点,我可能会做类似的事情(非常非常粗略):
First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.
That said, if I were to implement this in R, I would probably do something like (very very rough):
之一
Did you look at either one of the