跟踪单词的邻近度

发布于 2024-09-06 14:46:31 字数 554 浏览 6 评论 0原文

我正在开发一个小项目，其中涉及在文档集合中基于字典的文本搜索。我的字典有积极的信号词（又名好词），但在文档集中仅找到一个词并不能保证得到积极的结果，因为可能存在否定词（例如，不重要），它们可能位于这些积极词的附近。我想构造一个矩阵，使其包含文档编号、肯定词及其与否定词的接近度。

任何人都可以建议一种方法来做到这一点。我的项目处于非常非常早期的阶段，所以我给出了我的文本的基本示例。

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

这是我的示例文档，其中坎地沙坦西酯、格列本脲、硝苯地平、地高辛、华法林、氢氯噻嗪是我的正面词，而无意义是我的负面词。我想在我的积极词和消极词之间进行邻近（基于词的）映射。

任何人都可以提供一些有用的指示吗？

原文

I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.

Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.

Can anyone give some helpful pointers?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一杆小烟枪 2024-09-13 14:46:31

首先，我建议不要使用 R 来完成此任务。 R 可以做很多事情，但文本操作不是其中之一。 Python 可能是一个不错的选择。

也就是说，如果我要在 R 中实现这一点，我可能会做类似的事情（非常非常粗略）：

# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...

goodPos <- NULL
badPos <- NULL

# First we find the good words
for (w in goodWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    goodPos <- c(goodPos, pos)
    }

# And then the bad words
for (w in badWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    badPos <- c(badPos, pos)
    }

# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))

mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))

First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.

That said, if I were to implement this in R, I would probably do something like (very very rough):

# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...

goodPos <- NULL
badPos <- NULL

# First we find the good words
for (w in goodWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    goodPos <- c(goodPos, pos)
    }

# And then the bad words
for (w in badWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    badPos <- c(badPos, pos)
    }

# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))

mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))

回复收藏 0 原文