用常用词监控品牌

发布于 2024-08-21 02:50:27 字数 100 浏览 10 评论 0原文

假设您应该在线监控品牌“ONE”。可以使用哪些算法将有关品牌 ONE 的页面与包含常用词 ONE 的页面分开?

我想也许贝叶斯可以工作,但是还有其他方法可以做到这一点吗?

Let's say you should monitor the brand "ONE" online. What algorithms can be used to separate pages about the brand ONE from pages containing the common word ONE?

I'm thinking maybe Bayes could work, but are there other ways to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

成熟的代价 2024-08-28 02:50:27

如果它不是真正独特的词那么我会建议下一个方法。

假设我们的关键字是 Java。然后至少有两个类别:关于印度尼西亚的节目和旅游。我们对第一个感兴趣。

让我们看一段关于 Java 的小文本(可能来自书籍或维基百科)。然后假设某个阈值(例如 0.7)。然后让我们将文本与不同的页面进行比较(最快的方法之一是使用 经典向量空间模型算法,你可以自己实现或者在google中找到它的实现)。然后将结果与阈值进行比较并过滤弱结果。


关于使用贝叶斯算法:在我看来,这是一个不错的方法。但是您应该非常仔细地“教授”您的算法,因为几个错误的输入可能会破坏整个工作。

让我解释一下。贝叶斯算法的输入是带有您的品牌词的文本。输出是您的文本与您的品牌有关但与其他内容无关的概率 [0 .. 1]。在实践中,该算法经常给出接近 0 或接近 1 的结果,并且很少返回 0.2 到 0.8 之间的值。这意味着该算法对微小变化非常敏感,100 个单词的文本中的 1 或 2 个单词可能会严重影响结果。

If it's not really unique word then I would suggest the next approach.

Let's imagine that our key-word is Java. Then there are at least 2 categories: about programming and about tourism in Indonesia. We are interested in the first one.

Lets take a small text about Java (maybe from books or from wikipedia). Then lets assume some threshold (for example, 0.7). Then let's compare our text with different pages (one of the fastest ways is using Classic Vector Space Model algorithm, you can implement it yourself or find it's implementation in google). Then compare results with your threshold and filter weak results.


About using Bayes algorithm: it's not bad approach imo. But you should 'teach' your algorithm very carefully because several bad inputs can spoil the whole work.

Let me explain. Input for your Bayes algorithm is text with your brand-word. Output is probability [0 .. 1] that your text is about your brand but not about something else. In practice this algorithm very often gives you results near 0 or near 1 and it rare returns values between 0.2 and 0.8. It means that the algorithm is very sensitive to small variations and 1 or 2 words in text of 100 words can seriously affect the result.

您可能希望将 ONE 品牌与其产品、执行官或其监控中的挑战者联系起来。

You may want to associate brand ONE with its products, its executive officers or its challengers in your monitoring.

罪#恶を代价 2024-08-28 02:50:27

您要查找的术语是概念学习概念提取这个词出现在许多页面中,但最常见的是它指的是作为数量的概念。只有很少的情况下它指的是品牌“ONE”的概念。 (另一个经常使用的例子是星体太阳中的 SUN,或者名为 Sun 的公司)。

我知道 Ari Rappoport 对这个主题有很多研究。实际上这可以归结为类似的事情
mouviciel 的答案,但 Ari 的研究也是关于如何自动推断您需要查找哪些相关词才能区分“一号”和“一号品牌”。

The term you're looking for is Concept learning or Concept extraction. The word One appears in many pages, but most often it refers to the concept of one as a quantity. Only rarely it refers to the concept of ONE the brand. (Another frequently used example is SUN as in the astral object sun, or the company named Sun).

I know Ari Rappoport has a lot of research on this topic. Practically this boils down to something like
mouviciel's answer, but Ari's research is also about how you can automatically infer what related words you need to look for in order to distinguish one-as-number from one-the-brand.

妳是的陽光 2024-08-28 02:50:27

我通过将维基百科视为一个巨大的本体(其中每个超链接是源节点和结束节点之间的关系)来处理问题。

编辑:一种非常粗略的算法,以“Java”为例:

  • 在维基百科中查询“Java”。之中
    其他人,这应该给你(在
    至少)岛屿和节目
    语言。
  • 获取这些base的in/out节点
    页面(来自基本页面的超链接)。
  • 现在你已经有了一小组相关的单词。
  • 计算每组到页面的“距离”并找到这些距离的最小值。

您将使用的距离非常主观,必须进行一些调整以满足您的需求。您可能也很难获取每个页面的“核心”,因为解析 HTML 将是一个很大的痛苦。

I've done approaching things by seeing Wikipedia as a giant ontology (where each hyperlink is a relation between source node and end node).

EDIT : One very rough algo, with the "Java" example :

  • Query "Java" in wikipedia. Among
    others, this should give you (at
    least) the island and the programming
    language.
  • Get the in / out nodes of these base
    pages (from the base pages hyperlinks).
  • You have now small sets of correlated words.
  • Compute a "distance" of each set to the page and find the minimum of these distances.

The distance you'll use is very subjective and must be tweaked a bit to match your needs. You might have trouble getting the "core" of each page too, as parsing HTML will be a major pain.

↙厌世 2024-08-28 02:50:27

我建议采用一种无监督的方法来解决这个问题:

  1. 获取尽可能多的在正确的上下文中描述“ONE”的文档并创建一个语料库。

  2. 根据标准英语语料库在该语料库中查找统计上不可能的短语。

    根据标准

这个网站给出了一个很好的例子
http://sip.s-anand .net/?url=http://en.wikipedia.org/wiki/Apple_Inc

正如您所看到的,品牌特定术语(例如 ipod、powerpc 等)很容易被过滤掉。

提取这些内容后,您可以使用“SIP”和“ONE”等查询创建Google警报或类似的等效内容(如果Google警报过于简单)来监控新文章。

当然,鉴于这种方法是无人监督的,它可能不是很有效,但应该可以完成工作。

I would suggest an unsupervised approach to the problem:

  1. Get as many possible documents which describe the "ONE" in correct context and create a corpus.

  2. Find Statistically improbable phrases in that corpus against a standard english corpus.

this website gives a good example
http://sip.s-anand.net/?url=http://en.wikipedia.org/wiki/Apple_Inc.

As you can see the brand specific terms such as ipod, powerpc etc are easily filtered out.

Once you have extracted those you can create a Google alert or similar equivalent (if google alerts are too simplistic) with Queries like "SIP" AND "ONE" to monitor new articles.

Of course given this approach is unsupervised it might not be very efficient but should do the work.

十年不长 2024-08-28 02:50:27

另一种方法可能是在 Google Directory 中查找页面,其中“网络按主题组织成类别”。您可以使用每个页面的类别信息来决定其内容。

A different approach could be to look the page up in Google Directory, which has 'the web organized by topic into categories'. You could potentially use the category information for each page to decide what it is about.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文