如何评价一个网页的质量?

发布于 2024-08-31 03:36:42 字数 695 浏览 5 评论 0原文

我正在做一个大学项目,必须收集和合并有关用户提供的主题的数据。我遇到的问题是,许多术语的 Google 搜索结果都受到低质量自动生成页面的污染,如果我使用它们,我最终可能会得到错误的事实。如何评估页面的质量/可信度?

你可能会想“不,谷歌工程师已经在这个问题上工作了 10 年,他正在寻求解决方案”,但如果你想一想,SE 必须提供最新的内容,如果它将一个好页面标记为一个坏页面一、用户会不满意。我没有这样的限制,所以如果算法意外地将一些好页面标记为坏页面,那不会是问题。

这是一个例子: 假设输入是在南拉购买阿司匹林。尝试用谷歌搜索一下。前 3 个结果已从网站中删除,但第四个结果很有趣:radioteleginen.ning.com/profile/BuyASAAspirin(我不想创建活动链接)

这是第一段正文:

从加拿大购买处方药的成本很大 此刻在美国。这是 因为在美国处方药 价格猛涨 对于那些限制或限制的人来说是困难的 集中收入来购买他们的大部分 需要的药物。美国人付出更多 他们的毒品比世界上任何人都多 类。

文本的其余部分类似,然后是相关关键字的列表。我认为这是一个低质量的页面。虽然这个特定的文本似乎有道理(除了它很可怕),但我见过的其他示例(但现在找不到)只是一些垃圾,其目的是从 Google 获取一些用户并在创建后 1 天被禁止。

I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?

You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.

Here's an example:
Say the input is buy aspirin in south la. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I don't want to make an active link)

Here's the first paragraph of the text:

The bare of purchasing prescription drugs from Canada is big
in the U.S. at this moment. This is
because in the U.S. prescription drug
prices bang skyrocketed making it
arduous for those who bang limited or
concentrated incomes to buy their much
needed medications. Americans pay more
for their drugs than anyone in the
class.

The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

蓝咒 2024-09-07 03:36:43

N-gram 语言模型

您可以尝试训练一个n-自动生成的垃圾邮件页面和其他非垃圾邮件网页集合上的 gram 语言模型

然后,您可以简单地使用两种语言模型对新页面进行评分,以查看文本是否看起来更类似于垃圾邮件网页或常规网络内容。

通过贝叶斯定律更好地评分

当您使用垃圾邮件语言模型对文本进行评分时,您可以估计在垃圾邮件网页上找到该文本的概率,P(Text|Spam)< /代码>。该符号读取为给定垃圾邮件(页面)文本 的概率。非垃圾邮件语言模型的分数是对在非垃圾邮件网页上找到文本的概率的估计,P(Text|Non-Spam)

但是,您可能真正想要的术语是 P(Spam|Text) 或等效的 P(Non-Spam|Text)。也就是说,您想要了解根据页面上显示的文本,该页面是垃圾邮件非垃圾邮件的概率

要获得其中任何一个,您需要使用贝叶斯定律< /a>,其中指出

           P(B|A)P(A)
P(A|B) =  ------------
              P(B)

使用贝叶斯定律,我们有

P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) 是您的先验信念,即从网络中随机选择的页面是垃圾页面。您可以通过计算某些样本中存在多少垃圾邮件网页来估计此数量,或者您甚至可以将其用作手动调整以进行权衡的参数精确度和召回率。例如,为该参数指定较高的值将导致更少的垃圾邮件页面被错误地分类为非垃圾邮件,而为其指定较低的值将导致更少的非垃圾邮件页面被意外分类为垃圾邮件。

术语 P(Text) 是在任何网页上找到 Text 的总体概率。如果我们忽略 P(Text|Spam)P(Text|Non-Spam) 是使用不同模型确定的,则可以计算为 P(Text )=P(文本|垃圾邮件)P(垃圾邮件) + P(文本|非垃圾邮件)P(非垃圾邮件)。这对二进制变量 Spam/Non-Spam 求和。

仅分类

但是,如果您不打算将概率用于其他用途,则无需计算 P(Text)。相反,您可以只比较分子 P(Text|Spam)P(Spam)P(Text|Non-Spam)P(Non-Spam)。如果第一个较大,则该页面很可能是垃圾邮件页面,而如果第二个较大,则该页面很可能是非垃圾邮件。这是有效的,因为上述 P(Spam|Text)P(Non-Spam|Text) 的等式都通过相同标准化 <代码>P(文本)值。

工具

就可用于此类目的的软件工具包而言,SRILM 将是一个很好的起点,并且它对于非商业用途是免费的。如果您想将某些东西用于商业用途并且不想支付许可证费用,您可以使用 IRST LM< /a>,根据 LGPL 分发。

N-gram Language Models

You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.

You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.

Better Scoring through Bayes Law

When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).

However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.

To get either of these, you'll need to use Bayes Law, which states

           P(B|A)P(A)
P(A|B) =  ------------
              P(B)

Using Bayes law, we have

P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

and

P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.

The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.

Classification Only

However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.

Tools

In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.

め可乐爱微笑 2024-09-07 03:36:43

定义网页的“质量”?衡量标准是什么?

如果有人想买水果,那么搜索“大甜瓜”会得到许多包含“非纺织品”倾斜图像的结果。

然而,这些页面的标记和托管可能是合理的工程。

但是,一个土农展示他的高品质、美味和健康产品的页面可能只能在 IE4.5 中可见,因为 html 已“损坏”...

Define 'quality' of a web - page? What is the metric?

If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.

The markup and hosting of those pages may however be sound engineering ..

But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...

油焖大侠 2024-09-07 03:36:43

对于每个关键字查询的每个结果集,执行单独的 google 查询以查找链接到该网站的网站数量,如果没有其他网站链接到该网站,则将其排除。我认为这至少是一个好的开始。

For each result set per keyword query, do a separate google query to find number of sites linking to this site, if no other site links to this site, then exclude it. I think this would be a good start at least.

成熟稳重的好男人 2024-09-07 03:36:43

如果您正在寻找与性能相关的指标,那么 Y!Slow [firefox 插件] 可能会很有用。

http://developer.yahoo.com/yslow/

if you are looking for performance related metrics then Y!Slow [plugin for firefox] could be useful.

http://developer.yahoo.com/yslow/

万水千山粽是情ミ 2024-09-07 03:36:43

您可以使用监督学习模型来进行此类分类。一般过程如下:

  1. 获取用于训练的样本集。这需要提供您想要涵盖的文档示例。您想要的越通用,您需要使用的示例集就越大。如果您只想关注与阿司匹林相关的网站,那么这就缩小了必要的样本集。

  2. 从文档中提取特征。这可能是从网站中提取的单词。

  3. 将功能输入分类器,例如 (MALLETWEKA)。

  4. 使用类似 k-fold 的内容来评估模型交叉验证

  5. 使用模型对新网站进行评分。

当您谈论不关心是否将好网站标记为坏网站时,这称为召回。 回忆您应该取回的措施,实际取回了多少。 精度衡量您标记为“好”和“差”的正确率。由于您规定了更精确的目标,而召回率并不那么重要,因此您可以调整模型以获得更高的精度。

You can use a supervised learning model to do this type of classification. The general process goes as follows:

  1. Get a sample set for training. This will need to provide examples of documents you want to cover. The more general you want to be the larger the example set you need to use. If you want to just focus on websites related to aspirin then that shrinks the necessary sample set.

  2. Extract features from the documents. This could be the words pulled from the website.

  3. Feed the features into a classifier such as ones provided in (MALLET or WEKA).

  4. Evaluate the model using something like k-fold cross validation.

  5. Use the model to rate new websites.

When you talk about not caring if you mark a good site as a bad site this is called recall. Recall measures of the ones you should get back how many you actually got back. Precision measures of the ones you marked as 'good' and 'bad' how many were correct. Since you state your goal to be more precise and recall isn't as important you can then tweak your model to have higher precision.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文