如何构建一个算法来根据关键字对 HTML 页面进行分类?

发布于 2024-12-02 14:08:40 字数 1135 浏览 2 评论 0原文

我正在尝试创建一种算法,根据在页面上找到的关键字设置与网页的某些相关性。

我现在正在这样做:

我设置了一些单词和它们的值:“movie”(10)“cinema”(6) “演员”(5)“好莱坞”(4) 并搜索页面的某些部分,给出每个部分的权重并乘以单词权重。

中找到“电影”一词

示例:在 URL(1.5) * 10 和 标题(2.5) * 10 = 40 就是垃圾!这是我的第一次尝试,它返回了一些相关结果,但我认为由 244、66、30、15 这样的值确定的相关性没有用。

我想做一些在 0 到 1 或 1 到 100 范围内的事情。
我可以使用什么类型的单词权重?

除此之外,还有现成的算法可以根据 URL、关键字、标题等(主要内容除外)设置 HTML 页面的一些相关性?

编辑1:所有这些都可以重建,权重是随机的,我想使用一些简洁的权重,而不是ramdon数字来表示权重,如10、5和3。

类似于:低重要性 = 1中等重要性 = 2高重要性 = 4确定性重要性 = 8

标题> URL 的链接部分 >域>关键词
电影>电影院>演员> hollywood

编辑2: 目前,我想分析除页面正文内容之外的单词的页面相关性。我将在分析中包含域名、URL 的链接部分、标题、关键字(以及我认为有用的其他元信息)。

其原因是 HTML 内容是脏的。我可以在菜单和广告中找到很多像“电影”这样的词,但页面的主要内容不包含任何与主题相关的内容。

另一个原因是某些页面具有元信息,表明该页面包含有关电影的信息,但主要内容没有。示例:包含讲述历史、人物等的电影情节的页面,但不要在该文本中引用任何可以表明这是关于电影的内容,仅引用页面元信息。

稍后,在对HTML页面运行相关性分析后,我将单独对内容(过滤后的)进行相关性分析。

I'm trying to create an algorithm that set some relevance to a webpage based on keywords that it finds on the page.

I'm doing this at the moment:

I set some words and a value for they: "movie"(10), "cinema"(6), "actor"(5) and "hollywood"(4) and search on some parts of the page giving a weight for each part and multiplying the words weight.

Example: the "movie" word word was found in the URL(1.5) * 10 and in title(2.5) * 10 = 40

This is trash! It's my first attempt, and it return some relevant results, but I don't think that a relevance determined by a value like 244, 66, 30, 15 is useful.

I want to do something that be inside a range, from 0 to 1 or 1 to 100.
What type of weighting for words can I use?

Besides it, there are ready algorithms to set some relevance of an HTML page based in things like URL, keywords, title, etc., except the main content?

EDIT 1: All of this can be rebuilt, the weights are random, I want to use some weights concise, not ramdon numbers to represent the weight like 10, 5 and 3.

Something like: low importance = 1, medium importance = 2, high importante = 4, deterministic importance = 8.

Title > Link Part of URL > Domain > Keywords
movie > cinema> actor > hollywood

EDIT 2: At the moment, I want to analyze the page relevance for words excluding the body content of the page. I will include in the analysus the domain, the link part of the url, the title, keywords (and another meta informations I judge useful).

The reason for this is that the HTML content is dirty. I can find much words like 'movie' in menus and advertisements, but the main content of the page doesn't contains nothing relevant to the theme.

Another reason is that some pages has meta information indicating that pages contains info about a movie, but the main content no. Example: a page that contains the plot of the film telling the history, the characters, etc., but don't refers in that text nothing that can indicate that this is about a movie, only the page meta information.

Later, after running a relevance analysis on the HTML page, I will do a relevance analysis on the content (filtered) separatedly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

醉殇 2024-12-09 14:08:40

您能够在搜索引擎中索引这些文档吗?如果您是,那么也许您应该考虑使用这个 潜在语义库

可以从这里获取实际项目:https://github.com/algoriffic/lsa4solr

您 试图做的是确定文本语料库的含义,并根据其含义对其进行分类。然而,单词并不是单独唯一的,也不能脱离整篇文章进行抽象考虑。

例如,假设您有一篇文章大量讨论“Windows”。这个词在一篇 300 字的文章中出现了 7 次。所以你知道这很重要。然而,你不知道的是,它是在谈论操作系统“Windows”还是你所看到的东西。

假设您还看到诸如“安装”之类的词,那么这对您也没有任何帮助。因为人们在家里安装窗户就像安装Windows操作系统一样。但是,如果同一篇文章谈论碎片整理、操作系统、命令行和 Windows 7,那么您可以猜测该文档的含义实际上是关于 Windows 操作系统的。

然而,你如何确定这一点呢?

这就是潜在语义索引的用武之地。您想要做的是提取整个文档文本,然后对该文档应用一些巧妙的分析。

您构建的矩阵(参见此处)远远超出了我的想象,尽管我有查看了一些库并使用了它们,我一直无法完全理解构建空间感知矩阵背后的复杂数学,该矩阵未被潜在语义分析所使用......所以在我的建议中,我建议,只使用已经存在的现有的库可以做到这一点 你。

如果您不寻找外部库并想自己执行此操作,很高兴删除此答案

Are you able to index these documents in a search engine? If you are then maybe you should consider using this latent semantic library.

You can get the actual project from here: https://github.com/algoriffic/lsa4solr

What you are trying to do, is determine the meaning of a text corpus, and classify it based on it's meaning. However, words are not individually unique or to be considered in abstract away from the overall article.

For example, suppose that you have an article which talks a lot about "Windows". This word is used 7 times in a 300 word article. So you know that it is important. However, what you don't know, is if it is talking about the Operating System "Windows" or the things that you look through.

Suppose then that you also see words such as "Installation", well, that doesn't help you at all either. Because people install windows into their houses much like they install windows operating system. However, if the very same article talks about defragmentation, operating systems, command line and Windows 7, then you can guess that the meaning of this document is actual about the Windows operating system.

However, how can you determine this?

This is where Latent Semantic Indexing comes in. What you want to do, is extract the entire documents text and then apply some clever analysis to that document.

The matrix'es that you build (see here) are way above my head, and although I have looked at some libraries and used them, I have never been able to fully understand the complex math that goes behind building the space aware matrix that is unsed by Latent Semantic Analysis... so in my advice, I would recommend, just using an already existing library to do this for you.

Happy to remove this answer if you aren't looking for external libraries and want to do this yourself

煞人兵器 2024-12-09 14:08:40

将任何值转换为 0-100 范围的简单方法(对于任何正值 X):

(1-1/(1+X))*100

较高的 X 会提供接近 100 的值但这

并不能保证公平或正确的分配。这取决于您决定实际 X 值的算法。

A simple way to convert anything into a 0-100 range (for any positive value X):

(1-1/(1+X))*100

A higher X gives you a value closer to 100.

But this won't promise you a fair or correct distribution. That's up to your algorithm of deciding the actual X value.

冷弦 2024-12-09 14:08:40
your_sum / (max_score_per_word * num_words) * 100

应该有效。但大多数时候你会得到非常小的分数,因为很少有单词会与非零分数的单词相匹配。尽管如此,我没有看到其他选择。获得小分数并不是一件坏事:您将比较网页之间的分数。您尝试了许多不同的网页,然后您就可以弄清楚您的系统的“高分”是什么。

your_sum / (max_score_per_word * num_words) * 100

Should work. But you'll get very small scores most of the time since few of the words will match those that have a non-zero score. Nonetheless I don't see an alternative. And it's not a bad thing that you get small scores: you will be comparing scores between webpages. You try many different webpages and you can figure out what a "high score" is with your system.

眼藏柔 2024-12-09 14:08:40

查看这篇关于按主题对网页进行分类的博客文章,它讨论了如何实现相关的内容紧密贴合您的要求。您如何定义场景中的相关性?无论您对不同的输入应用什么权重,您仍然会选择一个有点任意的值,一旦您清理了原始数据,您将更好地应用机器学习来为您生成分类器。如果相关性是标量值,则这很困难,但如果它是布尔值(即,例如,页面与特定电影相关或不相关),则这是微不足道的。

Check out this blog post on classifying webpages by topic, it talks about how to implement something that relates closely to your requirements. How do you define relevance in your scenario? No matter what weights you apply to the different inputs you will still be choosing a somewhat arbitrary value, once you've cleaned the raw data you would be better served to apply machine learning to generate a classifier for you. This is difficult if relevance is a scalar value, but it's trivial if it's a boolean value (ie. a page is or isn't relevant to a particular movie, for example).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文