句子变压器使用弓?

发布于 2025-01-31 10:10:43 字数 199 浏览 3 评论 0原文

我有一个与网页相关或以某种方式相关的术语(例如,来自HTML标签的关键字)。这些不是句子,它们只是关键字的集合,标题中的单词等。我感兴趣的是,在这样的网页上,我感兴趣的是找到最相似的内容。在一个有句子 /段落的情况下,我会想到使用句子变压器甚至doc2vec。但是在这种情况下,我只有一个页面的一组单词,没有真实的上下文或句子。我是否纠正这使我无法使用句子变压器 / doc2vec?

I have a collection of terms that appear or are somehow related to web pages (e.g. keywords from the HTML tags). These are not sentences, they are just a collection of keywords, words in a title etc. I am interested in, given such a webpage, to find those most similar. In a case where one has sentences / paragraphs I would think of using a sentence transformer or even like Doc2vec. But in this case I only have the set of words of a page and there is no real context or sentences. Am I correct this precludes me from using sentence transformer / Doc2vec ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

随遇而安 2025-02-07 10:10:43

没有什么可以阻止您使用任何东西。相关测试是:用于您的唯一数据&目标?

doc2vec和其他浅水技术在不是完美的语法句子的列表上效果很好:他们通常使用单词的存在或不存在,而没有严格的语法理解,作为信号。对于许多目的而言,这足够了!

一些更深层次的变压器 更依赖秩序的自然语言 - 但是我不确定直到尝试并显示出来。它可能起作用!只有数据(来自您的问题)的数据&目标可以比自己的实验更好。

尝试一些事情 - 包括超级简单的事物,例如在单词袋表示上的余弦相似之处,或基于某种程度的最重要术语的关键字搜索 - 然后根据您的需求/所需结果评估结果。

您可能会通过临时眼球开始一些评估 - “这看起来不错,这似乎是错误的” - 但理想情况下,记录了哪些文档“应该比其他人更相似,在您所需的最终系统中,以便最终您最终您可以对替代方法进行自动定量比较。

Nothing precludes you from using anything. The relevant test is: does using it work, for your unique data & goals?

Doc2Vec and other shallow techniques work fine on things like lists-of-keywords that aren't perfect grammatical sentences: they're generally using the presence or absence of words, without rigorous grammatical understanding, as signals. And that's plenty for many purposes!

Some deeper transformers might have more order-dependent reliance on coherent natural-language utterances – but I wouldn't be sure of that until it was tried and shown lacking. It might work! And noone with only the vaguest sketch (from your question) of your data & goals can give you hints better than your own experiments.

Try things – including super-simple things like cosine-similarities on bag-of-words representation, or keyword searches based on some measure of most significant terms – then evaluate the results according to your needs/desired results.

You might start some evaluations via ad-hoc eyeballing – "this seems good, this seems wrong" – but would ideally record judgements of which docs "should" be more-similar than others, in your desired end-system, so that eventually you can run an automatic, quantitative comparison of alternate approaches.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文