句子变压器使用弓?
我有一个与网页相关或以某种方式相关的术语(例如,来自HTML标签的关键字)。这些不是句子,它们只是关键字的集合,标题中的单词等。我感兴趣的是,在这样的网页上,我感兴趣的是找到最相似的内容。在一个有句子 /段落的情况下,我会想到使用句子变压器甚至doc2vec。但是在这种情况下,我只有一个页面的一组单词,没有真实的上下文或句子。我是否纠正这使我无法使用句子变压器 / doc2vec?
I have a collection of terms that appear or are somehow related to web pages (e.g. keywords from the HTML tags). These are not sentences, they are just a collection of keywords, words in a title etc. I am interested in, given such a webpage, to find those most similar. In a case where one has sentences / paragraphs I would think of using a sentence transformer or even like Doc2vec. But in this case I only have the set of words of a page and there is no real context or sentences. Am I correct this precludes me from using sentence transformer / Doc2vec ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有什么可以阻止您使用任何东西。相关测试是:用于您的唯一数据&目标?
doc2vec
和其他浅水技术在不是完美的语法句子的列表上效果很好:他们通常使用单词的存在或不存在,而没有严格的语法理解,作为信号。对于许多目的而言,这足够了!一些更深层次的变压器 更依赖秩序的自然语言 - 但是我不确定直到尝试并显示出来。它可能起作用!只有数据(来自您的问题)的数据&目标可以比自己的实验更好。
尝试一些事情 - 包括超级简单的事物,例如在单词袋表示上的余弦相似之处,或基于某种程度的最重要术语的关键字搜索 - 然后根据您的需求/所需结果评估结果。
您可能会通过临时眼球开始一些评估 - “这看起来不错,这似乎是错误的” - 但理想情况下,记录了哪些文档“应该比其他人更相似,在您所需的最终系统中,以便最终您最终您可以对替代方法进行自动定量比较。
Nothing precludes you from using anything. The relevant test is: does using it work, for your unique data & goals?
Doc2Vec
and other shallow techniques work fine on things like lists-of-keywords that aren't perfect grammatical sentences: they're generally using the presence or absence of words, without rigorous grammatical understanding, as signals. And that's plenty for many purposes!Some deeper transformers might have more order-dependent reliance on coherent natural-language utterances – but I wouldn't be sure of that until it was tried and shown lacking. It might work! And noone with only the vaguest sketch (from your question) of your data & goals can give you hints better than your own experiments.
Try things – including super-simple things like cosine-similarities on bag-of-words representation, or keyword searches based on some measure of most significant terms – then evaluate the results according to your needs/desired results.
You might start some evaluations via ad-hoc eyeballing – "this seems good, this seems wrong" – but would ideally record judgements of which docs "should" be more-similar than others, in your desired end-system, so that eventually you can run an automatic, quantitative comparison of alternate approaches.