如何从语料库或查询中获取句句词词词汇?
我正在尝试sendencetransformer
模型来自 sbert.net.net.net.net 我想知道它如何处理实体名称。它们是否被标记为未知 - 它们是否被令牌分解了,等等。我想确保它们用于比较。
但是,要做到这一点,我需要看到它为查询而构建的词汇 - 甚至可能将嵌入到文本中。
看着API - 对我来说并不明显。
这是他们文档的快速示例:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = [
"A man is eating pasta."
]
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
...
I am trying SentenceTransformer
model from SBERT.net and I want to know how it handles entity names. Are they marked as unknown - are they broken down with tokens, etc. I want to make sure they are used in the comparison.
However, to do that I would need to see the vocab it build for the query - and perhaps even convert an embedding to text.
Looking at the api - its not obvious to me how to do that.
Here is a quick example from their docs:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = [
"A man is eating pasta."
]
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的
sencencetransformer
模型实际上是在包装并使用引擎盖下的transformers
库中的令牌化器。您可以以.tokenizer
属性访问它。这种令牌的典型行为是分解文字令牌中未知的令牌。在这一点上,我们可以继续检查它确实是它的作用,因为它相对简单:Your
SentenceTransformer
model is actually packing and using a tokenizer from Hugging Face'stransformers
library under the hood. You can access it as the.tokenizer
attribute of your model. The typical behaviour of such a token is to break down unknown tokens in word piece tokens. At this point, we can go on and check that it's indeed what it does, as it is relatively straightforward: