如何从语料库或查询中获取句句词词词汇？

发布于 2025-02-06 18:07:13 字数 826 浏览 2 评论 0原文

我正在尝试sendencetransformer模型来自 sbert.net.net.net.net 我想知道它如何处理实体名称。它们是否被标记为未知 - 它们是否被令牌分解了，等等。我想确保它们用于比较。

但是，要做到这一点，我需要看到它为查询而构建的词汇 - 甚至可能将嵌入到文本中。

看着API - 对我来说并不明显。

这是他们文档的快速示例：

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "A man is eating pasta."
]

top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    ...

原文

I am trying SentenceTransformer model from SBERT.net and I want to know how it handles entity names. Are they marked as unknown - are they broken down with tokens, etc. I want to make sure they are used in the comparison.

However, to do that I would need to see the vocab it build for the query - and perhaps even convert an embedding to text.

Looking at the api - its not obvious to me how to do that.

Here is a quick example from their docs:

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "A man is eating pasta."
]

top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    ...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

再见回来 2025-02-13 18:07:13

您的sencencetransformer模型实际上是在包装并使用引擎盖下的transformers库中的令牌化器。您可以以.tokenizer属性访问它。这种令牌的典型行为是分解文字令牌中未知的令牌。在这一点上，我们可以继续检查它确实是它的作用，因为它相对简单：

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]

# the tokenizer is just here:
tokenizer = embedder.tokenizer  # BertTokenizerFast

# and the vocabulary itself is there, if needed:
vocab = tokenizer.vocab  # dict of length 30522

# get the split of sentences according to the vocab, for example:
inputs = tokenizer(corpus, padding='longest', truncation=True)
tokens = [e.tokens for e in inputs.encodings]
# tokens contains:
# [
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'food', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'a', 'piece', 'of', 'bread', '.', '[SEP]']
#   ['[CLS]', 'the', 'girl', 'is', 'carrying', 'a', 'baby', '.', '[SEP]', '[PAD]', '[PAD]']
# ]

# now let's try with some unknown tokens and see what it does
queries = [
    "Edv Beq is eating pasta."
]
q_inputs = tokenizer(queries, padding='longest', truncation=True)
q_tokens = [e.tokens for e in q_inputs.encodings]
# q_tokens contains:
# [
#   ['[CLS]', 'ed', '##v', 'be', '##q', 'is', 'eating', 'pasta', '.', '[SEP]']
# ]

Your SentenceTransformer model is actually packing and using a tokenizer from Hugging Face's transformers library under the hood. You can access it as the .tokenizer attribute of your model. The typical behaviour of such a token is to break down unknown tokens in word piece tokens. At this point, we can go on and check that it's indeed what it does, as it is relatively straightforward:

embedder = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby."
]

# the tokenizer is just here:
tokenizer = embedder.tokenizer  # BertTokenizerFast

# and the vocabulary itself is there, if needed:
vocab = tokenizer.vocab  # dict of length 30522

# get the split of sentences according to the vocab, for example:
inputs = tokenizer(corpus, padding='longest', truncation=True)
tokens = [e.tokens for e in inputs.encodings]
# tokens contains:
# [
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'food', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
#   ['[CLS]', 'a', 'man', 'is', 'eating', 'a', 'piece', 'of', 'bread', '.', '[SEP]']
#   ['[CLS]', 'the', 'girl', 'is', 'carrying', 'a', 'baby', '.', '[SEP]', '[PAD]', '[PAD]']
# ]

# now let's try with some unknown tokens and see what it does
queries = [
    "Edv Beq is eating pasta."
]
q_inputs = tokenizer(queries, padding='longest', truncation=True)
q_tokens = [e.tokens for e in q_inputs.encodings]
# q_tokens contains:
# [
#   ['[CLS]', 'ed', '##v', 'be', '##q', 'is', 'eating', 'pasta', '.', '[SEP]']
# ]

回复收藏 0 原文

~没有更多了~