为什么令牌仪会分解词汇中存在的单词

发布于 2025-01-25 05:24:10 字数 1182 浏览 4 评论 0原文

在我的理解中,令牌仪的作用是,给定每个单词,只有在tokeniser.get_vocab()中,tokeniser才会将单词分解为子字:

def checkModel(model):
    tokenizer = AutoTokenizer.from_pretrained(model)

    allList = []
    for word in tokenizer.get_vocab():
        word = word.lower()

        tokens = tokenizer.tokenize(word)
        try:
            if word[0]!='#' and word[0]!='[' and tokens[0] != word:

                allList.append((word, tokens))
                print(word, tokens)
        except:
            continue 
    return allList

checkModel('bert-base-uncased')
# ideally should return an empty list

但是,什么我观察到,即使词汇中存在单词,也会在拥抱面上的某些模型将单词分解为较小的片段。

checkModel('emilyalsentzer/Bio_ClinicalBERT')

output: 
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
  
4630 # tokens in vocab got broken down, not supposed to happen

我已经检查了一些这种行为的模型,想知道为什么会发生这种情况?

In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :

def checkModel(model):
    tokenizer = AutoTokenizer.from_pretrained(model)

    allList = []
    for word in tokenizer.get_vocab():
        word = word.lower()

        tokens = tokenizer.tokenize(word)
        try:
            if word[0]!='#' and word[0]!='[' and tokens[0] != word:

                allList.append((word, tokens))
                print(word, tokens)
        except:
            continue 
    return allList

checkModel('bert-base-uncased')
# ideally should return an empty list

However, what I have observed is that some models on huggingface will break down words into smaller pieces even if the word is present in the vocab.

checkModel('emilyalsentzer/Bio_ClinicalBERT')

output: 
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
  
4630 # tokens in vocab got broken down, not supposed to happen

I have checked a few models of this behaviour, was wondering why is this happening?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瑕疵 2025-02-01 05:24:10

这是一个非常有趣的问题,我目前想知道是否应该将其视为HuggingFace Repo的错误报告。

编辑:我意识到可以定义特定于模型的tokenization_config.json文件以覆盖默认行为。一个例子是 bert-base-casted casted casted存储库,具有以下内容的令牌配置:

{
  "do_lower_case": false
}

鉴于此功能可用,我认为最好的选择是联系工作的原始作者,并要求他们有可能考虑此配置(如果适合一般用例)。

原始答案:

事实证明,您要检查的词汇词是welles,但词汇文件本身仅包含welles。请注意,额外的首字母的差异?
事实证明,您可以手动迫使令牌器专门检查壳体词汇单词,在这种情况下,它可以正常运行。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
                                          do_lower_case=False)  # This is different
print(tokenizer.do_lower_case)
# Output: False

# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']

# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']

但是,根据默认值,情况并非如此,所有单词都将转换为小写,这就是为什么您找不到单词:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)

# Output: True

# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']

This is a really interesting question, and I am currently wondering whether it should be considered as a bug report on the Huggingface repo.

EDIT: I realized that it is possible to define model-specific tokenization_config.json files to overwrite the default behavior. One example is the bert-base-cased repository, which has the following content for the tokenizer config:

{
  "do_lower_case": false
}

Given that this functionality is available, I think the best option would be to contact the original author of the work and ask them to potentially consider this configuration (if appropriate for the general use case).

Original Answer:

As it turns out, the vocabulary word that you are checking for is welles, yet the vocab file itself only contains Welles. Notice the difference in the uppercased first letter?
It turns out you can manually force the tokenizer to specifically check for cased vocabulary words, in which case it works fine.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
                                          do_lower_case=False)  # This is different
print(tokenizer.do_lower_case)
# Output: False

# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']

# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']

Per default, however, this is not the case, and all words will be converted to lowercase, which is why you cannot find the word:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)

# Output: True

# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']

岁月流歌 2025-02-01 05:24:10

您称之为“ emilyalsentzer/bio_clinicalbert”的标记器具有原始基础令牌中不存在的令牌。要添加令牌,可以提供字符串列表或tokenizers.addedToken s的列表。

在两种情况下,默认行为是允许新单词用作子字。在我的示例中,如果我们将“导演”和“ CTO”添加到令牌机中,则可以将“导演”分解为“ dire” +'CTO' +'r'(“ dire”和“ r”是原始令牌)。为了避免这种情况,应该使用:

tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])

我确实认为很多用户只会使用字符串列表(就像我一样,直到半小时前)。但这将导致您看到的问题。

为了将其更改为自定义的令牌仪(例如'emilyalsentzer/bio_clinicalbert'),而o在模型性能方面会损失很多,我建议从此令牌中提取一组单词,并将其与其基础令牌进行比较(例如,'bert--'bert--基于基础')。作为模型重新训练的一部分,这将为您提供一组添加到基础令牌的单词。然后使用基本令牌,并使用与Single_word设置为true的添加词向其添加此新单词。用这个新的令牌剂替换自定量令牌。

The tokenizer you are calling 'emilyalsentzer/Bio_ClinicalBERT' has tokens that are not present in the original base tokenizer. To add tokens to the tokenizer one can either provide a list of strings or a list of tokenizers.AddedTokens.

The default behavior in both cases is to allow new words to be used as subwords. In my example if we add 'director' and 'cto' to the tokenizer, then 'director' can be broken down into 'dire' + 'cto' + 'r' ('dire' and 'r' are a part of the original tokenizer). To avoid this, one should use:

tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])

I do think a lot of users would simply use a list of strings (as I did, until half an hour ago). But this would lead to the problem that you saw.

To change this for a customized tokenizer (like 'emilyalsentzer/Bio_ClinicalBERT') w/o losing much in model performance, I'd recommend extracting the set of words from this tokenizer, and comparing it to its base tokenizer (for example 'bert-base-uncased'). This will give you the set of words that were added to the base tokenizer as part of model re-training. Then take the base tokenizer and add this new words to it using AddedToken with single_word set to True. Replace the custom tokenizer with this new tokenizer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文