不同的象征器之间的翻译

发布于 2025-02-07 10:15:52 字数 198 浏览 3 评论 0原文

抱歉,这个问题太基本了,以至于在这里问。我尝试过,但找不到解决方案。

我现在正在研究一个NLP项目,该项目需要使用两个不同的模型(用于摘要的BART和bert for QA)。我想加入这两个型号,以便我可以同时训练它们而不是单独训练。但是,这两个模型使用不同的令牌。有没有办法创建图层将第一层的输出映射到第二层?是否可以手工创建字典(映射令牌的每个可用值)工作?多谢!

Sorry if this question is too basic to be asked here. I tried but I couldn't find solutions.

I'm now working on an NLP project that requires using two different models (BART for summarization and BERT for QA). I want to concatenate the two models so that I can train them at the same time instead of separately. However, the two models use different tokenizers. Is there a way to create a layer mapping the output of the first layer to the second one? Would creating a dictionary by hand (mapping every single available value of the tokenizer) work? Thanks a lot!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

牛↙奶布丁 2025-02-14 10:15:52

目前尚不清楚您是在问“如何将一个令牌输出转换为另一个输出”,“如何在这两个模型上使用相同的令牌”还是“如何将一个模型链接到另一个模型进行训练”。因此,我将回答涵盖这两种情况。

在直接回答您的问题之前,让我们看看您的ML模型。

我认为这是您要构建的模型(如果我错了,请在评论部分中纠正我)

                                                         |
                                                         |
                                                         Q
                                                         |
         ,-------------------,                    ,-------------,
--text-->|  BART (summarize) |--summarized text-->|  BERT (QA)  |---> A
         `-------------------`                    `-------------`

好的,如果您对引物器和机器学习管道有基本的想法,这将不容易。

我认为您正在使用HuggingFace库(即使没有,也不会有显着差异)。当我们训练NLP模型时,我们必须先将其归为象征。

Tokenizers做什么?

实际上,也已经经历了培训过程。它了解如何将句子和单词分解成块。之后,它会自动为每个相同的块构建一个映射。 块 - >数字/数组

让我们获得第一种情况,

这实际上是否定的。正如我在上面提到的,也接受了训练。它根据其自己的规则将句子/单词示为句子,并根据其自己的映射分配数字。相同的句子/单词可以通过不同的象征器在不同地方的不同块中分解为不同。因此,不可能做一些使用字典的英语单词的法语含义之类的事情。

训练变压器(BART/BERT或任何变压器导数)的第二种情况,

我们将结果传递到变压器中。因为变压器仅接受向量/张量/矩阵,而不接受字符串。然后在该输入上对变压器进行训练。因此,您必须记住五件事。

  1. 变压器的输出/训练取决于输入
  2. 输入取决于令牌仪的输出
  3. ,因此,变压器的输出/训练取决于令牌,
  4. 每个令牌仪都有不同的映射。 (对于同一文本,输出不同)
  5. 每个令牌的输出VERTOR大小。
  6. 因此,一旦变压器与特定的令牌仪一起训练,它只能使用该令牌。

您可以使用相同的令牌,这取决于您是使用预先训练的Bart和Bert或从头开始训练它们。如果您使用预审慎的方法,则必须使用特定的令牌。 (如果您使用的是HuggingFace模型,则给出了兼容的令牌名称)。否则,您可以使用同样的令牌毫无问题。仅当两个变压器的输入大小等于令牌剂的输出向量时,您只需要使用相同的令牌来进行变压器训练会话即可。但是之后,您无法使用其他令牌。

让我们转到第三种情况。

当然,您可以一次训练这两个。但是您必须先构建ML管道。这不是很困难。但是您必须学习如何首先构建管道。许多图书馆提供了轻松构建管道的设施。

It's not clear whether you are asking "How to translate one tokenizers output into another ones output","How to use same tokenizer on both models" or "How to link one model after other for training". So I will answer covering both cases.

Before giving a direct answer to your question, lets have a look into your ML model.

I think this is the model you want to build (correct me in comment section if I am wrong)

                                                         |
                                                         |
                                                         Q
                                                         |
         ,-------------------,                    ,-------------,
--text-->|  BART (summarize) |--summarized text-->|  BERT (QA)  |---> A
         `-------------------`                    `-------------`

OK, this would be little easy if you have basic idea on Tokenizers and machine learning pipe lines.

I assume you are using huggingface library (even if not there won't be a significant difference) . When we are training an NLP model, we have to tokenize them first.

What do the tokenizers do?

Actually, a tokenizer also has been gone through a training process. It learns how ti break sentences and words into chunks. After that it automatically builds a mapping for each identical chunk. chunk--> digit/array.

Lets get the first case

It's literally no. As I mentioned above Tokenizers also were trained. It tokenizes sentences/words according to its own rules and assign numbers according to it's own mapping. The same sentence/word can be broken into different no of chunks at different places by different tokenizers. So , It's not possible to do something like finding french meaning of an English word using dictionary.

The second case

When training the transformer (BART/BERT or any transformer derivative) we pass the result into the transformer. Because of transformers only accept vectors/tensors/matrices , not strings. And then transformer is trained on that input. So you must remember five things.

  1. Transformer's output/training depends on input
  2. Input depends on Tokenizer's output
  3. So, Transformer's output/training depends on Tokenizer
  4. Each tokenizer has different mappings. (Output is different for same text)
  5. Each tokenizer has different output vertor size.
  6. So Once a Transformer has trained along with a specific tokenizer, It can only use that tokenizer.

Can you use same tokenizer, It depends on are you using pre-trained bart and bert or train them from scratch. If you use pretrained ones, you have to use specific tokenizer with it. (If you are using huggingface models, the compatible tokenizer name has been given). Otherwise you can use same tokenizer without any problem. You just have to use same tokenizer for transformers training session only if both transformers have input size equal to output vector of the tokenizer. But after that you can't use other tokenizers.

Lets move to the third case.

Of course you can train both at once. But you have to build an ML pipeline first. It's not very difficult. But you have to learn how to build pipelines first. Many libraries provide facilities to build pipelines easily..

瑾兮 2025-02-14 10:15:52

当我想用伯特搜索提示并将结果输入到Llama时,我也有一个类似的问题,该提示具有完全不同的令牌。实际上,您可以查看此博客文章: https://tamuhey.githey.github.io/tokenizatization/ and github repo https://github.com/explosion/tokenizations

使用 tokenizations .get_alignments(tokens_a,tokens_b),您可以在不同的象征器之间获得映射,并且无需从头开始构建dict。我希望这有所帮助!

I have a similar question as yours when I want to search the prompt with BERT and input the results to LLaMa, which has a totally different tokenizer. Actually, you can check out this blog post: https://tamuhey.github.io/tokenizations/ and the GitHub repo https://github.com/explosion/tokenizations

Using the tokenizations.get_alignments(tokens_a, tokens_b), you can get a mapping between different tokenizers and there is no need to build a dict from scratch. I hope this help!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文