是否有机器学习或NLP模型可以在原始文本中分开问题和答案？

发布于 2025-02-05 14:26:58 字数 833 浏览 1 评论 0原文

我有一些原始文本，其中有问题和答案。我想确定文本的哪些部分是问题，哪些部分是答案。这似乎很容易，但是问题不一定会因问号而终止。我唯一可以肯定的是，在答案开始时，答案是在另一个问题开始之后开始的，但是答案中包含多少\ n的格式没有一致的格式。一个问题绝对是它自己的段落。

我希望为此建立某种预训练的模型吗？

一种可能性是获取一些现有数据，手动将每个段落标记为q vs a，然后为每个段落使用Google的通用句子编码器获取512维度输出，然后将其用作训练神经网或其他一些分类的输入标记数据的模型。我希望避开这条路，因为我不想手动标记几个段落，毕竟工作后，谁知道该模型是否会出现不错的分类错误。

另一种可能性是使用诸如GPT3之类的东西：将其馈入整个文本，然后问它是什么问题/请求。问题在于GPT3 API仍然有些沙盒。我在GPT3操场上尝试了一个样本，它仅确定了80％的问题。

还有其他建议吗？

为了给您一个想法，文字可能看起来像：

公司的名字是什么？

我们是Acme Inc.

那里有多少员工。

有50名员工。

描述员工生活中的一天。

一名员工到达上午9点。

然后他们去工厂制作小部件4个小时。制作小部件后，他们吃午餐，然后去QA工程师，以确保其小部件足够好。

在质量保证之后，他们撰写了一份报告，内容是他们制作了多少个小部件。

大多数员工在下午5点左右离开。

列出您的员工的工资范围。

起薪为$ 22/小时。

1年后，工资增加到每小时25美元，然后每年增加3％。

联系信息：

Acme Inc

123 Main Street

Anyplace，美国

原文

I have some raw text that has questions and answers in it. I would like to identify which parts of the text are questions and which parts are the answers. This seems like it would be easy, but the questions aren't necessarily terminated with question marks. The only thing I know for sure is that after a question is over the answer begins, and after the answer is over another question begins, but there is no consistent format on how many \n are included in the answers. A question is definitely its own paragraph though.

I'm hoping for some sort of pre-trained model for this?

One possibility would be to take some existing data, manually tag each paragraph as q vs a and then use google's universal sentence encoder for each paragraph to get the 512 dimension output and then use that as the input to train a neural net or some other classification model on the labeled data. I'm hoping to avoid this path because I don't want to manually tag a few thousand paragraphs, and after all that work, who knows if the model will have a decent classification error.

Another possibility is to use something like gpt3: feed it the entire text and just ask it what are the questions/requests. The problem with this is that the gpt3 api is still a bit sandboxed. I tried a sample on the gpt3 playground and it only identified 80% of the questions.

Any other suggestions?

To give you an idea, the text may look like this:

What is the name of the company?

We are Acme Inc.

How many employees are there.

There are 50 employees.

Describe a day in the life of an employee.

An employee arrives at 9am.

Then they go to the factory and make widgets for 4 hours. After making widgets they eat lunch and then go to the QA engineer to make sure their widgets are good enough.

After QA, they write a report about how many widgets they made.

Most employees leave around 5pm.

List the pay range of your employees.

The starting salary is $22/hours.

After 1 year pay increases to $25 an hour and then increases 3% per year.

Contact information:

Acme Inc

123 Main Street

Anyplace, USA

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长亭外，古道边 2025-02-12 14:26:58

根据您提供的描述和文本示例，我将此问题分为两个部分：

如何将整个文本分开
如何“分类”哪个句子（或段落）是一个问题或答案，

我尝试使用我尝试使用使用此问题的答案基于Spacy的基于启发式方法（您可以使用其他库）。
您可以直接使用此技术，也可以使用它来构建一个弱监督的数据集，您可以使用该技术（尝试Skweak）训练分类模型。

句子检测

这是简单的部分，您要做的就是遵循此链接中的详细信息 https：https：https：https： //spacy.io/usage/linguistic-features#sbd

nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi, I'm sentence number 1. Hi, I'm sentence number 2.")

for sent in doc.sents:
    print(sent.text)

# Hi, I'm sentence number 1.
# Hi, I'm sentence number 2.

您分享的样本中的问或回答

，我可以看到您想检测问题以及命令式短语：

问题：有多少员工。要检测此类问题，您可以使用Spacy的标签属性并查找WH令牌（谁，等等）。您可以使用相同的逻辑来查找主题动词倒置，例如等等...
命令式短语：列出员工的薪酬范围。为了检测这些情况，您可以搜索是句子第一个令牌的动词。

这是您可以遵循的一个小例子：

def is_question(sent):
    d = nlp(sent)
    token = d[0] # gets the first token in a sentence
    if token.pos_ == "VERB" and token.dep_ == "ROOT": # checks if the first token is a verb and root or not
        return True
    for token in d: # loops through the sentence and checks for WH tokens
        if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
            return True
    return  False

doc = nlp(text)

for sent in doc.sents:
    print(sent.text.strip())
    if is_question(sent.text.strip()):
        print("is question")
    else:
        print("not a question")
    print("***")

# what is the name of the company?
# is question
# ***
# We are Acme Inc.
# not a question
# ***
# How many employees are there.
# is question
# ***
# There are 50 employees.
# not a question

您可以在大型语料库上应用此功能，并获得一个弱注释的数据集，您可以使用该功能来训练分类器，也可以使用该功能。但是...

当心！！

This is a heuristic based approach, not all the results are correct for example: What a beautiful day !
The sentence start with a WH token but it's not a question, you fix this by checking if it ends with a question mark or not but in you corpus questions don't always end with a question mark.
A possible solution would be, to apply this on a corpus and manually filter out these outliers.

According to the description and the text sample that you provided, I would split this problem into 2 parts:

How to split the whole text
How to "classify" which sentence (or paragraph) is a question or an answer

I tried solving this problem using a heuristics based approach with spacy (you can use other libraries).
You can just use this technique directly or use it to build a weakly supervised dataset that you can train a classification model with (try skweak).

Sentence Detection

This is the easy part, all you have to do is follow the details in this link https://spacy.io/usage/linguistic-features#sbd

nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi, I'm sentence number 1. Hi, I'm sentence number 2.")

for sent in doc.sents:
    print(sent.text)

# Hi, I'm sentence number 1.
# Hi, I'm sentence number 2.

Question or Answer

From the sample that you shared, I can see that you want to detect questions and also imperative phrases:

Questions: How many employees are there. To detect this type of question, you can use spacy's tag property and look for WH tokens (who, where etc...). You could use the same logic to find Subject verb inversions for example etc...
Imperative phrases: List the pay range of your employees. To detect these cases, you can search for verbs that are the first tokens of a sentence.

Here's a small example that you can follow:

def is_question(sent):
    d = nlp(sent)
    token = d[0] # gets the first token in a sentence
    if token.pos_ == "VERB" and token.dep_ == "ROOT": # checks if the first token is a verb and root or not
        return True
    for token in d: # loops through the sentence and checks for WH tokens
        if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WPquot; or token.tag_ == "WRB":
            return True
    return  False

doc = nlp(text)

for sent in doc.sents:
    print(sent.text.strip())
    if is_question(sent.text.strip()):
        print("is question")
    else:
        print("not a question")
    print("***")

# what is the name of the company?
# is question
# ***
# We are Acme Inc.
# not a question
# ***
# How many employees are there.
# is question
# ***
# There are 50 employees.
# not a question

You can apply this function on a large corpus and get a weakly annotated dataset that you can use to train a classifier or you can just use the function. But...