如何使用SPACY提取Biluo标签,以解决相互冲突的实体?
我正在研究A kaggle dataset 使用Spacy提取Biluo实体
'triending.offsets_to_biluo_tags'
函数。原始数据是以CSV格式进行的,我设法将其转换为JSON格式下方:
{
"entities": [
{
"feature_text": "Lack-of-other-thyroid-symptoms",
"location": "['564 566;588 600', '564 566;602 609', '564 566;632 633', '564 566;634 635']"
},
{
"feature_text": "anxious-OR-nervous",
"location": "['13 24', '454 465']"
},
{
"feature_text": "Lack of Sleep",
"location": "['289 314']"
},
{
"feature_text": "Insomnia",
"location": "['289 314']"
},
{
"feature_text": "Female",
"location": "['6 7']"
},
{
"feature_text": "45-year",
"location": "['0 5']"
}
],
"pn_history": "45 yo F. CC: nervousness x 3 weeks. Increased stress at work. Change in role from researcher to lecturer. Also many responsibilities at home, caring for elderly mother and in-laws, and 17 and 19 yo sons. Noticed decreased appetite, but forces herself to eat 3 meals a day. Associated with difficulty falling asleep (duration 30 to 60 min), but attaining full 7 hours with no interruptions, no early morning awakenings. Also decreased libido for 2 weeks. Nervousness worsened on Sunday and Monday when preparing for lectures for the week. \r\nROS: no recent illness, no headache, dizziness, palpitations, tremors, chest pain, SOB, n/v/d/c, pain\r\nPMH: none, no pasMeds: none, Past hosp/surgeries: 2 vaginal births no complications, FHx: no pysch hx, father passed from acute MI at age 65 yo, no thyroid disease\r\nLMP: 1 week ago \r\nSHx: English literature professor, no smoking, occasional EtOH, no ilicit drug use, sexually active."
}
在JSON中,实体部分包含特征文本及其在文本中的位置,而PN_HISTORY零件包含整个文本文档。
我遇到的第一个问题是,数据集包含一个单个文本部分的实例,标记了一个以上独特的实体。例如,位于位置[289 314]的文本属于两个不同的实体“失眠”和“缺乏睡眠”。在处理此类实例时,Spacy陷入了以下方式:
valueerror [e103]试图在创建时设置矛盾的doc.ents 自定义ner
我在数据集中遇到的第二个问题是在某些情况下,明确提到了开始和结束位置[13 24],但是在某些情况下 索引分散。例如,对于“ 564 566; 588 600”,其中包含一个半列,预计将从位置564 566和位置588 600的第二组单词中选择第一个单词588 600。这些类型的索引i i无法传递到Spacy功能。 请建议我如何解决这些问题。
I am working on a Kaggle dataset and trying to extract BILUO entities using spacy
'training.offsets_to_biluo_tags'
function. The original data is in CSV format which I have managed to convert into below JSON format:
{
"entities": [
{
"feature_text": "Lack-of-other-thyroid-symptoms",
"location": "['564 566;588 600', '564 566;602 609', '564 566;632 633', '564 566;634 635']"
},
{
"feature_text": "anxious-OR-nervous",
"location": "['13 24', '454 465']"
},
{
"feature_text": "Lack of Sleep",
"location": "['289 314']"
},
{
"feature_text": "Insomnia",
"location": "['289 314']"
},
{
"feature_text": "Female",
"location": "['6 7']"
},
{
"feature_text": "45-year",
"location": "['0 5']"
}
],
"pn_history": "45 yo F. CC: nervousness x 3 weeks. Increased stress at work. Change in role from researcher to lecturer. Also many responsibilities at home, caring for elderly mother and in-laws, and 17 and 19 yo sons. Noticed decreased appetite, but forces herself to eat 3 meals a day. Associated with difficulty falling asleep (duration 30 to 60 min), but attaining full 7 hours with no interruptions, no early morning awakenings. Also decreased libido for 2 weeks. Nervousness worsened on Sunday and Monday when preparing for lectures for the week. \r\nROS: no recent illness, no headache, dizziness, palpitations, tremors, chest pain, SOB, n/v/d/c, pain\r\nPMH: none, no pasMeds: none, Past hosp/surgeries: 2 vaginal births no complications, FHx: no pysch hx, father passed from acute MI at age 65 yo, no thyroid disease\r\nLMP: 1 week ago \r\nSHx: English literature professor, no smoking, occasional EtOH, no ilicit drug use, sexually active."
}
In the JSON the entities part contains feature text and its location in the text and the pn_history part contains the entire text document.
The first problem I have is that the dataset contains instances where a single text portion is tagged with more than one unique entity. For instance, text located at position [289 314] belongs to two different entities 'Insomnia' and 'Lack of Sleep'. While processing this type of instance Spacy runs into:
ValueError [E103] Trying to set conflicting doc.ents while creating
custom NER
The second problem that I have in the dataset is for some cases the starting and ending positions are clearly mentioned for instance [13 24] but there are some cases where the
indices are scattered. e.g. for '564 566;588 600' which contains a semicolumn it is expected to pick the first set word(s) from the location 564 566 and the second set of word(s) from the location 588 600. These types of indexes I cannot pass to the Spacy function.
Please advise how can I solve these problems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的,听起来您有两个单独的问题。
重叠实体。您需要决定如何处理这些方法并过滤您的数据,Spacy不会自动为您处理此操作。由您决定什么是“正确”。通常您会想要最长的实体。您还可以使用最近发布的 spancat ,它就像ner一样,但可以处理重叠的注释。
不连续的实体。这些是您使用
;
的注释。这些更困难,Spacy目前无法处理它们(根据我的经验,很少有系统处理不连续的实体)。这是您示例的示例注释:有时,对于不连续的实体,您可以包括中间部分,但这在这里不起作用。我认为将其转化为Spacy没有什么好方法,因为您的输入标签是“缺乏甲状腺症状”。通常,我会将其建模为“甲状腺症状”,并分别处理否定。在这种情况下,这意味着您可以只标记
palpitations
。OK, it sounds like you have two separate problems.
Overlapping entities. You'll need to decide what to do with these and filter your data, spaCy won't automatically handle this for you. It's up to you to decide what's "correct". Usually you would want the longest entities. You could also use the recently released spancat, which is like NER but can handle overlapping annotations.
Discontinuous entities. These are your annotations with
;
. These are harder, spaCy has no way to handle them at the moment (and in my experience, few systems handle discontinuous entities). Here's an example annotation from your sample:Sometimes with discontinuous entities you can just include the middle part, but that won't work here. I don't think there's any good way to translate this into spaCy, because your input tag is "lack of thyroid symptoms". Usually I would model this as "thyroid symptoms" and handle negation separately; in this case that means you could just tag
palpitations
.