文本挖掘从文本中提取动物类型

发布于 2024-10-21 23:35:23 字数 601 浏览 2 评论 0原文

我需要做一个实验,而且我是 NLP 新手。我读过解释理论问题的书籍,但当涉及到实践时,我发现很难找到指南。所以请谁对 NLP 有所了解,尤其是实际问题告诉我并指出正确的道路,因为我觉得我迷失了(有用的书籍,有用的工具和有用的网站)

我想做的是获取文本并找到具体的内容例如其中的动物,如狗、猫等,然后我需要提取该单词和每侧 2 个单词。 例如,

I was watching TV with my lovely cat last night.

提取的文本将是

(my lovely cat last night)

这将是我对机床

Q1 的训练示例:将有大约 100 个与我上面解释的类似的训练示例。我使用 tocknizer 来提取单词,但如何提取每侧 2 个单词的特定单词(对于我们的示例,所有类型的动物)。例如,我需要使用标签吗?或者您的想法是什么?

问题 2:如果我有这些训练示例,我如何准备适当的数据集并将其提供给机床进行训练?我应该在这个数据集中写什么来指定动物以及我是否需要提供其他特征?以及如何将其排列在数据集中。

你的许多话可能对我有很大帮助,请毫不犹豫地告诉我你所知道的

I need to do an experiment and I am new in NLP. I have read books that explain the theoritical issues but when it comes to practical I found it hard to find a guide. so please who knows anything in NLP especially the practical issues tell me and point me to the right path because I feel I am lost (useful books, useful tools and useful websites)

what I am trying to do is to take a text and find specific words for example animals such as dogs, cats,...etc in it then I need to extract this word and 2 words on each side.
For example

I was watching TV with my lovely cat last night.

the extracted text will be

(my lovely cat last night)

This will be my training example to the machine tool

Q1: there will be around 100 training examples similar to what I explained above. I used tocknizer to extracts words but how can I extract specific words(for our example all types of animals) with 2 words on each side. do I need to use tags for example or what is your idea?

Q2: If I have these training examples how can I prepare appropriate datasets that I can give it to the machine tool to train it? what should I write in this dataset to specify the animal and should I need to give other features? and how can I arrange it in a dataset .

many words from you might help me a lot please do not hesitate to tell what you know

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

绮筵 2024-10-28 23:35:23

您尝试做的事情有时被称为“本体获取”或“自动化本体”,这是一个相当困难的问题。大多数方法都归结为“相似的单词往往会在相似的上下文中使用”。问题在于,虽然有些算法可以成功地从诸如您的数据中提取语义上有意义的关系,但从“这里有一堆在统计上与您的种子术语共享共同分布的术语”到“您的种子术语是动物名称” ,这里有一些其他动物的名字”是具有挑战性的。例如,对猫、狗、蛇、鸟的训练可能最终会给您提供类似上下文中使用的“哺乳动物、腊肠犬、生物、两足动物”等结果,但根据您的要求,可能并不完全符合您的需要。

下面是一篇研究论文的链接,该论文完全实现了您想要做的事情。他们描述了他们的数据表示方法和所使用的算法,并且在动物名称问题上至少取得了一定程度的成功。此外,追踪他们的参考文献可能是一项富有成果的练习。

http:// www.cl.cam.ac.uk/~ah433/cluk.pdf

What you are attempting to do is sometimes known as "Ontology Acquisition" or "Automated Ontology", and is a pretty difficult problem. Most approaches come down to "Words that are similar will tend to be used in similar contexts." The problem with this is that while there are algorithms that successfully extract semantically meaningful relationships from data such as yours, going from "Here are a bunch of terms that statistically share a common distribution with your seed terms" to "your seed terms are animal names, here are some other animal names" is challenging. For example, training on cat,dog, snake, bird, might end up giving you results like "mammal, dachshund, creature, biped" are used in similar contexts, but depending on your requirements, may not be exactly what you need.

Below is a link to a research paper that implemented exactly what you are trying to do. They describe their approach to data representation and algorithms used, and perform with at least some level of success on the animal name problem. In addition, tracking down their references may be a fruitful exercise..

http://www.cl.cam.ac.uk/~ah433/cluk.pdf

诺曦 2024-10-28 23:35:23

首先我要说的是,作为一名自学成才的工程师,当我几年前开始从事 NLP 工作时,我完全理解您的沮丧。我建议您阅读 NLTK 这本书,这是一本关于应用 NLP 的精彩介绍。特别是,请阅读第 3-7 章,其中涉及处理原始文本数据以提取信息并将其用于标记。该书可在线获取

关于您的具体问题:

我认为创建一个小动物列表,然后从包含这些动物名称的语料库中提取句子可能会更容易。维基百科句子就是一个明显的例子。您可以使用这种方法构建语料库,因为您已经知道每个句子中动物的名称。

// PSEUDO CODE
Dictionary animals = ["dog","dogs,"cat","cats","pig","pigs","cow","cows","lion","lions","lioness","lionesses"];
String[] sentences = getWikipediaSentences();
for(sent: sentences){
  for(token: Tokenizer.getTokens(sent)){
    if(animals.contains(token){
    addSentenceToCorpus(sent)
    } // else ignore sentence
  }
}

然后,您可以根据这些句子训练算法,以便可以使用经过训练的模型来提取新的动物名称。这种方法有一些警告,因为你的“训练数据”是人为收集的,但它仍然是一个很好的初次体验。

Let me begin by saying that being a self-taught engineer when I started working in NLP several years ago, I completely understand your frustration. I would suggest that you read the NLTK book which is a wonderful introduction to applied NLP. In particular, read Chapters 3-7 which deal with processing raw text data to extract information and use it for tagging. The book is available online.

With regards to your specific question:

I think that it might be much easier to create a small list of animals and then extract sentences from a corpus that contain these animal names. Wikipedia sentences is one obvious example. You can build your corpus using this method because you already know the names of the animals in each sentence.

// PSEUDO CODE
Dictionary animals = ["dog","dogs,"cat","cats","pig","pigs","cow","cows","lion","lions","lioness","lionesses"];
String[] sentences = getWikipediaSentences();
for(sent: sentences){
  for(token: Tokenizer.getTokens(sent)){
    if(animals.contains(token){
    addSentenceToCorpus(sent)
    } // else ignore sentence
  }
}

You can then train your algorithm on these sentences so that you can use the trained model to extract newer animal names. There are caveats with this approach since your "training data" is artificially collected but it will be a good first experience nonetheless.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文