当前位置：文江博客话题详情

从 wordnet 中提取以特定字母开头的所有单词

发布于 2024-09-13 16:07:56 字数 72 浏览 12 评论 0原文

如何从 wordnet 中提取以特定字母开头的所有单词。例如，如果我输入 A，wordnet 应该返回所有以字母 A 开头的单词。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高速公鹿 2024-09-20 16:07:56

我能看到的最简单的方法是从这里下载他们的数据库，然后解析每行中第 5 个元素的空格分隔数据文件（data.adj、data.adv、data.noun、data.verb），并将它们放入相关的数据结构中。

可能是一个哈希表，以起始字母作为键，每个元素作为以该字母开头的单词数组。

无论您使用动态数组还是常规数组，然后首先解析文件以获取每个字母的单词数（数组大小），都取决于您。

以下代码示例是用 C 编写的，读取 wordnet 数据文件并打印有问题的单词。它绝不是经过打磨，而是很快就制作完成的。

#include <stdio.h>
#include <string.h>
int main(int argc,char**argv)
{
  FILE *fp;

  fp=fopen("data.noun", "r");
  char line [ 3000 ];
  while ( fgets ( line, sizeof line, fp ) != NULL )
  {
      char *result = NULL;
      int count =0;
      result = (char*)strtok( line, " ");
      count++; 
      while( result != NULL ) 
      {
      if (count == 5) 
      {
          printf( "result is \"%s\"\n", result );
      }
      result = (char*)strtok( NULL, " ");
      count++;
      }
  }
  return 0;
}

有关 WordNet 数据库格式的更多文档，请参阅此处

如果您想使用 WordNet C API，然后查看此处记录的 findtheinfo 函数，尽管我不认为它的设计目的是使用该 API 调用返回您想要的信息。

Easiest way I can see is to download their database from here and then parse the space separated data files (data.adj,data.adv,data.noun,data.verb) for the 5th element in each line and place them into a relevant data structure.

Possibly a Hash table with starting letter as key and each element as an array of words that start with that letter.

Whether you use dynamic arrays or regular arrays and you then first parse of the file to get the number of words of each letter (array size) is up to you.

The following code sample is written in C, and reads through a wordnet datafile and prints the word in question. It is by no means polished and was quickly made.

#include <stdio.h>
#include <string.h>
int main(int argc,char**argv)
{
  FILE *fp;

  fp=fopen("data.noun", "r");
  char line [ 3000 ];
  while ( fgets ( line, sizeof line, fp ) != NULL )
  {
      char *result = NULL;
      int count =0;
      result = (char*)strtok( line, " ");
      count++; 
      while( result != NULL ) 
      {
      if (count == 5) 
      {
          printf( "result is \"%s\"\n", result );
      }
      result = (char*)strtok( NULL, " ");
      count++;
      }
  }
  return 0;
}

For further documentation on the WordNet database format see here

If you wanted to use the WordNet C API instead then see the findtheinfo function documented here, though I don't think it is designed to return the sort of information you want using that API call.

回复收藏 0 原文

醉酒的小男人 2024-09-20 16:07:56

在 python 中，从 Open Multilingual Wordnet 下载 .tab 文件后，您可以尝试以下方法：

# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
  reader = codecs.open(wnfile, "r", "utf8").readlines()
  wn = {}
  for l in reader:
    if l[0] == "#": continue
    if option=="ss":
      k = l.split("\t")[0] #ss as key
      v = l.split("\t")[2][:-1] #word
    else:
      v = l.split("\t")[0] #ss as value
      k = l.split("\t")[2][:-1] #word as key
    try:
      temp = wn[k]
      wn[k] = temp + ";" + v
    except KeyError:
      wn[k] = v  
  return wn

princetonWN = readWNfile('wn-data-eng.tab', 'word')

for i in princetonWN:
    if i[0] == "a":
        print i, princetonWN[i].split(";")

In python, after you've downloaded the .tab file from Open Multilingual Wordnet, you can try this recipe:

# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
  reader = codecs.open(wnfile, "r", "utf8").readlines()
  wn = {}
  for l in reader:
    if l[0] == "#": continue
    if option=="ss":
      k = l.split("\t")[0] #ss as key
      v = l.split("\t")[2][:-1] #word
    else:
      v = l.split("\t")[0] #ss as value
      k = l.split("\t")[2][:-1] #word as key
    try:
      temp = wn[k]
      wn[k] = temp + ";" + v
    except KeyError:
      wn[k] = v  
  return wn

princetonWN = readWNfile('wn-data-eng.tab', 'word')

for i in princetonWN:
    if i[0] == "a":
        print i, princetonWN[i].split(";")

回复收藏 0 原文

~没有更多了~