将字符串列表与可用的字典/同义词库进行比较

发布于 2024-08-20 18:51:09 字数 156 浏览 9 评论 0原文

我有一个程序(C#),它生成一个字符串列表(原始字符串的排列)。大多数字符串是按预期随机分组的原始字母(即 etam、aemt、team)。我想以编程方式找到列表中真正的英语单词的一个字符串。我需要一个同义词库/字典来查找和比较每个字符串。任何人都知道可用的资源。我在 C# 中使用 VS2008。

I have a program (C#) that generates a list of strings (permutations of an original string). Most of the strings are random grouping of the original letters as expected (ie etam,aemt, team). I wanna find the one string in the list that is an actual English word, programatically. I need a thesaurus/dictionary to look up and compare each string to. Any one know of a resource available. Im using VS2008 in C#.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

咽泪装欢 2024-08-27 18:51:09

您可以从网络下载单词列表(例如此处提到的文件之一:http:// /www.outpost9.com/files/WordLists.html),然后快速执行以下操作:

// Read words from file.
string [] words = ReadFromFile();

Dictionary<String, List<String>> permuteDict = new Dictionary<String, List<String>>(StringComparer.OrdinalIgnoreCase);

foreach (String word in words) {
    String sortedWord = new String(word.ToArray().Sort());
    if (!permuteDict.ContainsKey(sortedWord)) {
        permuteDict[sortedWord] = new List<String>();
    }
    permuteDict[sortedWord].Add(word);
}

// To do a lookup you can just use

String sortedWordToLook = new String(wordToLook.ToArray().Sort());

List<String> outWords;
if (permuteDict.TryGetValue(sortedWordToLook, out outWords)) {
    foreach (String outWord in outWords) {
        Console.WriteLine(outWord);
    }
}

You could download a list of words from the web (say one of the files mentioned here: http://www.outpost9.com/files/WordLists.html), then then do a quick:

// Read words from file.
string [] words = ReadFromFile();

Dictionary<String, List<String>> permuteDict = new Dictionary<String, List<String>>(StringComparer.OrdinalIgnoreCase);

foreach (String word in words) {
    String sortedWord = new String(word.ToArray().Sort());
    if (!permuteDict.ContainsKey(sortedWord)) {
        permuteDict[sortedWord] = new List<String>();
    }
    permuteDict[sortedWord].Add(word);
}

// To do a lookup you can just use

String sortedWordToLook = new String(wordToLook.ToArray().Sort());

List<String> outWords;
if (permuteDict.TryGetValue(sortedWordToLook, out outWords)) {
    foreach (String outWord in outWords) {
        Console.WriteLine(outWord);
    }
}
痴梦一场 2024-08-27 18:51:09

您还可以使用维基词典。 MediaWiki API(维基百科使用 MediaWiki)允许您查询文章标题列表。在维基词典中,文章标题(除其他外)是词典中的单词条目。唯一的问题是字典中也包含外来词,因此有时您可能会得到“不正确”的匹配。当然,您的用户还需要访问互联网。您可以在以下位置获取有关 api 的帮助和信息: http://en.wiktionary.org/w /api.php

下面是查询 URL 的示例:

http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=dog|god|ogd|odg|gdo

这将返回以下 xml:

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page ns="0" title="ogd" missing=""/>
      <page ns="0" title="odg" missing=""/>
      <page ns="0" title="gdo" missing=""/>
      <page pageid="24" ns="0" title="dog"/>
      <page pageid="5015" ns="0" title="god"/>
    </pages>
  </query>
</api>

在 C# 中,您可以使用 System.Xml.XPath 获取所需的部分(具有 pageid 的页面项)。这些才是“真话”。

我编写了一个实现并对其进行了测试(使用上面的简单“狗”示例)。它只返回“狗”和“神”。您应该更广泛地测试它。

public static IEnumerable<string> FilterRealWords(IEnumerable<string> testWords)
{
    string baseUrl = "http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=";
    string queryUrl = baseUrl + string.Join("|", testWords.ToArray());

    WebClient client = new WebClient();
    client.Encoding = UnicodeEncoding.UTF8; // this is very important or the text will be junk

    string rawXml = client.DownloadString(queryUrl);

    TextReader reader = new StringReader(rawXml);
    XPathDocument doc = new XPathDocument(reader);
    XPathNavigator nav = doc.CreateNavigator();
    XPathNodeIterator iter = nav.Select(@"//page");

    List<string> realWords = new List<string>();
    while (iter.MoveNext())
    {
        // if the pageid attribute has a value
        // add the article title to the list.
        if (!string.IsNullOrEmpty(iter.Current.GetAttribute("pageid", "")))
        {
            realWords.Add(iter.Current.GetAttribute("title", ""));
        }
    }

    return realWords;
}

像这样称呼它:

IEnumerable<string> input = new string[] { "dog", "god", "ogd", "odg", "gdo" };
IEnumerable<string> output = FilterRealWords(input);

我尝试使用 LINQ to XML,但我对它不太熟悉,所以这很痛苦,我放弃了它。

You can also use Wiktionary. The MediaWiki API (Wikionary uses MediaWiki) allows you to query for a list of article titles. In wiktionary, article titles are (among other things) word entries in the dictionary. The only catch is that foreign words are also in the dictionary, so you might get "incorrect" matches sometimes. Your user will also need internet access, of course. You can get help and info on the api at: http://en.wiktionary.org/w/api.php

Here's an example of your query URL:

http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=dog|god|ogd|odg|gdo

This returns the following xml:

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page ns="0" title="ogd" missing=""/>
      <page ns="0" title="odg" missing=""/>
      <page ns="0" title="gdo" missing=""/>
      <page pageid="24" ns="0" title="dog"/>
      <page pageid="5015" ns="0" title="god"/>
    </pages>
  </query>
</api>

In C#, you can then use System.Xml.XPath to get the parts you need (page items with pageid). Those are the "real words".

I wrote an implementation and tested it (using the simple "dog" example from above). It returned just "dog" and "god". You should test it more extensively.

public static IEnumerable<string> FilterRealWords(IEnumerable<string> testWords)
{
    string baseUrl = "http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=";
    string queryUrl = baseUrl + string.Join("|", testWords.ToArray());

    WebClient client = new WebClient();
    client.Encoding = UnicodeEncoding.UTF8; // this is very important or the text will be junk

    string rawXml = client.DownloadString(queryUrl);

    TextReader reader = new StringReader(rawXml);
    XPathDocument doc = new XPathDocument(reader);
    XPathNavigator nav = doc.CreateNavigator();
    XPathNodeIterator iter = nav.Select(@"//page");

    List<string> realWords = new List<string>();
    while (iter.MoveNext())
    {
        // if the pageid attribute has a value
        // add the article title to the list.
        if (!string.IsNullOrEmpty(iter.Current.GetAttribute("pageid", "")))
        {
            realWords.Add(iter.Current.GetAttribute("title", ""));
        }
    }

    return realWords;
}

Call it like this:

IEnumerable<string> input = new string[] { "dog", "god", "ogd", "odg", "gdo" };
IEnumerable<string> output = FilterRealWords(input);

I tried using LINQ to XML, but I'm not that familiar with it, so it was a pain and I gave up on it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文