使用 LINQ 进行简单语言识别

发布于 2024-11-05 12:48:49 字数 1157 浏览 0 评论 0原文

我第一次尝试 LINQ,并决定尝试基本的人类语言识别。输入文本会针对该语言中最常见的 10,000 个单词的 HashSet 进行测试,并获得分数。

我的问题是,是否有更好的 LINQ 查询方法?也许是我不知道的另一种形式?它有效,但我相信这里的专家将能够提供一个更干净的解决方案!

public PolyAnalyzer() {
    Dictionaries = new Dictionary<string, AbstractDictionary>();
    Dictionaries.Add("Bulgarian", new BulgarianDictionary());
    Dictionaries.Add("English", new EnglishDictionary());
    Dictionaries.Add("German", new GermanDictionary());
    Dictionaries.Values.Select(n => new Thread(() => n.LoadDictionaryAsync())).ToList().ForEach(n => n.Start());            
}  

public string getResults(string text) {
    int total = 0;
    return string.Join(" ",
        Dictionaries.Select(n => new {
            Language = n.Key,
            Score = new Regex(@"\W+").Split(text).AsQueryable().Select(m => n.Value.getScore(m)).Sum()
        }).
        Select(n => { total += n.Score; return n; }).
        ToList().AsQueryable(). // Force immediate evaluation
        Select(n =>
        "[" + n.Score * 100 / total + "% " + n.Language + "]").
        ToArray());
}

PS 我知道这是一种极其简单的语言识别方法,我只对 LINQ 方面感兴趣。

I'm experimenting with LINQ for the first time and decided to try basic human language identification. The input text gets tested against HashSets of the most common 10,000 words in the language and receives a score.

My question is, is there a better approach to the LINQ query? Maybe the other form that I don't know? It works, but I'm sure that the experts here will be able to provide a much cleaner solution!

public PolyAnalyzer() {
    Dictionaries = new Dictionary<string, AbstractDictionary>();
    Dictionaries.Add("Bulgarian", new BulgarianDictionary());
    Dictionaries.Add("English", new EnglishDictionary());
    Dictionaries.Add("German", new GermanDictionary());
    Dictionaries.Values.Select(n => new Thread(() => n.LoadDictionaryAsync())).ToList().ForEach(n => n.Start());            
}  

public string getResults(string text) {
    int total = 0;
    return string.Join(" ",
        Dictionaries.Select(n => new {
            Language = n.Key,
            Score = new Regex(@"\W+").Split(text).AsQueryable().Select(m => n.Value.getScore(m)).Sum()
        }).
        Select(n => { total += n.Score; return n; }).
        ToList().AsQueryable(). // Force immediate evaluation
        Select(n =>
        "[" + n.Score * 100 / total + "% " + n.Language + "]").
        ToArray());
}

P.S. I'm aware that this is an extremely simplistic approach to language identification, I'm just interested in the LINQ side of things.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

护你周全 2024-11-12 12:48:49

我会这样重构它:

    public string GetResults(string text)
    {
        Regex wordRegex = new Regex(@"\W+");
        var scores = Dictionaries.Select(n => new
            {
                Language = n.Key,
                Score = wordRegex.Split(text)
                                 .Select(m => n.Value.getScore(m))
                                 .Sum()
            });

        int total = scores.Sum(n => n.Score);
        return string.Join(" ",scores.Select(n => "[" + n.Score * 100 / total + "% " + n.Language + "]");
    }

几点:

  1. AsQueryAble() 是不必要的 -
    这就是 Linq to Objects 的全部内容,
    IEnumerable - 足够好。

  2. 还删除了一些 ToList()
    不必要并避免急切加载
    不需要时的结果。

  3. 虽然只有一个 LINQ 很好
    查询这不是一场比赛 - 目标
    为了整体的可读性并思考如何
    您(和其他人)必须维护代码。我将您的查询分成三个更易读 (imo) 的部分。

  4. 无论如何避免副作用
    可能 - 我删除了你的那个
    到变量 total - 它是
    令人困惑 - LINQ 查询不应该
    有副作用,因为运行相同的查询两次可能会产生不同的结果。在您的情况下,您可以只在单独的 Linq 查询中计算总数。

  5. 不要在 Linq 中重新新建或重新计算变量
    如果不需要投影
    - I
    从 Linq 中删除了正则表达式
    查询并初始化变量
    一旦出去 - 否则你就是
    重新更新 Regex 实例 N
    而不是仅仅一次。根据查询的不同,这可能会产生巨大的性能影响。

I would refactor it like this:

    public string GetResults(string text)
    {
        Regex wordRegex = new Regex(@"\W+");
        var scores = Dictionaries.Select(n => new
            {
                Language = n.Key,
                Score = wordRegex.Split(text)
                                 .Select(m => n.Value.getScore(m))
                                 .Sum()
            });

        int total = scores.Sum(n => n.Score);
        return string.Join(" ",scores.Select(n => "[" + n.Score * 100 / total + "% " + n.Language + "]");
    }

A few points:

  1. The AsQueryAble() are unnecessary -
    this is all Linq to Objects, which
    is IEnumerable<T> - good enough.

  2. Removed a few ToList() - also
    unnecessary and avoids eager loading
    of results when not needed.

  3. While its nice having just one LINQ
    query it's not a competition - aim
    for readability overall and think about how
    you (and others) have to maintain the code. I split up your query into three more readable (imo) parts.

  4. Avoid side effects by all means
    possible - I removed the one you had
    to the variable total - it's
    confusing - LINQ queries shouldn't
    have side effects, because running the same query twice might yield different results. In your case you can just calculate the total in a separate Linq query.

  5. Don't re-new or re-calculate variables inside a Linq
    projection if not necessary
    - I
    removed the regex from the Linq
    query and initialized the variable
    once outside - otherwise you are
    re-newing the Regex instance N times
    instead of just once. This might have huge performance implications depending on the query.

_失温 2024-11-12 12:48:49

我认为您发布的代码非常混乱。我重写了它,我认为它给了你相同的结果(当然我无法测试它,实际上我认为你的代码有一些错误的部分),但现在应该更加简洁。如果这不正确,请告诉我。

public PolyAnalyzer()
{
    Dictionaries = new Dictionary<string, AbstractDictionary>();
    Dictionaries.Add("Bulgarian", new BulgarianDictionary());
    Dictionaries.Add("English", new EnglishDictionary());
    Dictionaries.Add("German", new GermanDictionary());

    //Tip: Use the Parallel library to to multi-core, multi-threaded work.
    Parallel.ForEach(Dictionaries.Values, d =>
    {
        d.LoadDictionaryAsync();
    });            
}  

public Dictionary<string, int> GetResults(string text)
{
    //1) Split the words.
    //2) Calculate the score per dictionary (per language).
    //3) Return the scores.
    string[] words = new Regex(@"\w+").Split().ToArray();
    Dictionary<string, int> scores = this.Dictionaries.Select(d => new
    {
        Language = d.Key,
        Score = words.Sum(w => d.Value.GetScore(w))
    }));

    return scores;
}

I think the code you posted is very confusing. I've rewritten it and I think it gives you the same result (of course I couldn't test it and actually I think you're code has some wrong parts to it) but it should be much more concise now. Let me know if this is incorrect.

public PolyAnalyzer()
{
    Dictionaries = new Dictionary<string, AbstractDictionary>();
    Dictionaries.Add("Bulgarian", new BulgarianDictionary());
    Dictionaries.Add("English", new EnglishDictionary());
    Dictionaries.Add("German", new GermanDictionary());

    //Tip: Use the Parallel library to to multi-core, multi-threaded work.
    Parallel.ForEach(Dictionaries.Values, d =>
    {
        d.LoadDictionaryAsync();
    });            
}  

public Dictionary<string, int> GetResults(string text)
{
    //1) Split the words.
    //2) Calculate the score per dictionary (per language).
    //3) Return the scores.
    string[] words = new Regex(@"\w+").Split().ToArray();
    Dictionary<string, int> scores = this.Dictionaries.Select(d => new
    {
        Language = d.Key,
        Score = words.Sum(w => d.Value.GetScore(w))
    }));

    return scores;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文