当前位置：文江博客话题详情

nlp spell-checking hyphenation

检测单词中的音节

发布于 2024-07-11 06:17:28 字数 506 浏览 5 评论 0原文

我需要找到一种相当有效的方法来检测单词中的音节。例如，

隐形-> in-vi-sib-le

有一些可以使用的音节规则：

V 简历 VC 中央VC CCV CCCV CVCC

*其中 V 是元音，C 是辅音。例如，

发音（5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC）

我尝试了几种方法，其中使用正则表达式（仅当您想计算音节时才有帮助）或硬编码规则定义（一种被证明效率非常低的强力方法），最后使用有限状态自动机（这没有产生任何有用的结果）。

我的应用程序的目的是创建给定语言的所有音节的字典。该词典稍后将用于拼写检查应用程序（使用贝叶斯分类器）和文本到语音合成。

除了我以前的方法之外，如果有人能给我提供解决此问题的替代方法的建议，我将不胜感激。

我使用 Java 工作，但 C/C++、C#、Python、Perl... 中的任何技巧都适合我。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（17）

三生一梦 2024-07-18 06:17:28

阅读有关用于连字目的的解决此问题的 TeX 方法的信息。特别是参见 Frank Liang 的论文 Com- 的 Word Hy-phen-a-tion put-er。他的算法非常准确，并且还包含一个小的异常字典，用于处理算法不起作用的情况。

回复收藏 0 原文

断桥再见 2024-07-18 06:17:28

我偶然发现这个页面寻找同样的东西，并在这里找到了梁论文的一些实现：
https://github.com/mnater/hyphenator 或后继者：https://github.com/mnater/Hyphenopoly

除非你是那种喜欢阅读 60 页论文而不是免费改编的人非唯一问题的代码。 :)

回复收藏 0 原文

顾忌 2024-07-18 06:17:28

这是使用 NLTK 的解决方案：

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

Here is a solution using NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

回复收藏 0 原文

若相惜即相离 2024-07-18 06:17:28

我正在尝试通过一个程序来解决这个问题，该程序将计算文本块的 flesch-kincaid 和 flesch 阅读分数。我的算法使用我在这个网站上找到的内容：http://www.howmanysyllables.com/howtocountsyllables.html 并且它变得相当接近。它在处理诸如“看不见”和“连字符”之类的复杂单词时仍然存在问题，但我发现它已经达到了我的目的。

它的优点是易于实施。我发现“es”可以是音节，也可以不是。这是一场赌博，但我决定在我的算法中删除 es 。

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: http://www.howmanysyllables.com/howtocountsyllables.html and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.

It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

回复收藏 0 原文

风和你 2024-07-18 06:17:28

为什么要计算它？每个在线词典都有此信息。 http://dictionary.reference.com/browse/invisible
无形的

回复收藏 0 原文

沒落の蓅哖 2024-07-18 06:17:28

这是一个特别困难的问题，LaTeX 连字算法还没有完全解决。一些可用方法和所涉及的挑战的很好的总结可以在论文中找到评估英语自动音节算法（Marchand、Adsett 和 Damper 2007）。

回复收藏 0 原文

情丝乱 2024-07-18 06:17:28

不久前我遇到了同样的问题。

我最终使用了CMU发音词典来快速准确地查找大多数字。对于字典中没有的单词，我求助于机器学习模型，该模型在预测音节数方面的准确率约为 98%。

我将整个事情封装在一个易于使用的 python 模块中： https://github.com/ repp/big-phoney

安装：
pip install big-phoney

计数音节：

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

如果您不使用 Python 并且想尝试基于 ML 模型的方法，我做了一个非常详细的写下音节计数模型在 Kaggle 上的工作原理。

I ran into this exact same issue a little while ago.

I ended up using the CMU Pronunciation Dictionary for quick and accurate lookups of most words. For words not in the dictionary, I fell back to a machine learning model that's ~98% accurate at predicting syllable counts.

I wrapped the whole thing up in an easy-to-use python module here: https://github.com/repp/big-phoney

Install:
pip install big-phoney

Count Syllables:

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

If you're not using Python and you want to try the ML-model-based approach, I did a pretty detailed write up on how the syllable counting model works on Kaggle.

回复收藏 0 原文

一身软味 2024-07-18 06:17:28

今天我发现这个 Frank Liang的连字符算法的Java实现，具有英语或德语的模式，效果非常好可以在 Maven Central 上找到。

Cave：删除 .tex 模式文件的最后几行非常重要，否则这些文件将无法使用 Maven Central 上的当前版本加载。

要加载和使用连字符，您可以使用以下 Java 代码片段。 texTable 是包含所需模式的 .tex 文件的名称。这些文件可以在项目 github 站点上找到。

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

之后Hyphenator就可以使用了。要检测音节，基本思想是在提供的连字符处拆分术语。

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

您需要对 "\u00AD" 进行拆分，因为 API 不会返回正常的 "-"。

这种方法优于 Joe Basirico 的答案，因为它支持许多不同的语言并且更准确地检测德语连字符。

Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.

Cave: It is important to remove the last lines of the .tex pattern files, because otherwise those files can not be loaded with the current version on Maven Central.

To load and use the hyphenator, you can use the following Java code snippet. texTable is the name of the .tex files containing the needed patterns. Those files are available on the project github site.

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

Afterwards the Hyphenator is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

You need to split on "\u00AD", since the API does not return a normal "-".

This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.

回复收藏 0 原文

痕至 2024-07-18 06:17:28

撞见@Tihamer 和@joe-basirico。非常有用的功能，虽然不是完美，但适合大多数中小型项目。乔，我用 Python 重新编写了你的代码的实现：

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

希望有人觉得这很有用！

Bumping @Tihamer and @joe-basirico. Very useful function, not perfect, but good for most small-to-medium projects. Joe, I have re-written an implementation of your code in Python:

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

Hope someone finds this useful!

回复收藏 0 原文

恍梦境° 2024-07-18 06:17:28

Perl 有 Lingua::Phonology::Syllable模块。你可以尝试一下，或者尝试研究它的算法。我在那里也看到了其他一些旧模块。

我不明白为什么正则表达式只给出音节数。您应该能够使用捕获括号获取音节本身。假设您可以构造一个有效的正则表达式。

回复收藏 0 原文

夜还是长夜 2024-07-18 06:17:28

感谢 Joe Basirico，分享您在 C# 中的快速但肮脏的实现。我使用过大型库，它们可以工作，但通常有点慢，对于快速项目，您的方法效果很好。

这是您的 Java 代码以及测试用例：

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

结果符合预期（对于 Flesch-Kincaid 来说它工作得足够好）：

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.

Here is your code in Java, along with test cases:

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

The result was as expected (it works good enough for Flesch-Kincaid):

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

回复收藏 0 原文

红ご颜醉 2024-07-18 06:17:28

谢谢@joe-basirico 和@tihamer。我已将 @tihamer 的代码移植到 Lua 5.1、5.2 和 luajit 2（很可能也可以在其他版本的 lua 上运行）：

countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

以及一些有趣的测试确认它有效（与预期的一样）：

countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):

countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

And some fun tests to confirm it works (as much as it's supposed to):

countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

回复收藏 0 原文

故事与诗 2024-07-18 06:17:28

我找不到合适的方法来计算音节，所以我自己设计了一个方法。

您可以在此处查看我的方法： https://stackoverflow.com/a/32784041/2734752

我使用了以下组合字典和算法方法来计算音节。

您可以在此处查看我的库： https://github.com/troywatson/Lawrence-Style-Checker

我刚刚测试了我的算法，命中率为 99.4%！

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

输出：

4
3

I could not find an adequate way to count syllables, so I designed a method myself.

You can view my method here: https://stackoverflow.com/a/32784041/2734752

I use a combination of a dictionary and algorithm method to count syllables.

You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker

I just tested my algorithm and had a 99.4% strike rate!

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

Output:

4
3

回复收藏 0 原文

黑寡妇 2024-07-18 06:17:28

在进行了大量测试并尝试了连字符包之后，我根据一些示例编写了自己的包。我还尝试了与连字符词典交互的 pyhyphen 和 pyphen 包，但在许多情况下它们会产生错误的音节数。 nltk 包对于这个用例来说太慢了。

我在Python中的实现是我编写的一个类的一部分，音节计数例程粘贴在下面。它有点高估了音节的数量，因为我仍然没有找到一个好的方法来解释无声单词结尾。

该函数返回每个单词的音节比率，用于 Flesch-Kincaid 可读性分数。这个数字不必非常精确，只要足够接近估计即可。

在我的第 7 代 i7 CPU 上，此函数需要 1.1-1.2 毫秒来处理 759 字的示例文本。

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
        else:
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
            continue
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

After doing a lot of testing and trying out hyphenation packages as well, I wrote my own based on a number of examples. I also tried the pyhyphen and pyphen packages that interfaces with hyphenation dictionaries, but they produce the wrong number of syllables in many cases. The nltk package was simply too slow for this use case.

My implementation in Python is part of a class i wrote, and the syllable counting routine is pasted below. It over-estimates the number of syllables a bit as I still haven't found a good way to account for silent word endings.

The function returns the ratio of syllables per word as it is used for a Flesch-Kincaid readability score. The number doesn't have to be exact, just close enough for an estimate.

On my 7th generation i7 CPU, this function took 1.1-1.2 milliseconds for a 759 word sample text.

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
        else:
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
            continue
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

回复收藏 0 原文

楠木可依 2024-07-18 06:17:28

您可以尝试Spacy Syllables。这适用于 Python 3.9：

设置：

pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md

代码：

import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')


def spacy_syllablize(word):
    token = nlp(word)[0]
    return token._.syllables


for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
    print(f"{test_word} -> {spacy_syllablize(test_word)}")

输出：

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']

You can try Spacy Syllables. This works on Python 3.9:

Setup:

pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md

Code:

import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')


def spacy_syllablize(word):
    token = nlp(word)[0]
    return token._.syllables


for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
    print(f"{test_word} -> {spacy_syllablize(test_word)}")

Output:

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']

回复收藏 0 原文

断桥再见 2024-07-18 06:17:28

我提供了一个在 R 中“正常”工作的解决方案。远非完美。

countSyllablesInWord = function(words)
  {
  #word = "super";
  n.words = length(words);
  result = list();
  for(j in 1:n.words)
    {
    word = words[j];
    vowels = c("a","e","i","o","u","y");
    
    word.vec = strsplit(word,"")[[1]];
    word.vec;
    
    n.char = length(word.vec);
    
    is.vowel = is.element(tolower(word.vec), vowels);
    n.vowels = sum(is.vowel);
    
    
    # nontrivial problem 
    if(n.vowels <= 1)
      {
      syllables = 1;
      str = word;
      } else {
              # syllables = 0;
              previous = "C";
              # on average ? 
              str = "";
              n.hyphen = 0;
        
              for(i in 1:n.char)
                {
                my.char = word.vec[i];
                my.vowel = is.vowel[i];
                if(my.vowel)
                  {
                  if(previous == "C")
                    {
                    if(i == 1)
                      {
                      str = paste0(my.char, "-");
                      n.hyphen = 1 + n.hyphen;
                      } else {
                              if(i < n.char)
                                {
                                if(n.vowels > (n.hyphen + 1))
                                  {
                                  str = paste0(str, my.char, "-");
                                  n.hyphen = 1 + n.hyphen;
                                  } else {
                                           str = paste0(str, my.char);
                                          }
                                } else {
                                        str = paste0(str, my.char);
                                        }
                              }
                     # syllables = 1 + syllables;
                     previous = "V";
                    } else {  # "VV"
                          # assume what  ?  vowel team?
                          str = paste0(str, my.char);
                          }
            
                } else {
                            str = paste0(str, my.char);
                            previous = "C";
                            }
                #
                }
        
              syllables = 1 + n.hyphen;
              }
  
      result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
      }
  
  if(n.words == 1) { result[[1]]; } else { result; }
  }

以下是一些结果：

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));

my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);

my.count.df;

#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty

我没有意识到这是一个多大的“兔子洞”，看起来很容易。


################ hackathon #######


# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word



# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/


  # https://enchantedlearning.com/consonantblends/index.shtml
  # start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr", 
  #                   "fl", "fr", "gl", "gr", "pl", "pr",
  #                   "sc", "sh", "sk", "sl", "sm", "sn",
  #                   "sp", "st", "sw", "th", "tr", "tw",
  #                   "wh", "wr");
  # start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
  #                     "spr", "squ", "str", "thr");
  # 
  # 
  # 
  # end.digraphs = c("ch","sh","th","ng","dge","tch");
  # 
  # ile
  # 
  # farmer
  # ar er
  # 
  # vowel teams ... beaver1
  # 
  # 
  # # "able"
  # # http://www.abcfastphonics.com/letter-blends/blend-cial.html
  # blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian", 
  #             "ck", "ct", "dge", "dis", "ed", "ex", "ful", 
  #             "gh", "ng", "ous", "kn", "ment", "mis", );
  # 
  # glue = c("ld", "st", "nd", "ld", "ng", "nk", 
  #           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch", 
  #           "nse", "nt", "ph", "psy", "pt", "re", )
  # 
  # 
  # start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
  # 
  # # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
  # digraphs.start = c("ch","sh","th","wh","ph","qu");
  # digraphs.end = c("ch","sh","th","ng","dge","tch");
  # # https://www.education.com/worksheet/article/beginning-consonant-blends/
  # blends.start = c("pl", "gr", "gl", "pr",
  #                 
  # blends.end = c("lk","nk","nt",
  # 
  # 
  # # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
  # # Monte     Mon-te
  # # Sophia    So-phi-a
  # # American  A-mer-i-can
  # 
  # n.vowels = 0;
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  # 
  # 
  # 
  # 
  # 
  # n.syll = 0;
  # str = "";
  # 
  # previous = "C"; # consonant vs "V" vowel
  # 
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  #   
  #   my.vowel = is.element(tolower(my.char), vowels);
  #   if(my.vowel)
  #     {
  #     n.vowels = 1 + n.vowels;
  #     if(previous == "C")
  #       {
  #       if(i == 1)
  #         {
  #         str = paste0(my.char, "-");
  #         } else {
  #                 if(n.syll > 1)
  #                   {
  #                   str = paste0(str, "-", my.char);
  #                   } else {
  #                          str = paste0(str, my.char);
  #                         }
  #                 }
  #        n.syll = 1 + n.syll;
  #        previous = "V";
  #       } 
  #     
  #   } else {
  #               str = paste0(str, my.char);
  #               previous = "C";
  #               }
  #   #
  #   }
  # 
  # 
  # 
  # 
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE  1
# HALIDE  2
# TELEPHONE 3
# TELEPHONY 4
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# SEANCE  1
# SAILED  1
# RIBBED  1
# MOPED  1
# BLESSED  1
# AGED  1
# TOTED  2
# WARRED  1
# UNDERFED 2
# JADED  2
# INBRED  2
# BRED  1
# RED   1
# STATES  1
# TASTES  1
# TESTES  1
# UTILIZES  4

为了更好地衡量，一个简单的 kincaid 可读性函数...音节是从第一个函数返回的计数列表...

由于我的函数有点偏向更多音节，这将给出夸大的可读性分数...这对于现在很好......如果目标是使文本更具可读性，这还不是最糟糕的事情。

computeReadability = function(n.sentences, n.words, syllables=NULL)
  {
  n = length(syllables);
  n.syllables = 0;
  for(i in 1:n)
    {
    my.syllable = syllables[[i]];
    n.syllables = my.syllable$syllables + n.syllables;
    }
  # Flesch Reading Ease (FRE):
  FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
  # Flesh-Kincaid Grade Level (FKGL):
  FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59; 
  # FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
  # FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  
  list("FRE" = FRE, "FKGL" = FKGL); 
  }

I am including a solution that works "okay" in R. Far from perfect.

countSyllablesInWord = function(words)
  {
  #word = "super";
  n.words = length(words);
  result = list();
  for(j in 1:n.words)
    {
    word = words[j];
    vowels = c("a","e","i","o","u","y");
    
    word.vec = strsplit(word,"")[[1]];
    word.vec;
    
    n.char = length(word.vec);
    
    is.vowel = is.element(tolower(word.vec), vowels);
    n.vowels = sum(is.vowel);
    
    
    # nontrivial problem 
    if(n.vowels <= 1)
      {
      syllables = 1;
      str = word;
      } else {
              # syllables = 0;
              previous = "C";
              # on average ? 
              str = "";
              n.hyphen = 0;
        
              for(i in 1:n.char)
                {
                my.char = word.vec[i];
                my.vowel = is.vowel[i];
                if(my.vowel)
                  {
                  if(previous == "C")
                    {
                    if(i == 1)
                      {
                      str = paste0(my.char, "-");
                      n.hyphen = 1 + n.hyphen;
                      } else {
                              if(i < n.char)
                                {
                                if(n.vowels > (n.hyphen + 1))
                                  {
                                  str = paste0(str, my.char, "-");
                                  n.hyphen = 1 + n.hyphen;
                                  } else {
                                           str = paste0(str, my.char);
                                          }
                                } else {
                                        str = paste0(str, my.char);
                                        }
                              }
                     # syllables = 1 + syllables;
                     previous = "V";
                    } else {  # "VV"
                          # assume what  ?  vowel team?
                          str = paste0(str, my.char);
                          }
            
                } else {
                            str = paste0(str, my.char);
                            previous = "C";
                            }
                #
                }
        
              syllables = 1 + n.hyphen;
              }
  
      result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
      }
  
  if(n.words == 1) { result[[1]]; } else { result; }
  }

Here are some results:

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));

my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);

my.count.df;

#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty

I didn't realize how big of a "rabbit hole" this is, seems so easy.


################ hackathon #######


# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word



# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/


  # https://enchantedlearning.com/consonantblends/index.shtml
  # start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr", 
  #                   "fl", "fr", "gl", "gr", "pl", "pr",
  #                   "sc", "sh", "sk", "sl", "sm", "sn",
  #                   "sp", "st", "sw", "th", "tr", "tw",
  #                   "wh", "wr");
  # start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
  #                     "spr", "squ", "str", "thr");
  # 
  # 
  # 
  # end.digraphs = c("ch","sh","th","ng","dge","tch");
  # 
  # ile
  # 
  # farmer
  # ar er
  # 
  # vowel teams ... beaver1
  # 
  # 
  # # "able"
  # # http://www.abcfastphonics.com/letter-blends/blend-cial.html
  # blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian", 
  #             "ck", "ct", "dge", "dis", "ed", "ex", "ful", 
  #             "gh", "ng", "ous", "kn", "ment", "mis", );
  # 
  # glue = c("ld", "st", "nd", "ld", "ng", "nk", 
  #           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch", 
  #           "nse", "nt", "ph", "psy", "pt", "re", )
  # 
  # 
  # start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
  # 
  # # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
  # digraphs.start = c("ch","sh","th","wh","ph","qu");
  # digraphs.end = c("ch","sh","th","ng","dge","tch");
  # # https://www.education.com/worksheet/article/beginning-consonant-blends/
  # blends.start = c("pl", "gr", "gl", "pr",
  #                 
  # blends.end = c("lk","nk","nt",
  # 
  # 
  # # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
  # # Monte     Mon-te
  # # Sophia    So-phi-a
  # # American  A-mer-i-can
  # 
  # n.vowels = 0;
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  # 
  # 
  # 
  # 
  # 
  # n.syll = 0;
  # str = "";
  # 
  # previous = "C"; # consonant vs "V" vowel
  # 
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  #   
  #   my.vowel = is.element(tolower(my.char), vowels);
  #   if(my.vowel)
  #     {
  #     n.vowels = 1 + n.vowels;
  #     if(previous == "C")
  #       {
  #       if(i == 1)
  #         {
  #         str = paste0(my.char, "-");
  #         } else {
  #                 if(n.syll > 1)
  #                   {
  #                   str = paste0(str, "-", my.char);
  #                   } else {
  #                          str = paste0(str, my.char);
  #                         }
  #                 }
  #        n.syll = 1 + n.syll;
  #        previous = "V";
  #       } 
  #     
  #   } else {
  #               str = paste0(str, my.char);
  #               previous = "C";
  #               }
  #   #
  #   }
  # 
  # 
  # 
  # 
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE  1
# HALIDE  2
# TELEPHONE 3
# TELEPHONY 4
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# SEANCE  1
# SAILED  1
# RIBBED  1
# MOPED  1
# BLESSED  1
# AGED  1
# TOTED  2
# WARRED  1
# UNDERFED 2
# JADED  2
# INBRED  2
# BRED  1
# RED   1
# STATES  1
# TASTES  1
# TESTES  1
# UTILIZES  4

And for good measure, a simple kincaid readability function ... syllables is a list of counts returned from the first function ...

Since my function is a bit biased towards more syllables, that will give an inflated readability score ... which for now is fine ... if the goal is to make text more readable, this is not the worst thing.

computeReadability = function(n.sentences, n.words, syllables=NULL)
  {
  n = length(syllables);
  n.syllables = 0;
  for(i in 1:n)
    {
    my.syllable = syllables[[i]];
    n.syllables = my.syllable$syllables + n.syllables;
    }
  # Flesch Reading Ease (FRE):
  FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
  # Flesh-Kincaid Grade Level (FKGL):
  FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59; 
  # FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
  # FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  
  list("FRE" = FRE, "FKGL" = FKGL); 
  }

回复收藏 0 原文

浴红衣 2024-07-18 06:17:28

我用 jsoup 做过一次。这是一个音节解析器示例：

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

I used jsoup to do this once. Here's a sample syllable parser:

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

0 文章

0 评论

22 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

lioqio

文章 0 评论 0

Single

文章 0 评论 0

禾厶谷欠

文章 0 评论 0

alipaysp_2zg8elfGgC

文章 0 评论 0

qq_N6d4X7

文章 0 评论 0

放低过去

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文