有什么好的姓氏数据库吗?

发布于 2024-11-15 02:38:27 字数 207 浏览 3 评论 0原文

我希望生成一些数据库测试数据,特别是包含人名的表列。为了很好地表明索引在基于名称的搜索方面的工作效果如何,我希望尽可能接近真实世界的名称及其真实的频率分布,例如,许多不同的名称的频率分布在某些幂律分布上。

理想情况下,我正在寻找一个免费可用的数据文件,其名称后跟每个名称的单个频率值(或等效的概率)。

基于盎格鲁撒克逊语的名字就可以了,尽管来自其他文化的名字也很有用。

I'm looking to generate some database test data, specifically table columns containing people's names. In order to get a good indication of how well indexing works with regard to name based searches I want to get as close as possible to real world names and their true frequency distribution, e.g. lots of different names with frequencies distributed over some power law distribution.

Ideally I'm looking for a freely available data file with names followed by a single frequency value (or equivalently a probability) per name.

Anglo-saxon based names would be fine, although names from other cultures would be useful also.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

少钕鈤記 2024-11-22 02:38:27

我找到了一些符合要求的美国人口普查数据。唯一需要注意的是,它仅列出至少出现 100 次的名称...

通过此博客条目找到,还显示了幂律分布曲线

  • < a href="http://insidemr.blogspot.com/2010/01/power-law-curve-in-surnames.html" rel="nofollow">幂律曲线姓氏(博客条目)

除此之外,您可以使用轮盘赌选择从列表中进行采样,例如(未测试)

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}

I found some US census data which fits the requirement. The only caveat is that it lists only names that occur at least 100 times...

Found via this blog entry that also shows the power law distribution curve

Further to this you can sample from the list using Roulette Wheel Selection, e.g. (not tested)

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}
情深已缘浅 2024-11-22 02:38:27

牛津大学在其公共 FTP 站点上以压缩的 .gz 文件形式提供单词列表,网址为 ftp:// ftp.ox.ac.uk/pub/wordlists/names/

Oxford University provides word lists on their public FTP site as compressed .gz files at ftp://ftp.ox.ac.uk/pub/wordlists/names/.

泅人 2024-11-22 02:38:27

您还可以查看 jFairy 项目。它是用 Java 编写的,会生成虚假数据(例如名称)。 http://codearte.github.io/jfairy/

Fairy fairy = Fairy.create(); 
Person person = fairy.person();
System.out.println(person.firstName());           // Chloe
System.out.println(person.lastName());            // Barker
System.out.println(person.fullName());            // Chloe Barker

You can also check out jFairy project. It's written in Java and produces fake data (like for example names). http://codearte.github.io/jfairy/

Fairy fairy = Fairy.create(); 
Person person = fairy.person();
System.out.println(person.firstName());           // Chloe
System.out.println(person.lastName());            // Barker
System.out.println(person.fullName());            // Chloe Barker
可遇━不可求 2024-11-22 02:38:27

为了生成具有真实姓名频率分布的真实数据库测试数据,我建议探索 census.name 提供的免费预览版。他们提供了一个全面的数据库,其中包含来自不同文化的数百万个名字,包括频率分布,可以帮助您模拟现实世界的场景。虽然完整数据库是付费的,但您可以在 GitHub 和 Kaggle 上免费访问姓名普查前 100 名,其中包括带有频率值的姓名 - 这是满足您需求的一个很好的起点。

For generating realistic database test data with true frequency distributions of names, I recommend exploring the free preview offered by census.name. They provide a comprehensive database with millions of names from various cultures, including frequency distributions that can help you simulate real-world scenarios. While the full database is paid, you can access the Name Census Top 100 for free on GitHub and Kaggle, which includes names with frequency values—a great starting point for your needs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文