Java:从大文件中获取随机行

发布于 2024-12-08 01:22:06 字数 1802 浏览 0 评论 0原文

我已经看到如何获取文本文件中的随机行,但其中所述的方法(接受的答案)运行速度非常慢。它在我的 598KB 文本文件上运行速度非常慢,并且在我的文本文件版本上运行速度仍然很慢,该文本文件每 20 行只有 1 行,大小为 20KB。我从来没有越过“a”部分(它是一个单词列表)。

原始文件有64141行;缩短的有 2138 行。为了生成这些文件,我使用了 Linux Mint 11 /usr/share/dict/american-english 单词列表,并使用 grep 删除任何带有大写或撇号的内容 (grep -v [[:upper:]] | grep -v \')。

我使用的代码

String result = null;
final Random rand = new Random();
int n = 0;
for (final Scanner sc = new Scanner(wordList); sc.hasNext();) {
    n++;
    if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();
        boolean isOK = true;
        for (final char c : line.toCharArray()) {
            if (!(constraints.isAllowed(c))) {
                isOK = false;
                break;
            }
        }
        if (isOK) {
            result = line;
        }
        System.out.println(result);
    }
}
return result;

稍微改编自 Itay 的回答

对象 constraints 是一个 KeyboardConstraints,它基本上只有一个方法 isAllowed(char)

public boolean isAllowed(final char key) {
    if (allAllowed) {
        return true;
    } else {
        return allowedKeys.contains(key);
    }
}

其中 allowedKeys 和 < code>allAllowed 在构造函数中提供。这里使用的 constraints 变量将 "aeouhtns".toCharArray() 作为其 allowedKeys,并且 allAllowed 关闭。

本质上,我希望该方法做的是选择一个满足约束的随机单词(例如,对于这些约束,“outvote”可以工作,但“worker”不行,因为“w”不是在“aeouhtns”.toCharArray()中)。

我该怎么做?

I've seen how to get a random line from a text file, but the method stated there (the accepted answer) is running horrendously slow. It runs very slowly on my 598KB text file, and still slow on my a version of that text file which has only one out of every 20 lines, at 20KB. I never get past the "a" section (it's a wordlist).

The original file has 64141 lines; the shortened one has 2138 lines. To generate these files, I took the Linux Mint 11 /usr/share/dict/american-english wordlist and used grep to remove anything with uppercase or an apostrophe (grep -v [[:upper:]] | grep -v \').

The code I'm using is

String result = null;
final Random rand = new Random();
int n = 0;
for (final Scanner sc = new Scanner(wordList); sc.hasNext();) {
    n++;
    if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();
        boolean isOK = true;
        for (final char c : line.toCharArray()) {
            if (!(constraints.isAllowed(c))) {
                isOK = false;
                break;
            }
        }
        if (isOK) {
            result = line;
        }
        System.out.println(result);
    }
}
return result;

which is slightly adapted from Itay's answer.

The object constraints is a KeyboardConstraints, which basically has the one method isAllowed(char):

public boolean isAllowed(final char key) {
    if (allAllowed) {
        return true;
    } else {
        return allowedKeys.contains(key);
    }
}

where allowedKeys and allAllowed are provided in the constructor. The constraints variable used here has "aeouhtns".toCharArray() as its allowedKeys with allAllowed off.

Essentially, what I want the method to do is to pick a random word that satisfies the constraints (e.g. for these constraints, "outvote" would work, but not "worker", because "w" is not in "aeouhtns".toCharArray()).

How can I do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

执手闯天涯 2024-12-15 01:22:06

您的实现中有一个错误。在选择随机数之前,您应该阅读该行。更改此:

n++;
if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();

对此(如 原答案):

n++;
final String line = sc.nextLine();
if (rand.nextInt(n) == 0) {

在抽取随机数之前还应该检查约束条件。如果一行不符合约束,则应忽略它,如下所示:

n++;

String line;
do {
    if (!sc.hasNext()) { return result; }
    line = sc.nextLine();
} while (!meetsConstraints(line));

if (rand.nextInt(n) == 0) {
    result = line; 
}

You have a bug in your implementation. You should read the line before you choose a random number. Change this:

n++;
if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();

To this (as in the original answer):

n++;
final String line = sc.nextLine();
if (rand.nextInt(n) == 0) {

You should also check the constraints before drawing a random number. If a line fails the constraints it should be ignored, something like this:

n++;

String line;
do {
    if (!sc.hasNext()) { return result; }
    line = sc.nextLine();
} while (!meetsConstraints(line));

if (rand.nextInt(n) == 0) {
    result = line; 
}
欲拥i 2024-12-15 01:22:06

我会读入所有行,将它们保存在某个地方,然后从中选择一个随机行。这需要很短的时间,因为现在小于 1 MB 的单个文件已经很小了。

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        RandomDict dict = RandomDict.load("/usr/share/dict/american-english");
        final int count = 1000000;
        for (int i = 0; i < count; i++)
            dict.nextWord();
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f seconds to load and find %,d random words.", time / 1e9, count);
    }
}

class RandomDict {
    public static final String[] NO_STRINGS = {};
    final Random random = new Random();
    final String[] words;

    public RandomDict(String[] words) {
        this.words = words;
    }

    public static RandomDict load(String filename) throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(filename));
        Set<String> words = new LinkedHashSet<String>();
        try {
            for (String line; (line = br.readLine()) != null; ) {
                if (line.indexOf('\'') >= 0) continue;
                words.add(line.toLowerCase());
            }
        } finally {
            br.close();
        }
        return new RandomDict(words.toArray(NO_STRINGS));
    }

    public String nextWord() {
        return words[random.nextInt(words.length)];
    }
}

印刷

Took 0.091 seconds to load and find 1,000,000 random words.

I would read in all the lines, save these somewhere and then select a random line from that. This takes a trivial amount of time because a single file of less than 1 MB is a trivial size these days.

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        RandomDict dict = RandomDict.load("/usr/share/dict/american-english");
        final int count = 1000000;
        for (int i = 0; i < count; i++)
            dict.nextWord();
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f seconds to load and find %,d random words.", time / 1e9, count);
    }
}

class RandomDict {
    public static final String[] NO_STRINGS = {};
    final Random random = new Random();
    final String[] words;

    public RandomDict(String[] words) {
        this.words = words;
    }

    public static RandomDict load(String filename) throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(filename));
        Set<String> words = new LinkedHashSet<String>();
        try {
            for (String line; (line = br.readLine()) != null; ) {
                if (line.indexOf('\'') >= 0) continue;
                words.add(line.toLowerCase());
            }
        } finally {
            br.close();
        }
        return new RandomDict(words.toArray(NO_STRINGS));
    }

    public String nextWord() {
        return words[random.nextInt(words.length)];
    }
}

prints

Took 0.091 seconds to load and find 1,000,000 random words.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文