使用java在文本文件中查找字符串的最快方法

发布于 2024-10-27 11:47:43 字数 32 浏览 2 评论 0原文

检查文件是否包含特定字符串或数字的最快方法是什么?

What is the fastest way to check if a file contains a certain string or number?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

淡看悲欢离合 2024-11-03 11:47:43

看一下 JDK 附带的 Scanner 类 (请参阅官方文档)。您将能够跳过输入的某些部分(在本例中为文本文件)并与您想要的正则表达式进行匹配。我不确定这是否是最有效的方法,但可以肯定的是 - 它非常简单。您还可以查看 这个示例,它将帮助您入门。

Have a look at the Scanner class, that ships with JDK (See official documentation). You will be able to skip certain parts of input (in this case - text file) and match against regular expression of your desire. I'm not sure if this is the most efficient way, but sure enough - it's pretty simple. You might also take a look at this example, which will help you get started.

坚持沉默 2024-11-03 11:47:43

未经尝试,但可能最快的机制是首先获取您的搜索密钥并像文件一样对其进行编码。

例如,如果您知道文件是 UTF-8,请获取密钥并将其从字符串(UTF-16)编码为 UTF-8 字节数组。这很重要,因为通过编码到文件表示形式,您只需对密钥进行编码。使用标准 Java Readers 则采用另一种方式——将文件转换为 UTF-16。

现在您已经有了一个正确的密钥(以字节为单位),请使用 NIO 创建一个 MappedByteBuffer。这会将文件映射到虚拟内存空间。

最后,使用字节实现用于字符串搜索的 Boyer-Moore 算法通过映射区域将密钥与文件的字节进行比较,

很可能有一种更快的方法,但这解决了在 Java 中搜索文本文件的大部分问题。它利用 VM 来避免复制大块文件,并且跳过将文件采用的任何编码转换为 Java 内部使用的 UTF-16 的步骤。

Untried, but probably the fastest mechanism is to first, take your search key and encode it like the file.

For example, if you know the file is UTF-8, take your key and encode it from a String (which it UTF-16) in to a byte array that is UTF-8. This is important because by encoding down to the file representation, you're only encoding the key. Using standard Java Readers goes the other way -- converts the file to UTF-16.

Now that you have a proper key, in bytes, use NIO to create a MappedByteBuffer for the file. This maps the file in to the virtual memory space.

Finally, implement a Boyer-Moore algorithm for string search, using the bytes of the key against the bytes of the file via the mapped region,

There may well be a faster way, but this solves a bulk of the problems with searching a text file in Java. It leverages the VM to avoid copying large chunks of the file, and it skips the conversion step of whatever encoding the file is in to UTF-16, which Java uses internally.

风柔一江水 2024-11-03 11:47:43

查看以下算法:

或者如果您想查找一组字符串中的一个:

  • < a href="http://en.wikipedia.org/wiki/Rabin-karp" rel="nofollow">Rabin-Karp

Check out the following algorithms:

or if you want to find one of a set of strings:

一梦浮鱼 2024-11-03 11:47:43

我在 MIMEParser 中找到的最好的实现:
https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java

/**
 * Finds the boundary in the given buffer using Boyer-Moore algo.
 * Copied from java.util.regex.Pattern.java
 *
 * @param mybuf boundary to be searched in this mybuf
 * @param off start index in mybuf
 * @param len number of bytes in mybuf
 *
 * @return -1 if there is no match or index where the match starts
 */

private int match(byte[] mybuf, int off, int len) {

还需要:

private void compileBoundaryPattern();

The best realization I've found in MIMEParser:
https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java

/**
 * Finds the boundary in the given buffer using Boyer-Moore algo.
 * Copied from java.util.regex.Pattern.java
 *
 * @param mybuf boundary to be searched in this mybuf
 * @param off start index in mybuf
 * @param len number of bytes in mybuf
 *
 * @return -1 if there is no match or index where the match starts
 */

private int match(byte[] mybuf, int off, int len) {

Needed also:

private void compileBoundaryPattern();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文