java字符串优化——就地加载算法

发布于 2024-12-06 20:39:44 字数 1353 浏览 0 评论 0原文

我需要优化 csv 文件(字符串)的实际加载/解析。我知道的最好的方法是就地加载算法,我使用 JNI 和 C++ dll 成功地使用了它,直接从由解析的 csv 数据组成的文件加载数据。

如果它停在那里就好了,但使用该方案仅使其速度提高了 15%(不再解析数据)。它没有我最初想象的那么快的原因之一是因为 java 客户端使用 jstring,所以我需要再次将实际数据从 char* 转换为 jstring。

最好的方法是忽略该转换步骤并将数据直接加载到 jstring 对象中(不再进行转换)。因此,jstring 将直接指向内存块,而不是根据就地加载的数据复制数据(请注意,数据将由 jchars 而不是 chars 组成)。真正糟糕的是,我们需要确保垃圾收集器不会收集该数据(也许可以通过保留对它的引用?)但它应该是可行的..不是吗?

我想我有两个选择来做到这一点:

1-在java中加载数据(不再是jni)并使用指向加载数据的字符来创建字符串..但我需要找到一种方法来防止重复创建字符串时的数据。

2- 继续使用 jni “手动”创建和设置 jstring 变量,并确保垃圾收集器选项设置正确,以防止它对其执行任何操作。例如:

jstring str; 
str.data = loadedinplacedata;  // assign data pointer
return str;

不确定这是否可能,但我不介意将 jstring 直接保存到文件中并像这样重新加载它:

jstring * str = (jstring *)&loadedinplacedata[someoffset];
return * str;

我知道这不是通常的 Java 事物,但我很确定 Java 是可扩展的足以能够做到这一点。而且我在这件事上并没有真正的选择……这个项目已经有 3 年历史了,它需要工作。 =S

这是 JNI 代码 (C++):

const jchar * data = GetData(id, row, col); // get pointer of the string ends w/ \0
unsigned int len = wcslen( (wchar_t*)data );
// The best would be to prevent this function to duplicate the data.
jstring str = env->NewString( data, len ); 
return str;

注意:上面的代码通过使用 unicode 数据而不是 UTF8(NewString 而不是 NewStringUTF),使其速度提高了 20%(而不是 15)。这表明,如果我可以删除该步骤或对其进行优化,我将获得相当好的性能提升。

I need to optimize the actual loading/parsing of a csv file (strings). The best way I know is the load-in-place algorithms and I successfully used it using JNI and a C++ dll that loads the data directly from a file made out of the parsed csv data.

It would have been fine if it stopped there but using that scheme only made it 15% faster (no more parsing of the data). One of the reason it is not as fast as I first thought it would be is because the java client uses jstring so I need to convert the actual data again from char* to jstring.

The best would be to ignore that conversion step and load-in-place the data directly into the jstring objects (no more conversion). So instead of duplicating the data based on the loaded-in-place data, the jstring would be pointing directly into the chunk of memory (note that the data would be made of jchars instead of chars). The real bad thing is that we would need to make sure the garbage collector doesn't collect that data (by keeping a reference to it maybe?) but it should be feasible.. no?

I think I have two options to do that:

1- Load the data in java (no more jni) and use chars that are pointing to the loaded data to create the strings.. but I need to find a way to prevent the duplicating of the data when creating a String.

2- Continue using jni to "manually" create and set the jstring variable and make sure that the garbage collector options are set properly to prevent it from doing anything to it. For instance:

jstring str; 
str.data = loadedinplacedata;  // assign data pointer
return str;

Not sure if that's possible but I wouldn't mind just save the jstring directly into the file and reload it like that:

jstring * str = (jstring *)&loadedinplacedata[someoffset];
return * str;

I'm aware that this is not the usual Java thing, but I'm pretty sure Java is extensible enough to be able to do that. And it's not like I really have a choice in the matter... the project is already 3 years old and it needs to work. =S

This is the JNI code (C++):

const jchar * data = GetData(id, row, col); // get pointer of the string ends w/ \0
unsigned int len = wcslen( (wchar_t*)data );
// The best would be to prevent this function to duplicate the data.
jstring str = env->NewString( data, len ); 
return str;

Note: The code above made it 20% faster (instead of 15) by using unicode data instead of UTF8 (NewString instead of NewStringUTF). This shows that if I can remove that step or optimize it, I'd get quite the good performance increase.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

作妖 2024-12-13 20:39:44

我从来没有使用过 JNI,但是...让它返回一个实现 CharSequence 的自定义类,也许还有一些其他接口,如 Comparable< ,还有什么意义吗? CharSequence >,而不是字符串?这样看来您不太可能遇到数据损坏问题。

I've never worked with JNI, but... does it make any sense to have it return a custom class implementing CharSequence, and maybe a few other interfaces like Comparable< CharSequence >, instead of a String? It seems like you'd be less likely to have data corruption problems that way.

安静被遗忘 2024-12-13 20:39:44

我认为首先您必须了解为什么 C++ 版本的运行速度快了 15%,以及为什么这种性能改进不能直接转化为 Java。为什么用 Java 编写代码的速度不能提高 15%?

让我们看看你的问题。您已经通过使用 C++ dll 消除了解析。 (为什么这不能在 Java 中完成?)。然后据我了解:

  1. 您建议直接操作 jstrings 的内容
  2. 您希望防止垃圾收集器接触这些修改过的 jstrings(通过保留对它们的引用),因此可能会修改 JVM 的行为和当垃圾收集器最终进行垃圾收集时,与垃圾收集器发生冲突。

在允许垃圾收集之前,您会“修复”这些引用吗?

如果您建议自己进行内存管理,那么为什么还要使用 java 呢?为什么不直接用纯 C++ 来做呢?

假设您希望继续使用Java,当您创建一个String时,该String本身是一个新对象,但它指向的数据不一定是。您可以通过调用 String.intern() 来测试这一点。使用以下代码:

public static void main(String[] args) {
    String s3 = "foofoo";

    String s1 = call("foo");
    String s2 = call("foo");

    System.out.println("s1 == s2=" + (s1 == s2));
    System.out.println("s1.intern() == s2.intern()=" + (s1.intern() == s2.intern()));
    System.out.println("s1.intern() == s3.intern()=" + (s1.intern() == s3.intern()));

    System.out.println("s1.substring(3) == s2.substring(3)=" + (s1.substring(3) == s2.substring(3)));
    System.out.println("s1.substring(3).intern() == s2.substring(3).intern()=" + (s1.substring(3).intern() == s2.substring(3).intern()));
}

public static String call(String s) {
    return s + "foo";        
}

这会产生:

s1 == s2=false
s1.intern() == s2.intern()=true
s1.intern() == s3.intern()=true
s1.substring(3) == s2.substring(3)=false
s1.substring(3).intern() == s2.substring(3).intern()=true

因此您可以看到,尽管 String 对象不同,但数据(即实际字节)不同。因此,您的修改实际上可能并不那么相关,JVM 可能已经为您完成了。值得一提的是,如果您开始修改 jstrings 的内部结构,很可能会搞砸。

我的建议是找出你可以在算法方面做什么。使用纯java进行开发总是比Java和Java开发更快。 JNI 合并。您有更好的机会使用纯 Java 找到更好的解决方案。

I think first you have to understand why the C++ version runs 15% faster, and why that performance improvement is not directly translatable into Java. Why can't you write the code 15% faster in Java?

Lets look at your problem. You've eliminated the parsing by using a C++ dll. (Why could this not have been done in Java?). And then as I understand it:

  1. You're proposing to manipulate the contents of the jstrings directly
  2. You want to prevent the garbage collector from touching these modified jstrings (by keeping a reference to them), and therefore potentially modifying the behaviour of the JVM and screwing with the garbage collector when it does eventually garbage collect.

Will you 'fix' these references before you allow them to be garbage collected?

If you propose doing your own memory management, why are you using java at all? Why not just do it in pure C++?

Assuming that you wish to continue in Java, when you create a String, it the String itself is a new Object, but the data that it's pointing to is not necessarily. You can test this by calling String.intern(). Using the following code:

public static void main(String[] args) {
    String s3 = "foofoo";

    String s1 = call("foo");
    String s2 = call("foo");

    System.out.println("s1 == s2=" + (s1 == s2));
    System.out.println("s1.intern() == s2.intern()=" + (s1.intern() == s2.intern()));
    System.out.println("s1.intern() == s3.intern()=" + (s1.intern() == s3.intern()));

    System.out.println("s1.substring(3) == s2.substring(3)=" + (s1.substring(3) == s2.substring(3)));
    System.out.println("s1.substring(3).intern() == s2.substring(3).intern()=" + (s1.substring(3).intern() == s2.substring(3).intern()));
}

public static String call(String s) {
    return s + "foo";        
}

This produces:

s1 == s2=false
s1.intern() == s2.intern()=true
s1.intern() == s3.intern()=true
s1.substring(3) == s2.substring(3)=false
s1.substring(3).intern() == s2.substring(3).intern()=true

So you can see that although the String objects are different, the data, the actual bytes aren't. So your modifications may not actually be that relevant, the JVM may already be doing it for you. And it's worth saying that if you start modifying the internals of jstrings, this may well screw this up.

My suggestion would be to find out what you can do in terms of algorithms. Development with pure java is always quicker that Java & JNI combined. You've got a much better chance of finding a better solution with pure Java.

冰雪之触 2024-12-13 20:39:44

嗯...似乎我想做的事情不受Java“支持”,除非我破解它。我相信可以通过使用 GetStringCritical 来获取实际的 char 数组地址,然后找出 char 数组的数量来做到这一点字符等,但这远远超出了“安全”编程的范围。

我发现的最好的解决方法是在 java 中创建一个哈希表,并使用在创建数据文件时处理的唯一标识符(作用类似于 .intern())。如果该字符串不在哈希表中,则通过dll查询该字符串并将其保存在哈希表中。

数据文件:
行数、列数、
对于每个单元格,添加一个整数值(在我的例子中是内存中指向字符串的偏移量)
对于每个单元格,添加以 \0 结尾的字符串

通过使用偏移值,我可以在某种程度上最小化字符串创建和字符串查询的数量。我尝试使用 globalref 将字符串保留在 dll 内,但这使它慢了 4 倍。

Well... seems like what I wanted to do is not "supported" by Java unless I hack it.. I believe it would be possible to do so by using GetStringCritical to get the actual char array address and then find out the number of characters and such but this is way beyond "safe" programming.

The best work around I found was to create a hash table in java and use an unique identifier processed while creating my data file (acting similar to .intern()). if the string was not in the hash table, it would query it through the dll and save it in the hash table.

data file:
numrow,numcols,
for each cell, add a integer value (in my case the offset in memory pointing to the string)
for each cell, add string ending with \0

By using the offset value, I can somewhat minimize the number of strings creation and string queries. I tried using globalref to keep the string inside the dll but that made it 4 times slower.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文