从 urlconnection 的输入流读取时速度变慢(即使使用 byte[] 和缓冲区)

现在讨论手头的问题 - 我正在编写一个程序,它将解析游戏中的 api 数据,即战斗日志。数据库中会有很多条目(20+百万),因此每个战斗日志页面的解析速度非常重要。

要解析的页面如下所示: http://api.erepublik.com /v1/feeds/battle_logs/10000/0。 (如果使用chrome,请参阅源代码,它不会正确显示页面)。它有 1000 个点击条目,后面是一些战斗信息(最后一页显然将有 <1000 个)。平均一个页面包含175000个字符,UTF-8编码,xml格式(v 1.0)。程序将在一台好的PC上本地运行,内存几乎是无限的(因此创建字节[250000]是完全可以的)。



//global vars,class declaration skipped

    public WebObject(String url_string, int connection_timeout, int read_timeout, boolean redirects_allowed, String user_agent)
                    throws java.net.MalformedURLException, java.io.IOException {
                // Open a URL connection
                java.net.URL url = new java.net.URL(url_string);
                java.net.URLConnection uconn = url.openConnection();
                if (!(uconn instanceof java.net.HttpURLConnection)) {
                    throw new java.lang.IllegalArgumentException("URL protocol must be HTTP");
                conn = (java.net.HttpURLConnection) uconn;
                conn.setRequestProperty("User-agent", user_agent);
     public void executeConnection() throws IOException {
            try {
                is = conn.getInputStream(); //global var
                l = conn.getContentLength(); //global var         
            } catch (Exception e) {
            //handling code skipped

//getContentStream and getLength methods which just return'is' and 'l' are skipped

有趣的部分开始了。 我运行了一些分析(使用 System.currentTimeMillis())来找出什么需要很长时间,什么不需要。 调用这个方法只需要 200 毫秒,平均

public InputStream getWebPageAsStream(int battle_id, int page) throws Exception {
    String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
    WebObject wobj = new WebObject(url, 10000, 10000, true, "Mozilla/5.0 "
            + "(Windows; U; Windows NT 5.1; en-US; rv: Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)");
    l = wobj.getContentLength(); // global variable
    return wobj.getContentStream(); //returns 'is' stream

200 毫秒是网络操作所期望的,我对此很满意。 但是当我以任何方式解析 inputStream(将其读入字符串/使用 java XML 解析器/将其读入另一个 ByteArrayStream)时,该过程需要超过 1000 毫秒!

例如,如果我将上面从 getContentStream() 得到的流('is')直接传递给此方法,此代码需要 1000 毫秒:

public static Document convertToXML(InputStream is) throws ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document doc = db.parse(is);
        return doc;

如果传入初始 InputStream 'is',此代码也需要大约 920 毫秒(不要读入)代码本身 - 它只是通过直接计算字符来提取我需要的数据,这可以通过严格的 api feed 格式来完成):

public static parsedBattlePage convertBattleToXMLWithoutDOM(InputStream is) throws IOException {
        // Point A
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        LinkedList ll = new LinkedList();
        String str = br.readLine();
        while (str != null) {
            str = br.readLine();
        if (((String) ll.get(1)).indexOf("error") != -1) {
            return new parsedBattlePage(null, null, true, -1);
        //Point B
        Iterator it = ll.iterator();
        String[][] hits_arr = new String[1000][4];
        String t_str = (String) it.next();
        String tmp = null;
        int j = 0;
        for (int i = 0; t_str.indexOf("time") != -1; i++) {
            hits_arr[i][0] = t_str.substring(12, t_str.length() - 11);
            tmp = (String) it.next();
            hits_arr[i][1] = tmp.substring(14, tmp.length() - 9);
            tmp = (String) it.next();
            hits_arr[i][2] = tmp.substring(15, tmp.length() - 10);
            tmp = (String) it.next();
            hits_arr[i][3] = tmp.substring(18, tmp.length() - 13);
            t_str = (String) it.next();
        String[] b_info_arr = new String[9];
        int[] space_nums = {13, 10, 13, 11, 11, 12, 5, 10, 13};
        for (int i = 0; i < space_nums.length; i++) {
            tmp = (String) it.next();
            b_info_arr[i] = tmp.substring(space_nums[i] + 4, tmp.length() - space_nums[i] - 1);
        //Point C
        return new parsedBattlePage(hits_arr, b_info_arr, false, j);

我尝试用替换默认的 BufferedReader

BufferedReader br = new BufferedReader(new InputStreamReader(is), 250000);

这并没有太大改变。 我的第二次尝试是将 A 和 B 之间的代码替换为: 迭代器 it = IOUtils.lineIterator(is, "UTF-8");

结果相同,只是这次 AB 为 0ms,BC 为 1000ms,因此每次调用 it.next() 一定会消耗一些大量时间。(IOUtils 来自 apache-commons-io 库)。

这就是罪魁祸首 - 在所有情况下通过迭代器或 BufferedReader 将流解析为字符串所花费的时间约为 1000 毫秒,而其余代码花费了 0 毫秒(例如,不相关)。这意味着由于某种原因,将流解析为 LinkedList 或对其进行迭代会消耗大量系统资源。问题是——为什么?这就是java的制作方式吗...不...那只是愚蠢的,所以我做了另一个实验。

在我的主要方法中,我在 getWebPageAsStream() 之后添加:

    //Point A
    ba = new byte[l]; // 'l'  comes from wobj.getContentLength above
    bytesRead = is.read(ba); //'is' is our URLConnection original InputStream 
    offset = bytesRead;           
    while (bytesRead != -1) {
        bytesRead = is.read(ba, offset - 1, l - offset);
        offset += bytesRead;
    //Point B
    InputStream is2 = new ByteArrayInputStream(ba);
    //Now just working with 'is2' - the "copied" stream

InputStream->byte[] 转换再次花费了 1000ms - 这是许多人建议读取 InputStream 的方式,但它仍然很慢。猜猜看 - 上面的 2 个解析器方法(convertToXML() 和 ConvertBattlePagetoXMLWithoutDOM(),当传递“is2”而不是“is”时,在所有 4 种情况下,完成时间都在 50 毫秒内。

我读到了一个建议,即流等待连接在解锁之前关闭,所以我尝试使用 HttpComponentsClient 4.0 (http://hc.apache. org/httpcomponents-client/index.html),但是初始的 InputStream 需要同样长的时间来解析,例如这段代码:

public InputStream getWebPageAsStream2(int battle_id, int page) throws Exception {
        String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
        HttpClient httpclient = new DefaultHttpClient();
        HttpGet httpget = new HttpGet(url);      
        HttpParams p = new BasicHttpParams();
        HttpConnectionParams.setSocketBufferSize(p, 250000);
        HttpConnectionParams.setStaleCheckingEnabled(p, false);
        HttpConnectionParams.setConnectionTimeout(p, 5000);
        HttpResponse response = httpclient.execute(httpget);
        HttpEntity entity = response.getEntity();
        l = (int) entity.getContentLength();
        return entity.getContent();

花费了更长的时间来处理(仅网络就多了 50 毫秒)并且流解析时间仍然存在。显然它可以被实例化,这样就不会每次都创建 HttpClient 和属性(更快的网络时间),但是流问题不会受此影响,

所以我们来到了中心问题 - 为什么初始的 URLConnection InputStream (或) HttpClient InputStream)需要很长时间才能处理,而本地创建的相同大小和内容的任何流的速度要快几个数量级?我的意思是,初始响应已经在 RAM 中的某个地方,并且与刚刚从 byte[] 创建相同的流相比,我看不出任何好的理由为什么它的处理速度如此之慢。

考虑到我必须解析数百万个条目和数千个页面,几乎 1.5 秒/页的总处理时间似乎太长了。


PS请询问是否需要更多代码 - 解析后我唯一做的就是制作一个PreparedStatement并将条目以1000+的包放入JavaDB中,并且性能还可以〜200ms/1000条目,prb可以通过更多缓存进行优化但我没有深入研究。

Ok so after spending two days trying to figure out the problem, and reading about dizillion articles, i finally decided to man up and ask to for some advice(my first time here).

Now to the issue at hand - I am writing a program which will parse api data from a game, namely battle logs. There will be A LOT of entries in the database(20+ million) and so the parsing speed for each battle log page matters quite a bit.

The pages to be parsed look like this: http://api.erepublik.com/v1/feeds/battle_logs/10000/0.
(see source code if using chrome, it doesnt display the page right). It has 1000 hit entries, followed by a little battle info(lastpage will have <1000 obviously). On average, a page contains 175000 characters, UTF-8 encoding, xml format(v 1.0). Program will run locally on a good PC, memory is virtually unlimited(so that creating byte[250000] is quite ok).

The format never changes, which is quite convenient.

Now, I started off as usual:

//global vars,class declaration skipped

    public WebObject(String url_string, int connection_timeout, int read_timeout, boolean redirects_allowed, String user_agent)
                    throws java.net.MalformedURLException, java.io.IOException {
                // Open a URL connection
                java.net.URL url = new java.net.URL(url_string);
                java.net.URLConnection uconn = url.openConnection();
                if (!(uconn instanceof java.net.HttpURLConnection)) {
                    throw new java.lang.IllegalArgumentException("URL protocol must be HTTP");
                conn = (java.net.HttpURLConnection) uconn;
                conn.setRequestProperty("User-agent", user_agent);
     public void executeConnection() throws IOException {
            try {
                is = conn.getInputStream(); //global var
                l = conn.getContentLength(); //global var         
            } catch (Exception e) {
            //handling code skipped

//getContentStream and getLength methods which just return'is' and 'l' are skipped

Here is where the fun part began.
I ran some profiling (using System.currentTimeMillis()) to find out what takes long ,and what doesnt.
The call to this method takes only 200ms on avg

public InputStream getWebPageAsStream(int battle_id, int page) throws Exception {
    String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
    WebObject wobj = new WebObject(url, 10000, 10000, true, "Mozilla/5.0 "
            + "(Windows; U; Windows NT 5.1; en-US; rv: Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)");
    l = wobj.getContentLength(); // global variable
    return wobj.getContentStream(); //returns 'is' stream

200ms is quite expected from a network operation, and i am fine with it.
BUT when i parse the inputStream in any way(read it into string/use java XML parser/read it into another ByteArrayStream) the process takes over 1000ms!

for example, this code takes 1000ms IF i pass the stream i got('is') above from getContentStream() directly to this method:

public static Document convertToXML(InputStream is) throws ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document doc = db.parse(is);
        return doc;

this code too, takes around 920ms IF the initial InputStream 'is' is passed in(dont read into the code itself - it just extracts the data i need by directly counting the characters, which can be done thanks to the rigid api feed format):

public static parsedBattlePage convertBattleToXMLWithoutDOM(InputStream is) throws IOException {
        // Point A
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        LinkedList ll = new LinkedList();
        String str = br.readLine();
        while (str != null) {
            str = br.readLine();
        if (((String) ll.get(1)).indexOf("error") != -1) {
            return new parsedBattlePage(null, null, true, -1);
        //Point B
        Iterator it = ll.iterator();
        String[][] hits_arr = new String[1000][4];
        String t_str = (String) it.next();
        String tmp = null;
        int j = 0;
        for (int i = 0; t_str.indexOf("time") != -1; i++) {
            hits_arr[i][0] = t_str.substring(12, t_str.length() - 11);
            tmp = (String) it.next();
            hits_arr[i][1] = tmp.substring(14, tmp.length() - 9);
            tmp = (String) it.next();
            hits_arr[i][2] = tmp.substring(15, tmp.length() - 10);
            tmp = (String) it.next();
            hits_arr[i][3] = tmp.substring(18, tmp.length() - 13);
            t_str = (String) it.next();
        String[] b_info_arr = new String[9];
        int[] space_nums = {13, 10, 13, 11, 11, 12, 5, 10, 13};
        for (int i = 0; i < space_nums.length; i++) {
            tmp = (String) it.next();
            b_info_arr[i] = tmp.substring(space_nums[i] + 4, tmp.length() - space_nums[i] - 1);
        //Point C
        return new parsedBattlePage(hits_arr, b_info_arr, false, j);

I have tried replacing the default BufferedReader with

BufferedReader br = new BufferedReader(new InputStreamReader(is), 250000);

This didnt change much.
My second try was to replace the code between A and B with:
Iterator it = IOUtils.lineIterator(is, "UTF-8");

Same result, except this time A-B was 0ms, and B-C was 1000ms, so then every call to it.next() must have been consuming some significant time.(IOUtils is from apache-commons-io library).

And here is the culprit - the time taken to parse the stream to string, be it by an iterator or BufferedReader in ALL cases was about 1000ms, while the rest of the code took 0ms(e.g. irrelevant). This means that parsing the stream to LinkedList, or iterating over it, for some reason was eating up a lot of my system resources. question was - why? Is it just the way java is made...no...thats just stupid, so I did another experiment.

In my main method I added after the getWebPageAsStream():

    //Point A
    ba = new byte[l]; // 'l'  comes from wobj.getContentLength above
    bytesRead = is.read(ba); //'is' is our URLConnection original InputStream 
    offset = bytesRead;           
    while (bytesRead != -1) {
        bytesRead = is.read(ba, offset - 1, l - offset);
        offset += bytesRead;
    //Point B
    InputStream is2 = new ByteArrayInputStream(ba);
    //Now just working with 'is2' - the "copied" stream

The InputStream->byte[] conversion took again 1000ms - this is the way many ppl suggested to read an InputStream, and stil it is slow. And guess what - the 2 parser methods above (convertToXML() and convertBattlePagetoXMLWithoutDOM(), when passed 'is2' instead of 'is' took, in all 4 cases, under 50ms to complete.

I read a suggestion that the stream waits for connection to close before unblocking, so i tried using HttpComponentsClient 4.0 (http://hc.apache.org/httpcomponents-client/index.html) instead, but the initial InputStream took just as long to parse. e.g. this code:

public InputStream getWebPageAsStream2(int battle_id, int page) throws Exception {
        String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
        HttpClient httpclient = new DefaultHttpClient();
        HttpGet httpget = new HttpGet(url);      
        HttpParams p = new BasicHttpParams();
        HttpConnectionParams.setSocketBufferSize(p, 250000);
        HttpConnectionParams.setStaleCheckingEnabled(p, false);
        HttpConnectionParams.setConnectionTimeout(p, 5000);
        HttpResponse response = httpclient.execute(httpget);
        HttpEntity entity = response.getEntity();
        l = (int) entity.getContentLength();
        return entity.getContent();

took even longer to process(50ms more for just the network) and the stream parsing times remained the same. Obviously it can be instantiated so as to not create HttpClient and properties every time(faster network time), but the stream issue wont be affected by that.

So we come to the center problem - why does the initial URLConnection InputStream(or HttpClient InputStream) take so long to process, while any stream of same size and content created locally is orders of magnitude faster? I mean, the initial response is already somewhere in RAM, and I cant see any good reasong why it is processed so slowly compared to when a same stream is just created from a byte[].

Considering I have to parse million of entries and thousands of pages like that, a total processing time of almost 1.5s/page seems WAY WAY too long.

Any ideas?

P.S. Please ask in any more code is required - the only thing I do after parsing is make a PreparedStatement and put the entries into JavaDB in packs of 1000+, and the perfomance is ok ~ 200ms/1000entries, prb could be optimized with more cache but I didnt look into it much.

失而复得 2024-09-08 23:08:41

由于它是从远程服务器读取,因此需要更长的时间。您的executeConnection() 方法只是创建流,它实际上并没有从服务器读取整个响应。一旦你开始从流中读取数据,这一切就完成了。

It takes longer because it is reading from the remote server. Your executeConnection() method just creates the stream, it doesn't actually read the entire response from the server. That is done once you start reading from the stream.

