Java SAX 解析器进度监控

发布于 2024-09-06 20:26:23 字数 78 浏览 4 评论 0原文

我正在用 Java 编写一个 SAX 解析器来解析 wikipedia 文章的 2.5GB XML 文件。有没有办法监控Java中的解析进度?

I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

羞稚 2024-09-13 20:26:23

感谢EJP对ProgressMonitorInputStream的建议,最终我扩展了FilterInputStream,使得ChangeListener可以用来以字节为单位监控当前的读取位置。

这样您就可以进行更精细的控制,例如显示多个进度条以并行读取大 xml 文件。这正是我所做的。

因此,可监视流的简化版本:

/**
 * A class that monitors the read progress of an input stream.
 *
 * @author Hermia Yeung "Sheepy"
 * @since 2012-04-05 18:42
 */
public class MonitoredInputStream extends FilterInputStream {
   private volatile long mark = 0;
   private volatile long lastTriggeredLocation = 0;
   private volatile long location = 0;
   private final int threshold;
   private final List<ChangeListener> listeners = new ArrayList<>(4);


   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * @param in Underlying input stream, should be non-null because of no public setter
    * @param threshold Min. position change (in byte) to trigger change event.
    */
   public MonitoredInputStream(InputStream in, int threshold) {
      super(in);
      this.threshold = threshold;
   }

   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * Default threshold is 16KB, small threshold may impact performance impact on larger streams.
    * @param in Underlying input stream, should be non-null because of no public setter
    */
   public MonitoredInputStream(InputStream in) {
      super(in);
      this.threshold = 1024*16;
   }

   public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
   public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
   public long getProgress() { return location; }

   protected void triggerChanged( final long location ) {
      if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
      lastTriggeredLocation = location;
      if (listeners.size() <= 0) return;
      try {
         final ChangeEvent evt = new ChangeEvent(this);
         for (ChangeListener l : listeners) l.stateChanged(evt);
      } catch (ConcurrentModificationException e) {
         triggerChanged(location);  // List changed? Let's re-try.
      }
   }


   @Override public int read() throws IOException {
      final int i = super.read();
      if ( i != -1 ) triggerChanged( location++ );
      return i;
   }

   @Override public int read(byte[] b, int off, int len) throws IOException {
      final int i = super.read(b, off, len);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public long skip(long n) throws IOException {
      final long i = super.skip(n);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public void mark(int readlimit) {
      super.mark(readlimit);
      mark = location;
   }

   @Override public void reset() throws IOException {
      super.reset();
      if ( location != mark ) triggerChanged( location = mark );
   }
}

它不知道(也不关心)底层流有多大,因此您需要通过其他方式获取它,例如从文件本身。

因此,这里是简化的示例用法:

try (
   MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4) 
) {

   // Setup max progress and listener to monitor read progress
   progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
   mis.addChangeListener( new ChangeListener() { @Override public void stateChanged(ChangeEvent e) {
      SwingUtilities.invokeLater( new Runnable() { @Override public void run() {
         progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess! 
      }});
   }});
   // Start parsing. Listener would call Swing event thread to do the update.
   SAXParserFactory.newInstance().newSAXParser().parse(mis, this);

} catch ( IOException | ParserConfigurationException | SAXException e) {

   e.printStackTrace();

} finally {

   progressBar.setVisible(false); // Again please call this in swing event thread

}

在我的例子中,进度从左到右很好地提升,没有异常跳跃。调整阈值以实现性能和响应能力之间的最佳平衡。太小,在小型设备上读取速度可能会增加一倍以上,太大,进度将不顺利。

希望有帮助。如果您发现错误或拼写错误,请随时进行编辑,或者投票给我一些鼓励! :D

Thanks to EJP's suggestion of ProgressMonitorInputStream, in the end I extended FilterInputStream so that ChangeListener can be used to monitor the current read location in term of bytes.

With this you have finer control, for example to show multiple progress bars for parallel reading of big xml files. Which is exactly what I did.

So, a simplified version of the monitorable stream:

/**
 * A class that monitors the read progress of an input stream.
 *
 * @author Hermia Yeung "Sheepy"
 * @since 2012-04-05 18:42
 */
public class MonitoredInputStream extends FilterInputStream {
   private volatile long mark = 0;
   private volatile long lastTriggeredLocation = 0;
   private volatile long location = 0;
   private final int threshold;
   private final List<ChangeListener> listeners = new ArrayList<>(4);


   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * @param in Underlying input stream, should be non-null because of no public setter
    * @param threshold Min. position change (in byte) to trigger change event.
    */
   public MonitoredInputStream(InputStream in, int threshold) {
      super(in);
      this.threshold = threshold;
   }

   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * Default threshold is 16KB, small threshold may impact performance impact on larger streams.
    * @param in Underlying input stream, should be non-null because of no public setter
    */
   public MonitoredInputStream(InputStream in) {
      super(in);
      this.threshold = 1024*16;
   }

   public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
   public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
   public long getProgress() { return location; }

   protected void triggerChanged( final long location ) {
      if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
      lastTriggeredLocation = location;
      if (listeners.size() <= 0) return;
      try {
         final ChangeEvent evt = new ChangeEvent(this);
         for (ChangeListener l : listeners) l.stateChanged(evt);
      } catch (ConcurrentModificationException e) {
         triggerChanged(location);  // List changed? Let's re-try.
      }
   }


   @Override public int read() throws IOException {
      final int i = super.read();
      if ( i != -1 ) triggerChanged( location++ );
      return i;
   }

   @Override public int read(byte[] b, int off, int len) throws IOException {
      final int i = super.read(b, off, len);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public long skip(long n) throws IOException {
      final long i = super.skip(n);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public void mark(int readlimit) {
      super.mark(readlimit);
      mark = location;
   }

   @Override public void reset() throws IOException {
      super.reset();
      if ( location != mark ) triggerChanged( location = mark );
   }
}

It doesn't know - or care - how big the underlying stream is, so you need to get it some other way, such as from the file itself.

So, here goes the simplified sample usage:

try (
   MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4) 
) {

   // Setup max progress and listener to monitor read progress
   progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
   mis.addChangeListener( new ChangeListener() { @Override public void stateChanged(ChangeEvent e) {
      SwingUtilities.invokeLater( new Runnable() { @Override public void run() {
         progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess! 
      }});
   }});
   // Start parsing. Listener would call Swing event thread to do the update.
   SAXParserFactory.newInstance().newSAXParser().parse(mis, this);

} catch ( IOException | ParserConfigurationException | SAXException e) {

   e.printStackTrace();

} finally {

   progressBar.setVisible(false); // Again please call this in swing event thread

}

In my case the progresses raise nicely from left to right without abnormal jumps. Adjust threshold for optimum balance between performance and responsiveness. Too small and the reading speed can more then double on small devices, too big and the progress would not be smooth.

Hope it helps. Feel free to edit if you found mistakes or typos, or vote up to send me some encouragements! :D

一腔孤↑勇 2024-09-13 20:26:23

使用javax.swing.ProgressMonitorInputStream。

Use a javax.swing.ProgressMonitorInputStream.

晨曦慕雪 2024-09-13 20:26:23

您可以通过重写 org.xml.sax.helpers.DefaultHandler/BaseHandlersetDocumentLocator 方法来估计文件中的当前行/列。使用一个对象调用此方法,您可以在需要时从中获取当前行/列的近似值。

编辑:据我所知,没有标准方法可以获得绝对位置。不过,我确信某些 SAX 实现确实提供了此类信息。

You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator of org.xml.sax.helpers.DefaultHandler/BaseHandler. This method is called with an object from which you can get an approximation of the current line/column when needed.

Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.

香草可樂 2024-09-13 20:26:23

假设您知道自己有多少篇文章,难道不能在处理程序中保留一个计数器吗?例如

public void startElement (String uri, String localName, 
                          String qName, Attributes attributes) 
                          throws SAXException {
    if(qName.equals("article")){
        counter++
    }
    ...
}

(我不知道你是否在解析“文章”,这只是一个例子)

如果你事先不知道文章的数量,你需要先数一下。然后,您可以打印状态 nb 个标签读取/total nb 标签,例如每 100 个标签 (counter % 100 == 0)。

或者甚至让另一个线程监视进度。在这种情况下,您可能希望同步对计数器的访问,但没有必要,因为它不需要非常准确。

我的2分钱

Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.

public void startElement (String uri, String localName, 
                          String qName, Attributes attributes) 
                          throws SAXException {
    if(qName.equals("article")){
        counter++
    }
    ...
}

(I don't know whether you are parsing "article", it's just an example)

If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags, say each 100 tags (counter % 100 == 0).

Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.

My 2 cents

酷遇一生 2024-09-13 20:26:23

我会使用输入流位置。创建您自己的简单流类,该流类委托/继承“真实”流类并跟踪读取的字节。正如您所说,获取总文件大小很容易。我不会担心缓冲、前瞻等问题——对于像这样的大文件来说,这是鸡饲料。另一方面,我会将位置限制为“99%”。

I'd use the input stream position. Make your own trivial stream class that delegates/inherits from the "real" one and keeps track of bytes read. As you say, getting the total filesize is easy. I wouldn't worry about buffering, lookahead, etc. - for large files like these it's chickenfeed. On the other hand, I'd limit the position to "99%".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文