如何将boilerpipe与本地html文件一起使用?
我的本地磁盘上有一个 html 文件,想使用 BoilerPipe 从中提取文本。
ExtractorBase 类中的“getText”方法接受读取器,因此我写道:
FileReader fr = new FileReader("D:/myHTMLfile");
System.out.println(ArticleExtractor.INSTANCE.getText(fr));
但随后我收到一个指向第二行代码的错误。
有什么线索吗?谢谢!
编辑:整个错误消息是:
Exception in thread "pool-1-thread-1" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException: org.cyberneko.html.HTMLConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 9 more
Exception in thread "pool-1-thread-2" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
BUILD SUCCESSFUL (total time: 0 seconds)
I have an html file on my local disk and would like to extract text from it using BoilerPipe.
The "getText" method from the class ExtractorBase accepts a reader, so I wrote:
FileReader fr = new FileReader("D:/myHTMLfile");
System.out.println(ArticleExtractor.INSTANCE.getText(fr));
But then I get an error pointing to this second line of code.
Any clue? Thx!
EDIT: the whole error msg is:
Exception in thread "pool-1-thread-1" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException: org.cyberneko.html.HTMLConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 9 more
Exception in thread "pool-1-thread-2" java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLConfiguration
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:50)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLParser.<init>(BoilerpipeHTMLParser.java:41)
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:101)
at neuromarket.BoilerPlateExtractor.run(BoilerPlateExtractor.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
BUILD SUCCESSFUL (total time: 0 seconds)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要在类路径中添加 nekohtml-1.xxjar 。
You need to add nekohtml-1.x.x.jar at classpath.