使用boilerpipe提取非英文文章

发布于 01-04 15:46 字数 655 浏览 4 评论 0原文

我正在尝试使用 boilerpipe java 库从一组网站中提取新闻文章。它非常适合英语文本，但对于带有特殊字符的文本，例如带有重音符号的单词 (história)，无法正确提取这些特殊字符。我认为这是一个编码问题。

在boilerpipe常见问题解答中，它说“如果您提取非英语文本，您可能需要更改一些参数”，然后引用论文。我在这篇论文中没有找到解决方案。

我的问题是，使用锅炉管道时是否有任何参数可以指定编码？有什么办法可以绕过并正确获取文本吗？

我如何使用图书馆：（第一次尝试基于 URL）：（

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

第二次尝试基于 HTLM 源代码）

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

原文

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library:
(first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蹲墙角沉默2025-01-11 15:46:24

您不必修改内部 Boilerpipe 类。

只需将 InputSource 对象传递给 ArticleExtractor.INSTANCE.getText() 方法并对该对象强制编码。例如：

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

问候！

You don't have to modify inner Boilerpipe classes.

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards!

回复收藏 0 原文

许一世地老天荒2025-01-11 15:46:24

嗯，据我所知，当你这样使用它时，库会自动选择要使用的编码。来自 HTMLFetcher 源：

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

尝试稍微调试一下他们的代码，从 ArticleExtractor.getText(URL) 开始，看看是否可以覆盖编码

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

回复收藏 0 原文

我要还你自由2025-01-11 15:46:24

好的，找到解决方案了。
正如 Andrei 所说，我必须更改 HTMLFecther 类，该类位于 de.l3s.boilerpipe.sax 包中
我所做的是将所有获取的文本转换为 UTF-8。
在获取函数的末尾，我必须添加两行，并更改最后一行：

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

回复收藏 0 原文

烟若柳尘2025-01-11 15:46:24

Boilerpipe 的 ArticleExtractor 使用一些专门针对英语定制的算法 - 测量平均短语中的单词数量等。在任何比英语冗长或不那么冗长的语言（即：所有其他语言）中，这些算法都会不太准确。

此外，该库使用一些英语短语来尝试找到文章的结尾（评论、发表评论、发表意见等），这显然不适用于其他语言。

这并不是说该库将彻底失败 - 只是要注意，为了在非英语语言中获得良好的结果，可能需要进行一些修改。

回复收藏 0 原文

◇流星雨2025-01-11 15:46:24

Java：

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse：
运行>运行配置>通用选项卡。将编码设置为其他(UTF-8)，然后单击运行。

在此处输入图像描述

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

enter image description here

回复收藏 0 原文

探春2025-01-11 15:46:24

我遇到了一些问题； cnr 解决方案效果很好。只需将 UTF-8 编码更改为 ISO-8859-1 即可。谢谢

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

回复收藏 0 原文

~没有更多了~