为什么 Jsoup 在 Java/Android 中抓取数据的方式不同

发布于 2025-01-03 18:54:20 字数 1754 浏览 3 评论 0原文

我一直在尝试从此 URL http://www.isleworthsyon.hounslow 抓取“学校通知”。 sch.uk/

我尝试在 Java 中抓取文本，然后使用 String.replaceAll 方法将所有“ ”（我不确定它是什么字符）替换为新行，效果非常好具有完美的结果，但是当我将相同的代码应用到 Android 时......它给了我不同的结果。

在 JAVA 中：

String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

Document site = null;
try {
    site = Jsoup.connect(URL).get();
} catch (IOException e) {

    e.printStackTrace();
}


        String HTML = site.html();

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        String output = news.text();


        String for_output_text = output.replaceAll("   ","\n\n");

        System.out.println(for_output_text);

}

}

在 ANDROID 中：

super.onCreate(savedInstanceState);
    setContentView(R.layout.main);

    final TextView text = (TextView)findViewById(R.id.text);

        String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

        Document site = null;
        try {
            site = Jsoup.connect(URL).get();
        } catch (IOException e) {

            e.printStackTrace();
        }

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        String output = news.text();

        String for_output_text = output.replaceAll("   ","\n\n");

        text.setText(for_output_text);
        }

两个输出文本不同，如下所示

http:// /dl.dropbox.com/u/35866688/Comparison.png

顺便说一句，这是我第一次进行网络抓取

编辑：经过实验，我的字符串从 news.text() 获取的内容在 java 和 android 中具有不同的间距。关于为什么会出现这种情况有什么建议以及是否有其他选择？

原文

I have been trying to scrape the 'School Notices' from this URL http://www.isleworthsyon.hounslow.sch.uk/

I tried scraping the text in Java and then replaced all " " (which I am not sure what character it is,) with a new line using String.replaceAll method and it worked absolutely fine with perfect results, but when I apply the same code to Android..it gives me different results.

IN JAVA:

String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

Document site = null;
try {
    site = Jsoup.connect(URL).get();
} catch (IOException e) {

    e.printStackTrace();
}


        String HTML = site.html();

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        String output = news.text();


        String for_output_text = output.replaceAll("   ","\n\n");

        System.out.println(for_output_text);

}

}

IN ANDROID:

super.onCreate(savedInstanceState);
    setContentView(R.layout.main);

    final TextView text = (TextView)findViewById(R.id.text);

        String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

        Document site = null;
        try {
            site = Jsoup.connect(URL).get();
        } catch (IOException e) {

            e.printStackTrace();
        }

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        String output = news.text();

        String for_output_text = output.replaceAll("   ","\n\n");

        text.setText(for_output_text);
        }

The two output texts are different as you can see below

http://dl.dropbox.com/u/35866688/Comparison.png

btw this is my first go at web scraping

Edit: After experiments, The strings I get from news.text() have different spacings in java and android. Any suggestions as to why that is the case and are there any alternatives?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幸福％小乖 2025-01-10 18:54:20

最后！我找到了更好的解决方案；使用 for 循环并分别添加每个元素。

final TextView text = (TextView)findViewById(R.id.text);

        String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

        Document site = null;
        try {
            site = Jsoup.connect(URL).get();
        } catch (IOException e) {

            e.printStackTrace();
        }

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        Elements newsline = news.select("[align~=center]");

        String output = "";
        String  oldline = "";

        for (int i = 0; i < newsline.size(); i++) {


            String  newline = oldline + "\n" + newsline.get(i).text();  
            oldline = newline;
            output = newline;
          }           

        text.setText(output);

感谢所有提供帮助的人

Finally! I found a better Solution; to use for loop and add each element separately.

final TextView text = (TextView)findViewById(R.id.text);

        String URL = "http://www.isleworthsyon.hounslow.sch.uk/";

        Document site = null;
        try {
            site = Jsoup.connect(URL).get();
        } catch (IOException e) {

            e.printStackTrace();
        }

        site.select("a").remove();

        Elements news = site.select("div#np_91983-1");

        Elements newsline = news.select("[align~=center]");

        String output = "";
        String  oldline = "";

        for (int i = 0; i < newsline.size(); i++) {


            String  newline = oldline + "\n" + newsline.get(i).text();  
            oldline = newline;
            output = newline;
          }           

        text.setText(output);

Thanks to everyone who helped

回复收藏 0 原文

~没有更多了~