如何解析包含多个表的页面

发布于 2025-01-03 22:19:48 字数 3192 浏览 2 评论 0原文

关于如何抓取包含多个表的网页有什么想法吗？我正在连接到网页

这是一个表，但在同一网页上有多个表

我也不知道如何读取该表...

XML：

    <p><a href="/fantasy_news/feature/?ID=49818"><strong>Top 300 Overall Fantasy Rankings</strong></a></p> 
<div class="storyStats"> 
<table> 
<thead> 
<tr> 
<th>RANK</th> 
<th>CENTRES</th> 
<th>TEAM</th> 
<th>POS</th> 
<th>GP</th> 
<th>G</th> 
<th>A</th> 
<th>PTS</th> 
<th>+/-</th> 
<th>PIM</th> 
<th>PPP</th> 
</tr> 
</thead> 
<tbody> 
<tr class="bg1"> 
<td>1.</td> 
<td><a href="/nhl/teams/players/?name=steven+stamkos">Steven&nbsp;Stamkos</a></td> 

<td>Tampa Bay</td> 
<td>C</td> 
<td align="right">81</td> 
<td align="right">50</td> 
<td align="right">51</td> 
<td align="right">101</td> 
<td align="right">-2</td> 
<td align="right">56</td> 
<td align="right">38</td> 
</tr> 


Iterator<Element> trSIter = doc.select("table")
            .iterator();
    while (trSIter.hasNext()) {
        Element trEl = trSIter.next().child(0);
        Elements tdEls = trEl.children();
        Iterator<Element> tdIter = tdEls.select("tr").iterator();
        System.out.println("><1><><"+tdIter);
        boolean firstRow = true;
        while (tdIter.hasNext()) {

            Element tr = (Element) tdIter.next();


            while (tdIter.hasNext()) {
                int tdCount = 1;
                Element tdEl = tdIter.next();
                //name = tdEl.getElementsByClass("playertablePlayerName").get(0).text();

                Elements tdsEls = tdEl.select("td");
                System.out.println("><2><><"+tdsEls);
                Iterator<Element> columnIt = tdsEls.iterator();

                while (columnIt.hasNext()) {

                    Element column = columnIt.next();
                    switch (tdCount++) {
                    case 1:
                        name =column.select("a").first().text();

                        break;
                    case 2:
                        stat2 = Double.parseDouble(column.text());
                        break;
                    case 3:
                        stat3 = Double.parseDouble(column.text());
                        break;
                    case 4:
                        stat4 = Double.parseDouble(column.text());
                        break;
                    case 5:
                        stat5 = Double.parseDouble(column.text());
                        break;
                    case 6:
                        stat6 = Double.parseDouble(column.text());
                        break;
                    case 7:
                        stat7 = Double.parseDouble(column.text());
                        break;
                    case 8:
                        stat8 = Double.parseDouble(column.text());
                        break;

原文

Any idea on how to scrape a web page with multiple tables?
I am connecting to the web page

This is one table but on the same web page there are multiple tables

I also cant figure out how to read the table...

XML:

    <p><a href="/fantasy_news/feature/?ID=49818"><strong>Top 300 Overall Fantasy Rankings</strong></a></p> 
<div class="storyStats"> 
<table> 
<thead> 
<tr> 
<th>RANK</th> 
<th>CENTRES</th> 
<th>TEAM</th> 
<th>POS</th> 
<th>GP</th> 
<th>G</th> 
<th>A</th> 
<th>PTS</th> 
<th>+/-</th> 
<th>PIM</th> 
<th>PPP</th> 
</tr> 
</thead> 
<tbody> 
<tr class="bg1"> 
<td>1.</td> 
<td><a href="/nhl/teams/players/?name=steven+stamkos">Steven Stamkos</a></td> 

<td>Tampa Bay</td> 
<td>C</td> 
<td align="right">81</td> 
<td align="right">50</td> 
<td align="right">51</td> 
<td align="right">101</td> 
<td align="right">-2</td> 
<td align="right">56</td> 
<td align="right">38</td> 
</tr> 


Iterator<Element> trSIter = doc.select("table")
            .iterator();
    while (trSIter.hasNext()) {
        Element trEl = trSIter.next().child(0);
        Elements tdEls = trEl.children();
        Iterator<Element> tdIter = tdEls.select("tr").iterator();
        System.out.println("><1><><"+tdIter);
        boolean firstRow = true;
        while (tdIter.hasNext()) {

            Element tr = (Element) tdIter.next();


            while (tdIter.hasNext()) {
                int tdCount = 1;
                Element tdEl = tdIter.next();
                //name = tdEl.getElementsByClass("playertablePlayerName").get(0).text();

                Elements tdsEls = tdEl.select("td");
                System.out.println("><2><><"+tdsEls);
                Iterator<Element> columnIt = tdsEls.iterator();

                while (columnIt.hasNext()) {

                    Element column = columnIt.next();
                    switch (tdCount++) {
                    case 1:
                        name =column.select("a").first().text();

                        break;
                    case 2:
                        stat2 = Double.parseDouble(column.text());
                        break;
                    case 3:
                        stat3 = Double.parseDouble(column.text());
                        break;
                    case 4:
                        stat4 = Double.parseDouble(column.text());
                        break;
                    case 5:
                        stat5 = Double.parseDouble(column.text());
                        break;
                    case 6:
                        stat6 = Double.parseDouble(column.text());
                        break;
                    case 7:
                        stat7 = Double.parseDouble(column.text());
                        break;
                    case 8:
                        stat8 = Double.parseDouble(column.text());
                        break;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离笑几人歌 2025-01-10 22:19:48

使用下面的代码，从 HTML 解析表格似乎没有问题。

public class JsoupActivity extends Activity {
    Document doc;
    myHttpGet _myGet;
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);
        final TextView tv = (TextView)findViewById(R.id.tv1);
        _myGet = new myHttpGet();
        try {
            doc = _myGet.doHttpGet();
            Elements tdsEls = doc.getElementsByClass("storyStats");
            //tv.setText(tdsEls.get(0).child(0).text());
            tv.setText(String.valueOf(tdsEls.first().children().size()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private class myHttpGet {
        Document myDom;
        Connection myConnection;
        Response myResponse;
        public Document doHttpGet() {
            myConnection = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815");
            try {
                myResponse = myConnection.execute();
                try {
                    myDom = myResponse.parse();
                    return myDom;
                } catch (IOException e) {
                    Log.e("napster","Parse Error");
                }
            } catch (IOException e) {
                Log.e("napster","HTTP Error");
            }
            return myDom;
        }
    }

}

该代码可以在 textView 中显示 5，这是该 HTML 中 storyStats 类下的表格数量。如果您必须继续解析表的内容，您可以将表分配给另一个 Elements 对象并继续解析它。

Elements es = tdsEls.first().children();

安德森的答案展示了如何解析它的数据。希望有帮助。

With the below code, it seems there is no problem in parsing the tables from the HTML.

public class JsoupActivity extends Activity {
    Document doc;
    myHttpGet _myGet;
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);
        final TextView tv = (TextView)findViewById(R.id.tv1);
        _myGet = new myHttpGet();
        try {
            doc = _myGet.doHttpGet();
            Elements tdsEls = doc.getElementsByClass("storyStats");
            //tv.setText(tdsEls.get(0).child(0).text());
            tv.setText(String.valueOf(tdsEls.first().children().size()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private class myHttpGet {
        Document myDom;
        Connection myConnection;
        Response myResponse;
        public Document doHttpGet() {
            myConnection = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815");
            try {
                myResponse = myConnection.execute();
                try {
                    myDom = myResponse.parse();
                    return myDom;
                } catch (IOException e) {
                    Log.e("napster","Parse Error");
                }
            } catch (IOException e) {
                Log.e("napster","HTTP Error");
            }
            return myDom;
        }
    }

}

The code can show 5 in textView which is the number of tables you have in that HTML under the class storyStats. If you have to go ahead parsing the contents of the tables, you can assign the tables into another Elements object and go ahead parsing it.

Elements es = tdsEls.first().children();

Anderson's answer shows how to parse it for data. Hope that helps.

回复收藏 0 原文

断爱 2025-01-10 22:19:48

这应该可以帮助您开始。每个表都有一个空白记录，您必须考虑这一点。您还必须弄清楚您想要哪些统计数据以及它们在表格中的位置。您可以使用 tds.get() 获取统计信息。让我知道它对您有何作用。

    Document doc = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815").get();

    for (Element table : doc.select("div.storyStats").select("table")) {
        for (Element row : table.select("tr")) {
            Elements tds = row.select("td");
            if (tds.size() > 0) {
                System.out.println(tds.get(1).text() + ":" + tds.get(5).text());
            }
        }
    }

This should get you started. Each table has a blank record you will have to account for. You will also have to figure out which stats you want and where they are in the table. You get the stats with tds.get(). Let me know how it works for you.

    Document doc = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815").get();

    for (Element table : doc.select("div.storyStats").select("table")) {
        for (Element row : table.select("tr")) {
            Elements tds = row.select("td");
            if (tds.size() > 0) {
                System.out.println(tds.get(1).text() + ":" + tds.get(5).text());
            }
        }
    }

回复收藏 0 原文

~没有更多了~