JSOUP 选择
;具有特定ID

发布于 2024-12-15 10:14:16 字数 9463 浏览 2 评论 0原文

我正在为一堂课制作一个小型 Android 应用程序,我可以在其中从美国癌症协会的网站上查找与癌症相关的事件。我一直在使用 JSoup 来获取有关事件的基本信息,并从我尝试使用 select() 方法的网站获取特定信息。然而,我当前使用的方法获取的 HTML 节点比我想要的要多,但我不明白为什么。我试图抓取的表看起来像这样:

编辑:我意识到 where id = "pnlResults" 并没有在该表处结束,它在大约 3 个表之后结束,所有表都包含我想要抓取的信息。这是再次的表格

    <div id="pnlResults">

        <h2><span id="lblEventName">American Cancer Society 44th Annual Walter Hagen Golf Tournament</span></h2>
        <!-- General Information Box -->
        <div class="text-box boxed wide">
            <h3 class="head" style="width:97%;">
                General Information
            </h3>
            <div class="content">


                <p>
                    <label>Event Times:</label><span id="lblStartDate">Monday, July 30, 2012</span><span id="lblEndDate"></span><br />
                    <label>&nbsp;</label><span id="lblStartTime">10:00 AM</span> - <span id="lblEndTime">9:00 PM</span>
                </p>
                <p>
                    <label>Time Zone:</label><span id="lblTimeZone">Eastern</span>

                </p>
                <p>
                    <label>Description:</label><span id="lblDesc" class="fieldData long">The American Cancer Society Walter Hagen Golf Tournament highlights the Society’s role in supporting research and patient care here in Rochester. Funds raised through this event help us make a difference in patents’ lives every day though programs including Road to Recovery and Patient Navigation as well as support grants to our research institutions.  144 golfers will play a round of golf and then enjoy cocktails, dinner, and silent auction following the tournament. </span>
                </p>
                <p>
                    <label>Agenda:</label><span id="lblAgenda" class="fieldData long">10:00am - Check-in, 11:00am - Lunch, 12:15pm - Shot gun start, 6:00 - Cocktails and silent auction, 7:00pm Dinner and program</span>
                </p>

            </div>
        </div>

        <div id="pnlStandardDisplay">


        <!-- Event Location Box -->
        <div class="text-box boxed wide line">
            <h3 class="head" style="width:97%;">
                Event Location
            </h3>
            <div class="content" style="display:inline-block; width:97%;">


                <div >
                    <div id="mapOutsideContainer" class="resource-map">
                       <div id="map_canvas" class="resource-map" ></div>
                    </div> 
                    <script  type="text/javascript">
                        var mapDataPoints = [{ "lat":43.1075545,"lng":-77.5164518, "title":"Golf Event","content":"<b>American Cancer Society 44th Annual Walter Hagen Golf Tournament<\/b><br/><\/br>4045 East Avenue<br /><br/>Rochester, New York  14618<br /><br />Phone: <br />Fax: "} ];
                        buildMap(mapDataPoints, -5);
                    </script>
                </div>

                <h4><span id="lblLocationName">Irondequoit Country Club</span></h4>
                <p>

                    <label>Address:</label><span id="lblAddress" class="fieldData" style="width:150px;">4045 East Avenue<br />Rochester, New York 14618</span>
                </p>
                <p>
                    <label nowrap="nowrap">Handicap Accessible:</label><span id="lblHandicapAccesible">Yes</span>
                </p>
            </div>

        </div>

        <!-- Primary Contact Box -->
        <div class ="line" >
        <div id="eventPrimaryContact_divContact" class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Primary Contact
                    </h3>
                    <div class="content">

                        <p>

                            <label>Contact:</label><span id="eventPrimaryContact_lblContact">Katerina Kormas (<a href="mailto:[email protected]?subject=American Cancer Society 44th Annual Walter Hagen Golf Tournament">Contact ACS for Details</a>)</span>

                        </p>
                        <p>
                            <label>Contact Type:</label><span id="eventPrimaryContact_lblContactType">ACS Staff</span>
                        </p>
                        <p>

                            <label>Phone:</label><span id="eventPrimaryContact_lblContactPhone">(585) 288-1950</span>
                        </p>
                        <p>
                            <label>Additional Information:</label><span id="eventPrimaryContact_lblContactAddlInfo" class="fieldData long">Direct line is 585-224-4919 or cell 585-645-8912</span>
                        </p>
                    </div>
                </div>

        </div>

        <!-- Registration Information Box -->

        <div class="text-box boxed wide line">
            <h3 class="head" style="width:97%;">
                Registration Information
            </h3>
            <div class="content">

                <p>
                    <label nowrap="nowrap">Registration Required?: </label><span id="lblRegRequired">Yes</span>

                </p>
            </div>
        </div>       

        <!-- Event Cost Box -->
        <div class ="line" >
        <div id="eventCost_divCost" class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Event Cost
                    </h3>
                    <div class="content">

                        <p>
                            <label>Cost/Registration Fee: </label><span id="eventCost_lblCostRegFee" class="fieldData long">$350 per golfer</span>
                        </p>
                        <p>
                            <label>Payment Type: </label><span id="eventCost_lblPaymentTypes" class="fieldData">Cash, Check, American Express, Mastercard, Visa, Discover</span>
                        </p>
                        <p>

                            <label>Check Payable To: </label><span id="eventCost_lblCheckPayable" class="fieldData">American Cancer Society</span>
                        </p>
                        <p>
                            <label>Memo Line: </label><span id="eventCost_lblCheckMemo" class="fieldData">American Cancer Society 44th Annual Walter Hagen Golf Tourna</span>
                        </p>
                        <p>
                            <label>Mail Check To:</label><span id="eventCost_lblCheckMailTo" class="fieldData">American Cancer Society<br />1120 South Goodman St<br />Rochester, New York 14620</span>

                        </p>
                    </div>
                </div>

        </div>

        <!-- Tax Deduction Information Box -->
        <div class="line">

                <div class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Tax Deduction Information
                    </h3>

                    <div class="content">
                        <p>
                            $210  per golfer is tax deductible
                        </p>
                    </div>
                </div>  

        </div>



</div> <!-- end standard display -->
         <!-- end daffodil display -->

编辑:鉴于这些新表格,我想提取一般信息和事件位置。我该怎么做呢?也许使用 select 的子集我只需再次选择标题在哪里是我想要的?

我使用 select() 的代码如下所示。正如我之前所说,我尝试使用

select("div[id=pnlResults]);

但返回的数据不仅仅是id为pnlResults的div。

public ArrayList<Event> results()
{
    ArrayList<Event> results = new ArrayList<Event>();
    Document doc = Jsoup.parse(page);
    Elements links = doc.select("a[href*=event-details]");

    for(Element e: links)
    {
        String title = e.text();
        String link = "http://www.cancer.org/involved/participate/app/"+e.attr("href");
        try{
            Document eventInfo = Jsoup.connect(link).get();
            Elements info = eventInfo.select("div[id*=pnlResults");


        }
        catch(MalformedURLException exception)
        {
            exception.printStackTrace();
        }
        catch(IOException exception)
        {
            exception.printStackTrace();
        }

    }
    return results;
}

任何帮助将不胜感激。

I'm making a small Android application for a class where I find cancer-related events from the American Cancer Society's website. I've been using JSoup to get basic information about the events, and to get specific information from the website I've tried to use the select() method. However, the current method that I'm using grabs way more HTML nodes than I would like and I couldn't figure out why. The table that I'm trying to grab looks like this:

EDIT: I realized that the where id = "pnlResults" does not end at that table, it ends after about 3 more tables, all with information that I would like to grab. Here is the table again

    <div id="pnlResults">

        <h2><span id="lblEventName">American Cancer Society 44th Annual Walter Hagen Golf Tournament</span></h2>
        <!-- General Information Box -->
        <div class="text-box boxed wide">
            <h3 class="head" style="width:97%;">
                General Information
            </h3>
            <div class="content">


                <p>
                    <label>Event Times:</label><span id="lblStartDate">Monday, July 30, 2012</span><span id="lblEndDate"></span><br />
                    <label> </label><span id="lblStartTime">10:00 AM</span> - <span id="lblEndTime">9:00 PM</span>
                </p>
                <p>
                    <label>Time Zone:</label><span id="lblTimeZone">Eastern</span>

                </p>
                <p>
                    <label>Description:</label><span id="lblDesc" class="fieldData long">The American Cancer Society Walter Hagen Golf Tournament highlights the Society’s role in supporting research and patient care here in Rochester. Funds raised through this event help us make a difference in patents’ lives every day though programs including Road to Recovery and Patient Navigation as well as support grants to our research institutions.  144 golfers will play a round of golf and then enjoy cocktails, dinner, and silent auction following the tournament. </span>
                </p>
                <p>
                    <label>Agenda:</label><span id="lblAgenda" class="fieldData long">10:00am - Check-in, 11:00am - Lunch, 12:15pm - Shot gun start, 6:00 - Cocktails and silent auction, 7:00pm Dinner and program</span>
                </p>

            </div>
        </div>

        <div id="pnlStandardDisplay">


        <!-- Event Location Box -->
        <div class="text-box boxed wide line">
            <h3 class="head" style="width:97%;">
                Event Location
            </h3>
            <div class="content" style="display:inline-block; width:97%;">


                <div >
                    <div id="mapOutsideContainer" class="resource-map">
                       <div id="map_canvas" class="resource-map" ></div>
                    </div> 
                    <script  type="text/javascript">
                        var mapDataPoints = [{ "lat":43.1075545,"lng":-77.5164518, "title":"Golf Event","content":"<b>American Cancer Society 44th Annual Walter Hagen Golf Tournament<\/b><br/><\/br>4045 East Avenue<br /><br/>Rochester, New York  14618<br /><br />Phone: <br />Fax: "} ];
                        buildMap(mapDataPoints, -5);
                    </script>
                </div>

                <h4><span id="lblLocationName">Irondequoit Country Club</span></h4>
                <p>

                    <label>Address:</label><span id="lblAddress" class="fieldData" style="width:150px;">4045 East Avenue<br />Rochester, New York 14618</span>
                </p>
                <p>
                    <label nowrap="nowrap">Handicap Accessible:</label><span id="lblHandicapAccesible">Yes</span>
                </p>
            </div>

        </div>

        <!-- Primary Contact Box -->
        <div class ="line" >
        <div id="eventPrimaryContact_divContact" class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Primary Contact
                    </h3>
                    <div class="content">

                        <p>

                            <label>Contact:</label><span id="eventPrimaryContact_lblContact">Katerina Kormas (<a href="mailto:[email protected]?subject=American Cancer Society 44th Annual Walter Hagen Golf Tournament">Contact ACS for Details</a>)</span>

                        </p>
                        <p>
                            <label>Contact Type:</label><span id="eventPrimaryContact_lblContactType">ACS Staff</span>
                        </p>
                        <p>

                            <label>Phone:</label><span id="eventPrimaryContact_lblContactPhone">(585) 288-1950</span>
                        </p>
                        <p>
                            <label>Additional Information:</label><span id="eventPrimaryContact_lblContactAddlInfo" class="fieldData long">Direct line is 585-224-4919 or cell 585-645-8912</span>
                        </p>
                    </div>
                </div>

        </div>

        <!-- Registration Information Box -->

        <div class="text-box boxed wide line">
            <h3 class="head" style="width:97%;">
                Registration Information
            </h3>
            <div class="content">

                <p>
                    <label nowrap="nowrap">Registration Required?: </label><span id="lblRegRequired">Yes</span>

                </p>
            </div>
        </div>       

        <!-- Event Cost Box -->
        <div class ="line" >
        <div id="eventCost_divCost" class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Event Cost
                    </h3>
                    <div class="content">

                        <p>
                            <label>Cost/Registration Fee: </label><span id="eventCost_lblCostRegFee" class="fieldData long">$350 per golfer</span>
                        </p>
                        <p>
                            <label>Payment Type: </label><span id="eventCost_lblPaymentTypes" class="fieldData">Cash, Check, American Express, Mastercard, Visa, Discover</span>
                        </p>
                        <p>

                            <label>Check Payable To: </label><span id="eventCost_lblCheckPayable" class="fieldData">American Cancer Society</span>
                        </p>
                        <p>
                            <label>Memo Line: </label><span id="eventCost_lblCheckMemo" class="fieldData">American Cancer Society 44th Annual Walter Hagen Golf Tourna</span>
                        </p>
                        <p>
                            <label>Mail Check To:</label><span id="eventCost_lblCheckMailTo" class="fieldData">American Cancer Society<br />1120 South Goodman St<br />Rochester, New York 14620</span>

                        </p>
                    </div>
                </div>

        </div>

        <!-- Tax Deduction Information Box -->
        <div class="line">

                <div class="text-box boxed wide">
                    <h3 class="head" style="width:97%;">
                        Tax Deduction Information
                    </h3>

                    <div class="content">
                        <p>
                            $210  per golfer is tax deductible
                        </p>
                    </div>
                </div>  

        </div>



</div> <!-- end standard display -->
         <!-- end daffodil display -->

EDIT: Given these new tables, I would like to extract the General Information, and Event location. How would I go about doing that? Maybe using the subset of select I just got to select again Where the headers are what I want?

The code where I'm using the select() is shown below. As I said before, I tried to use

select("div[id=pnlResults]);

but the returned data is much more than just the div where the id is pnlResults.

public ArrayList<Event> results()
{
    ArrayList<Event> results = new ArrayList<Event>();
    Document doc = Jsoup.parse(page);
    Elements links = doc.select("a[href*=event-details]");

    for(Element e: links)
    {
        String title = e.text();
        String link = "http://www.cancer.org/involved/participate/app/"+e.attr("href");
        try{
            Document eventInfo = Jsoup.connect(link).get();
            Elements info = eventInfo.select("div[id*=pnlResults");


        }
        catch(MalformedURLException exception)
        {
            exception.printStackTrace();
        }
        catch(IOException exception)
        {
            exception.printStackTrace();
        }

    }
    return results;
}

Any help would be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

自此以后,行同陌路 2024-12-22 10:14:16

尝试:

 Elements info = eventInfo.select("div#pnlResults");

更新您的更新:

由于您现在拥有更多数据,并且 HTML 本身并不是那么好,您只需通过它来挑选数据即可。如果您需要的内容都有 id 值,则使用这些元素的 id 属性来获取文本。

Try:

 Elements info = eventInfo.select("div#pnlResults");

Update for your update:

Since you now have more data, and since the HTML itself isn't that great you'll just have to work through it to pick out your data. If the content you need all have id values then use the id attribute of those elements to get the text.

梦晓ヶ微光ヅ倾城 2024-12-22 10:14:16

如果你想获取id为“pnlResults”的div的内容,JSoup提供了方法getElementById

例如,如果你想获取该内容并将其放入字符串中,你可以这样做:

Document document = Jsoup.connect(LINK_TO_WEBSITE).get();
String content = document.getElementById("pnlResults").outerHtml();

然后,你可以将此内容放入Android的WebView中,它会很好地工作。

希望这会对某人有所帮助!

If you want to get content of the div with id "pnlResults", JSoup provide method getElementById.

For example, if you want get that content and put it in string, you can do it like this:

Document document = Jsoup.connect(LINK_TO_WEBSITE).get();
String content = document.getElementById("pnlResults").outerHtml();

Then, you can put this content in Android's WebView, and it will work nice.

Hope this will help someone!

疾风者 2024-12-22 10:14:16

这对我有用:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class DivStuff {
   public static final String MY_PAGE = "http://www.cancer.org/Involved/Participate/app" +
        "/event-search.aspx?zip=28590&city=&state=&local-radius=20&textsrch=&startdate=" +
        "11%2F13%2F2011&enddate=&all=1";
   private static final String[] HEADINGS = {"Event", "Location", "City, State", "Date", "Distance"};
   private String page;


   public static void main(String[] args) throws IOException {
      Document doc = Jsoup.connect(MY_PAGE).get();

      Elements links = doc.select("table");
      Elements links2 = links.select("tr");

      if (links2.size() < 2) {
         return;
      }

      for (int i = 1; i < links2.size(); i++) {
         Elements innerDetails = links2.get(i).select("td");
         if (innerDetails.size() != 5) {
            break;
         }
         for (int j = 0; j < HEADINGS.length; j++) {
            System.out.print(HEADINGS[j] + ": ");
            if (j == 0) {
               System.out.println(innerDetails.get(j).select("a").get(0).text());
            } else {
               System.out.println(innerDetails.get(j).text());
            }
         }
         System.out.println();
      }
   }
}

This worked for me:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class DivStuff {
   public static final String MY_PAGE = "http://www.cancer.org/Involved/Participate/app" +
        "/event-search.aspx?zip=28590&city=&state=&local-radius=20&textsrch=&startdate=" +
        "11%2F13%2F2011&enddate=&all=1";
   private static final String[] HEADINGS = {"Event", "Location", "City, State", "Date", "Distance"};
   private String page;


   public static void main(String[] args) throws IOException {
      Document doc = Jsoup.connect(MY_PAGE).get();

      Elements links = doc.select("table");
      Elements links2 = links.select("tr");

      if (links2.size() < 2) {
         return;
      }

      for (int i = 1; i < links2.size(); i++) {
         Elements innerDetails = links2.get(i).select("td");
         if (innerDetails.size() != 5) {
            break;
         }
         for (int j = 0; j < HEADINGS.length; j++) {
            System.out.print(HEADINGS[j] + ": ");
            if (j == 0) {
               System.out.println(innerDetails.get(j).select("a").get(0).text());
            } else {
               System.out.println(innerDetails.get(j).text());
            }
         }
         System.out.println();
      }
   }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文