python BeautifulSoup查找span id名称而不使用string\re方法

发布于 2024-11-25 11:01:54 字数 3636 浏览 0 评论 0原文

我正在尝试获取我的跨度标签的 ID 名称。

<td vAlign="top" colSpan="2"><IMG height="25" src="images/spacer.gif" width="1"><br>
    <!--start table details-->
    <table cellSpacing="1" cellPadding="5" width="100%" bgColor="#a18c42" border="0" id="compDetails">
        <tr bgColor="white">
            <td class="rowName" noWrap>מספר תאגיד:</td>

            <td width="100%" colSpan="3"><span id="lblCompanyNumber">520000472</span></td>
        </tr>
        <tr bgColor="white">
            <td class="rowName" noWrap>שם תאגיד (עברית):</td>
            <td width="50%"><span id="lblCompanyNameHeb">חברת החשמל לישראל בעמ</span></td>
            <td class="rowName" noWrap>שם תאגיד (אנגלית):</td>
            <td width="50%"><span id="lblCompanyNameEn"></span></td>

        </tr>
        <tr bgColor="white">
            <td class="rowName" noWrap>סטטוס:</td>
            <td width="50%"><span id="lblStatus">פעילה</span></td>
            <td class="rowName" noWrap>סוג תאגיד:</td>
            <td width="50%"><span id="lblCorporationType">חברה ציבורית</span></td>
        </tr>

        <tr bgColor="white">
            <td class="rowName" noWrap>סוג חברה ממשלתית:</td>
            <td width="50%"><span id="lblGovCompanyType">חברה  ממשלתית</span></td>
            <td class="rowName" noWrap>סוג מגבלות:</td>
            <td width="50%"><span id="lblLimitType">מוגבלת</span></td>

假设 htmlSpan 包含上面的 html -

soup = BeautifulSoup(htmlSpan , fromEncoding="windows-1255") # I want to use windows-1255 and not utf8
spans = soup('span', limit=30)

这就是输出 -

[<span class="mainTitle">╫¿╫⌐╫¥ ╫פ╫ק╫ס╫¿╫ץ╫¬</span>,
 <span class="subTitle">╫ñ╫¿╫ר╫ש
            ╫ק╫ס╫¿╫פ/╫⌐╫ץ╫¬╫ñ╫ץ╫¬</span>,
 <span id="lblCompanyNumber">514568245</span>,
 <span id="lblCompanyNameHeb">╫£╫ס╫ש╫נ ╫נ╫ש╫á╫ר╫ע╫¿╫ª╫ש╫פ ╫ץ╫á╫ש╫¬╫ץ╫ק ╫₧╫ó╫¿╫¢╫
ץ╫¬ ╫ס╫ó"╫₧</span>,
 <span id="lblCompanyNameEn">LAVI INTEGRATION &SYSTEM; ANALYSIS LTD</span>,
 <span id="lblStatus">╫ñ╫ó╫ש╫£╫פ</span>,
 <span id="lblCorporationType">╫ק╫ס╫¿╫פ ╫ñ╫¿╫ר╫ש╫¬</span>,
 <span id="lblGovCompanyType">╫ק╫ס╫¿╫פ ╫£╫נ ╫₧╫₧╫⌐╫£╫¬╫ש╫¬</span>,
 <span id="lblLimitType">╫₧╫ץ╫ע╫ס╫£╫¬</span>,
 <span id="lblStatusMafera"><b><font color="Red"></font></b></span>,
 <span id="lblMaferaDate"></span>,
 <span id="lblStatusMafera1"><b><font color="Red"></font></b></span>,
 <span id="lblCountry">╫ש╫⌐╫¿╫נ╫£</span>,
 <span id="lblCity">╫ק╫ף╫¿╫פ</span>,
 <span id="lblStreet">╫פ╫£╫£ ╫ש╫ñ╫פ</span>,
 <span id="lblStreetNumber">34</span>,
 <span id="lblZipCode">38424</span>,
 <span id="lblPOB"></span>,
 <span id="lblLocatedAt"></span>,
 <span id="lblCompanyGoal">╫£╫ó╫í╫ץ╫º ╫ס╫¢╫£ ╫ó╫ש╫í╫ץ╫º ╫ק╫ץ╫º╫ש</span>,
 <span id="lblCompanyDesc"></span>,
 <span id="lblDochShana"></span>]

我知道如何获取 span 内容,但我无法获取 span id 名称(“lblStatus” for ex')。

我怎样才能用BeautifulSoup的方法得到它?

我在保存 spans 内容时也遇到了麻烦,没有 BeautifulSoup 将其转换(字符集)为 utf8 (或乱码),最后我需要将 span id 名称和内容保存到 csv 中,并且我遇到了 utf8 问题。

谢谢

I'm trying to get the id name of my span tags.

<td vAlign="top" colSpan="2"><IMG height="25" src="images/spacer.gif" width="1"><br>
    <!--start table details-->
    <table cellSpacing="1" cellPadding="5" width="100%" bgColor="#a18c42" border="0" id="compDetails">
        <tr bgColor="white">
            <td class="rowName" noWrap>מספר תאגיד:</td>

            <td width="100%" colSpan="3"><span id="lblCompanyNumber">520000472</span></td>
        </tr>
        <tr bgColor="white">
            <td class="rowName" noWrap>שם תאגיד (עברית):</td>
            <td width="50%"><span id="lblCompanyNameHeb">חברת החשמל לישראל בעמ</span></td>
            <td class="rowName" noWrap>שם תאגיד (אנגלית):</td>
            <td width="50%"><span id="lblCompanyNameEn"></span></td>

        </tr>
        <tr bgColor="white">
            <td class="rowName" noWrap>סטטוס:</td>
            <td width="50%"><span id="lblStatus">פעילה</span></td>
            <td class="rowName" noWrap>סוג תאגיד:</td>
            <td width="50%"><span id="lblCorporationType">חברה ציבורית</span></td>
        </tr>

        <tr bgColor="white">
            <td class="rowName" noWrap>סוג חברה ממשלתית:</td>
            <td width="50%"><span id="lblGovCompanyType">חברה  ממשלתית</span></td>
            <td class="rowName" noWrap>סוג מגבלות:</td>
            <td width="50%"><span id="lblLimitType">מוגבלת</span></td>

lets say htmlSpan contains the html above -

soup = BeautifulSoup(htmlSpan , fromEncoding="windows-1255") # I want to use windows-1255 and not utf8
spans = soup('span', limit=30)

that's the output -

[<span class="mainTitle">╫¿╫⌐╫¥ ╫פ╫ק╫ס╫¿╫ץ╫¬</span>,
 <span class="subTitle">╫ñ╫¿╫ר╫ש
            ╫ק╫ס╫¿╫פ/╫⌐╫ץ╫¬╫ñ╫ץ╫¬</span>,
 <span id="lblCompanyNumber">514568245</span>,
 <span id="lblCompanyNameHeb">╫£╫ס╫ש╫נ ╫נ╫ש╫á╫ר╫ע╫¿╫ª╫ש╫פ ╫ץ╫á╫ש╫¬╫ץ╫ק ╫₧╫ó╫¿╫¢╫
ץ╫¬ ╫ס╫ó"╫₧</span>,
 <span id="lblCompanyNameEn">LAVI INTEGRATION &SYSTEM; ANALYSIS LTD</span>,
 <span id="lblStatus">╫ñ╫ó╫ש╫£╫פ</span>,
 <span id="lblCorporationType">╫ק╫ס╫¿╫פ ╫ñ╫¿╫ר╫ש╫¬</span>,
 <span id="lblGovCompanyType">╫ק╫ס╫¿╫פ ╫£╫נ ╫₧╫₧╫⌐╫£╫¬╫ש╫¬</span>,
 <span id="lblLimitType">╫₧╫ץ╫ע╫ס╫£╫¬</span>,
 <span id="lblStatusMafera"><b><font color="Red"></font></b></span>,
 <span id="lblMaferaDate"></span>,
 <span id="lblStatusMafera1"><b><font color="Red"></font></b></span>,
 <span id="lblCountry">╫ש╫⌐╫¿╫נ╫£</span>,
 <span id="lblCity">╫ק╫ף╫¿╫פ</span>,
 <span id="lblStreet">╫פ╫£╫£ ╫ש╫ñ╫פ</span>,
 <span id="lblStreetNumber">34</span>,
 <span id="lblZipCode">38424</span>,
 <span id="lblPOB"></span>,
 <span id="lblLocatedAt"></span>,
 <span id="lblCompanyGoal">╫£╫ó╫í╫ץ╫º ╫ס╫¢╫£ ╫ó╫ש╫í╫ץ╫º ╫ק╫ץ╫º╫ש</span>,
 <span id="lblCompanyDesc"></span>,
 <span id="lblDochShana"></span>]

I know how to get the span content but I can't get the span id name ('lblStatus' for ex').

how can I get it with BeautifulSoup's methods?

I'm also having trouble saving the spans content without BeautifulSoup converting (charset) it to utf8 (or gibberish) in the end I need to save the the span id name and content into a csv, and I'm having utf8 problems with it.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

何必那么矫情 2024-12-02 11:01:54

我无法获取跨度 ID 名称(“lblStatus”代表 ex)。

使用您自己的代码设置的 spans

for span in spans:
    print span['id']

在没有 BeautifulSoup 转换为 utf8 或乱码的情况下保存 spans 内容时,我也遇到了麻烦

我无法复制此内容:对我来说 spans 的输出不是乱码,而是相同的与 html 中的字符相同。您确定您尝试解析的页面是用“windows-1255”编码的吗?你的Python文件有正确的UTF-8编码声明(# -*-coding: UTF-8 -*-)吗?

UTF-8 几乎是当今 python 中的标准,BeautifulSoup 在内部使用它。我的建议是在所有代码中使用 UTF-8,并仅在输出/转储数据时更改编码(如果您确实需要这样做)。

最后我需要将 span id 名称和内容保存到 csv 中...

这只是一个粗略的想法,您应该根据您的需要进行调整:

import csv
file_ = open('output.csv', 'w')
writer = csv.writer(file_)
for span in spans:
    writer.writerow([span['id'], span.string])

...我遇到了 utf8 问题。

您能具体说明您的问题是什么吗?在我的系统(GNU/Linux)上它工作得很好。

I can't get the span id name ('lblStatus' for ex').

Using spans as set by your own code:

for span in spans:
    print span['id']

I'm also having trouble saving the spans content without BeautifulSoup converting to utf8 or gibberish

I could not replicate this: the output of spans for me is not gibberish, but the same chars as in the html. Are you sure the page you are trying to parse is encoded in "windows-1255"? Do you have a proper UTF-8 encoding declaration (# -*- coding: UTF-8 -*-) you your python file?

UTF-8 is pretty much the standard in python nowadays and BeautifulSoup uses it internally. My suggestion would be to work in UTF-8 in all your code and change encoding (if you truly need to do it) only when you output/dump data.

in the end I need to save the the span id name and content into a csv...

This is just a rough idea that you should tweak as per your need:

import csv
file_ = open('output.csv', 'w')
writer = csv.writer(file_)
for span in spans:
    writer.writerow([span['id'], span.string])

...and I'm having utf8 problems with it.

Could you specify about what your problems are? On my system (GNU/Linux) it works just fine.

掩耳倾听 2024-12-02 11:01:54

您可以通过查找来访问标签的属性标签作为字典,按标签名称键入:

for span in spans:
    print span['id']

给出您想要的内容:lblCompanyNumber lblCompanyNameHeb lblCompanyNameEn lblStatus lblCorporationType lblGovCompanyType lblLimitType...

我也无法将 spans 内容保存到 csv 中,而没有 BeautifulSoup 将其转换(字符集)为 utf8(或乱码)

mac 使用decode() 的答案是正确的。它与默认为“ascii”的 sys.getdefaultencoding() 无关,这并不重要。

You can access the attributes of tags by looking up the tag as a dict, keyed by tag name:

for span in spans:
    print span['id']

gives what you want: lblCompanyNumber lblCompanyNameHeb lblCompanyNameEn lblStatus lblCorporationType lblGovCompanyType lblLimitType...

I'm also having trouble saving the spans content into a csv without BeautifulSoup converting (charset) it to utf8 (or gibberish)

mac's answer to use decode() is correct. It's unrelated to sys.getdefaultencoding() which defaults to 'ascii', that doesn't matter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文