获取“AttributeError:“NoneType”对象没有属性“count”扩展我的数据集时

发布于 2025-01-10 11:13:01 字数 2134 浏览 0 评论 0原文

我正在用 python 分析 Twitter 数据集,并尝试找到每一条引用。该代码应该为我提供一个 .csv 文件,其中包含所有推文及其引用的列表。 我在 Github 上找到了一段代码,其中有人尝试了同样的操作,但数据来自网站。我调整了代码的数据集。 这些推文都位于一个 .xml 文件中,如下所示:

<articles>
     <article>
          <paragraph>Tweet text is here.</paragraph>
          <paragraph>Tweet text is here.</paragraph>
     </article>
</articles>

我的数据集有 1.000.000 条推文。当分析 50,000 条推文的样本时,一切都按预期进行。 分析完整数据集时,我收到此消息:

Traceback (most recent call last):
  File "C:/xxx.py", line 16, in <module>
    count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'

为什么在分析整个数据集时会收到此消息,但在分析样本时却不会收到此消息?

这是我的代码:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('tweets.xml')
articles = tree.getroot()

paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []

for article in articles:
    for paragraph in article.findall('paragraph'):
        text = paragraph.text
        count = text.count("\'")
        indexes = []
        if count > 1:
            paragraphs_with_quotes.append(text)
            index = text.index("\'")
            while count > 0:

                if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
                    indexes.append(index)
                if count > 1:
                    index = text.index("\'", index + 1)
                count -= 1
            for i in range(0, len(indexes), 2):
                start = indexes[i]
                end = indexes[min(len(indexes) - 1, i + 1)]
                print(text)

                quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
                extracted_paragraphs.append(text)

                print("Quote:" + quotes[len(quotes) - 1])
                print()

d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')

for i in range(1):
    print()

print(len(paragraphs_with_quotes))

谢谢!

I'm analyzing a twitter dataset in python and try to find every quote. The code is supposed to give me a .csv file with a list of all tweets and their quotes.
I found a code on Github where someone tried the same thing but with data from a website. I adjusted my dataset for the code.
The tweets are all in an .xml-file like this:

<articles>
     <article>
          <paragraph>Tweet text is here.</paragraph>
          <paragraph>Tweet text is here.</paragraph>
     </article>
</articles>

My dataset has 1.000.000 tweets. When analyzing a sample size of 50.000 tweets everything works as supposed.
When analyzing the full dataset I get this message:

Traceback (most recent call last):
  File "C:/xxx.py", line 16, in <module>
    count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'

Why do I get this when I analyze the whole dataset but not when I analyze the sample?

Here's my code:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('tweets.xml')
articles = tree.getroot()

paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []

for article in articles:
    for paragraph in article.findall('paragraph'):
        text = paragraph.text
        count = text.count("\'")
        indexes = []
        if count > 1:
            paragraphs_with_quotes.append(text)
            index = text.index("\'")
            while count > 0:

                if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
                    indexes.append(index)
                if count > 1:
                    index = text.index("\'", index + 1)
                count -= 1
            for i in range(0, len(indexes), 2):
                start = indexes[i]
                end = indexes[min(len(indexes) - 1, i + 1)]
                print(text)

                quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
                extracted_paragraphs.append(text)

                print("Quote:" + quotes[len(quotes) - 1])
                print()

d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')

for i in range(1):
    print()

print(len(paragraphs_with_quotes))

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

人│生佛魔见 2025-01-17 11:13:01

我的猜测是您的文章没有段落

您需要能够处理 textNoneType

if (text is not None):

My guess would be that you have an article that doesn't have a paragraph.

You need to be able to handle when text is NoneType

if (text is not None):
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文