获取“AttributeError：“NoneType”对象没有属性“count”扩展我的数据集时

发布于 2025-01-10 11:13:01 字数 2134 浏览 0 评论 0原文

我正在用 python 分析 Twitter 数据集，并尝试找到每一条引用。该代码应该为我提供一个 .csv 文件，其中包含所有推文及其引用的列表。我在 Github 上找到了一段代码，其中有人尝试了同样的操作，但数据来自网站。我调整了代码的数据集。这些推文都位于一个 .xml 文件中，如下所示：

<articles>
     <article>
          <paragraph>Tweet text is here.</paragraph>
          <paragraph>Tweet text is here.</paragraph>
     </article>
</articles>

我的数据集有 1.000.000 条推文。当分析 50,000 条推文的样本时，一切都按预期进行。分析完整数据集时，我收到此消息：

Traceback (most recent call last):
  File "C:/xxx.py", line 16, in <module>
    count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'

为什么在分析整个数据集时会收到此消息，但在分析样本时却不会收到此消息？

这是我的代码：

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('tweets.xml')
articles = tree.getroot()

paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []

for article in articles:
    for paragraph in article.findall('paragraph'):
        text = paragraph.text
        count = text.count("\'")
        indexes = []
        if count > 1:
            paragraphs_with_quotes.append(text)
            index = text.index("\'")
            while count > 0:

                if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
                    indexes.append(index)
                if count > 1:
                    index = text.index("\'", index + 1)
                count -= 1
            for i in range(0, len(indexes), 2):
                start = indexes[i]
                end = indexes[min(len(indexes) - 1, i + 1)]
                print(text)

                quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
                extracted_paragraphs.append(text)

                print("Quote:" + quotes[len(quotes) - 1])
                print()

d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')

for i in range(1):
    print()

print(len(paragraphs_with_quotes))

谢谢！

原文

I'm analyzing a twitter dataset in python and try to find every quote. The code is supposed to give me a .csv file with a list of all tweets and their quotes.
I found a code on Github where someone tried the same thing but with data from a website. I adjusted my dataset for the code.
The tweets are all in an .xml-file like this:

<articles>
     <article>
          <paragraph>Tweet text is here.</paragraph>
          <paragraph>Tweet text is here.</paragraph>
     </article>
</articles>

My dataset has 1.000.000 tweets. When analyzing a sample size of 50.000 tweets everything works as supposed.
When analyzing the full dataset I get this message:

Traceback (most recent call last):
  File "C:/xxx.py", line 16, in <module>
    count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'

Why do I get this when I analyze the whole dataset but not when I analyze the sample?

Here's my code:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('tweets.xml')
articles = tree.getroot()

paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []

for article in articles:
    for paragraph in article.findall('paragraph'):
        text = paragraph.text
        count = text.count("\'")
        indexes = []
        if count > 1:
            paragraphs_with_quotes.append(text)
            index = text.index("\'")
            while count > 0:

                if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
                    indexes.append(index)
                if count > 1:
                    index = text.index("\'", index + 1)
                count -= 1
            for i in range(0, len(indexes), 2):
                start = indexes[i]
                end = indexes[min(len(indexes) - 1, i + 1)]
                print(text)

                quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
                extracted_paragraphs.append(text)

                print("Quote:" + quotes[len(quotes) - 1])
                print()

d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')

for i in range(1):
    print()

print(len(paragraphs_with_quotes))

Thank you!

分享到QQ

分享到微博