函数无法更新逗号之后的间距

发布于 2025-01-25 07:47:10 字数 1092 浏览 4 评论 0原文

之后的间距不一致,例如:

534323,93495443,34234234,3523423423,2342342,2342342,236555,6564354344

我有一个CSV文件在逗号 间距一致,但似乎并没有更新任何内容。打开创建的新文件后,与原始文件没有区别。我写的功能是:

def ensure_consistent_spacing_in_csv(dirpath, original_name, new_name):
    with open(dirpath + original_name, "r") as f:
        data = f.readlines()
    for item in data:
        if "," in data:
            comma_index = item.index(",")
            if item[comma_index + 1] != " ":
                item = item.replace(",", ", ")
    with open(dirpath + new_name, "w") as f:
        f.writelines(data)

我要在哪里出错?

我已经查看了问题的答案在这里,但是我无法使用该方法,因为我需要定界符为“”,这是两个字符,因此不允许。我还试图在sed中遵循该方法的回答在这里使用process.call系统,但这也失败了,我不知道bash吧,所以我不愿意走那条路线,并且想要使用纯Python方法。

谢谢你!

I have a csv file that has inconsistent spacing after the comma, like this:

534323, 93495443,34234234, 3523423423, 2342342,236555, 6564354344

I have written a function that tries to read in the file and makes the spacing consistent, but it doesn't appear to update anything. After opening the new file created, there is no difference from the original. The function I've written is:

def ensure_consistent_spacing_in_csv(dirpath, original_name, new_name):
    with open(dirpath + original_name, "r") as f:
        data = f.readlines()
    for item in data:
        if "," in data:
            comma_index = item.index(",")
            if item[comma_index + 1] != " ":
                item = item.replace(",", ", ")
    with open(dirpath + new_name, "w") as f:
        f.writelines(data)

Where am I going wrong?

I have looked at the answer to the question here, but I cannot use that method as I need the delimiter to be ", ", which is two characters and hence not allowed. I also tried to follow the method in the sed answer to the question here using a process.call system, but that also failed and I don't know bash well so I'm hesitant to go that route and would like to use a pure python method.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

软糯酥胸 2025-02-01 07:47:10

这是我能够从您的示例

note 的字符串中标准化间距的方式:我假设文件的内容不足以超过可用的内存,因为您将其读取到列表中在您的代码中。

注意:使用正则表达式可能并非总是(几乎从来没有读)他是解决问题的最有效方法,但可以完成工作。

regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
test_str = "534323, 93495443,34234234, 3523423423, 2342342,236555, 6564354344"
subst = ", "
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print(result)

将产生

534323, 93495443, 34234234, 3523423423, 2342342, 236555, 6564354344

并为文件提供以下上下文:

1,    2, 3, 4,5,6
1,2,3,4,  5,    6
1,        2,3,4,5,6

我运行

with open('test.csv') as f:
    data = f.read()
regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
subst = ", "
result = re.sub(regex, subst, data)
print(result)

并得到了此结果:

1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6

您可以使用CSV模块读取行,对于每行,您都会剥离()元素。

upd:
正则可以简化

regex = r"\s*,\s*"

Here is how I was able to normalize the spacing given a string from your example

NOTE: I am assuming the content of the file isn't large enough to exceed the available memory since you read it into the list in your code.

NOTE: using regular expressions may not always (read almost never) be he most efficient way to solve a problem, but it gets the job done.

regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
test_str = "534323, 93495443,34234234, 3523423423, 2342342,236555, 6564354344"
subst = ", "
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print(result)

will produce

534323, 93495443, 34234234, 3523423423, 2342342, 236555, 6564354344

and for the file with the following context:

1,    2, 3, 4,5,6
1,2,3,4,  5,    6
1,        2,3,4,5,6

I ran

with open('test.csv') as f:
    data = f.read()
regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
subst = ", "
result = re.sub(regex, subst, data)
print(result)

and got this result:

1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6

Alternatively you could use the csv module to read the rows and for each row you would strip() the element.

UPD:
The regex could be simplified to

regex = r"\s*,\s*"
风情万种。 2025-02-01 07:47:10

原始代码有几个错误:

  • 条件中的if“”永远不会评估为true。 数据是列表,其中列表中的每个项目都是代表文件的一行的字符串。文件中没有一行是,因此条件永远不会评估为true。要修复它,请在项目中使用if“”。这样,它正在检查每行是否都有逗号。
  • 还有第二个问题:item.Index函数仅返回逗号的第一个实例,因此,如果在一个中有两次不一致的间距不一致,则算法不会捕获它。

一个简单的解决方案,不需要正则表达式或sed或索引并按字符查看每个单词是:

with open(dirpath + orig_filename, "r") as f:
    for line in f:
        new_line = line.replace(" ", "").replace(",", ", ")
        with open(dirpath + cleaned_filename, "a") as cleaned_data:
            cleaned_data.writelines(new_line)

是:

  1. f 中的line <代码>文件的行。
  2. line.replace(“”,“”).replace(“,”,“,”))首先从行中完全删除所有空间(感谢@megakarg的建议),然后进行当然,每个逗号之后都有一个空间来满足规格。

The original code has a couple bugs:

  • The if "," in data condition never evaluates to true. data is a list, where each item in the list is a string representing one entire line of the file. No single line in the file is ,, so that condition never evaluates to true. To fix it, use if "," in item. That way it's checking to see if each line has a comma.
  • There's also a second problem: the item.index function returns only the first instance of a comma, so if there's inconsistent spacing twice in one the algorithm does not catch it.

A simple solution that doesn't require regular expressions or sed or indexing and looking at each word character by character is:

with open(dirpath + orig_filename, "r") as f:
    for line in f:
        new_line = line.replace(" ", "").replace(",", ", ")
        with open(dirpath + cleaned_filename, "a") as cleaned_data:
            cleaned_data.writelines(new_line)

What this is doing is:

  1. for line in f reads each line of the file.
  2. line.replace(" ", "").replace(",", ", ")) first removes all spaces entirely (thanks to @megakarg for the suggestion) from the line, and then makes sure there's a single space after each comma to meet the spec.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文