如何获得。从目录中以.docx和.doc结尾的文件字符的字符,并将每个文件的字符除以65,然后将它们保存到XLSX

发布于 2025-02-12 19:20:05 字数 1329 浏览 1 评论 0原文

我有一个以.doc和.docx结尾的许多Word文档文件的文件夹。

此代码仅适用于.docx 我想要.doc的图像

import docx
import os

charCounts = {}
directory = os.fsencode('.')
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".docx"):
        #filename = os.path.join(directory, filename)
        doc = docx.Document(filename)
        chars = sum(len(p.text) for p in doc.paragraphs)
        charCounts[filename] = chars / 65

# uses openpyxl package
from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws.cell(row=1, column=2, value='File Name')
ws.cell(row=1, column=4, value='chars/65')
for i, x in enumerate(charCounts):
    ws.cell(row=i + 3, column=2, value=x)
    ws.cell(row=i + 3, column=4, value=charCounts[x])
    ws.cell(row=len(charCounts) + 3, column=4, value=sum(charCounts.values()))
path = './charCounts.xlsx'
wb.save(path)

: -

我有这样的文件。

我希望它们像这样发生:

在这里注意两件事。

Excel表中的文件名已安排在数字上。

第二件事是在Excel表中,已删除了文件扩展名。我想要那样。

I have a folder of many word document files ending with .doc and .docx.

This code is working only for .docx
I want this for .doc also

import docx
import os

charCounts = {}
directory = os.fsencode('.')
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".docx"):
        #filename = os.path.join(directory, filename)
        doc = docx.Document(filename)
        chars = sum(len(p.text) for p in doc.paragraphs)
        charCounts[filename] = chars / 65

# uses openpyxl package
from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws.cell(row=1, column=2, value='File Name')
ws.cell(row=1, column=4, value='chars/65')
for i, x in enumerate(charCounts):
    ws.cell(row=i + 3, column=2, value=x)
    ws.cell(row=i + 3, column=4, value=charCounts[x])
    ws.cell(row=len(charCounts) + 3, column=4, value=sum(charCounts.values()))
path = './charCounts.xlsx'
wb.save(path)

Images:-

I have files like these.
enter image description here

I want them to happen like these:

enter image description here

Notice two things here.

File names in excel sheet have been arranged number-wise.

Second thing is in excel sheet, the file extensions have been removed. I want it Like that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

牵你的手,一向走下去 2025-02-19 19:20:05

这是您问题中代码的更新,它将按照我的要求进行操作:

# uses python-docx package
import docx
import os

# uses pywin32 package
import win32com.client as win32
from win32com.client import constants
app = win32.gencache.EnsureDispatch('Word.Application')

charCounts = {}
fileDir = '.' # Put the path of the directory to be searched here
os.chdir(fileDir)
cwd = os.getcwd()
directory = os.fsencode(cwd)
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.startswith('TEMP_CONVERTED_WORD_FILE_'):
        continue
    filenameOrig = None
    if filename.endswith(".doc"):
        filenameOrig = filename
        src_path = os.path.join(cwd, filename)
        src_path_norm = os.path.normpath(src_path)
        doc = app.Documents.Open(src_path_norm)
        doc.Activate()
        docxPath = 'TEMP_CONVERTED_WORD_FILE_' + filename[:-4] + ".docx"
        dest_path = os.path.join(cwd, docxPath)
        dest_path_norm = os.path.normpath(dest_path)
        app.ActiveDocument.SaveAs(dest_path_norm, FileFormat=constants.wdFormatXMLDocument)
        doc.Close(False)
        filename = docxPath
    if filename.endswith(".docx"):
        src_path = os.path.join(cwd, filename)
        src_path_norm = os.path.normpath(src_path)
        doc = docx.Document(src_path_norm)
        chars = sum(len(p.text) for p in doc.paragraphs) + sum(len(p.text) for section in doc.sections for hf in [section.header, section.footer] for p in hf.paragraphs)
        charCounts[filenameOrig if filenameOrig else filename] = chars / 65
charCounts = {k:charCounts[k] for k in sorted(charCounts)}

# uses openpyxl package
from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws.cell(row=1, column=2, value='File Name')
ws.cell(row=1, column=4, value='chars/65')
for i, x in enumerate(charCounts):
    ws.cell(row=i + 3, column=2, value=x[:-4] if x.endswith('.doc') else x[:-5])
    ws.cell(row=i + 3, column=4, value=charCounts[x])
ws.cell(row=len(charCounts) + 3, column=3, value='Total')
ws.cell(row=len(charCounts) + 3, column=4, value=sum(charCounts.values()))
path = './charCounts.xlsx'
wb.save(path)

说明:

  • 对于.docx的每个文件的说明:以temp_converted_word_file _开始的文件以.docx> .docx。 ,存储字符计数(除以65)用文件名作为键Charcount
  • .doc中的每个文件中的键pywin32 win32扩展程序的软件包将其转换为.docxtemp_converted_word_word_file _预先添加到文件名中,然后存储字符数(然后分配65) 上述词典中的键
  • 以其原始文件名作为与 通过filename键
  • 通过charcounts将内容存储在Excel文件中,小心截断.doc.docx从filename中的后缀钥匙。

Here is an update to the code in your question which will do what I believe you have asked:

# uses python-docx package
import docx
import os

# uses pywin32 package
import win32com.client as win32
from win32com.client import constants
app = win32.gencache.EnsureDispatch('Word.Application')

charCounts = {}
fileDir = '.' # Put the path of the directory to be searched here
os.chdir(fileDir)
cwd = os.getcwd()
directory = os.fsencode(cwd)
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.startswith('TEMP_CONVERTED_WORD_FILE_'):
        continue
    filenameOrig = None
    if filename.endswith(".doc"):
        filenameOrig = filename
        src_path = os.path.join(cwd, filename)
        src_path_norm = os.path.normpath(src_path)
        doc = app.Documents.Open(src_path_norm)
        doc.Activate()
        docxPath = 'TEMP_CONVERTED_WORD_FILE_' + filename[:-4] + ".docx"
        dest_path = os.path.join(cwd, docxPath)
        dest_path_norm = os.path.normpath(dest_path)
        app.ActiveDocument.SaveAs(dest_path_norm, FileFormat=constants.wdFormatXMLDocument)
        doc.Close(False)
        filename = docxPath
    if filename.endswith(".docx"):
        src_path = os.path.join(cwd, filename)
        src_path_norm = os.path.normpath(src_path)
        doc = docx.Document(src_path_norm)
        chars = sum(len(p.text) for p in doc.paragraphs) + sum(len(p.text) for section in doc.sections for hf in [section.header, section.footer] for p in hf.paragraphs)
        charCounts[filenameOrig if filenameOrig else filename] = chars / 65
charCounts = {k:charCounts[k] for k in sorted(charCounts)}

# uses openpyxl package
from openpyxl import Workbook
wb = Workbook()
ws = wb.active

ws.cell(row=1, column=2, value='File Name')
ws.cell(row=1, column=4, value='chars/65')
for i, x in enumerate(charCounts):
    ws.cell(row=i + 3, column=2, value=x[:-4] if x.endswith('.doc') else x[:-5])
    ws.cell(row=i + 3, column=4, value=charCounts[x])
ws.cell(row=len(charCounts) + 3, column=3, value='Total')
ws.cell(row=len(charCounts) + 3, column=4, value=sum(charCounts.values()))
path = './charCounts.xlsx'
wb.save(path)

Explanation:

  • For every file with name ending in .docx except those starting with TEMP_CONVERTED_WORD_FILE_, store character count (divided by 65) by filename as key in a dictionary charCount
  • For every file ending in .doc, use the pywin32 package of Win32 extensions to convert it to a .docx file with TEMP_CONVERTED_WORD_FILE_ prepended to the filename, then store character count (divided by 65) by its original filename as key in the same dictionary as above
  • Replace the charCounts dictionary with one that has insertion order by the filename key
  • Iterate through charCounts storing the contents in an Excel file, taking care to truncate the .doc or .docx suffix from the filename key.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文