在 Python 中将单词位置编号附加到 Unicode 文本

发布于 2024-08-22 10:55:03 字数 7767 浏览 6 评论 0原文

我有一个代码，它将单词位置附加到源文件中的单词上但输出没有按预期出现：

输入文件包含以下内容：

3.  भारत का इतिहास काफी समृद्ध एवं विस्तृत है।
57. जैसे आज के झारखंड प्रदेश से, उन दिनों, बहुत से लोग चाय बागानों में मजदूरी करने के उद्देश्य से असम आए।

原始源代码是这样的：

    #!/usr/bin/python

    # -*- coding: UTF-8 -*-

    # encoding: utf-8

separators = [u'।', ',', '.']

text = open("hinstest1.txt").read()

    #This converts the encoded text to an internal unicode object, where

    # all characters are properly recognized as an entity:

text = text.decode("UTF-8")

    #this breaks the text on the white spaces, yielding a list of words:

words = text.split()



counter = 1



output = ""

    #if the last char is a separator, and is joined to the word:

for word in words:

    if word[-1] in separators and len(word) > 1:

        #word up to the second to last char:

        output += word[:-1] + u'(%d) ' % counter

        counter += 1

        #last char

        output += word[-1] +u'(%d) ' % counter

    else:

        output += word + u'(%d) ' % counter

        counter += 1

    #for ch in word:    

    #   if ch is '\n':

    print output

    #counter = 1

该代码的输出是这样的：

3(1) .(2) 
3(1) .(2) भारत(2) 
3(1) .(2) भारत(2) का(3) 
3(1) .(2) भारत(2) का(3) इतिहास(4) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) असम(30) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) असम(30) आए(31) ।(32)

我尝试修改上面的代码，以便计数器检测到新行并重新初始化每个新行的单词位置从 1 开始，我还需要确保没有显示序列号的单词位置。

我修改后的代码不是 100% 正确，您可以帮我更正它以获得所需的输出吗：

修改后的代码如下所示：

    #!/usr/bin/python

    # -*- coding: UTF-8 -*-
    # encoding: utf-8
import fileinput
list1 = []
separators = [u'।', ',', '.']
chknwlin = ['\n']
text = open("hinstest1.txt").read()
output_file = ("ophwp1.txt")
    #This converts the encoded text to an internal unicode object, where

    # all characters are properly recognized as an entity:

text = text.decode("UTF-8")
    #this breaks the text on the white spaces, yielding a list of words:

words = text.split()

counter = 1

output = ""
    #if the last char is a separator, and is joined to the word:

for line in words:
    for word in line:
        for ch in line:

            if word[-1] in separators and len(word) > 1:

                #word up to the second to last char:

                output += word[:-1] + u'(%d) ' % counter

                counter += 1

                #last char

                output += word[-1] +u'(%d) ' % counter

            else :

                output += word + u'(%d) ' % counter

                counter += 1
#   if ch is '\n':
            if ch in chknwlin: 

            #for ch in words:   

                print output

                counter = 1
                    list1.append(output)

#words.close()

f1=open(output_file,'w')

f1.write(' '.join(list1))

f1.close()

我最终希望输出如下所示：

3. भारत(1) का(2) इतिहास(3) काफी(4) समृद्ध(5) एवं(6) विस्तृत(7) है(8) ।(9)
57. जैसे(1) आज(2) के(3) झारखंड(4) प्रदेश(5) से(6) ,(7) उन(8) दिनों(9) ,(10) बहुत(11) से(12) लोग(13) चाय(14) बागानों(15) में(16) मजदूरी(17) करने(18) के(19) उद्देश्य(20) से(21) असम(22) आए(23) ।(24)

修改后的代码没有在控制台上给我任何输出，并且也不将任何内容复制到输出文件。

原文

I have a code which appends word positions to the words from the source file
but the output is not coming as desired:

The input file contains the following:

3.  भारत का इतिहास काफी समृद्ध एवं विस्तृत है।
57. जैसे आज के झारखंड प्रदेश से, उन दिनों, बहुत से लोग चाय बागानों में मजदूरी करने के उद्देश्य से असम आए।

The original source code is like this:

    #!/usr/bin/python

    # -*- coding: UTF-8 -*-

    # encoding: utf-8

separators = [u'।', ',', '.']

text = open("hinstest1.txt").read()

    #This converts the encoded text to an internal unicode object, where

    # all characters are properly recognized as an entity:

text = text.decode("UTF-8")

    #this breaks the text on the white spaces, yielding a list of words:

words = text.split()



counter = 1



output = ""

    #if the last char is a separator, and is joined to the word:

for word in words:

    if word[-1] in separators and len(word) > 1:

        #word up to the second to last char:

        output += word[:-1] + u'(%d) ' % counter

        counter += 1

        #last char

        output += word[-1] +u'(%d) ' % counter

    else:

        output += word + u'(%d) ' % counter

        counter += 1

    #for ch in word:    

    #   if ch is '\n':

    print output

    #counter = 1

The output for this code is like this:

3(1) .(2) 
3(1) .(2) भारत(2) 
3(1) .(2) भारत(2) का(3) 
3(1) .(2) भारत(2) का(3) इतिहास(4) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) असम(30) 
3(1) .(2) भारत(2) का(3) इतिहास(4) काफी(5) समृद्ध(6) एवं(7) विस्तृत(8) है(9) ।(10) 57(10) .(11) जैसे(11) आज(12) के(13) झारखंड(14) प्रदेश(15) से(16) ,(17) उन(17) दिनों(18) ,(19) बहुत(19) से(20) लोग(21) चाय(22) बागानों(23) में(24) मजदूरी(25) करने(26) के(27) उद्देश्य(28) से(29) असम(30) आए(31) ।(32)

I have tried to modify the above code so that the counter detects a new line and reinitializes the word positions to start from 1 for every new line, I also need to make sure that no word positions is displayed for the serial nos.

My modified code is not 100% correct could you please help me correct it to get the desired output:

Modified code looks like this:

    #!/usr/bin/python

    # -*- coding: UTF-8 -*-
    # encoding: utf-8
import fileinput
list1 = []
separators = [u'।', ',', '.']
chknwlin = ['\n']
text = open("hinstest1.txt").read()
output_file = ("ophwp1.txt")
    #This converts the encoded text to an internal unicode object, where

    # all characters are properly recognized as an entity:

text = text.decode("UTF-8")
    #this breaks the text on the white spaces, yielding a list of words:

words = text.split()

counter = 1

output = ""
    #if the last char is a separator, and is joined to the word:

for line in words:
    for word in line:
        for ch in line:

            if word[-1] in separators and len(word) > 1:

                #word up to the second to last char:

                output += word[:-1] + u'(%d) ' % counter

                counter += 1

                #last char

                output += word[-1] +u'(%d) ' % counter

            else :

                output += word + u'(%d) ' % counter

                counter += 1
#   if ch is '\n':
            if ch in chknwlin: 

            #for ch in words:   

                print output

                counter = 1
                    list1.append(output)

#words.close()

f1=open(output_file,'w')

f1.write(' '.join(list1))

f1.close()

I finally want the output to look like this:

3. भारत(1) का(2) इतिहास(3) काफी(4) समृद्ध(5) एवं(6) विस्तृत(7) है(8) ।(9)
57. जैसे(1) आज(2) के(3) झारखंड(4) प्रदेश(5) से(6) ,(7) उन(8) दिनों(9) ,(10) बहुत(11) से(12) लोग(13) चाय(14) बागानों(15) में(16) मजदूरी(17) करने(18) के(19) उद्देश्य(20) से(21) असम(22) आए(23) ।(24)

The modified code is not giving me any output on the console and is also copying nothing to the output file.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

何处潇湘 2024-08-29 10:55:04

尝试将：更改

counter = 1
for line in words:
    # etc...

为：

for line in words:
    counter = 1
    # etc...

这会将每个新行的计数器重置为 1。

Try changing:

counter = 1
for line in words:
    # etc...

to:

for line in words:
    counter = 1
    # etc...

This will reset the counter to 1 for each new line.

回复收藏 0 原文

書生途 2024-08-29 10:55:03

我认为你的代码应该是这样的：

# the input part is fine as is
lines = text.split('\n')
outlines = []
for line in lines:
    lout = []
    counter = 1
    for i, word in enumerate(lines.split()):
        if i == 0:  # leave 1st word of line alone, it's a marker:
            lout.append(word)
            continue
        # process each and every other word
        if word[-1] in separators and len(word) > 1:
            lout.append(word[:-1] + (u'(%d) ' % counter) +
                        word[-1] + (u'(%d) ' % counter+1))
            counter += 1
        else :
            lout.append(word + u'(%d)' % counter)
        counter += 1
    outlines.append(' '.join(lout))

f1=open(output_file,'w')
f1.write('\n'.join(outlines))
f1.close()

无法测试此代码，因此可能会留下一些小问题，但我认为其中的主要原则是合理的：在两个级别上工作（按行内细，使用 \n 作为分隔符，并按行内的单词，以空格作为分隔符），并且每次都使用列表（带有附加和连接），而不是逐段构建字符串。

I think your code should something like:

# the input part is fine as is
lines = text.split('\n')
outlines = []
for line in lines:
    lout = []
    counter = 1
    for i, word in enumerate(lines.split()):
        if i == 0:  # leave 1st word of line alone, it's a marker:
            lout.append(word)
            continue
        # process each and every other word
        if word[-1] in separators and len(word) > 1:
            lout.append(word[:-1] + (u'(%d) ' % counter) +
                        word[-1] + (u'(%d) ' % counter+1))
            counter += 1
        else :
            lout.append(word + u'(%d)' % counter)
        counter += 1
    outlines.append(' '.join(lout))

f1=open(output_file,'w')
f1.write('\n'.join(outlines))
f1.close()

Can't test this code, so there might be minor issues left, but I think the main principles in it are sound: work on two levels (by line within fine, with \n as separator, and by word within line, with space as separator) and each time use lists (with append and join) rather than build up strings by pieces.

回复收藏 0 原文

忆悲凉 2024-08-29 10:55:03

该代码将为您提供所需的输出。我在行首添加了对数字的检查，该数字不应编号。

我改编了你的原始代码，它（大部分）有效。您只需要在输入行末尾重置计数器，并在输出中添加换行符。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# encoding: utf-8

import re

list1 = []
separators = [u'।', ',', '.']
text = open('hinstest1.txt').read().decode('UTF-8')
output_file = ('ophwp1.txt')

for line in text.splitlines():
    counter = 1
    output = ''
    for word in line.split():
        # Special case for the number at the start of the line
        # The regex matches one or more decimal digits (\d+) followed by a dot (\.)
        if re.match(r'\d+\.', word):
            output += word + ' '
            continue
        # Special case: the last char is a separator joined to the word
        if word[-1] in separators and len(word) > 1:
            # word up to the second to last char
            output += word[:-1] + u'(%d) ' % counter
            counter += 1
            # last char
            output += word[-1] + u'(%d) ' % counter
            counter += 1
        else:
            output += word + u'(%d) ' % counter
            counter += 1
    output += u'\n'
    list1.append(output.encode('UTF-8'))

f1=open(output_file,'w')
f1.write(''.join(list1))
f1.close()

我在您提供的输入文件上测试了此代码，并且在大多数情况下，我保留了您的编码风格。

This code will give you the desired output. I added a check for the number at the start of the line, which should not be numbered.

I adapted your original code, which was (mostly) working. You just needed to reset the counter at the end of an input line, and add a newline to your output as well.

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# encoding: utf-8

import re

list1 = []
separators = [u'।', ',', '.']
text = open('hinstest1.txt').read().decode('UTF-8')
output_file = ('ophwp1.txt')

for line in text.splitlines():
    counter = 1
    output = ''
    for word in line.split():
        # Special case for the number at the start of the line
        # The regex matches one or more decimal digits (\d+) followed by a dot (\.)
        if re.match(r'\d+\.', word):
            output += word + ' '
            continue
        # Special case: the last char is a separator joined to the word
        if word[-1] in separators and len(word) > 1:
            # word up to the second to last char
            output += word[:-1] + u'(%d) ' % counter
            counter += 1
            # last char
            output += word[-1] + u'(%d) ' % counter
            counter += 1
        else:
            output += word + u'(%d) ' % counter
            counter += 1
    output += u'\n'
    list1.append(output.encode('UTF-8'))

f1=open(output_file,'w')
f1.write(''.join(list1))
f1.close()

I tested this code on the input file you provided and, for the most part, I retained your coding style.

回复收藏 0 原文

~没有更多了~