在文本文件的不同部分中计算单词

发布于 2025-01-26 02:03:08 字数 585 浏览 1 评论 0原文

对于一个项目，我必须与Python分析具有200多个简历的TXT文件。我必须搜索文件，并必须计算是否提到了特定键。这是我非常简单的代码：

file = open("CVC.txt")

data=file.read()

occurence = data.count("Biology")

print('Number of occurrences of the word :', occurence)

问题是当我在一个简历中搜索EG Enginnerser几次。但是我只想计算一次。每个简历都以“联系”一词开头。我的问题是如何指定可以区分简历的算法，并且仅计入简历中的特定关键字。

提前致谢！

ex1 ex2

原文

for a project I have to analyze a txt file with over 200 resumes in it with python. I have to search trough the file and have to count if a specific key is mentioned. This is my very easy code:

file = open("CVC.txt")

data=file.read()

occurence = data.count("Biology")

print('Number of occurrences of the word :', occurence)

The problem is when I search for e.g. Enginnering it is mentioned several times in one CV. But I just want to count it once. Every resume starts with the word 'contact'. My question is how can I specify an Algorithm that can differentiate between the resumes and only counts for a specific keyword ones in the cv.

Thanks in advance!

ex1
ex2

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

五里雾 2025-02-02 02:03:08

逻辑有些简单。解析文件的每一行，当您看到启动联系人的行时，然后存储该行，然后全部存储直到您看到下一个联系人行。读取文件后，将剩余的行存储为最后一个开始联系的一部分。

contacts = []
current_contact = None

with open("CVC.txt") as data:

    for line in data.splitlines():
      # skip page lines (e.g. in middle of a contact)
      if line.strip().startswith("Page "):
        continue
      # start a new contact
      if line.strip() == "Contact":
        if current_contact is not None:
          # store the current contact lines, if they exist
          contacts.append('\n'.join(current_contact))
        current_contact = []
        continue
      # collect all lines for a single contact
      if current_contact is not None:
        current_contact.append(line.rstrip())
      else:
        print(f"Not seen 'Contact' yet... '{line.rstrip()}'")  # for debugging, e.g. start of the file
    # store remaining data after all lines are read
    if current_contact:
      contacts.append('\n'.join(current_contact))
      del current_contact

我制作了一个这样的示例文件

Contact

https://linkedin.com/1

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus

Page 1 of 2

Hic dignissimos consequatur error.

Contact

https://linkedin.com/2

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.

，此测试输出

>>> for c in contacts:
...   print(c.splitlines())
... 
['', 'https://linkedin.com/1', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus', '', '', 'Hic dignissimos consequatur error.']
['', 'https://linkedin.com/2', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.']

以一个联系方式计算单词，您可以通过位置访问

contacts[0].count("Biology")

The logic is somewhat straightforward. Parse each line of the file, when you see a line that starts a contact, then store the line and all after until you see the next contact line. When the file is done being read, store the remaining lines as part of the last started contact.

contacts = []
current_contact = None

with open("CVC.txt") as data:

    for line in data.splitlines():
      # skip page lines (e.g. in middle of a contact)
      if line.strip().startswith("Page "):
        continue
      # start a new contact
      if line.strip() == "Contact":
        if current_contact is not None:
          # store the current contact lines, if they exist
          contacts.append('\n'.join(current_contact))
        current_contact = []
        continue
      # collect all lines for a single contact
      if current_contact is not None:
        current_contact.append(line.rstrip())
      else:
        print(f"Not seen 'Contact' yet... '{line.rstrip()}'")  # for debugging, e.g. start of the file
    # store remaining data after all lines are read
    if current_contact:
      contacts.append('\n'.join(current_contact))
      del current_contact

I made an example file like this

Contact

https://linkedin.com/1

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus

Page 1 of 2

Hic dignissimos consequatur error.

Contact

https://linkedin.com/2

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.

And this test output

>>> for c in contacts:
...   print(c.splitlines())
... 
['', 'https://linkedin.com/1', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus', '', '', 'Hic dignissimos consequatur error.']
['', 'https://linkedin.com/2', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.']

To count words in one contact, you can access by the position

contacts[0].count("Biology")

回复收藏 0 原文

红ご颜醉 2025-02-02 02:03:08

这是一种具有更简单逻辑的解决方案，创建一个标志，该标志说明1。我们在联系人和2中。如果我们已经在此联系人中看到了该词。

counter = 0 
is_counted = True # Initialize the flag to avoid the code breaking
word = 'engineering' # Change this
with open('cv.txt','r') as file:
    line = file.readline()
    while line:
        if "contact" in line.lower():
            is_counted = False
        elif is_counted == False and word in line.lower():
            counter += 1
            is_counted = True
        line = file.readline()

print(counter)

我已经成功地在一个小样本上尝试了它，请在您的输入上尝试一下，看看它是否有效。

Here is a solution with simpler logic, create a flag that tells if 1. We are inside a contact and 2. if we have already seen that word in this contact.

counter = 0 
is_counted = True # Initialize the flag to avoid the code breaking
word = 'engineering' # Change this
with open('cv.txt','r') as file:
    line = file.readline()
    while line:
        if "contact" in line.lower():
            is_counted = False
        elif is_counted == False and word in line.lower():
            counter += 1
            is_counted = True
        line = file.readline()

print(counter)

I have tried it on a small sample successfully, try it on your input and see if it works.

回复收藏 0 原文

~没有更多了~