在一个txt格式的日记里面，有十几万行以id递增记录的数据处理结果，但可能有错漏，如何找出错漏未出现的id？

发布于 2022-09-13 00:42:10 字数 671 浏览 24 评论 0

项目网站上有一个txt格式的日志，日记里面以如下格式记录了数据处理的过程：

...
2021-07-07 21:35:05 id=9 empty_content 
2021-07-07 21:35:06 id=10 empty_content 
2021-07-07 21:36:36 id=11 start_saveas_imgs 
2021-07-07 21:36:38 id=11 imgs_notes[0] success_qn_upload=updataa/0128/1517124106989.jpeg 
2021-07-07 21:36:39 id=11 imgs_notes[1] success_qn_upload=updataa/0128/1517124107128.jpeg 
2021-07-07 21:36:41 id=11 imgs_notes[2] success_qn_upload=updataa/0128/1517124107213.jpeg 
...

理论上每一个id都要处理数据，以id递增记录数据处理结果，每个id可能是1行，也可能是几行或十几行。

实际发现有一部分id没有被处理，txt里面没有这部分id的记录，也就是有错漏。

例如id从1到50000，其中可能少了 666、888、1313等。

txt里面有十几万行记录，所以问题是：如何用php或linux命令找出错漏未出现的id？

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丑丑阿 2022-09-20 00:42:10

其他回答中的：借助数据库处理有点overkill；python实现可以有很大提升。

所以我想再用python试着实现一次，基于问题描述中的几个约束条件：

id 自增
每条记录固定前缀

#!/usr/bin/env python3

import argparse
import logging
import mmap


def find_id(file):
    prefix_len = len("2021-07-07 21:35:05 id=")

    with open(file, "rb") as _fp:
        with mmap.mmap(_fp.fileno(), 0, access=mmap.ACCESS_READ) as fp:
            while True:
                line = fp.readline()

                if line == b"":
                    logging.debug("reached eof")
                    break

                end = line.find(b" ", prefix_len)

                if end < prefix_len:
                    continue

                raw = line[prefix_len:end]

                try:
                    id = int(raw)
                except ValueError as e:
                    logging.error(f"{e=} {raw=} {prefix_len=} {end=} {line=}")
                else:
                    yield id


def missing_id(source):
    cursor = 0
    for id in source:
        assert cursor <= id, f"id should always increase {cursor=} {id=}"

        yield from range(cursor + 1, id)
        cursor = id


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=str)
    args = parser.parse_args()

    print("missing id:")
    for i in missing_id(find_id(args.file)):
        print(i)

回复收藏 0

绅刃 2022-09-20 00:42:10

#!/usr/bin/python

import re

with open("log.txt") as fp:
    existed=set()
    
    while True:
        line=fp.readline()
        if not line:
            break
        m=re.match(".+id=(\d+)",line)
        if m:
            existed.add(int(m.groups()[0]))
    
    full=set(range(min(existed),max(existed)+1))
    missed=full-existed
    print(list(missed).sort())

回复收藏 0

極樂鬼 2022-09-20 00:42:10

实际发现有一部分id没有被处理，txt里面没有这部分id的记录，也就是有错漏。

txt里面没有这部分id的记录，也就是有错漏。 看起来是要和另外一套数据做比较？

很显然，直接在 txt 中搜索，效率是很低的，你应该考虑把 txt 文本按照固定格式分割，存入数据库，然后利用数据库处理，即 日志采集。

你可以自行实现切割放进数据库，也可以使用 ELK 实现自动处理入库。

最后，使用数据库来处理后，即使遍历，采用主键索引进行查询效率也是很高，这样就可以快速查出差集。

几十万也不是很大，也可以全部 ID 取出来，然后用 array_diff 做差集比较。

简单点儿，你也可以直接按行读取 txt 文件，把每行 ID 提取出来后和直接在数据库查询。

回复收藏 0

剪不断理还乱 2022-09-20 00:42:10

使用几个命令通过管道组合可以直接得出结果，示例：
seq 1 1 50 > seq1.txt; awk 'BEGIN{FS=" id="}{print $2}' example.log | awk '{print $1}' | sort -n | uniq > seq2.txt; diff seq1.txt seq2.txt | grep '<' | awk '{print $2}'