返回介绍

solution / 0600-0699 / 0609.Find Duplicate File in System / README_EN

发布于 2024-06-17 01:03:36 字数 6605 浏览 0 评论 0 收藏 0

609. Find Duplicate File in System

中文文档

Description

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return _all the duplicate files in the file system in terms of their paths_. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

  • "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • "directory_path/file_name.txt"

 

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

 

Constraints:

  • 1 <= paths.length <= 2 * 104
  • 1 <= paths[i].length <= 3000
  • 1 <= sum(paths[i].length) <= 5 * 105
  • paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
  • You may assume no files or directories share the same name in the same directory.
  • You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

 

Follow up:

  • Imagine you are given a real file system, how will you search files? DFS or BFS?
  • If the file content is very large (GB level), how will you modify your solution?
  • If you can only read the file by 1kb each time, how will you modify your solution?
  • What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
  • How to make sure the duplicated files you find are not false positive?

Solutions

Solution 1

class Solution:
  def findDuplicate(self, paths: List[str]) -> List[List[str]]:
    d = defaultdict(list)
    for p in paths:
      ps = p.split()
      for f in ps[1:]:
        i = f.find('(')
        name, content = f[:i], f[i + 1 : -1]
        d[content].append(ps[0] + '/' + name)
    return [v for v in d.values() if len(v) > 1]
class Solution {
  public List<List<String>> findDuplicate(String[] paths) {
    Map<String, List<String>> d = new HashMap<>();
    for (String p : paths) {
      String[] ps = p.split(" ");
      for (int i = 1; i < ps.length; ++i) {
        int j = ps[i].indexOf('(');
        String content = ps[i].substring(j + 1, ps[i].length() - 1);
        String name = ps[0] + '/' + ps[i].substring(0, j);
        d.computeIfAbsent(content, k -> new ArrayList<>()).add(name);
      }
    }
    List<List<String>> ans = new ArrayList<>();
    for (var e : d.values()) {
      if (e.size() > 1) {
        ans.add(e);
      }
    }
    return ans;
  }
}
class Solution {
public:
  vector<vector<string>> findDuplicate(vector<string>& paths) {
    unordered_map<string, vector<string>> d;
    for (auto& p : paths) {
      auto ps = split(p, ' ');
      for (int i = 1; i < ps.size(); ++i) {
        int j = ps[i].find('(');
        auto content = ps[i].substr(j + 1, ps[i].size() - j - 2);
        auto name = ps[0] + '/' + ps[i].substr(0, j);
        d[content].push_back(name);
      }
    }
    vector<vector<string>> ans;
    for (auto& [_, e] : d) {
      if (e.size() > 1) {
        ans.push_back(e);
      }
    }
    return ans;
  }

  vector<string> split(string& s, char c) {
    vector<string> res;
    stringstream ss(s);
    string t;
    while (getline(ss, t, c)) {
      res.push_back(t);
    }
    return res;
  }
};
func findDuplicate(paths []string) [][]string {
  d := map[string][]string{}
  for _, p := range paths {
    ps := strings.Split(p, " ")
    for i := 1; i < len(ps); i++ {
      j := strings.IndexByte(ps[i], '(')
      content := ps[i][j+1 : len(ps[i])-1]
      name := ps[0] + "/" + ps[i][:j]
      d[content] = append(d[content], name)
    }
  }
  ans := [][]string{}
  for _, e := range d {
    if len(e) > 1 {
      ans = append(ans, e)
    }
  }
  return ans
}
function findDuplicate(paths: string[]): string[][] {
  const d = new Map<string, string[]>();
  for (const p of paths) {
    const [root, ...fs] = p.split(' ');
    for (const f of fs) {
      const [name, content] = f.split(/\(|\)/g).filter(Boolean);
      const t = d.get(content) ?? [];
      t.push(root + '/' + name);
      d.set(content, t);
    }
  }
  return [...d.values()].filter(e => e.length > 1);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文