This topic has not yet been rated - Rate this topic

How to: Query for Duplicate Files in a Directory Tree (LINQ)

Sometimes files that have the same name may be located in more than one folder. For example, under the Visual Studio installation folder, several folders have a readme.htm file. This example shows how to query for such duplicate file names under a specified root folder. The second example shows how to query for files whose size and creation times also match.

class QueryDuplicateFileNames
{
    static void Main(string[] args)
    {
        // Uncomment QueryDuplicates2 to run that query.
        QueryDuplicates();
        // QueryDuplicates2(); 

        // Keep the console window open in debug mode.
        Console.WriteLine("Press any key to exit.");
        Console.ReadKey();
    }

    static void QueryDuplicates()
    {
        // Change the root drive or folder if necessary 
        string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\";

        // Take a snapshot of the file system.
        System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);

        // This method assumes that the application has discovery permissions 
        // for all folders under the specified path.
        IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);

        // used in WriteLine to keep the lines shorter 
        int charsToSkip = startFolder.Length;

        // var can be used for convenience with groups. 
        var queryDupNames =
            from file in fileList
            group file.FullName.Substring(charsToSkip) by file.Name into fileGroup
            where fileGroup.Count() > 1
            select fileGroup;

        // Pass the query to a method that will 
        // output one page at a time.
        PageOutput<string, string>(queryDupNames);
    }

    // A Group key that can be passed to a separate method. 
    // Override Equals and GetHashCode to define equality for the key. 
    // Override ToString to provide a friendly name for Key.ToString() 
    class PortableKey
    {
        public string Name { get; set; }
        public DateTime CreationTime { get; set; }
        public long Length { get; set; }

        public override bool Equals(object obj)
        {
            PortableKey other = (PortableKey)obj;
            return other.CreationTime == this.CreationTime &&
                   other.Length == this.Length &&
                   other.Name == this.Name;
        }

        public override int GetHashCode()
        {
            string str = String.Format("{0}{1}{2}", this.CreationTime, this.Length, this.Name);
            return str.GetHashCode();
        }
        public override string ToString()
        {
            return String.Format("{0} {1} {2}", this.Name, this.Length, this.CreationTime);
        }
    }
    static void QueryDuplicates2()
    {
        // Change the root drive or folder if necessary. 
        string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\Common7";

        // Make the the lines shorter for the console display 
        int charsToSkip = startFolder.Length;

        // Take a snapshot of the file system.
        System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
        IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);

        // Note the use of a compound key. Files that match 
        // all three properties belong to the same group. 
        // A named type is used to enable the query to be 
        // passed to another method. Anonymous types can also be used 
        // for composite keys but cannot be passed across method boundaries 
        //  
        var queryDupFiles =
            from file in fileList
            group file.FullName.Substring(charsToSkip) by 
                new PortableKey { Name = file.Name, CreationTime = file.CreationTime, Length = file.Length } into fileGroup
            where fileGroup.Count() > 1
            select fileGroup;

        var list = queryDupFiles.ToList();

        int i = queryDupFiles.Count();

        PageOutput<PortableKey, string>(queryDupFiles);
    }


    // A generic method to page the output of the QueryDuplications methods 
    // Here the type of the group must be specified explicitly. "var" cannot
    // be used in method signatures. This method does not display more than one 
    // group per page. 
    private static void PageOutput<K, V>(IEnumerable<System.Linq.IGrouping<K, V>> groupByExtList)
    {
        // Flag to break out of paging loop. 
        bool goAgain = true;

        // "3" = 1 line for extension + 1 for "Press any key" + 1 for input cursor.
        int numLines = Console.WindowHeight - 3;

        // Iterate through the outer collection of groups. 
        foreach (var filegroup in groupByExtList)
        {
            // Start a new extension at the top of a page. 
            int currentLine = 0;

            // Output only as many lines of the current group as will fit in the window. 
            do
            {
                Console.Clear();
                Console.WriteLine("Filename = {0}", filegroup.Key.ToString() == String.Empty ? "[none]" : filegroup.Key.ToString());

                // Get 'numLines' number of items starting at number 'currentLine'. 
                var resultPage = filegroup.Skip(currentLine).Take(numLines);

                //Execute the resultPage query 
                foreach (var fileName in resultPage)
                {
                    Console.WriteLine("\t{0}", fileName);
                }

                // Increment the line counter.
                currentLine += numLines;

                // Give the user a chance to escape.
                Console.WriteLine("Press any key to continue or the 'End' key to break...");
                ConsoleKey key = Console.ReadKey().Key;
                if (key == ConsoleKey.End)
                {
                    goAgain = false;
                    break;
                }
            } while (currentLine < filegroup.Count());

            if (goAgain == false)
                break;
        }
    }
}

The first query uses a simple key to determine a match; this finds files that have the same name but whose contents might be different. The second query uses a compound key to match against three properties of the FileInfo object. This query is much more likely to find files that have the same name and similar or identical content.

  • Create a Visual Studio project that targets the .NET Framework version 3.5. By default, the project has a reference to System.Core.dll and a using directive (C#) or Imported namespace (Visual Basic) for the System.Linq namespace. In C# projects, add a using directive for the System.IO namespace.

  • Copy this code into your project.

  • Press F5 to compile and run the program.

  • Press any key to exit the console window.

For intensive query operations over the contents of multiple types of documents and files, consider using the Windows Desktop Search engine.

Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft. All rights reserved.