How to: Query for Duplicate Files in a Directory Tree (LINQ)
Sometimes files that have the same name may be located in more than one folder. For example, under the Visual Studio installation folder, several folders have a readme.htm file. This example shows how to query for such duplicate file names under a specified root folder. The second example shows how to query for files whose size and creation times also match.
class QueryDuplicateFileNames { static void Main(string[] args) { // Uncomment QueryDuplicates2 to run that query. QueryDuplicates(); // QueryDuplicates2(); // Keep the console window open in debug mode. Console.WriteLine("Press any key to exit."); Console.ReadKey(); } static void QueryDuplicates() { // Change the root drive or folder if necessary string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\"; // Take a snapshot of the file system. System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder); // This method assumes that the application has discovery permissions // for all folders under the specified path. IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories); // used in WriteLine to keep the lines shorter int charsToSkip = startFolder.Length; // var can be used for convenience with groups. var queryDupNames = from file in fileList group file.FullName.Substring(charsToSkip) by file.Name into fileGroup where fileGroup.Count() > 1 select fileGroup; // Pass the query to a method that will // output one page at a time. PageOutput<string, string>(queryDupNames); } // A Group key that can be passed to a separate method. // Override Equals and GetHashCode to define equality for the key. // Override ToString to provide a friendly name for Key.ToString() class PortableKey { public string Name { get; set; } public DateTime CreationTime { get; set; } public long Length { get; set; } public override bool Equals(object obj) { PortableKey other = (PortableKey)obj; return other.CreationTime == this.CreationTime && other.Length == this.Length && other.Name == this.Name; } public override int GetHashCode() { string str = String.Format("{0}{1}{2}", this.CreationTime, this.Length, this.Name); return str.GetHashCode(); } public override string ToString() { return String.Format("{0} {1} {2}", this.Name, this.Length, this.CreationTime); } } static void QueryDuplicates2() { // Change the root drive or folder if necessary. string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\Common7"; // Make the the lines shorter for the console display int charsToSkip = startFolder.Length; // Take a snapshot of the file system. System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder); IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories); // Note the use of a compound key. Files that match // all three properties belong to the same group. // A named type is used to enable the query to be // passed to another method. Anonymous types can also be used // for composite keys but cannot be passed across method boundaries // var queryDupFiles = from file in fileList group file.FullName.Substring(charsToSkip) by new PortableKey { Name = file.Name, CreationTime = file.CreationTime, Length = file.Length } into fileGroup where fileGroup.Count() > 1 select fileGroup; var list = queryDupFiles.ToList(); int i = queryDupFiles.Count(); PageOutput<PortableKey, string>(queryDupFiles); } // A generic method to page the output of the QueryDuplications methods // Here the type of the group must be specified explicitly. "var" cannot // be used in method signatures. This method does not display more than one // group per page. private static void PageOutput<K, V>(IEnumerable<System.Linq.IGrouping<K, V>> groupByExtList) { // Flag to break out of paging loop. bool goAgain = true; // "3" = 1 line for extension + 1 for "Press any key" + 1 for input cursor. int numLines = Console.WindowHeight - 3; // Iterate through the outer collection of groups. foreach (var filegroup in groupByExtList) { // Start a new extension at the top of a page. int currentLine = 0; // Output only as many lines of the current group as will fit in the window. do { Console.Clear(); Console.WriteLine("Filename = {0}", filegroup.Key.ToString() == String.Empty ? "[none]" : filegroup.Key.ToString()); // Get 'numLines' number of items starting at number 'currentLine'. var resultPage = filegroup.Skip(currentLine).Take(numLines); //Execute the resultPage query foreach (var fileName in resultPage) { Console.WriteLine("\t{0}", fileName); } // Increment the line counter. currentLine += numLines; // Give the user a chance to escape. Console.WriteLine("Press any key to continue or the 'End' key to break..."); ConsoleKey key = Console.ReadKey().Key; if (key == ConsoleKey.End) { goAgain = false; break; } } while (currentLine < filegroup.Count()); if (goAgain == false) break; } } }
The first query uses a simple key to determine a match; this finds files that have the same name but whose contents might be different. The second query uses a compound key to match against three properties of the FileInfo object. This query is much more likely to find files that have the same name and similar or identical content.
-
Create a Visual Studio project that targets the .NET Framework version 3.5. By default, the project has a reference to System.Core.dll and a using directive (C#) or Imported namespace (Visual Basic) for the System.Linq namespace. In C# projects, add a using directive for the System.IO namespace.
-
Copy this code into your project.
-
Press F5 to compile and run the program.
-
Press any key to exit the console window.
For intensive query operations over the contents of multiple types of documents and files, consider using the Windows Desktop Search engine.