File Content and Directory Search using Directory.GetFiles and PLINQ

 

 

 

 

 

Array of File Names

Starting .NET 4, you can use PLINQ queries to parallelize operations on file directories. The following code snippet shows how you can write a query by using the GetFiles method to populate an array of file names in a directory and all subdirectories. This method does not return until the entire array is populated, and therefore it can introduce latency at the beginning of the operation. However, after the array is populated, PLINQ can be used to search inside all the files with the specific extension located in a particular directory for a specific word very quickly. For measuring the performance, you can create a folder called CLOBS and create 8 large text files (1GB each).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After running the project, the CPU usage goes up as it is shown in the following figure:

Finding all matches in 8 large text files (1GB each) takes 407.03 seconds as it is shown in the output window:

File Content and Directory Search using Directory.EnumerateFiles and PLINQ

Enumerable Collection of File Names

Starting .NET 4, you can enumerate directories and files by using methods that return an enumerable collection of strings of their names. In previous versions of the .NET Framework, you could only obtain arrays of these collections. Enumerable collections provide better performance than arrays.

Parallel LINQ (PLINQ)

In .NET 4, you can use Parallel LINQ (PLINQ) for queries that contain computationally expensive operations on every element over all the files in a specified directory tree.
The following code snippet shows how to parallelize operations on file directories. The PLINQ query uses the Directory.EnumerateFiles method to search inside all the files with the specific extension located in the particular directory for a specific word. For measuring the performance, you can create a folder called CLOBS and create 8 large text files (1GB each).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After running the project, the CPU usage goes up as it is shown in the following figure:

Finding all matches in 8 large text files (1GB each) takes 402.596 seconds as it is shown in the output window:

Performance of PLINQ Queries

Parallel LINQ (PLINQ)
The main goal of the Parallel LINQ, or PLINQ is to execute LINQ to Objects queries in parallel, realizing the benefits of multithreading. Using PLINQ is simple, if you have to perform the same task on each element in a sequence, and those tasks are independent. If you need the result of one calculation step in order to find the next, PLINQ is not for you but many CPU intensive tasks can in fact be done in parallel. To tell the compiler to use PLINQ, you just need to call AsParallel and let PLINQ handle the threading.
The following samples demonstrate the performance of PLINQ queries for different scenarios:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The result is shown below:

Without the AsParallel call, we would only use a single thread. Please note that except you specify that you want the results in the same order as the original sequence, PLINQ will assume you don’t mind getting results as soon as they’re available, even if results from earlier elements haven’t been returned yet. You can prevent this by using AsParallel().AsOrdered()

When to Use PLINQ

It’s tempting to search your existing applications for LINQ queries and experiment with parallelizing them. This is usually unproductive, because most problems for which LINQ is obviously the best solution tend to execute very quickly and so don’t benefit from parallelization. A better approach is to find a CPU-intensive bottleneck and then consider, “Can this be expressed as a LINQ query?”
PLINQ is well suited to embarrassingly parallel problems. It also works well for structured blocking tasks, such as calling several web services at once. PLINQ can be a poor choice for imaging, because collating millions of pixels into an output sequence creates a bottleneck. Instead, it’s better to write pixels directly to an array or unmanaged memory block and use the Parallel class or task parallelism to manage the multi-threading.

Copyright © All Rights Reserved - C# Learners