Oct 13 2010

3 ways for splitting a collection into pages with Linq

Category: C# — Duke @ 18:27
There are situations when you have a collection of objects (IEnumerable<T>) and you need to split it in chunks or pages so to process these resulting pars in different moments.
There are a lot of way to achieve this but the nicer is probably using Linq (it makes also your code looks a lot more modern and cool!)
the focus of this post is to discuss some of these ways and it’s pros and cons.
Digging around the web I've found this first example ( unfortunately I've lost the link, anyway the code in this first sample is not mine)
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> enumerable, int chunkSize) 
{ 
    var total = enumerable.Count(); 
    for (int i = 0; i * chunkSize < total; i++) 
    { 
        yield return enumerable.Skip(i * chunkSize).Take(chunkSize); 
    } 
}
 
This allows you to write code like:
foreach(var page in myCollection.Chunk(30))
{
    foreach(var item in page)
    {
         process(item);
    }
}
anyway, this work well in limited scenarios, when the amount of record are quite low. To prove it imagine to have a collection with 1 million item in it, and you want to split it in chunk of 1000 elements each.
the Chunk method will first skip zero element (hey pretty fast!) and then take 1000. the total amount of time is c, given c the size of the chunk.
but the next round the method have to skip the first 1000(as they are already taken) and then take 1000. Hum… still not so bad as the first 1000 was already taken and then we have to process the next 1000.
third round…. skip 2000 this time… and then take 1000; ok but we have skipped 0 the first time, 1000 the second time and 2000 the third time… if we continue this way at the 10th page we will have to skip 0, 1000, 2000, 3000, 4000 … 9000! oh no, this means that at the kth page we will need to skip
((k * (k-1))/2 ) * c;
Old friend Gauss appears again! plus of course (k * c) for the take part.
that’s so bad, as you can see, we have a term that grows as O(k^2) and you for sure don’t want to have such complexity if your collection have 1 million of elements… moreover, one usually want to split thinks in chunks when the number is important, so exactly our scenario… therefore this algorithm is quite poor in a general situation: we are not going to use it in our programs, isn’t it?
So, let’s move forward and try to find something better: we have to solve the problem of restarting every time from the beginning, let’s try some caching, it usually works,right?
 
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> enumerable, int chunkSize) 
{ 
    var total = enumerable.Count(); 
    var current = enumerable; 

    for (int i = 0; i * chunkSize < total; i++) 
    { 
        yield return current.Take(chunkSize); 
        current = current.Skip(chunkSize); 
    } 
}
that’s even worst! even if it looks to work at first, if we consider the internal implementation of linq we have now the same time complexity but we have also added a spatial complexity of the same order, as “current” will result in somthing like enumerable.Skip(chunkSize).Skip(chunkSize).Skip(chunkSize) … Take(chunkSize);
this rapidly explodes and brings your software to an out of memory exception in case you have to work with collection of millions of items.
So is everything lost? have linq really failed in a so easy task? certainly not!
even if skip and take are very useful for pagination purposes, they are not suitable in this scenario.
let’s say that they are designed to identify 1 subset of a sequence just for one time, so they work well if you have 1 million items and you need to consider only from the item 890000 to the 890100, and then give up, in this case they are the best.
They are not suited to repeatedly extract part of these big collection not to split them in chunks… so: as Microsoft is infallible and this methods simply don’t work, this can means only 1 thing: we are on the wrong path!
The answer is quite simple… wat we want to do is to split our sequence in groups and each group should have at maximum the chunkSize dimension.
Eureka! Linq has a GroupBy constructor and if we know lambra expressions and closures we can figure out … this:
public static IEnumerable<IGrouping<int,T>> Chunk<T>(this IEnumerable<T> enumerable, int chunkSize) 
{            
    int i = 0; 
    return enumerable.GroupBy(x => (int)Math.Floor((i++) / (double)chunkSize)); 
}
 
WTF! 2 lines of code! this really means it must be the correct answer ;)
Let’s analyze this. we broke the collection in groups, each group will be grouped using a key that is given by a pregressive number divided by the chunk size and then considering only the integral part (ok it is possible to write it (int) (double)i++/(double)chunkSize … Floor was just for clarity)
This basically groups the first “chunkSize” items with the same key (0), then another “chunkSize” items with the next key (1)… and so on!
is it fast?
BLAZING fast! in fact the iterators doesn’t needs to restart each time from zero bringing back our computational complexity from O(k^2) to O(k)! great!
Is it memory consuming? not that much, not at all!
I have real life situation where I've tried to use these method with the infamous 1 million items collection.
1st method was soooo slow that i’ve never discovered how long it takes, really! it probably takes 1 minutes to iterate the entire paginated 1 million items collection. i’ve always killed it before it ends.
The second method is immediatly out of the game as it go in  OutOfMemoryException in about 10 seconds, and doesn’t completes.
while the 3rd… it is simply lovely! it takes just the required time to materially enumerate the collection (about 5 seconds in my case) and no abnormal memory consumption!
So finally we have our Hunk or Page method :) do it well and do it with GroupBy! fast, memory efficient, and very elegant.
Linq is very lovely i think, isn’t it?

Tags:

1.
Luke Luke United States says:

Yes, thats exactly what I wanted to hear! Great stuff here. The information and the detail were just perfect. I think that your perspective is deep, its just well thought out and really fantastic to see someone who knows how to put these thoughts down so well. Great job on this. =-=

2.
butterfly dress butterfly dress United States says:

Should you could e-mail me with a few solutions on just how you made your blog look this glorious, I'd be grateful.

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading