Data Deduplication and Read Ahead Cache
This has been discussed quite a lot, but I thought I’d throw in my 2 cents. A lot of people are still new to de-duplication and what it does, and a lot of people are still very nervous of using it in production. I’m still hesitant, I’m waiting for “2.0” in the dedupe world, as ever I’m rarely at the bleeding edge with technology as it scares me too much and I’ve seen too many people lose large amounts of data by going bleeding edge.
But dedupe has some interesting effects on the storage end. I’ve covered this slightly in my look at SSD usage, but right now today it has some interesting effects on performance.
So in theory you would think that dedupe would actually have an adverse effect on performance. Several processes all requiring access to the same blocks of data on disk, this is going to result in a bottle neck accessing the disk surely? But don’t forget read cache! Especially dedupe tuned read cache I might add! I won’t go into detail of un-tuned read cache but you’ll be able to figure that out for yourself.
So what happens with dedupe tuned read cache is that the data blocks that are being accessed by multiple processes get kept in cache for longer, so who cares if there are now 10 pointers all pointing to this one deduped block, this is stored in level 1 read cache! We have lightening performance on that!
Take this one step forward, you’ve got dedupe tuned, read ahead cache! That’s pretty cool! Blocks A and B are read into cache, but then blocks C through H are all the same, so lets just read block C and share the access for the other blocks. So I’m deduplicating my cache! So in reality, I actually need less read cache to get a more performant system
Dedupe is great, and I can’t wait for “2.0”, that’s the stable release that has the cache fully tuned and has block level storage support. This isn’t too far away either!