Monday, September 15, 2014

Hands on big data - Crash Course on Spark - Optimizing Transformations - lesson 4


1. Take a list of names
2. Create buckets by initials
3. Group by keys
4. Map each each initial into the set of names, get the size
5. Collect the results into the master as an array

This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks

Then we can avoid to count duplicates



 or better count the unique names with a reduceByKey where the reduce function is the sum



These optimizations reduce the information sent over the wire.

Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check


and make sure that you don't run in some heap problems if you fix the partitions





You will find more transformatios here

No comments:

Post a Comment