
Dr. Pucketlove – Or, How I Learned to Stop Worrying and Love Parquet (partitioning)
Pucket is a Scala library which provides a simple partitioning system for Parquet. But what is Parquet and why does it need partitioning when it already supports filtering? In this post I will attempt to explain Parquet, partitioning in Hadoop, and the motivation and design of Pucket. If you’re not interested in the background, you can skip straight to some simple code examples or go to the GitHub repository.