[SPARK-54838] A new feature to optimize partition size and count #53599
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
I am proposing to add a new function in Dataset class to fix small file problem. We have noticed that if the source data our spark job reads have many small files (KB size), it creates lot of partitions. This PR adds a new function named optimizePartition which when used creates partitions of size 128MB if no size passed. You can pass your own desired partition size.
Why are the changes needed?
The changes are needed to solve small file problem. It also helps in reducing the number of files that gets written back to sink.
Does this PR introduce any user-facing change?
It does not introduce any change in any existing feature/functions of Dataset. It is a brand new function.
How was this patch tested?
I have added number of unit tests that covers the scenario of lot of small partitions and when this function is called, it either coalesces to reduce partition count or uses repartition to increase partition count if partition size is too big. Also tested it locally using a dockerized environment containing spark cluster.
Was this patch authored or co-authored using generative AI tooling?
I did most of the coding. I used Gemini along to help me walk through the process of opening PRs, jira ticket and using linters/formatters. This PR does not have lot of code change.