Top k elements and k-th element
Two twists on the same idea that show up frequently in coding interviews in various incarnations:
- given an unsorted array of objects of length n, find the top k objects based on a given comparison function
- given an unsorted array of objects of length n, find k-th object based on a given comparion function, that is an object that would appear at the k-th place in an sorted array
You can expect k<<n, that is the length of the array is much bigger than the number of elements we are interested in.
In this article, we will go through 3 solutions to these problems:
- a trivial solution with O(n∗log(n)) complexity
- a medium complexity solution with O(n∗log(k)) complexity
- a complex solution that relies on prior knowledge of specific algorithms with O(n) complexity
All solutions are presented using an array of ints to minimize amount of boilerplate.
Trivial solution O(n∗log(n))
The trivial solution is given to us on a silver plate in the assignment text.
If the array would be sorted, returning either the k-th element or the top k elements would be trivial.
So let's just do that. Sort the array and simply return the k-th element or the top k elements of the sorted array.
Complexity analysis of trivial solution
We sort the array, which has n elements resulting in O(n∗log(n)) complexity. Since sorting is a destructive operation we also need to create a copy of the array resulting in O(n) space complexity.
Medium solution O(n∗log(k))
When we consider optimizations, we need to look at the unnecessary work we are doing. Since we do not care about the elements besides the top k we do not need to sort them. The standard library already offers an appropriate function
Complexity analysis of partial sort solution
Time complexity of
partial_sort is O(n∗log(k)).
We are still copying the array, so space complexity remains unchanged at O(n).
Medium solution with O(k) space complexity
One problem with the previous straightforward solution is that we still need to make copy of the entire array, since
partial_sort is still destructive, leaving us with O(n) space complexity.
We must store the top k elements, so the space complexity will be at least O(k). If we scan through the array and remember the top k elements upto now, we would end up with O(k) space complexity.
Datastructure we require must have the ability to insert elements, look a the smallest element and remove the smallest element all in log(k) time. Fortunately, heap provides such characteristics. In C++ we can use the
priority_queue (which is not necessarily implemented as a heap, but providing the same performance characteristics).
Complexity analysis of medium solution
We scan through all the elements of the array and at worst case we will insert each of them into the
priority_queue. This leads to total O(n∗2∗log(k) complexity (n calls of
pop(), each with log(k) complexity), ultimately collapsing to O(n∗log(k)). As already discussed the space complexity of this approach is O(k), as we only ever store k+1 elements in the
Complex solution O(n)
First observation is that if we know the k-th element, getting the top k elements is a simple scan through the array with O(n) complexity. So let's just concentrate on solving the k-th element in O(n) time.
What would happen if we would just blindly guess that a particular element is the k-the element. How would we verify that? If we use this element as a pivot and partition the array around it, we end up with all the bigger elements to its left and all the smaller elements to its right. If the picked element ends up on the k-th position, we know that our guess was correct (if the array has all unique elements).
What happens if we weren't correct? Well, we know on which side of the pivot our k-the element lies at least. If the pivot ended up on a position i<k then we need to search the elements to the right of the pivot, if it ended up on a position i>k, we need to search to the left.
This might sound very similar to the Quicksort algorithm and indeed, this is referred to as the Quickselect algorithm.
There is one more important step however. You might recall that Quicksort with a random pivot is O(n2) in worst case. Quickselect suffers the same faith unfortunately.
To fix this we need the median of medians algorithm.
Median of medians
The source for the worst case of Quickselect is a corner case when we keep choosing either the smallest or the largest element as our pivot. In that case we only improve the situation by one element, recursing over n+(n−1)+(n−2)... elements which ends up in the O(n2) class.
What if we could pick the pivot somewhat well. Let's say that we always pick a pivot that splits that array in such a way that the larger portion has q∗n;q<1 elements. Now the recursive pattern would be a geometric progression of n+q∗n+q2∗n... Sum of such progression is ∑n1qi∗n=n∗1−qn1−q. For all q<1 the right side of the sum ends up a constant number as we can bound it bellow 11−q.
The core idea of the median of medians is to process the array in chunks of 5 elements and pick medians from these 5 element sub-arrays. Due to the constant size, picking each of these medians is O(1). Once we have all the n5 medians, we then recursively apply the Quickselect algorithm to pick a median from these elements. The median (for size 5) has a nice property of splitting the array into at worst 310 and 710 chunks.
Complexity analysis of complex solution
As already mentioned, due to the smart choice of our median, the time complexity is O(n). Space complexity is O(n) as this is a destructive method and we need to make a copy of the array.
Next: O(1) data structure
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Šimon Tóth 2020 (firstname.lastname@example.org)