Saturday 8 March 2014

Sorting Algorithms


There are lots of things to consider with sorting algorithms. Most people don't need to know anything about them; others need to know a lot. They are an excellent area to explore the questions of complexity and simplicity, there being so many, some of which are subtle, others are devious, and many are clever.
Comparison follows. "stable" means "algorithm can be used to implement a StableSort".

                Worst case    Average case   Best case   Extra space   Stable?

 BubbleSort       O(n^2)        O(n^2)?        O(n)          O(1)        yes

 SelectionSort    O(n^2)        O(n^2)        O(n^2)         O(1)        No (it can depend on sort order and data arrangement)

 InsertionSort    O(n^2)        O(n^2)         O(n)          O(1)        yes

 BitonicSort   O(n log^2 n)  O(n log^2 n)?      ?            O(1)?         ?

 ShellSort        O(n^2)       O(n log n)?     O(n)          O(1)         no

 QuickSort        O(n^2)       O(n log n)    O(n log n)    O(log n)       no

 HeapSort       O(n log n)     O(n log n)    O(n log n)      O(1)         no

 SmoothSort     O(n log n)     O(n log n)?     O(n)          O(1)         no


 MergeSort      O(n log n)     O(n log n)    O(n log n)      O(n)        yes

 TimSort        O(n log n)     O(n log n)?   O(n)            O(n)        yes

 CountingSort     O(n+k)        O(n+k)        O(n+k)        O(n+k)       yes

 RadixSort        O(n+k)        O(n+k)        O(n+k)        O(n+k)       yes

 BucketSort       O(n^2)        O(n+k)        ?????     O(n*k) or O(n+k) ?

 BogoSort       unbounded        O(n!)         O(n)          O(1)         no

 SlowSort       O(n^(log n))  O(n^(log n))  O(n^(log n))     O(1)        yes

 QuantumBogoSort O(1)         O(1)          O(1)             O(0)         no

In above table, k is the size of the bucket or size of the packet.
Why are there so many of them? And why isn't the "best" one used everywhere? We need to consider time-complexity and space-complexity, both short and asymptotic - this comparison only shows asymptotic behaviour and neglects constant factors. The "real" performance of your algorithm depends on the system you are coding for, the amounts of data, the characteristics of your data, and so on. We also need to consider ease of coding, compactness of coding, predictability of time consumption and intermediary state of data.
In short, each of the algorithms has its limitations. HeapSort in-place, for example, completely messes up the order of your data, should you interrupt it before it is finished, whereas the three simpler algorithms iteratively improve the sortedness. If we need stability, none of the unstable algorithms can be used. Believe it or not, there are situations where BubbleSort is not a bad choice.
Most of the time the truly "optimal" method is a combination of two or more of the above.

Since a list of n elements has n! possible orderings, and since each comparison might only eliminate half the remaining candidate orderings, we know that in the worst case a sorting algorithm might need to make log(n!) comparisons, where the logarithm is taken base 2. This gives a lower bound for the number of comparisons that a serial algorithm can take in the worst case, which is O(n log n). Checking that an already sorted sequence is sorted takes O(n), which is therefore the lower bound of the best case.

The RadixSort runs in O(n), but only works in a few specialized applications. It avoids the above O(n log n) worst case lower bound because it is not based on comparing elements with each other.
Another consideration is the one between StableSort and UnstableSort. If we find an algorithm that suits our data perfectly, but isn't stable, we may need to use a less optimal algorithm (if, of course, stability is important to us for some reason).

Sweep through your data swapping adjacent items if they're the wrong way around. A factor of 2 improvement can be made by having successive sweeps get shorter. Time-complexity is (n-1)*(n-2)/2 comparisons and we expect (on average) to make half that many exchanges. If you don't swap equal items, it's stable.
Comes in two basic flavors. One scans the data until it knows there can be no problems remaining (n-1 sweeps, as above), the other sets a flag if items were swapped and sweeps until the flag remains false. The latter stops early if the data were almost sorted to start with but carries a small extra cost associated with having and setting the flag.

BubbleSort is easy to understand, easy to code, but inefficient. One hypothetical Really Aggressive Optimizer automatically recognizes undergraduate implementations of bubble sort and replaces them with something better.
Separating the ideas of "finding the largest" and "moving the largest" gives InsertionSort and SelectionSort, the comparative speeds of which depend on which is more expensive, comparisons or swaps. See those pages for more details.
BubbleSort is never faster than InsertionSort, so long as the InsertionSort does not use a binary search.
BubbleSort is faster than InsertionSort on an array of 3 elements.
It is important to remember that Bubble Sort is not a 100% useless algorithm - if you have a sequence of objects that you would like to keep ordered that is occasionally perturbed by having the value of one or two objects increase or decrease, bubble sorting is a good thing. Consider Z-buffering a field of mostly-immobile objects. Say one or two objects move closer or further away from the observer - the sorting would be simply swapping a few adjacent slots, which is what Bubble Sort is ideal for. Meanwhile, the Merge Sort or Quick Sort algorithm would be inefficient. Quick sort is notoriously clumsy in cases where the list is already nearly-sorted. Of course, an insertion-sort on the perturbed objects is usually better.


Sample code in CeeLanguage
  void
  bubble_sort (int a[], int n /* the size of array */)
  {
      int i;
      for (i = 0; i < n - 1; i++)
      {
            int j;
            for (j = n - 1; j > i; j--)
            {
                  if (a[j - 1] > a[j])
                  {
                        /* Swap a adjacent pair of items. */
                        int t = a[j - 1];
                        a[j - 1] = a[j];
                        a[j] = t;
                  }
            }
      }
  }
This formulation obtains the advertised best-case performance of O(n). Otherwise, best-case == worse-case == average-case == O(n^2).
  void
  bubble_sort (int a[], int n /* the portion of the array that is unsorted */)
  {
      int i, sorted;
      do {
            sorted = 1;
            --n;
            for (i = 0; i < n; i++)
            {
                  if (a[i] > a[i + 1])
                  {
                        /* Swap adjacent pair of items. */
                        int t = a[i];
                        a[i] = a[i + 1];
                        a[i + 1] = t;
                        sorted = 0;
                  }
            }
      } while (!sorted);
  }

Sample code in PythonLanguage:
 def bubble(alist):
      length = len(alist)
      for i in xrange(length-1):
            for j in xrange(length-i-1):
                  if alist[j] > alist[j+1]:
                        alist[j], alist[j+1] = alist[j+1], alist[j]

 def bubble_flag(alist):
      length = len(alist)
      for i in xrange(length-1):
            swapped = False
            for j in xrange(length-i-1):
                  if alist[j] > alist[j+1]:
                        alist[j], alist[j+1] = alist[j+1], alist[j]
                        swapped = True
            if not swapped: return


 sort: aCollection
      ^aCollection size < self switchSize
      ifTrue: [^self insertionSort: aCollection]
      ifFalse: [^self quickSort: aCollection]



I've always thought that Bubble Sort should be called Rock Sort because the in order elements actually fall to the bottom not bubble to the top. Depends purely on whether you run forwards or backwards.
 8 7 7 3
 7 8 3 5
 9 3 5 7
 3 5 8 8
 5 9 9 9

Selection Sort works by finding the smallest/largest item and then placing it at the front/back. Runtime analysis:
                Worst case  Average case  Best case
 Comparisons      O(n^2)       O(n^2)        O(n)
 Swaps             O(n)         O(n)         O(1)

The "best case" analysis assumes the "stop early" strategy is used, checking that things are in order as you run along. If comparisons are expensive and swaps cheap you'd be better using InsertionSort.


Example Code:
The first version uses YAGNI - hence the selection is done in-line because the abstraction isn't needed.
  void insertionSort(int A[], int n)
  {
    int i,j,m;
    while (n>1)
    {
       m = A[0]; j = 0;
       for (i=1;i<n;i++)
          if (m<A[i]) m = A[i], j = i;
       swap(A,--n,j);
    }
  }

This second version uses ShortMethods - "Factor out units of semantic meaning, even if they are only ever used once. "
  void insertionSort(int A[], int n)
  {
    while (n>1)
    {
       int j;
       j = indexOfLargest(A, n);
       swap(A,--n,j);
    }
  }

  int indexOfLargest(int A[], int n)
  {
    int i, j, m;
    m = A[0]; j = 0;

    for (i=1;i<n;i++)
      if (m<A[i])
        m = A[i], j = i;
    return j;
  }

You might think that this last algorithm is the equivalent of BubbleSort, but it isn't. This only does n exchanges to get every item to its correct place.
This third version is in the StlStyle:
  template<typename ForwardIterator>
  void SelectionSort(ForwardIterator start, ForwardIterator stop)
  {
    for(; start != stop; ++start)
      std::iter_swap(std::min_element(start, stop), start);
  }

Of course, this isn't an optimal implementation (the last element is swapped with itself).


Moved here from SortingAlgorithms (RefactorMe):
SelectionSort: sweep your data to find the largest element and swap that with the last element. Repeat on smaller and smaller initial arrays. You perform an exchange almost every time because the largest item in the original array is almost never at the end. Complexity is (n-1)*(n-2)/2 comparisons and n exchanges. When coded in-line in C, possibly the smallest code. Can easily be stable.

There is a "stop early" strategy:
  For each sweep:
    set "sorted_so_far" to "true".
    As you sweep:
      If the item you're looking at is more than max_so_far:
        update max_so_far and the index pointing to it.
      If the item you're looking at is less than max_so_far:
        set "sorted_so_far" to false.

At the end of any sweep, if "sorted_so_far" is still true, then the array is sorted and you can stop early.
This is can be the most economical sort to choose if you perform each selection interleaved with other work, and that work may abort the sort operation. For example, an AlphaBetaSearch? is most efficient if most valuable edges ("moves") are traversed first. SelectionSort can pick out the estimated most valuable edge to traverse, with a high probability of an immediate beta cutoff. A cutoff would eliminate the need to sort any further edges, thus saving the work that would have been required to do a full sort of the edges.

 Insertion Sort



The Insertion Sort works by picking an element, figuring out where it should go, then putting it there.
Using BinarySearch to figure where the item should go gives the following runtime analysis:
                Worst case  Average case  Best case
 Comparisons    O(n log n)   O(n log n)   O(n log n)
 Swaps            O(n^2)       O(n^2)        O(1)

If you use LinearSearch to find the item location this changes to:
                Worst case  Average case  Best case
 Comparisons      O(n^2)       O(n^2)        O(n)
 Swaps            O(n^2)       O(n^2)        O(1)

If you expect your data to be mostly sorted the LinearSearch version might be faster, but BubbleSort might work better anyway. Beware of BinarySearch - it is subtle and error-prone, even when using versions from the published literature.
If comparisons are cheap and swaps expensive you'd be better using SelectionSort.
InsertionSortSelectionSort and MergeSort are all StableSorts, whereas the famous QuickSort is not.


Code example in C
Note that this code is pretty severely broken. Correct examples welcome.
    void
    insertion_sort (int a[], int n /* the size of array */)
    {
        int i;
        for (i = 1; i < n; i++)
        {
            /* Assume items before a[i] are sorted. */
            /* Pick an number */
            int b = a[i];

            /* Do binary search to find out the
               point where b is to be inserted. */

            int low = 0, high = i - 1, k;

            while (high - low > 1)
            {
                int j = (high + low) / 2;

                if (b <= a[j]) high = j;
                else           low = j;
            }

            /* Shift items between high and i by 1 */
            for (k = i; k > high; k--)
                a[k] = a[k - 1];

            a[high] = b;
      }
  }



Code using upper_bound of STL
  /* Algorithm from stl_algo.h in the SGI stl library */
  int
  upper_bound (int a[], int low, int high, int value)
  {
      int len = high - low;
      while (len > 0)
      {
      int half = len / 2;
      if (value < a[low + half])
      len = half;
      else
      {
      low = low + half + 1;
      len = len - (half + 1);
      }
      }
      return low;
  }

  void
  insertion_sort (int a[], int n /* the size of array */)
  {
      int i;
      for (i = 1; i < n; i++)
      {
      /* Assume items before a[i] are sorted. */

      /* Pick up a number */
      int b = a[i];

      int up, k;
      up = upper_bound (a, 0, i, b);

      for (k = i; k > up; k--)
      a[k] = a[k - 1];

      a[up] = b;
      }
  }




Couldn't resist:
  void
  insertion_sort (int a[], int n /* the size of array */)
  {
      int i;
      for (i = 1; i < n; i++)
      {
      /* Assume items before a[i] are sorted. */

      int high = findInsertionPoint(a, i);     

      int b = a[i];          /* OK, so this was added during refactoring ... */
      shiftUp(a, high, i);
      a[high] = b;
      }
  }

  int findInsertionPoint (int a[], int i)
  {
      /* Pick a number */
      int b = a[i];

      /* Do binary search to find out the point where b is inserted
       at. */

      int low = 0, high = i - 1;

      while (high - low > 1)
      {
      int j = (high + low) / 2;

      if (b <= a[j])
        high = j;
      else
        low = j;
      }
      return high;
      }

      int shiftUp(int a[], int high, int i)
      {    
      /* Shift items between high and i by 1 */
      int k;
      for (k = i; k > high; k--)
      a[k] = a[k - 1];
      }





Not only is a binary search difficult to implement correctly, it doesn't gain you anything in this case. Every element in the already-sorted set greater than the element being inserted will have to be accessed in order to make room for the element being inserted, so you may as well combine the movement of the data with the comparison:
 void insertion_sort(int a[], int elementCount) {
      int i;
      for (i = 1; i < elementCount; i++) {
      int insertee = a[i], j;

      for (j = i - 1; j >= 0 && insertee > a[j]; j--)
      a[j+1] = a[j];
      a[j] = insertee;
      }
 }

Incidentally, this approach is also stable, unlike the binary-search version. It's also about half the size.


No comments:

Post a Comment

 

FACEBOOK PAGE

SKETCHES & PAINTINGS