Sorting

CSC 105 - The Digital Age - Weinman

Summary:: We introduce and formalize two commonly used sorting algorithms.

The Problem of Sorting

Sorting a collection of values - arranging them in a fixed order, usually alphabetical or numerical - is one of the most common computing applications. When the number of values is even moderately large, sorting is such a tiresome, error-prone, and time-consuming process for human beings that the programmer should automate it whenever possible. For this reason, computer scientists have studied this application with extreme care and thoroughness.

One of the clear results of their investigations is that no one algorithm for sorting is best in all cases. Which approach is best depends on whether one is sorting a small collection or a large one, on whether the individual elements occupy a lot of storage (so that moving them around in memory is time-consuming), on how easy or hard it is to compare two elements to figure out which one should precede the other, and so on. In this course we'll be looking at two of the most generally useful algorithms for sorting: insertion sort and merge sort.

Insertion Sort

As one might guess, the key operation in insertion sort is that of insertion. We envision separating the elements to be sorted into two collections: those that are not yet sorted, and those that are already sorted. We repeatedly insert one value from the unsorted collection into the proper place in the sorted collection.

The way in which we represent the unsorted and sorted collections has an effect on the way in which we implement the insertion operation. We will logically partition our data into two parts by keeping track of a boundary between them inside the data. Items to the left of the boundary are in the sorted part; items to its right are in the unsorted part. Initially the boundary is at the left end of the list. The plan is to shift the boundary, one position at a time, to the right end. When it arrives, the entire list has been sorted.

Here's the general flow of insertion sort.

: Set index to 1

While we have unsorted values left:

    Get the value of the first unsorted item
    Set position to current index as the first possible insertion position for that item

    While there are sorted items remaining and
      the inserted item comes before the previous position
         Move the item at current position forward (making space for the inserted item)
         Retreat to earlier position in sorted items

   (When the previous loop is completed, we've either
     run out of items, or
     found the position in the sorted part where the current value should go)
   Finally, insert the item at the current position
   Advance index to the next unsorted position

How does this work? We assume that there's a space at a position in the list. (That is, that we can safely insert something there without removing anything from the list.) Now, what do we do? If the position is at the left end of the list, there's nothing smaller in the list, so we just put the new value there. If the thing to the left of the current position is smaller, we know we've reached the right place, so we put the value there. In every other case, the value to the left is larger than the value we want to insert, so we shift that value right and continue working one position to the left. Because we've copied the value to the right, it is remains safe to insert something in the position just vacated.

In Python, this is concretely (and succinctly) expressed as follows.

: boundary = 1

while boundary < len(data):
     value = data[boundary]
     position = boundary

     while position>0 and data[position-1] > value:
         data[position] = data[position-1]
         position = position-1
     data[position] = value
     boundary = boundary + 1

Divide and Conquer

What techniques do we know for making algorithms faster? As we saw in the case of binary search, it is often profitable to divide an input in half. We call this technique divide-and-conquer. The strategy works somewhat differently for different domains. For binary search, once dividing the list in half, we could throw away half and then recurse on the other half. Clearly, for sorting, we cannot throw away part of the list. However, we can still rely on the idea of dividing in half. That is, we'll divide the list into two halves, sort them, and then do something with the two resulting lists. Here's a brief sketch of that idea.

: MergeSort( data ):

    If there is actually data to sort
        Calculate the (integer) midpoint of the data

        Split the data into the left half, and the right half
        Merge sort the left data
        Merge sort the right data
        Merge the sorted halves

Merging

But what do we do once we've sorted the two sublists? We need to put them back into one list. Traditionally, we refer to the process of joining two sorted lists as merging. It is relatively easy to merge two lists: You repeatedly take whichever element of the two lists should come first. When do you stop? When you run out of elements in one of the lists, in which case you use the elements of the remaining list. Putting it all together, we get the following.

: Merge (left, right, data):

  Start at the beginning of each list
  While there are unmerged values in both halves, do the following

     If the unmerged item from the left precede the unmerged item from the right
       Add the unmerged item from left to the merged list
       Advance to the next unmerged item in left
     Otherwise
       Add the unmerged item from right to the merged list
       Advance to the next unmerged item in right

     Advance to next position in merged

  (When the loop terminates, one (or both) halves have all data merged,
    so we need to merge the remainder of either half separately.)

  While there are unmerged values in left
    Add the unmerged item from left to the merged list
    Advance to the next unmerged item in left
    Advance to next position in merge

  While there are unmerged values in right
    Add the unmerged item from right to the merged list
    Advance to the next unmerged item in right
    Advance to next position in merge

Merge Sort

Putting it all together, here is a concrete (and less succinct) version of merge sort in Python.

: def mergeSort( data ):

    if len(data)>1:
        midpoint = len(data) // 2

        leftdata = data[:midpoint]
        rightdata = data[midpoint:]

        mergeSort(leftdata)
        mergeSort(rightdata)

        merge(leftdata,rightdata,data)

def merge (left, right, merged):

        leftPos = 0
        rightPos = 0
        mergedPos = 0

        while leftPos<len(left) and rightPos<len(right):
            if left[leftPos]<right[rightPos]:
               merged[mergedPos]=left[leftPos]
                leftPos=leftPos+1
            else:
                merged[mergedPos]=right[rightPos]
                rightPos=rightPos+1

            mergedPos=mergedPos+1

        while leftPos<len(left):
            merged[mergedPos]=left[leftPos]
            leftPos=leftPos+1
            mergedPos=mergedPos+1

        while rightPos<len(right):
            merged[mergedPos]=right[rightPos]
            rightPos=rightPos+1
            mergedPos=mergedPos+1

Donald Knuth (1998)¹ tells us, "from a historical point of view, merge sorting was one of the very first methods proposed for computer sorting; it was suggested by John von Neumann as early as 1945" (p. 159). Knuth likewise credits John Mauchly for a variant of insertion sort dating to 1946 "in the first published discussion of computer sorting" (p. 82), but does not credit anyone for straight insertion sort.

Acknowledgments

Adapted in part from Sorting and Merge Sort by Janet Davis, Samuel A. Rebelsky, and Jerod Weinman. Used by permission.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Footnotes:

¹Knuth, D. (1998). The Art of Computer Programming: Sorting and Searching. (2nd edition). Addison-Wesley.