Sorting
CSC 105 - The Digital Age - Weinman
- Summary:
- We introduce and formalize two commonly used sorting
algorithms.
The Problem of Sorting
Sorting a collection of values - arranging them in a fixed order,
usually alphabetical or numerical - is one of the most common computing
applications. When the number of values is even moderately large,
sorting is such a tiresome, error-prone, and time-consuming process
for human beings that the programmer should automate it whenever possible.
For this reason, computer scientists have studied this application
with extreme care and thoroughness.
One of the clear results of their investigations is that no one algorithm
for sorting is best in all cases. Which approach is best depends on
whether one is sorting a small collection or a large one, on whether
the individual elements occupy a lot of storage (so that moving them
around in memory is time-consuming), on how easy or hard it is to
compare two elements to figure out which one should precede the other,
and so on. In this course we'll be looking at two of the most generally
useful algorithms for sorting: insertion sort and merge sort.
Insertion Sort
As one might guess, the key operation in insertion sort is that of
insertion. We envision separating the elements to be sorted
into two collections: those that are not yet sorted, and those that
are already sorted. We repeatedly insert one value from the unsorted
collection into the proper place in the sorted collection.
The way in which we represent the unsorted and sorted collections
has an effect on the way in which we implement the insertion operation.
We will logically partition our data into two parts by keeping track
of a boundary between them inside the data. Items to the left of the
boundary are in the sorted part; items to its right are in
the unsorted part. Initially the boundary is at the left end
of the list. The plan is to shift the boundary, one position at a
time, to the right end. When it arrives, the entire list has been
sorted.
Here's the general flow of insertion sort.
-
Set index to 1
While we have unsorted values left:
Get the value of the first unsorted item
Set position to current index as the first possible insertion position for that item
While there are sorted items remaining and
the inserted item comes before the previous position
Move the item at current position forward (making space for the inserted item)
Retreat to earlier position in sorted items
(When the previous loop is completed, we've either
run out of items, or
found the position in the sorted part where the current value should go)
Finally, insert the item at the current position
Advance index to the next unsorted position
How does this work? We assume that there's a space
at a position in the list. (That is, that we can safely insert something
there without removing anything from the list.) Now, what do we do?
If the position is at the left end of the list, there's nothing smaller
in the list, so we just put the new value there. If the thing to the
left of the current position is smaller, we know we've reached the
right place, so we put the value there. In every other case, the value
to the left is larger than the value we want to insert, so we shift
that value right and continue working one position to the left. Because
we've copied the value to the right, it is remains safe to insert
something in the position just vacated.
In Python, this is concretely (and succinctly) expressed as follows.
-
boundary = 1
while boundary < len(data):
value = data[boundary]
position = boundary
while position>0 and data[position-1] > value:
data[position] = data[position-1]
position = position-1
data[position] = value
boundary = boundary + 1
Divide and Conquer
What techniques do we know for making algorithms faster? As we saw
in the case of binary search, it is often profitable to divide an
input in half. We call this technique divide-and-conquer. The strategy
works somewhat differently for different domains. For binary search,
once dividing the list in half, we could throw away half and then
recurse on the other half. Clearly, for sorting, we cannot throw away
part of the list. However, we can still rely on the idea of dividing
in half. That is, we'll divide the list into two halves, sort them,
and then do something with the two resulting lists. Here's a brief
sketch of that idea.
-
MergeSort( data ):
If there is actually data to sort
Calculate the (integer) midpoint of the data
Split the data into the left half, and the right half
Merge sort the left data
Merge sort the right data
Merge the sorted halves
Merging
But what do we do once we've sorted the two sublists? We need to put
them back into one list. Traditionally, we refer to the process of
joining two sorted lists as merging. It is relatively easy
to merge two lists: You repeatedly take whichever element of the two
lists should come first. When do you stop? When you run out of elements
in one of the lists, in which case you use the elements of the remaining
list. Putting it all together, we get the following.
-
Merge (left, right, data):
Start at the beginning of each list
While there are unmerged values in both halves, do the following
If the unmerged item from the left precede the unmerged item from the right
Add the unmerged item from left to the merged list
Advance to the next unmerged item in left
Otherwise
Add the unmerged item from right to the merged list
Advance to the next unmerged item in right
Advance to next position in merged
(When the loop terminates, one (or both) halves have all data merged,
so we need to merge the remainder of either half separately.)
While there are unmerged values in left
Add the unmerged item from left to the merged list
Advance to the next unmerged item in left
Advance to next position in merge
While there are unmerged values in right
Add the unmerged item from right to the merged list
Advance to the next unmerged item in right
Advance to next position in merge
Merge Sort
Putting it all together, here is a concrete (and less succinct) version
of merge sort in Python.
-
def mergeSort( data ):
if len(data)>1:
midpoint = len(data) // 2
leftdata = data[:midpoint]
rightdata = data[midpoint:]
mergeSort(leftdata)
mergeSort(rightdata)
merge(leftdata,rightdata,data)
def merge (left, right, merged):
leftPos = 0
rightPos = 0
mergedPos = 0
while leftPos<len(left) and rightPos<len(right):
if left[leftPos]<right[rightPos]:
merged[mergedPos]=left[leftPos]
leftPos=leftPos+1
else:
merged[mergedPos]=right[rightPos]
rightPos=rightPos+1
mergedPos=mergedPos+1
while leftPos<len(left):
merged[mergedPos]=left[leftPos]
leftPos=leftPos+1
mergedPos=mergedPos+1
while rightPos<len(right):
merged[mergedPos]=right[rightPos]
rightPos=rightPos+1
mergedPos=mergedPos+1
Donald Knuth (1998)1 tells us, "from a historical point of view, merge sorting was one
of the very first methods proposed for computer sorting; it was suggested
by John von Neumann as early as 1945" (p. 159). Knuth likewise credits
John Mauchly for a variant of insertion sort dating to 1946 "in
the first published discussion of computer sorting" (p. 82), but
does not credit anyone for straight insertion sort.
Acknowledgments
Adapted in part from Sorting and Merge
Sort by Janet Davis, Samuel A. Rebelsky, and Jerod Weinman.
Used by permission.
Copyright © 2011 Jerod
Weinman.
This work is licensed under
a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Footnotes:
1Knuth, D. (1998). The Art of Computer Programming: Sorting and
Searching. (2nd edition). Addison-Wesley.