## Search This Blog

Simplicity is the ultimate sophistication.” — Leonardo da Vinci
Contact me: sreramk360@gmail.com

## Sunday, 19 April 2015

### Determine the average run-time of insertion sort algorithm

Insertion sorting algorithm analysis
This article shows the analysis of insertion sorting algorithm and a program to implement insertion sorting algorithm. Link list implementation is used for the implementation.

This algorithm does comparison based sorting which involves $N^2$ iterations for the worse case and $N$ iterations if the list is already in sorted order. For the reader's convenience, a quick description of the algorithm is provided here.

Let there be an unsorted list $L$ which has its elements read linearly, starting from one of its edges. Another list $L'$ (which is assumed to be always sorted)  is generated upon reading the first element from the list $L$. For each element read from $L$ (let's consider the variable $x$ to store the read value) it's inserted into the sorted list $L'$ in such a position that $L'$ remains sorted  after the insertion (note that initially $L'$ was empty). The position in $L'$ to which the variable $x$ has to be inserted is determined by a linear search through $L'$ starting from one of its edges and comparing the arithmetic priority of $x$ with the succeeding element.

This algorithm explains what basically happens in an insertion sorting algorithm. But it states that the searching process could begin from either of the ends. On the whole, this wouldn't have any influence in the run time, but certain parameters change. For example, if the algorithm is made to arrange a set of numbers in ascending order, and the search iterations begins with the beginning of the list $L'$ during any point in time, then passing on a list that has its elements arranged in descending order takes $O(1)$ to be sorted in acceding order whereas passing on a list which is initially in acceding order would take $O(N^2)$ time to get sorted. Searching from the beginning of the list $L'$ and moving to its end is same as another algorithm which reverses the order of elements in $L$ and then sorts them in ascending order but have its search iteration start from the last element in list $L'$ and move towards the first. So minor changes such as these are not going to count, especially if we do pure theoretical analysis on this topic, rather than provide an optimal practical working modal.

A link to inserting sorting implementation C program.

Note, throughout this analysis, unless stated, the time complexity of 'list insertion' operation and 'list move' operation is always considered to be $O(1)$.

list insertion - inserts an element to the list
list move - moves one element across the list and returns the new element

Now lets proceed with the analysis. Throughout this analysis lets assume the algorithm sorts an unsorted list in acceding order. Its not difficult to show that the worse case in this sorting algorithm is $O(N^2)$. Let $L$ be sorted in descending order. Then, sorting it in acceding order using insertion sorting algorithm would obviously take $\frac{N(N+1)}{2}$ iterations. As the algorithm reads an element from $L$, it inserts it to the list $L'$. So after each element read from $L$, the size of $L'$ increases by one. On the worse case, the algorithm which searches for the optimal position for the element $x$ should terminate only after the maximum possible iterations are over. So for each element read from $L$, the number of iterations for searching the position of that element in $L'$ should increase by exactly one, compared to the number of search iterations when the previous element was read. In the worse case, its obvious that each element in $L'$ is to be read before inserting the element. If this has to occur then each element that's read from $L$ should be smaller than every other element in $L'$ and hence, $L$ should be sorted in descending order for the worst case to arise.

Now, If we have $L$ sorted in acceding order, then for each element read form $L$ only one search iteration is required to determine the position of that element in $L'$. Hence, exactly $N$ iterations would occur. To standardize this analysis and to avoid confusion, throughout this analysis the search is assumed to propagate from the end of the list $L'$ rather than from its beginning, and the algorithm sorts in acceding order .

I. The average case analysis

One popular way to proceed is to show that number of iterations required to sort an unsorted list and the reverse of it (where the left side elements are brought to their complementary elements in the right side and visa verse) is $\frac{N(N+1)}{2}$. Every possible combination of $N$ elements  (assuming every element to be unique) should have a reverse of itself. So if $x$ denotes all the possible combinations for $N$ numbers, then the average should be $\frac{\frac{\frac{x}{2}N(N+1)}{2}}{x} = \frac{N(N+1)}{4}$ where $x\to\infty$.

II. Now lets prove that $f(L) + f(L_{rev}) = \frac{N(N+1)}{2}$ where $f(L)$ returns the number iterations required to sort the list L using insertion sorting algorithm ($L_{rev}$ is the reverse of the list $L$).

We proceed with this proof with the assumption that no value in the list $L$ is repeated. Let the position of $x\in L$ be represented as $p(L, x)$ w.r.t the origin of the list $L$. Now each time the algorithm inserts an element $x$ in $L'$, a few alterations to the arrangement of the other elements in $L'$ is made; if the element $x$ is placed at a position $p_1$ in the list $L'$, then every element that's present after the position $p_1$ shifts one place towards right. For a linked list implementation, the insertion operation need not require any shift in the other elements in the list for the operation to take place. But still, the position of the elements that lie after the position $p_1$ gets incremented by one, if we consider the beginning of the list $L'$ to be the origin. There are two ways to count the number of iterations. One way is to count the number of search operations for the insertion of each element (which seems obvious enough) the other is to count the number of places each element gets shifted after being inserted to the list $L'$. Both these values are going to remain the same. If the search iteration reaches a particular element $y$ in $L'$, then it shows that every position that succeeds the element $y$ is not $p_1$ and that position precedes the element $y$.

So the number of search iterations is same as the number of elements moved by one step towards right. In our analysis, we consider the number of steps each element moves individually. We would generally be able to determine this information only after all the sorting iterations are over. During each iteration, we have the number of elements that moved by one step, rather than the number of steps moved by each element. Lets say we have an element $x$. The number of steps moved by this element after it gets inserted to the list $L'$ depends upon the number of elements that are lesser than the element $x$ and succeeds this element in the list $L$.

Seeing the summation we have to prove, the most common way to proceed will be to find the expression that gives the position where this element $x$ gets inserted in the list $L'$ and the the position where this element ends up after all the sorting iterations and obtain their difference and sum them up this displacement with all the other possible displacements that could arise in the list. But on a first glance, this method seems to be inappropriate because, the information required to determine the insertion position of the element $x$ in $L'$ would be a function of the values of elements that are in $L'$ at that time. One better way will be to determine the sum of number of places the element $x\in L$ gets moved after being inserted to $L'$ and the sum of number of places $y\in L_{rev}$ gets moved after being inserted to $L'_{rev}$ (consider the values are equal i.e., $x=y$; also remember that we are stating this proof with a fundamental assumption that no values are repeated in the list $L$. So $x$ and $y$ should be the same value in different positions in both the lists $L$ and $L_{rev}$). This summation exactly gives us the number of elements that are preceding the element $x$. Again, we have this as a rule rather than as a value. Determining this value will again require a function of all the elements in the list $L$. But fortunately we may proceed with the proof with just this piece of information.

Let's show that the summation of the number of elements succeeding $x\in L$ and lesser than $x$, and summation of the number of elements succeeding $y\in L_{rev}$ and lesser than $y$, is same as the number of elements preceding the element $x$ in $L'$ (note that $L' = L'_{rev}$ after all sorting iterations are over). The number changes in the position of an element $x$ in $L'$ after it gets inserted, and through the iteration process is equal to the number of elements that succeeds the position of the element $x$ in $L$ and lesser than $x$; because, this algorithm does not involve a deletion operation. So the distance of the element $x$ can only keep increasing from the beginning of $L'$ rather than decrease; and at times, it could remain same. So if there is a change in the position of $x$ observed in the list $L'$ through the sorting iterations after the insertion, then it obviously shows that there is an element lesser than $x$ which gets placed before it. When the list $L$ gets reversed to $L_{rev}$ across its center, then change of positions of each element takes place following the given expression,

Let the position of $x$ in $L$ be $p_1$. Then $p'_1$ which shows the position of $x$ in $L_{rev}$ can be expressed as,
$p'_1 = \frac{S}{2} - (p_1-\frac{S}{2}) = p'_1 = S-p_1$
Where, $S$ is the size of the list $L$. Also, after inversion the elements that originally preceded $x$ will be placed after $x$. Elements that lie before $x$ in $L'$ immediately after the insertion process, and the ones that are to be placed before $x$ in the subsequent insertion process, lie before and after $x$ in $L$ respectively. But after inversion, the order changes such that the converse holds true. So summing up the elements that aught to precede $x$ while sorting $L$ and $L_{rev}$ sum up to all the elements that aught to precede $x$ after the sorting process is over. Lets express the number of step movements shown by $x$ in $L'$ and $y$ in $L'_{rev}$ as $step(x_i, L)$ and $step(y_j, L_{rev})$ respectively (where $i$ and $j$ represents the number of iterations over and $L$ i.e., iterations of the main loop). Then we get,
$step(x_i, L) + step(y_j,L_{rev}) = N(x, L)$
Where, $j = S-i$ as $j$ represents the position of x_i in $L_{rev}$ and $N(x,L)$ represents the number of elements preceding $x$ in $L'$ and $step(x, L)$ represents the number of steps moved by element $x$ after all the iteration processes are over.

So finally we have, $step(x, L) + step(y,L_{rev})$ expressing the number of elements before $x$ in the sorted order. So now our summation to determine the number of iterations is,
$\sum_{i=1}^{S}{step(x_i, L)}$                         (1)
Now the summation for the reversed list can be expressed as,
$\sum_{i=1}^{S}{step(y,L_{rev})}$                    (2)
The final summation of (1) and (2) can be said to be similar to the summation $f(L) + f(L_{rev})$. Adding (1) and (2) we get,
$\sum_{i=1}^{S}{step(x, L) + step(y,L_{rev})}$        (3)
(3) can be written as,
$\sum_{i=1}^{S}{N(x_i, L)}$                        (4)

Now the summation (3) reduces to $\sum_{i=1}^{S}{i} = \frac{S(S+1)}{2}$ or
$\frac{N(N+1)}{2}$ if $N$ is the size of the input (i.e., the list $L$).
So if we proceed with the method stated in (I), we may arrive at our result!

III Conclusion
We have proceeded with the above proof without actually expressing all the possible combinations explicitly, rather we used information provided by the properties of the insertion sorting algorithm which lead us to symmetries in certain places and we have taken advantage of those symmetries to prove the result. In most cases, proving algorithms rigorously involve making use of these symmetries in the information provided (and sometimes it would be the information we determine with other given information).

IV My previous mistake

While proving this, I had made a mistake when I first started proceeding with the proof. The mistake was, I had tried to prove it by directly summing up the above summation rather than take advantage of the symmetries that were observable from the given algorithm which holds as our given information. Also, I had made a mistake determining the step process; I had taken the displacement to be the number of steps it covers from its position in $L$ to the position in the sorted $L'$. But this is not what actually happens. The element was initially inserted in a location in $L'$ such that the sorted nature of $L'$ does not get hindered. But, the below analysis holds only when $x$ is inserted in the last location of $L'$.

V The mistake

Let the position of $x$, where $x\in L$ to be $p(x,L)$ and the position of $x_{rev}\in L_{rev}$ to be $p(x_{rev}, L_{rev})$. And its obvious that if $x\in L$ then $x\in L'$.

let,   $|f(x, L) - f(x, L')| = l_x$                                        (1)
Its not difficult to show that after sorting a list $L$, each element (let it be $x$) in $L$ moves $l_x$ number of steps from its position in $L$. Let's see how the above statement holds. For each iteration involving reading of an element from $L$ causes it to be placed in an appropriate position in $L'$. But it might not be same as the position $f(x, L')$ because, at that iteration the set $L'$ would be missing many other elements that could hinder the final position of $x$ in $L'$. We may either observe the change in position of the element $x_i$ relative to its position in $L$ or from the either ends of the list $L'$. Both hold to be appropriate when we are trying to analyze the algorithm rather than provide a working practical model for it. But because the search process searches for the position in $L'$ starting from its end, lets choose the beginning of the list $L'$ as our reference point (this reason for this choice would soon make sense as we proceed with the proof). Now for a new element $x_i$ being added to the list $L'$, a search process begins from the end of $L'$ and moves towards its beginning.

If the newer element is to be place between the position of $x_j$ and the end of the list $L'$, then the search process wouldn't have reached $x_j$. And as our reference point is the beginning of the list $L'$, we may say that the position of $x_j$ is not changed. But if $x_i$ is placed between the beginning of $L'$ and $x_j$ then the position of $x_j$ is said to increase by one unit. In all iterations after $x_j$ gets added to the list, we may not observe the movement of the element $x_j$ towards the beginning of the list $L'$, rather we only observe its movement towards the end of the list. So if the number of steps moved by $x_j$ is $k_{x_j}$, then it means $k_{x_j}$ elements are added before $x_j$'s position in $L'$. After the last element of $L$ gets placed in the list $L'$, the position of $x_j$ is exactly said to be $f(x_j, L')$. Instead of counting the number of elements searched for, lets consider the number of steps moved by each element to reach its final position. So the run time for sorting an input of size $N$ is considered to be $i\in[1, N], i\in\mathbb{Z}^+, \sum{l_{x_i}}$.

Now on proving that,
$\sum{l_{x_i}} + \sum{l_{x'_j}} = \frac{N(N+1)}{2}, x_i\in L, x'_j\in L_{rev}$     (2)
we could finally show that the average  run time of insertion sorting algorithm is $O(N^2)$. For any value that gets displaced from the position $p_1$ to $p_2$ in list $L$, the same element gets displaced form the position $p'_1$ to $p_2$ in the list $L_{rev}$. Because the elements in $L$ are reversed compared to the elements in $L_{rev}$, every element other than the middle element in the list $L$ changes position in $L_{rev}$.

Let $S$ be the size of the list $L$. Then, the position $p_1$ w.r.t the position $\frac{S}{2}$ is,

$p_1 - \frac{S}{2}$                                                              (3)

(3) expresses the size for the list $L$. For List $L_{rev}$, the size remains same as that of $L$ if all the sorting iterations are over. It can be simply expressed as, $S_{rev} = S$.

$p'_1 - \frac{S_{rev}}{2}$                                                     (4)

Because the position of $x\in L$ reverses itself w.r.t to the center of the list, if the position of $x$ was $p_1$, the the position
$p'_1 = \frac{S}{2} - (p_1-\frac{S}{2}) = p'_1 = S-p_1$                     (5)
So the number of iterations required to move that particular element form $p'_1$ to $p_2$ and from $p_1$ to $p_2$ can be expressed as,
$|p_2 - p'_1| + |p_2 - p_1|$                                     (6)
Now, our obvious step is going to be to substitute the values of $p'_1$ in the equation (6). On doing so we get (6) same as,
$|p_1 + p_2 - S|+ |p_2-p_1|$                                      (7)

We can evaluate the summation (7) by two cases. In the first, both terms are positive, and in the second, both are of opposite signs. So for the first case we get,

$p_1+p_2-S+p_2-p_1 = 2p_2-S$                                    (8)

Now lets consider the second case. In this case, both are of opposite signs. From this case, we either get $S-2p_1$ or $2p_1 - S$. But both are treated same.  Now lets decide on exactly what happens when we sum them up. Because the two sub-cases of the second case can be treated equally,

This article has my analysis of insertion sorting algorithm and a program to implement insertion sorting algorithm. Link list implementation is used for the implementation.