## Tuesday, November 1, 2011

### Algorithms and Data Structures - Part 2

Topics:

Data Structures
Heap is always a perfectly-balanced fully-populated tree data structure that satisfies the heap property: If B is a child node of A, then key(A) ≥ key(B). This implies that an element with the greatest key is always in the root node, and so such a heap is sometimes called a max-heap. (Alternatively, if the comparison is reversed, the smallest element is always in the root node, which results in a min-heap.) There is no restriction as to how many children each node has in a heap, although in practice each node has at most two. The heap is one maximally-efficient implementation of an abstract data type called a priority queue with a worst-case performance of O(log n) for adding and removing elements and O(1) for getting the maximum or minimum element. Heaps are crucial in several efficient graph algorithms such as Dijkstra's algorithm, and in the sorting algorithm Heapsort.
A heap data structure should not be confused with the heap which is a common name for dynamically allocated memory. The term was originally used only for the data structure. Some early popular languages such as LISP provided dynamic memory allocation using heap data structures, which gave the memory area its name.

The binary tree data structure of the Heap in the following Java source code is organized as an array (see also "Introduction to Algorithms", chapter 6):
```public class BinaryHeap {
private Integer[] mArray;
private int mHeapSize;

public BinaryHeap(int maxSize) {
mArray = new Integer[maxSize];
mHeapSize = 0;
}

int parentIndex(int i) {
return (i + 1) / 2 - 1;
}

int leftChildIndex(int i) {
return 2 * i + 1;
}

int rightChildIndex(int i) {
return 2 * i + 2;
}

// When maxHeapify is called, it is assumed that
// the binary tree rooted at leftChildIndex(i)
// and rightChildIndex(i) are max-heaps.
// Worst-case performance: O(log n).
void maxHeapify(int i) {
int leftChildIndex = leftChildIndex(i);
int rightChildIndex = rightChildIndex(i);
int largestElementIndex;
if (leftChildIndex < mHeapSize &&
mArray[leftChildIndex] > mArray[i]) {
largestElementIndex = leftChildIndex;
} else {
largestElementIndex = i;
}
if (rightChildIndex < mHeapSize &&
mArray[rightChildIndex] >
mArray[largestElementIndex]) {
largestElementIndex = rightChildIndex;
}
if (largestElementIndex != i) {
int tmpValue = mArray[i];
mArray[i] = mArray[largestElementIndex];
mArray[largestElementIndex] = tmpValue;
maxHeapify(largestElementIndex);
}
}

void buildMaxHeap() {
int heapSize = mArray.length;
for (int i = heapSize / 2; i >= 0; i--) {
maxHeapify(i);
}
}

public int max() {
return mArray;
}

// Worst-case performance: O(log n).
public int extractMax() {
int max = mArray;
mArray = mArray[mHeapSize - 1];
mArray[mHeapSize - 1] = null;
mHeapSize--;
maxHeapify(0);
return max;
}

// Worst-case performance: O(log n).
void increaseKey(int i, int newValue)
throws Exception {
if (newValue < mArray[i]) {
throw new Exception("New value is smaller
than current value");
}
mArray[i] = newValue;
while (i > 0 && mArray[parentIndex(i)] <
mArray[i]) {
int tmpValue = mArray[parentIndex(i)];
mArray[parentIndex(i)] = mArray[i];
mArray[i] = tmpValue;
i = parentIndex(i);
}
}

// Worst-case performance: O(log n).
public boolean insert(int value) {
if (mHeapSize < mArray.length) {
mHeapSize++;
mArray[mHeapSize - 1] = value;
try {
increaseKey(mHeapSize - 1, value);
} catch (Exception e) {
return false;
}
return true;
} else {
return false;
}
}

public int size() {
return mHeapSize;
}
}
```

Graphs
A graph data structure consists of a finite (and possibly mutable) set of ordered pairs, called edges, arcs or links, of certain entities called nodes or vertices. As in mathematics, an edge (x,y) is said to point or go from x to y. The nodes may be part of the graph structure, or may be external entities represented by integer indices or references. A graph data structure may also associate to each edge some edge value, such as a symbolic label or a numeric attribute (cost, capacity, length, etc.).
Graph algorithms are a significant field of interest within computer science. Typical higher-level operations associated with graphs are: finding a path between two nodes, like depth-first search and breadth-first search and finding the shortest path from one node to another, like Dijkstra's algorithm.
Graphs are normally represented as adjacency lists or by an adjacency matrix.
Adjacency list (which I used in my examples) are a set of vertices that are stored as records or objects, and every vertex stores a list of adjacent vertices. This data structure allows the storage of additional data on the vertices.
An adjacency matrix is a two-dimensional matrix, in which the rows represent source vertices and columns represent destination vertices. Data on edges and vertices must be stored externally. Only the cost for one edge can be stored between each pair of vertices.

```public Dictionary<Node, NodeAttributes>
Hashtable<Node, NodeAttributes> nodeAttributes =
new Hashtable<Node, NodeAttributes>();

for (Node u : mNodes) {
NodeAttributes attributes = new NodeAttributes();
attributes.color = NodeAttributes.WHITE;
attributes.distance = Integer.MAX_VALUE;
attributes.predecessor = null;
nodeAttributes.put(u, attributes);
}
NodeAttributes sAttributes = nodeAttributes.get(s);
sAttributes.color = NodeAttributes.GRAY;
sAttributes.distance = 0;
sAttributes.predecessor = null;
Queue<Node> queue = new ArrayDeque<Node>();
while (!queue.isEmpty()) {
Node u = queue.poll();
NodeAttributes uAttributes =
nodeAttributes.get(u);
for (Node v : u.getLinks()) {
NodeAttributes vAttributes =
nodeAttributes.get(v);
if (vAttributes.color == NodeAttributes.WHITE) {
vAttributes.color = NodeAttributes.GRAY;
vAttributes.distance =
uAttributes.distance + 1;
vAttributes.predecessor = u;
}
}
Attributes.color = NodeAttributes.BLACK;
}

return nodeAttributes;
}
```

Java source code for depth-first search [O(|V| + |E|)]. For more details see also "Introduction to Algorithms", chapter 22.
```public Dictionary<Node, NodeAttributes>
depthFirstSearch() {
Hashtable<Node, NodeAttributes> nodeAttributes =
new Hashtable<Node, NodeAttributes>();

for (Node u : mNodes) {
NodeAttributes attributes = new NodeAttributes();
attributes.color = NodeAttributes.WHITE;
attributes.predecessor = null;
nodeAttributes.put(u, attributes);
}
mTime = 0;
for (Node u : mNodes) {
if (nodeAttributes.get(u).color ==
NodeAttributes.WHITE) {
dfsVisit(u, nodeAttributes);
}
}
return nodeAttributes;
}

private void dfsVisit(Node u,
Dictionary<Node, NodeAttributes> nodeAttributes) {
NodeAttributes uAttributes = nodeAttributes.get(u);
uAttributes.color = NodeAttributes.GRAY;
mTime++;
uAttributes.startTime = mTime;

for (Node v : u.getLinks()) {
NodeAttributes vAttributes =
nodeAttributes.get(v);
if (vAttributes.color == NodeAttributes.WHITE) {
vAttributes.predecessor = u;
dfsVisit(v, nodeAttributes);
}
}

uAttributes.color = NodeAttributes.BLACK;
mTime++;
uAttributes.finishTime = mTime;
}
```

What is also very interesting is the topological ordering of a directed graph. A topological sort creates a dependency graph. That's very useful e.g. if you want to implement your own dependency injection container that automatically resolves dependencies between objects to create them in the right order.
The topological sort is just a depth-first search within a graph that constructs a linked-list of dependencies. The Java source code containing topological sort is available here in my blog repository. The Java source code also provides an example of Dijkstra's algorithm.

Algorithms
Dynamic programming
Dynamic programming, like the divide-and-conquer method, solves problems by combining the solutions to subproblems. Divide-and-conquer algorithms partition the problem into independent subproblems, solve the subproblems recursively, and then combine their solutions to solve the original problem. In contrast, dynamic programming is applicable when the subproblems are not independent.
A dynamic programming algorithm solves every subproblem just once and then saves its answers in a table, thereby avoiding the work of recomputing the answer every time the subproblem is encountered.
Dynamic programming is typically applied to optimization problems. In such problems there can be many possible solutions. Each solution has a value, and we wish to find a solution with the optimal (minimum or maximum) value.
Top-down dynamic programming simply means storing the results of certain calculations, which are later used again since the completed calculation is a sub-problem of a larger calculation. Bottom-up dynamic programming involves formulating a complex calculation as a recursive series of simpler calculations.

One of the computer science areas where dynamic programming is used a lot is Bioinformatics.
An example is the longest common subsequence algorithm that is used to compare DNA strands. A DNA strand consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine and thymine. The goal of comparing two DNA strands is to determine how similar the two strands are, as some measure of how closely related the two organisms are. So this is no exact substring method but one that allows for gaps while comparing the two input DNA strands.

Java source code for the longest common subsequence problem (see also "Introduction to Algorithms", chapter 15)
```public class LongestCommonSubsequence {
char[] mSeqA;
char[] mSeqB;
int[][] mC;
int[][] mB;
private static int UP = 0;
private static int DIAG = 1;
private static int LEFT = 2;

public LongestCommonSubsequence(char[] seqA,
char[] seqB) {
mSeqA = new char[seqA.length + 1];
for (int i = 0; i < seqA.length; i++) {
mSeqA[i+1] = seqA[i];
}
mSeqB = new char[seqB.length + 1];
for (int i = 0; i < seqB.length; i++) {
mSeqB[i+1] = seqB[i];
}
mC = new int[mSeqA.length][mSeqB.length];
mB = new int[mSeqA.length][mSeqB.length];
for (int i = 0; i < mSeqA.length; i++) {
mC[i] = 0;
}
for (int j = 0; j < mSeqB.length; j++) {
mC[j] = 0;
}
}

// O(m + n) -> O(mSeqA.length + mSeqB.length)
public void lcsLength() {
for (int i = 1; i < mSeqA.length; i++) {
for (int j = 1; j < mSeqB.length; j++) {
if (mSeqA[i] == mSeqB[j]) {
mC[i][j] = mC[i-1][j-1] + 1;
mB[i][j] = DIAG;
} else if (mC[i-1][j] >= mC[i][j-1]) {
mC[i][j] = mC[i-1][j];
mB[i][j] = UP;
} else {
mC[i][j] = mC[i][j-1];
mB[i][j] = LEFT;
}
}
}
}

public void backtrack() {
backtrack(mSeqA.length - 1, mSeqB.length - 1);
}

void backtrack(int i, int j) {
if (i == 0 || j == 0) {
return;
}
if (mB[i][j] == DIAG) {
backtrack(i-1, j-1);
System.out.print(mSeqA[i] + ", ");
} else if (mB[i][j] == UP) {
backtrack(i-1, j);
} else {
backtrack(i, j-1);
}
}
}
```

There are also protein sequence comparison methods in Bioinformatics that are more sophisticated than the longest common subsequence, like the Needleman-Wunsch algorithm, the Smith-Waterman algorithm and  HHsearch. My blog repository currently contains implementations of the Needleman-Wunsch algorithm and the Smith-Waterman algorithm.

Memoization is a variation of dynamic programming. A memoized recursive algorithm maintains an entry in a table for the solution to each subproblem. When a subproblem is first encountered during the execution of the recursive algorithm, its solution is computed an then stored in a table. Each subsequent time that the subproblem is encountered, the value stored in the table is simply looked up and returned.

Greedy algorithms always make the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution.
For example, applying the greedy strategy to the traveling salesman problem yields the following algorithm: "At each stage visit the unvisited city nearest to the current city".
For many optimization problems, using dynamic programming to determine the best choice is overkill, often a more efficient greedy algorithm will do. Greedy algorithms do not always yield optimal solutions, but for many problems they do.
The following problem is about activity scheduling where several competing activities require exclusive use of a common resource. The goal is to select the maximum number of mutually compatible activities. The greedy algorithm solves the problem by sorting the activities by their finish time and then go once through the sorted list of activities and select the next possible activity based on the start time.

Java source code for the activity selector problem (see also "Introduction to Algorithms", chapter 16)
```public class ActivitySelector {
private List<Activity> mActivities =
new ArrayList<Activity>();

public static class Activity implements
Comparable<Activity> {
private int mStartTime;
private int mFinishTime;

public Activity(int startTime, int finishTime) {
mStartTime = startTime;
mFinishTime = finishTime;
}

public int getStartTime() {
return mStartTime;
}

public int getFinishTime() {
return mFinishTime;
}

@Override
public int compareTo(Activity activity) {
if (this.mFinishTime == activity.mFinishTime) {
return 0;
} else if (this.mFinishTime >
activity.mFinishTime) {
return 1;
} else {
return -1;
}
}
}

}

public List<Activity> greedyActivitySelector() {
Collections.sort(mActivities);
List<Activity> activities =
new ArrayList<Activity>();
int i = 0;
for (int m = 1; m < mActivities.size(); m++) {
if (mActivities.get(m).getStartTime() >=
mActivities.get(i).mFinishTime) {
i = m;
}
}
return activities;
}
}
```

Greedy algorithms versus dynamic programming
The 0-1 knapsack problem is the best way to compare the greedy strategy and dynamic programming.
A thief robbing a store finds n items; the i-th item is worth vi dollars and weighs wi pounds, where vi and wi are integers. He wants to take as valuable a load as possible, but he can carry at most W pounds in his knapsack for some integer W. Which items should he take? (This is called the 0-1 knapsack problem because each item must either be taken or left behind; the thief cannot take a fractional amount of an item.)
While the 0-1 knapsack problem is only solvable using dynamic programming but not with a greedy algorithm, the fractional knapsack problem is also solvable using a greedy strategy.
To solve the fractional knapsack problem, you first have to compute the value per pound for each item. The thief then begins by taking as much as possible of the item with the greatest value per pound. If the supply of that item is exhausted, the thief takes as much as possible of the item with the next greatest value per pound and so forth.

String matching algorithms
See this tutorial with great explanations and examples by Christian Charras and Thierry Lecroq.