CSC 143 Computer Programming II

Project 4: Merge Sort, Heaps, and HashMaps

In Project 4, you'll implement Merge Sort, plus a selection of data structures: Heaps, Priority Queues (bonus!), and HashMaps.

Merge Sort is a well-known O(nlogn) optimal sort that uses the merge technique to join two sorted lists in O(n) time. Merge Sort operates off the observation that an empty or single-element list is already sorted. The base case for a recursive sorting algorithm can then be thought of as empty or single-element lists, and the recursive case merges the recursive calls, propagating larger and larger sorted lists up the stack until the whole list is sorted.

Heaps are a data structure optimized for retrieving and storing minimum values (or maximum values, for a max-heap). Heaps can be thought of as binary trees, with two properties: parent nodes are less than or equal to their children, and the tree is complete. That is to say, when appending elements to the heap, they are inserted in the deepest row of the tree farthest to the left in the row.

HashMaps are ideal for associating data, or in other words, creating lookup tables from data to data. HashMaps are efficient due to hashing functions that convert arbitrary input to predictable, but uniformly random integer outputs. As integer indexing in an array is O(1), a HashMap with a O(1) hashing function enables us to look up values associated with keys in O(1) time. However, because hashing has collisions and can cause keys to map to the same index (bucket), we need to store an additional Linked List for every bucket in the array that can be appended to when keys have the same index (but not necessarily the same value).

Java determines the hash of each Object using its public int hashCode() method. Java provides hashCode() for free for all library types; for custom types, you must override the hashCode() function just as you would override toString().

As a general rule of thumb, when two Objects are .equals() to each other, they should have the same hashCode() However, it is fine if two Objects that do not .equals() each other to have the same hashCode() as well. What matters is that the hashCode() of unequal Objects are sufficiently random so that their table indices (buckets) will be random, and their value entires are spread across different buckets in the table.

Time for battle. Get the Project 4 starter code here! (jUnit dependencies included)

The project code should work out-of-the-box for IntelliJ (running and debugging), but if you're having trouble getting it working, try these IntelliJ setup instructions.

Merge Sort

Your implementation of Merge Sort will be a two-step process: first, you'll implement merging, and then you'll call the merge function recursively to form Merge Sort. Both static methods below are in the algorithms.Sorting class.

ArrayList<Integer> merge(ArrayList<Integer> a, ArrayList<Integer> b): Merge two sorted lists in O(n) using the merge technique (see class notes). You may use plain for-loops and indexing for this method. The function should return a new ArrayList that combines the two sorted lists.
ArrayList<Integer> mergeSort(ArrayList<Integer> list): Sorts a recursively list using Merge Sort. The base case for Merge Sort is an empty or single-element list, which is already sorted on its own. The recursive case should split the list into two halves, call mergeSort(...) on both, and merge the two results.

ArrayList supports the List<E> subList(int fromIndex, int toIndex) method. You may construct a new ArrayList to convert the returned List to an ArrayList by using the ArrayList(Collection<E>) constructor, or you can manually construct a new sublist ArrayList by appending half the elements onto the list using iteration.

Heaps

The heaps.MinHeap class contains method stubs for a minimum Heap data structure. The Heap will be backed by an ArrayList<T>, storing the root node at index 1. The mathematical formulas for computing child and parent indices are povided for free.

void insert(T): Inserts a new element into the heap. Remember that the heap is complete; new elements are inserted into the bottom row and furtherst to the left. In practice, the newly inserted element should appear first at the end of the backing array.

Once inserted, the element should bubble up the heap, swapping with parent nodes until the element has a parent node that is less than itself. You may use the provided void swap(int, int) to swap two elements by their indices.
T peek(): Returns the root element of the heap, or throws a NoSuchElementException() if the heap is empty.
T remove(): Returns the root element of the heap, and replaces the root with the last element of the heap. The new root is then bubbled down the heap and swapped in place with children, until it is in a correct position. If two children are equal, break ties to the left.
int size(): Returns the number of elements in the heap.

Heaps are excellent backing structures for Priority Queues. The heaps.PriorityQueue<T> class is a bonus class which contains a MinHeap that you may optionally choose to implement, for up to 10 bonus points on the project.

HashMaps

The hashing.HashMap<K, V> class contains method stubs for a HashMap data structure that maps keys of type K to keys of type V. The HashMap will use the hashing with buckets technique, by storing an array of LinkedList<Pair<K, V>>. Keys that hash to the same bucket will store their values as individual Pair<K, V> elements on the Linked List at the array index (bucket).

To determine what bucket a key belongs to, you can use the provided int getSlot(K) method, which computes hashCode() % table.length. Java permits a single null key, which always hashes to 0.

V put(K key, V value): Associates a key with a value. If an old value exists for the key, overwrite the old value with the new value, and return the old value. Otherwise, return null. If the load factor is exceeded, expand().
V get(K key): Retrieves and returns the value associated with the key, or null if the key is not in the table.
V remove(K key): Removes a key-value association from the HashMap, returning the old value if it existed and null otherwise.
int size(): Returns the number of entries in the HashMap.
Iterator<Pair<K, V>> iterator(): Returns a HashMapIterator which iterates over all key-value pairs in the HashMap. You will need to fulfill the Iterator<Pair<K, V>> interface by filling out the methods of HashMapIterator. The order in which key-value pairs are retrieved doesn't matter, so long as every key-value pair is retrieved exactly once. Iterating through all pairs should be O(n) in the worst case.

To avoid performance degredation when the HashMap is too full, the HashMap has a load factor that, when surpassed, causes the HashMap to expand. The current load factor is computed as number of key-value pairs / array length. When the load factor exceeds the hardcoded double LOAD_FACTOR = 0.75, the HashMap should be expanded by relocating all the key-value pairs to their new buckets.

void expand(): Expands the HashMap so it has the same key-value pairs as before, but is backed by an array twice as large.

Running Your Code + Bonus Points

As with the previous projects, this project contains a full test suite (see the test package) that will evaluate your code.

As usual, your project score is based on the percentage of tests you pass, contingent on the correctness of your submission. The Priority Queue class is a bonus and can be implemented for an additional 10 bonus points on the project, for a total possible score of 110. You are eligible to receive bonus points even if other parts of the project are incomplete.

A run script for the entire project test suite is included as run.bat (Windows) or run.sh (Mac/Linux). This will compile your entire project and produce a run report with your estimated grade, including bonus points. Your entire project must compile for the script to work, and your entire project must compile to receive a grade after submission.

You are welcome to modify the tests for debugging purposes, though keep in mind your project will be graded using an unmodified copy of the test suite.

Submission Instructions

See the submissions page for details. Have questions or need help? Post on Piazza!

Critical Analysis Questions

These questions aren't graded, but are similar to those you may see on an exam. You are welcome to discuss these with classmates offline on on Piazza.

How is Merge Sort different from Quick Sort?
Which list structure would be most performant for merge sort, in-place?
Consider a Heap which prefers the right node when bubbling down (instead of the left), if both child nodes are equal. Is the Heap still correct?
Why don't Heaps guarantee queue (FIFO) ordering for elements of the same value?
What are the performance consequences of a HashMap which inserts objects that have non-random hashCode()?
Why do we have to check that two keys are equals() instead of only comparing their hashCode()?
Why is it recommended to override hashCode() if we have overridden the equals()?
What would happen if properties of a key Object inside the HashMap are modified, so that its hashCode() is different?

(Return to homepage)