webdevjeff.us

Web Developer Jeff George

Sorting It All Out

Using Ruby's #sort and #sort_by methods on arrays and hashes

Posted Oct. 11, 2015

Like all collection classes in Ruby, the Array and Hash classes include the Enumerable module. Enumerable bestows upon classes that include it a wide selection of methods which iterate over collections in many useful ways. In order to include the Enumerable module and use its methods, a collection class must include its own version of the #each method, which provides the Enumerable methods with a basic strategy for iterating over collection objects of that class. In order to function at all, every method in Enumerable must call the class’s #each method.

One of the most common operations performed on collections is to sort them according to various criteria. This commonly done using the #sort and #sort_by methods, which are found in the Enumerable module. This can be confusing, because when you are looking for a method to solve a problem, you may not find it among the methods listed in the Ruby docs for the class you expect. On first glance, Hash seems to have no sorting methods, but in fact, Hash (like collection classes Range and Set) uses the base Enumerable versions of #sort and #sort_by—you just have to go to the Enumerable page to find them. Array, on the other hand, has its own custom versions of #sort and #sort_by, which do appear on the list of Array methods on Ruby-doc.org. The take-away from this is that when looking for a method to solve a problem, it’s important to also check the list for any module the class includes.

Sorting arrays

Although the sort and #sort_by methods for the Array class override the default Enumerable versions, we needn’t worry about under-the-hood stuff like that here. For our purposes, we can call them on an array object and get a perfectly usable return no matter where the methods themselves are stored. In its simplest form, Array#sort returns a new array, rearranged in alphanumeric order, while leaving the original array unaltered:

This works fine, so long as all of the items in the array are of the same type, and can be compared directly. That is, #sort works as long as the items in the collection are either all strings or all numbers (#sort can handle a mix of integers and floats, though). If you have a mixture of numbers and strings in your array, #sort will raise an ArgumentError.

You can #sort an array of arrays, but #sort only considers the value of the first item in each member array when sorting. The other items in the member arrays are ignored. This means that #sort will leave [3, 6] and [3, 2] in that order, if that’s how they appear in the original array. But it also means that member arrays do not have to be of the same length, and that only the first items in each member array need to be of matching type. The second and subsequent items can be a mixture of strings, numbers, or other objects. Here are some examples of sorted arrays:

Other sorts of #sorts

If you need to sort an array by some criterion other than alphanumeric order—perhaps by length—you’ll need to tell #sort how you want to compare the items in the array by passing a code block to the method that defines the comparison for your sort. The syntax for the code block is { |item_1, item_2| item_1_code <=> item_2_code }. In this example, we’re calling #length on each element for the sort:

As you can imagine, the code block for a #sort can become unweildy very quickly. For non-alphanumeric sorts, you might prefer the #sort_by method, which lets you define how to evaluate each item to be sorted just once, and eliminates the spaceship comparison operator. The syntax for the #sort_by code block is simply { |item| item_code }, so the block for our #length-based sort becomes { |a| a.size }. If you’re calling a single method on your objects, as we are here, there’s also a Ruby shorthand syntax to make it even shorter. You can use an ampersand to represent the object, and the symbol form of the method name: (&:size). Here we see both versions of #sort_by in action:

Mixed #sorts

If you need to sort an array with members that are not comparable—a mix of strings and numbers, most likely—a simple #sort will fail, raising an ArgumentError. You can use #sort_by in conjunction with #to_s to sort a mixed array, but sorting numbers as strings will put 25 before 8; alphanumeric sorting has no awareness of place value. To properly perform an alphanumeric sort on an array of mixed numbers and strings, you can use #partition to sift the numbers and strings into two separate sub-arrays, sort each sub-array separately, and then use #flatten to turn them back into a single array.

Note that the first example works because the array contains only single-digit numbers. When we make one of the numbers three digits long, in the second example, sorting numbers as strings no longer works—7 should come before 356, but it doesn’t. Negative numbers and floats would further confuse a #sort by strings. The third example is more complex, but it sorts numbers and strings correctly according to alphanumeric rules, no matter how many digits are in the numbers. As the sorting criterion for #partition, we pass a code block that checks if each item is a member of the class Numeric; this catches all numbers of all classes, including Floats, Fixnums, and Bignums. The first #sort sorts only the numbers, and the second sorts only the strings. We turn it all back into a single array by calling #flatten at the end.

The algorithm above is set up to sort the existing array in place. If you need a non-destructive version, you can write a method like this:

This #alphanumeric_sort method begins with the Object#clone method, which creates an identical but separate array object to work on and return—that’s why it’s non-destructive. Not only does it handle positive and negative integers as well as floats, but because it’s non-destructive, we can use #sort_by and #downcase to sort the strings sub-array. This makes the method sort capital and lower-case letters together, like a dictionary. Note that I switched to a multi-line, do/end syntax for the #partition call simply to shorten the lines and avoid a confusing line-wrap on this web page—there’s no reason not to use the single-line, curly-braces syntax for that code block in a real text editor.

Sorting hashes

In some ways, sorting hashes with #sort is trickier than arrays, because each member of a hash is a pair of values—a matched key and value. By default, #sort reorders the members of a hash alphanumerically, by their keys, like this:

Not only are hashes sorted by their keys, but they aren’t returned in a hash! Instead, they come back as a two-dimensional array, because that’s how #sort (and all the rest of the Enumerable methods) handles them. We can easily turn the returned array back into a hash by adding #to_h after the #sort call, though:

As with arrays, #sort works on hashes so long as the keys are all the same data type. Usually, the keys in a hash will all be symbols, or all be strings; #sort also works if they are all numbers, too, though this isn’t very common in real programs. Here we see a hash of the households on Clue Street, with house numbers as keys and residents’ surnames as values:

It’s pretty unlikely that you’ll have a hash with keys of mixed data types, but if you do, a simple #sort won’t work on it—it will raise the same ArgumentError we got above, with arrays. As long as the keys are what you want to sort by, you can use the same techniques I described for arrays. If you need to sort the values, however, the easiest way to do that is with #sort_by.

Sorting by hash values

The easiest way to sort a hash by the value in its key-value pairs is to use the #sort_by method. To make #sort_by use the values to sort the pairs, use this code block: { |key,value| value }. This tells #sort_by to perform its default alphanumeric sort on the hash, based on the values in the key-value pairs instead of the keys. If you have a different search criterion, such as the length of the values, you can add the appropriate method to the code block as well. In this example, I also added #to_h to turn the two-dimensional array returned by #sort_by back into a hash, as I did in the previous example.

Stupid #sort tricks

Since we’re on the subject of sorting collections, there are two Array methods that I think of as being connected with #sort, at least conceptually if not technically. These are Array#reverse and Array#shuffle. Each of these methods does to an array exactly what you expect it to: #reverse reverses the order, and #shuffle randomizes the order. Here they are in action:

If you have an unsorted set, and you need to access it from largest to smallest, there’s no reason not to put the #sort and the #reverse on the same line—doing so is more concise and more clear than writing the two calls on separate lines of code. Of course, if you’re shuffling your array, there’s no reason sort it first. I include it here because I think of #sort and #shuffle as being opposites. If you want to assign your students to study groups in alphabetic order, sort them before you allocate them; if you want them assigned randomly, shuffle them first.

#reverse and #shuffle are strictly Array methods, not Enumerable methods, so you can’t use them directly on hashes, sets, or ranges. But it’s not hard to use #to_a to convert a hash or set into an array in order to reverse it or shuffle, and then convert it back to its original class with #to_h or #to_s. Since a range is an ascending sequence of integers by definition, there’s not much point in reversing or shuffling it. You might use a range in conjunction with a loop to populate a large array of numbers without typing in every member, and then reverse or shuffle it, though.