Sorting by enum values in MongoDB (and Java Driver)


You could think that sorting by an enum value is a routine operation in MongoDB, but if we are interested in some custom order, suddenly it’s not so trivial anymore. In SQL databases, the information about how enum values compare to each other is stored in the schema. Since there are no schemas in MongoDB, we have to find other solutions.

Let’s start with some background

I had a collection with ~10 million documents and I wanted to present the data in a table with paging functionality. Moreover, this table had sortable fields, and one of those was of an enum type.

Obviously, I didn’t want to fetch all 10 million documents just to display 10 of them. Paging itself wasn’t hard to implement. The find() operation in Java Mongo driver returns the object implementing the Iterable interface, so it’s a great fit for tools we have in Java.

This, however, made the easiest approach impossible, which was to sort the data within the application. It was necessary to involve MongoDB in the process of sorting, but since MongoDB doesn’t have any schema, I had to pass the order of enum values in some other way.

Serialize enums as numbers

The obvious choice would be to serialize the enum values as numbers. Unfortunately, this comes with a disadvantage if we would need to modify the enum hierarchy, e.g. by adding a field in the middle. To do that, we would have to migrate the whole collection and change the values accordingly.

Use aggregation framework

A different solution is to use the aggregation framework. It is very well explained in the blog post by Asya Kamsky called Stupid tricks with MongoDB – Using 3.4 Aggregation to Return Documents in Same Order as “$in” Expression

The author uses the $addFields stage, where we could add a numeric field to our results representing the enum value in the hierarchy. So for such documents:

{ "_id" : ObjectId(...), "enumField" : "FirstValue" },
{ "_id" : ObjectId(...), "enumField" : "LastValue" },
{ "_id" : ObjectId(...), "enumField" : "MiddleValue" },
{ "_id" : ObjectId(...), "enumField" : "FirstValue" }

we would have to run the following query:

enumOrder = [ "FirstValue", "MiddleValue", "LastValue" ]

m = { "$match" : { "enumField" : { "$in" : enumOrder } } };
a = { "$addFields" : { "__order" : { "$indexOfArray" : [ enumOrder, "$enumField" ] } } };
s = { "$sort" : { "__order" : 1 } };
db.people.aggregate( [ m, a, s ] );

As an intermediate step, the query above adds the __order field to our data, so it looks like this:

{ "_id" : ObjectId(...), "enumField" : "FirstValue", "__order": 1 },
{ "_id" : ObjectId(...), "enumField" : "LastValue", "__order": 3 }
{ "_id" : ObjectId(...), "enumField" : "MiddleValue", "__order": 2 }
{ "_id" : ObjectId(...), "enumField" : "FirstValue", "__order": 1 }

and the last step simply sorts by the added numeric field.

This is better from the previous approach, because the order is passed in the query itself, so potential change in code does not require any data migration.

Unfortunately, for large amount of documents this approach haven’t performed well.

Java driver to the rescue – the iterator chain approach

My case was quite specific:

  • The enum type I’ve been sorting by had just 3 possible values.
  • Most of the documents had the last enum value in order.
  • I had a Java app, so the solution did not necessarily have to be 100% within the MongoDB DSL.

Because of that, I could make 3 separate queries, each with a different value of the enum, and concatenate them into one Iterable. This could be achieved e. g. by Iterables.concat from Guava.

So … it would look like somewhat like this:

FindIterable<Document> iterable1 = collection.find(Filters.eq("enumValue", EnumType.FIRST_VALUE.toString()));
FindIterable&lt;Document&gt; iterable2 = collection.find(Filters.eq("enumValue", EnumType.MIDDLE_VALUE.toString()));
FindIterable&lt;Document&gt; iterable3 = collection.find(Filters.eq("enumValue", EnumType.LAST_VALUE.toString()));

Iterable&lt;Document&gt; result = Iterables.concat(iterable1, iterable2, iterable3);</code></pre>

You can see the full example here:

When we run the program, we can see in the logs, that the queries aren’t being run immediately, but instead, the first query is executed, the data is fetched, and only then the next query is executed in the database.

// The first query
Sending command &#039;{"find": "iteratorTest", "filter": {"enumValue": "FIRST_VALUE"}, "$db": "iteratorTest", "lsid": {"id": {"$binary": {"base64": "nWH0+qjXSCmGA7gGjrJqEg==", "subType": "04"}}}}&#039; ...

// Reading from results of the first query
Received value: Document{{_id=5f8ca10c669f4a51ec5fa005, enumValue=FIRST_VALUE}}
Received value: Document{{_id=5f8ca10c669f4a51ec5fa008, enumValue=FIRST_VALUE}}

// The second query
Sending command &#039;{"find": "iteratorTest", "filter": {"enumValue": "MIDDLE_VALUE"}, "$db": "iteratorTest", "lsid": {"id": {"$binary": {"base64": "nWH0+qjXSCmGA7gGjrJqEg==", "subType": "04"}}}}&#039; ...

// Reading from results of the second query  
Received value: Document{{_id=5f8ca10c669f4a51ec5fa007, enumValue=MIDDLE_VALUE}}

// The third query
Sending command &#039;{"find": "iteratorTest", "filter": {"enumValue": "LAST_VALUE"}, "$db": "iteratorTest", "lsid": {"id": {"$binary": {"base64": "nWH0+qjXSCmGA7gGjrJqEg==", "subType": "04"}}}}&#039; ...

// Reading from results of the third query
Received value: Document{{_id=5f8ca10c669f4a51ec5fa006, enumValue=LAST_VALUE}}

So … full success!


If it was helpful in any way or if you have any other idea how to tackle this problem, make sure to let me know in the comment below!

Did you like the blog post?