The distinct subsets problem and solutions

1. The distinct subsets problem

For instance, imagine this: you are running a new, recently created, e-commerce web­site, and for your launch, you’d like to provide non-personalized recommendations to your users. If it helps you feel like this is a real situation, you might imagine owning a time-machine and being back in 1999. Or more realistically, you can think about opening a geographically localized website with stronger ties to your country’s retail­ers; or maybe you open a specialized retail website focusing on niche products. Either way, this can be a very interesting exercise.

So, non-personalized recommendations, we were saying. You might ask what that’s about. Allow me to take a short detour to explain this: personalized recommendations are targeted on individual customers, based on the data we have about them (past purchases or metadata that shows similarity with other users). But sometimes we don’t have this data at all: for instance, when we start a new website, or even when we get a new customer about which we know nothing at all. That’s why many websites, such as Twitter, Pinterest, Netflix, and MovieLens, ask you questions on sign-up to understand your tastes and be able to provide some rough personalized recommendations based on users with similar profiles.

Non-personalized recommendations, on the other hand, are not targeted to you, the customer, and are the same for all customers. They might be hardcoded associa­tions, if you have no data at all, or based on purchases made by all other customers, for example.

And that’s exactly what we are going to do: whenever customers add something to their cart, we would like to provide them with a recommendation about something else they might want to buy along with it. Our goal is finding products that are fre­quently bought together; sometimes we are going to find reasonable associations, such as milk and bread bought together. Other times, the result might be more sur­prising: you probably have already heard about the research performed at Walmart linking the purchase of diapers and beer, since it’s one of the most quoted data sci­ence anecdotes.

Figure 5.2 illustrates what we would like to achieve. Initially, since we have no data at all, we consider every product as a group of its own, or if you prefer, no two prod­ucts will have an association.

When customers frequently buy two products together, we then establish a link between them, considering both to be part of the same group. To keep things simple, imagine that the rule set by the data science team is that if item X and item Y are bought together more than a fixed threshold during the last hour, then their two cat­egories are merged.

For instance, it might be that if, during the last hour, phones and tablets are bought together more than 500 times (or for more than 1% of total purchases), then they should be in the same group, so we merge their groups.

Then if a customer buys a product X, we can suggest to them a random item from the same group.

This process described is pretty common in data science. Some of you might have recognized that it’s nothing less than hierarchical clustering. If that doesn’t ring a bell, do not worry; we will expand on clustering in section 5.7.3.

This is, obviously, an extreme simplification. In a real non-personalized recom- mender system, we would track associations between products, measuring the strength of the link as the confidence that when X is bought, Y is also. For that, we may compute the number of purchases where both appear, divided by the total number of purchases where at least Y appears. That would give us better insight about what goes with what, we could define a threshold on the confidence for merging groups, and instead of pick­ing a random item in the same group, show the top five strongest associations.

Nevertheless, clustering items in groups will probably be a smart move, because it will help performance, allowing us to run some algorithms on each group of items separately rather than on the whole catalog.

If you are interested in knowing more about non-personalized (and personalized) rec- ommender systems, we suggest you look at Practical Recommender Systems (by Kim Falk, Manning Publications, 2019), a fine and thorough guide to the topic.

Back to our example: the gist is that we would need to start from this huge set of items and partition them into disjoint groups. And of course, new items are added to the catalog all the time, and relations are dynamic, so we would need to be able to update both the list of items and the groups.

2. Reasoning on solutions

In this and the following sections, we will mainly use the term partition to refer to dis­joint groups. However, group and set can also appear as synonyms.

We will also restrict to the aggregative case; in other words, two partitions can be merged in a single, bigger set; the opposite, however, is not allowed. That is, a parti­tion can’t be split into two subsets.

Now imagine that there was a design discussion between your data science team and your support engineering team, and one engineer fiercely stood up and exclaimed, “Well, that’s easy! You keep an array (a dynamic array or a vector) for each subset.”

You don’t want to be that person, believe me, and one of the goals of this book is making sure you don’t find yourself in that situation. Because the next thing to hap­pen, hopefully, is that somebody else points out how, by using arrays, understanding whether two elements are in the same set could potentially require going over all the elements of all the subsets. Likewise, just understanding which subset an element belongs to could require the same number of element checks; that is, search could be linear in the total number of elements.

This would be a real performance concern, and it seems obvious we should be able to do better.

The next idea in this brainstorming could involve adding a map from elements to subsets, together with the list of subsets explained previously. This is a slightly better improvement for some operations; although, as we will see, it still forces operations such as merging two sets to potentially require O(n) assignments.

Performance, however, is not the main concern with that design. Using two inde­pendent data structures is a terrible idea, because you will have to manually sync them every time you face this problem in your application. This is very error-prone.

An already better solution is to provide a wrapper class that internally uses those two structures: it gives you encapsulation and isolation and as a result, you are able to write only once the code that syncs both structures on, say, add or merge (and so you gain reusability). Even more important, you are able to unit test your class in isolation and hence have a reasonable guarantee that it’s going to do its job without experienc­ing bugs every time you use it in your application (that is, provided you do write good and thorough unit tests, acing the edge cases and challenging the behavior of your class in all possible contexts).

So, let’s assume we agree on the need to write a class that takes care of the whole prob­lem, keeping track of which (disjoint) subset an element belongs to, and encapsulat­ing all the logic in it. We are going to delay the discussion on implementation for now: before focusing on the details, let’s discuss its public API and behavior.

Depending on the size of the catalog, we could even fit such a data structure in memory, but let’s instead assume that we set up a REST service (illustrated in figure 5.3) based on a persisted Memcached-like storage, something like Redis. Durability of the data is important in this case, because in our example, the monitoring activity over the items will last for years, and we don’t want to recompute the whole disjoint set structure every time there is a change or a new product is added.

Alternatively, if the size of the universe is small enough to fit in memory, it could be possible to imagine a synchronization mechanism that periodically serializes our in­memory data structure into a persistent database.

3. Describing the data structure API: Disjoint set

In our design, our data structure needs to offer only a few, though crucial, operations.

First, we’d like, obviously, to initialize our instance on construction. Without any loss of generality, we can restrict to the case where the Universe U, that is, the set of all possible elements, is known in advance and static. We also assume that initially every element is in its own partition. Workarounds to support violations to these assumptions are easily achievable by making wise use of dynamic arrays and the class’s very own methods.

Finally, throughout this chapter we assume the elements of our Universe U are the integers between 0 and n-1. This is not a real restriction, because we can easily associ­ate an index to each of the actual elements of U.

Initialization, therefore, just takes care of allocation of the basic fields needed by the class and assigns each element into its own partition.

The findPartition method, when called on an element x of U, will return the par­tition to which x belongs. This output might not be meaningful outside of the instance of our data structure: think of this method mostly as a protected method for the class, or even consider restricting its visibility to private.

The two main operations that we’d like to perform are

  • Given two elements x and y, both belonging to U, we’d like to check if they belong to different partitions (areDisjoint).
  • Given two elements x and y, we’d like to merge their partitions into a single one (merge).

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *