Preparing for the Unexpected

[ad_1]

Some of the problems we tackle using machine learning involve categorical
features that represent real world objects, such as words, items and categories.
So what happens when at inference time we get new object values that have never
been seen before? How can we prepare ourselves in advance so we can still make
sense out of the input?

Unseen values, also called OOV (Out of Vocabulary) values, must be handled
properly. Different algorithms have different methods to deal with OOV values.
Different assumptions on the categorical features should be treated differently
as well.

In this post, I’ll focus on the case of deep learning applied to dynamic data,
where new values appear all the time. I’ll use Taboola’s recommender system as
an example. Some of the inputs the model gets at inference time contain unseen
values — this is common in recommender systems. Examples include:

Item id: each recommendable item gets a unique identifier. Every day thousands
of new items get into the system.
Advertiser id: sponsored content is created by advertisers. The number of new
daily advertisers is much smaller compared to the number of new items.
Nonetheless, it’s important to handle them correctly, especially since we want
to support new advertisers.

So what’s the challenge with OOV values?

An OOV value is associated with values not seen by the model at training time.
Hence, if we get an OOV value at inference time, the model won’t know what to do
with it.

One simple solution is to replace all the rare values with a special OOV token
before training. Since all OOV values are the same from the model’s point of
view, we’ll replace them with the OOV token at inference time. This solution has
two positive outcomes:

The model will be exposed to the OOV token while training. In deep learning we
usually embed categorical features. After training, the model will learn a
meaningful embedding for all OOV values .
The risk of overfitting to the rare values will be mitigated.
These values appear in a small number of examples. If we learn embeddings for
these values, the model might learn to use them to explain particularities or
random noise found in these specific examples.
Another disaster that can result with learning these embeddings is not getting
enough gradient updates propagated to them. As a consequence, the random
initialization will dominate the result embeddings over the signal learned
through training.

Problem solved… Or is it?

The model uses the item id feature to memorize different information per item,
similarly to the pure collaborative
filtering approach. Rare
items that are injected with the OOV token can’t benefit from it, so the model
performs worse on them.

The interesting thing is that even if we don’t use the item id at all during
training, the model still performs worse on rare items! This is because they
come from a distribution different than that of the general population. They
have specific characteristics — maybe they performed poorly online, which caused
Taboola’s recommender system to recommend them less, and in turn — they became
rare in the dataset. So why does this distribution difference matter?

If we learn the OOV embedding using this special distribution, it won’t
generalize to the general population. Think about it this way — every item was a
new item at some point. At that point, it was injected with the OOV token. So
the OOV embedding should perform well for all possible items.

In order to learn the OOV embedding using the general population, we can inject
the OOV token to a random set of examples from the dataset before we start the
training process. But how many examples will suffice?

The more we sample, the better the OOV embedding will be. But at the same time,
the model will be exposed to a fewer number of non-OOV values, so the
performance will degrade.

How can we use lots of examples to train the OOV embedding while at the same
time use the same examples to train the non-OOV embeddings? Instead of randomly
injecting the OOV token before starting to train, we chose the following
approach: in each epoch the model trains using all of the available values (the
OOV token isn’t injected). At the end of the epoch we sample a random set of
examples, inject the OOV token, and train the model once again. This way, we
enjoy both worlds!

As was done in the previous approach, we also inject the OOV token to rare
values – to avoid overfitting.

To evaluate the new approach, we injected the OOV token to all of the examples
and evaluated our offline metric (MSE). It improved by 15% compared to randomly
injecting the OOV token before the model starts to train.

Our model had been used in production for a long time before we thought of the
new approach. It could have been easy to miss this potential performance gain,
since the model performed well overall. It just stresses the fact that you
always have to look for the unexpected!

Originally published by me at
engineering.taboola.com.

[ad_2]

Source link