Skip to Content

Note on Factorization machines

Posted on 2 mins read

Factorization machines (FM) are a generic model class that can mimic most factorization models just by feature engineering.

FMs combine feature engineering with factorization models(matrix factorization or tensor factorization) in estimating interactions between categorical variables.

Factorization Machines Math

$$ \hat y := w_0 + \sum_{i= 1}^n w_i x_i + \sum_{i = 1}^n \sum_{j = i+1}^n \hat w_{i,j} x_i x_j $$

where $\hat w_{i,j}$ are the factorized interaction parameters between pairs

$$ w_{i, j} = \langle \mathbf{v_i}, \mathbf{v_j}\rangle = \sum_{f = 1}^k v_{i,f} \cdot v_{j,f} $$

and the model parameters $\Theta$ that have to be estimated are: $$w_{0}, \mathbf{w} \in \mathbb{R}^n, \mathbf{V} \in \mathbb{R}^{n \times k}$$

Intepretation: $w_0$ is the global bias, $w_i$ models the interaction of the i-th variable to the target and $\hat w_{i, j}$ models the factorized interaction of a pair of variables with the target.

When we consider quadractic feature interactions, the complexity increases to $O(n^2)$ in the above formula. Now consider a very sparse set of features, the runtime blows up. In most cases, we instead model a very limited set of feature interactions to manage the complexity.

How do FMs solve the problem?

In the recommendataion problem space, the sparsity problem has been dealt with a well-documented technique called (non-negative) matrix factorization

Example image

We factorize sparse user item matrix $( r )$ $\in \mathbb{R}^{U \times I}$ into a user matrix $(u)$ $\in \mathbb{R}^{U \times K}$ and an item matrix $(i)$ $\in \mathbb{R}^{I \times K}$, where $K << U$ and $K << I$.

User $(u_i)$’s preference for item $i_j$ can be approximated by $u_i \cdot i_j$.

Factorization Machines takes inspiration from matrix factorization, and models the feature iteractions like using latent vectors of size $K$. As a result, every sparse feature $f_i$ has a corresponding latent vector $v_i$. And two feature’s interactions are modelled as $v_i \cdot v_j$.