ConvSVD++: A Hybrid Deep CF Recommender Model using Convolutional Neural Network

Corresponding Author: Lamiaa Fayed Faculty of Computers and Informatics, Zagazig University, Egypt Email: lamiaa.fayed@yahoo.com Abstract: Recommender systems are powerful systems that give added value to business and corporation. They are relatively recent technology and they will only keep improving in the future. The most widely used algorithms for recommender systems are categorized into the traditional recommender and deep-based recommender system. The traditional recommendation algorithm suffers from sparse data that significantly degrades recommendation accuracy. The hybrid approaches are attempts to tackle recommendation challenges. This paper addresses the integration of deep learning into traditional recommendation approaches especially, Collaborative Filtering (CF) algorithms to get a significant accurate prediction. It proposes a hybrid deep CF recommender model called ConvSVD++ that tightly integrates Convolution Neural Network (CNN) and Singular Value Decomposition (SVD++). The proposed model incorporates items’ content, implicit user’s feedback along with explicit item-user interaction to enhance prediction accuracy and handle sparsity problem. The proposed model is evaluated and all baseline models based on Movielens1M datasets. The results are evaluated using Root Mean Squared Error (RMSE) metric and it is concluded that the proposed model ConvSVD++ outperforms the baselines models. Accordingly, it is concluded that integrating CNN with SVD++ algorithm improves rating prediction accuracy.


Introduction
The Recommender System (RS) has become a significant web service in many applications. Several online companies apply RS to build up the relationship with users and enhance marketing and sales (Bobadilla et al., 2013). The Recommender system is introduced as a computer program that generates suggestions about new items for a particular user or customer . The most common algorithms for recommender systems are categorized into the traditional recommendation approaches, deep-based recommendation and hybrid deep recommendation. Early work considered traditional recommendation approaches that are classified into Collaborative Filtering (CF), Content-Based (CB) and hybrid approach. The CF approach makes a recommendation based on users' evaluations of items and preferences of users with similar behavior (Ricci et al., 2011;Xu et al., 2013;Herlocker et al., 2017;Costa-Montenegro et al., 2012). The CF approach is the most extensively applied recommendation algorithms . On the other hand, Matrix Factorization (MF) (Cremonesi et al., 2012;Vozalis and Margaritis, 2005;Kurucz et al., 2007) is considered as a superior CF algorithm (Bauer and Nanopoulos, 2014) due to its capability to reduce sparsity problem and improve recommendations accuracy. The content-based approach recommends items based on their content and metadata (Costa-Montenegro et al., 2012). CB approach compares the user's profile with items characteristics. The more the item is descriptive, the more recommendations are accurate. The hybrid approach combines collaborative and content data to improve performance (Barragáns-Martínez et al., 2010;Ghazanfar and Prügel-Bennett, 2014;Celdrán et al., 2016). Traditional recommender systems encounter a number of fundamental problems that cause a reduction in prediction accuracy, such as sparsity (Xie et al., 2012;Shambour and Lu, 2015), scalability (Moreno et al., 2016) and cold start (Vartak et al., 2017;Zhang et al., 2010;Qiu et al., 2011).
Various studies have recently introduced Deep Learning (DL) to develop a recommender system. A deep learning model has a capability to learn latent features in large and complex data that are usually difficult to manipulate with standard tools (Batmaz et al., 2019). It also has powerful computations. A deep recommender system can be developed in accordance with collaborative or content-based models. Moreover, it can be developed basically upon the DL algorithm or integrated tightly or loosely with other traditional RS models (Zhang et al., 2019).
The hybrid deep recommendation approach is one that integrates traditional and deep-based models as an attempt to overcome traditional approaches limitations and to provide an accurate recommendation. Various articles applied these models to incorporate various auxiliary information to improve accuracy and solve sparsity and cold start problems. However, it is still an open research issue.
In this study, we propose a hybrid deep Collaborative Filtering (CF) model called ConvSVD++ that incorporates items' content, implicit user's feedback along with explicit user-item interactions to enhance rating prediction accuracy and handle sparsity problem. A proposed model tightly integrates a Convolutional Neural Network (CNN) and the SVD++ approach. CNN considers contextual information that explains words neighboring, a sequence of words and position of the word that leads to a deeper understanding of the auxiliary information and learns more hidden feature representation. On the other hand, select SVD++ as it is the most common traditional approach that considers the implicit rating.
The research contributions are summarized as follow:  To the best of our knowledge, there is no research introduce a hybrid model that combines a Convolution Neural Network (CNN) with SVD++. A proposed model is tightly integrated that learns and optimizes parameters simultaneously where two integrated approaches are mutually affected each other and allow two-way interaction between them.  The proposed model ConvSVD++ considers the effect of the various forms of implicit feedback using the SVD++ model that leads to improving prediction accuracy  By incorporating the CNN, it can consider the spatial structure of the input which explains words neighboring, a sequence of words and position of the word that leads to a significant improving the efficiency of extracting complex features  The proposed model employs an efficient SVD++ algorithm that significantly enhances the computation efficiency by grouping users that shares the same implicit feedback  An experiment demonstrates the superiority of the proposed model over the state-of-the-art models This paper is organized as follows: Section 2 discusses the most recent related works. Section 3 presents the preliminaries for convolutional neural network and singular value decomposition approaches. Section 4 presents the proposed hybrid deep recommender model, ConvSVD++. Section 5 presents experimental result and evaluation that discusses the result of the experiment. Finally, the conclusion is presented.

Related Work
Various studies introduced hybrid deep collaborative filtering approaches to alleviate sparsity and cold start problems to enhance performance and prediction accuracy. Some studies integrate latent features from items' content and auxiliary information (like description, reviews, synopses and abstract) to the collaborative recommendation process. For instance, (Wang et al., 2015a) proposed a tightly integrated model for tag recommendation system. Generalized Bayesian SDAE was employed for learning feature representation (relation between items) and to be combined with relational information matrix. The model result exceeds the state of the art tag aware recommendation models. (Wang et al., 2015b) also introduced a hierarchical Bayesian model called Collaborative Deep Learning (CDL). CDL integrates Stacked Denoising Auto-Encoder (SDAE) with probabilistic matrix factorization. It is a tightly coupled method system that permits a recommender for two-way interaction between two components. CDL employs generalized Bayesian SDAE for learning feature representation. Similarly, (Li et al., 2015) applied marginalized Denoising Autoencoder (DAE) integrated to matrix factorization which is more efficient and scalable. These studies attempt to combine deep learning and collaborative filtering. Moreover, (Wang and Wang, 2014) employs probabilistic matrix factorization together with the features learned via DBNs. Zhang et al. (2019) introduced a hybrid approach (ConvFM) that integrates the convolution neural network with probabilistic matrix factorization. CNN is differentiated by providing high-level representation and accurate contextual information of items through word embedding and convolutional kernels. Furthermore, (Zheng et al., 2017) join a convolution neural network with matrix factorization. The model consists of two parallel CNN connected at the final layer with matrix factorization to extract user-item interactions for predicting rate. It handles sparsity and provides a good semantic representation of review text. Shin et al. (2015) utilized CNN to integrate extracted features from text 1699 and images into their proposed boosted inductive matrix completion method.
Generally, the framework to unify this integration was investigated in (Li et al., 2015). This framework can interpret all previous models. It models the mappings between the latent factors used in CF and the latent layers in deep models.
Other studies incorporate items' content in addition to further sources of information about the users such as user's implicit feedback. Implicit feedback represents behavioral information about users' procurements or browsing history to learn preferences. For example, (Zhang et al., 2017) improve prediction accuracy by combining the item's content and the user's implicit feedback. It integrates a contractive auto-encoder with SVD++. A contractive auto-encoder captures several input variations, while SVD++ can learn the implicit feedback to improve performance.
Other studies consider the temporal dynamic factor that reflects the dynamic and time-drifting of user-item interactions. Wei et al. (2017) considered the temporal dynamics of user preferences and item features. The deep learning neural network SDAE is responsible for the extraction of item content features, while the timeSVD++ model is responsible for the prediction of unknown ratings. Xiong et al. (2019) introduced a recommendation framework joining temporal dynamics, CNN-based text features and item correlation. It covers the item cold-start problem for diverse cold start situations.
As it is evident, Autoencoder and CNN deep learning models are the most frequently applied in recommendation design due to its power in extracting complex hidden features to handle sparsity and improve prediction accuracy. However, the auto-encoder deep learning model utilizes a bag-of-words model that fails to define the contextual information of the documents. Contextual information explains words neighboring, a sequence of words and position of the word that leads to a deeper understanding of the document. Furthermore, deep understanding assists to enhance rating prediction accuracy. Additionally, CNN can employ pre-trained word embedding models like Glove (Pennington et al., 2014) which provide a deep understanding of the item. SDAE can't apply the word embedding model because it applies a bag of words model.
All prior discussed studies prove that adopting both user and item features in developing RS will lead to enhance recommendation and rating prediction accuracy. A proposed model combines items' features with the user's implicit feedback. CNN is responsible for extracting items latent features, while the SVD++ model is responsible to consider the effect of the user's implicit feedback. A proposed model takes the advantages of CNN to gain significant performance improvement. To the best of our knowledge, the proposed model is the first hybrid approach that couples CNN with SVD++.

Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) was utilized in two directions: Extract the latent association between users and items that help predicting user's ratings on items (Sarwar et al., 2000) or shrink the dimensionality of user-items space. The basic concept is to decompose user-item matrix Rm×n into two low-rank matrices that denote item factors and user factors, as depicted in Fig. 1. for declaration, each item i is denoted by a vector qi  R f and each user u is denoted by a vector p u R f . For each item i, the components of qi measure the level to which the item holds those factors, positive or negative. For a given user u, the elements of pu quantity the level of interest the user has in items that are high on the corresponding factors (again, these may be positive or negative). The rating prediction is estimated by the dot product T iu qp, captures the overall user's interest in features of the item. The rating is predicted by Equation 1:

Biased SVD
Unlike the conventional SVD model, the biased SVD introduces user and item bias terms as in Equation 2: where,  is the average rating, bu denotes the observed deviations of user u, bi is the bias term for item i. The objective function of the model is to minimize the regularized squared error on training process while learning the parameters: where, λ1 is a constant that controls the degree of regularization and usually determined by crossvalidation. Stochastic gradient descent or alternating least squares can be employed for optimization.

SVD++
SVD++ considers the effect of implicit feedback. SVD++ method provides accuracy superior to SVD and also supportive for users that providing much more implicit feedback than explicit one new item factors are used to signify users based on the set of items that they implicitly rated as in the following Equation 4: where, R(u) is a set of items implicitly rated by user u.
A user is modeled by Where pu is a user vector that is learned from the given explicit ratings. This vector is complemented by the sum which represents the aspect of implicit feedback. Since the yj's are centered around zero (by the regularization), the sum is normalized by Ru  , to stabilize its variance across the range of observed values of |R(u)|.
The objective function is to minimize the associated regularized squared error. Stochastic gradient descent is the learning algorithm that iterates all known ratings in K, estimating: SVD++ can model various forms of implicit feedback. For example, if a user u rent a set of items in R 1 (u) and browse a set of items in R 2 (u), the model modified as:

Convolutional Neural Network
Convolution Neural Network (CNN) is a type of feed-forward neural network that achieves a significant performance in computer vision and Natural Language Processing (NLP). CNN is similar to a neural network which is made up of numerous neurons associated with weights and biases. Each neuron takes various inputs, sum them up, then applies the activation function to get the output. CNN is different from neural networks where the input can be a multi-channeled (3 channel in the case of image). Unlike non-convolutional neural networks, CNN considers the spatial structure of the input that leads to a significant performance in computer vision and document analysis. A CNN composed of an input layer, several hidden layers and an output layer. The hidden layer consists of convolutional, pooling operations and connected layers which significantly improving the efficiency of extracting complex features (Zhang et al., 2019): I. Convolution layer generates local features of input data. It consists of a set of independent filters that are initialized randomly to be learned by the network afterward. The size of the filter is smaller than the input. Each filter is rolled overall spatial locations of the input and compute the dot product between the input and weights defined in the filter at every spatial position. The results are summed up into one number that represents all the pixels the filter observed. Convolution layer output is called an activation map (feature map). Its input can be the original input or a prior layer output that is an alternative activation map. The convolutional layer output is achieved by stacking the activation maps of all filters and its volume will be smaller than the input. Every neuron in the activation map is only linked to a small local area of the input II. The pooling layer extracts a high-level representation of data by making the down sampling of the input. Pooling processes each activation map independently by selecting the maximum (max pool) or average over a region of the feature map, Fig. 2. It gradually decreases the spatial dimension of the representation to decrease the number of parameters and computation in the network III. The fully connected layer is the final layer. It is a multi-layer perceptron neural network that takes a one-dimensional vector of the feature maps of the previous layer as input. It can be zero, one, or more hidden layers. The output is a list of probabilities producing a multi-class classification. The class with the highest probability is the classification decision

Hybrid Deep CF Model (ConvSVD++)
A proposed model ConvSVD++ consists of four phases, as shown in Fig. 3, including the preprocessing phase, extracting latent item features by the CNN phase, rating prediction phase and evaluation phase. These phases will be discussed in detail in the following subsections.

Phase1: Preprocessing
The most commonly chosen data set is MovieLens that represent 21% of articles in the literature that are utilized in our experiments. MovieLens data set doesn't contain an item description. We extracted documents of plot summary corresponding items from IMDB.
The preprocessing phase is essential to clean text. In this section, three main steps are elaborated in dataset preprocessing, including word tokenization, stopword removal and normalization.

Word Tokenization
Tokenization is the first step in natural language processing applications. It segments input text to several linguistic units, named tokens. It works based on white spaces, marks, white space and punctuation marks.

Normalization
The normalization process has two tasks stemming and lemmatization. Stemming is the process of eliminating suffix, prefixes and infixes from a word to get a word stem, for example, designing to design. Lemmatization captures canonical forms based on a word's lemma. For instance, the lemmatization of worse word returns bad.

Stop Word Removal
Stop word removal step is a significant step to get a cleaned dataset. Such a step removes all commonly occurring words that have no significant meaning, like of, the, in, some etc.
After accomplishing this step, each item document is represented as a vector of cleaned, normalized tokens.

Phase2: Extracting Latent Item Features by CNN
The main aim of CNN in a proposed model is to construct an item latent vector to be used in rating prediction of the recommender system. CNN capable of extracting sophisticated and effective feature representation. CNN architecture proposed in (Kim et al., 2016) is employed in the proposed model. CNN mainly consists of four layers as depicted in Fig. 4. These layers are described in details as follow.

The Embedding Layer
CNN does not understand the text. However, it recognizes only numbers. An embedding layer is a word embedding where individual words are represented as real-valued vectors in a predefined vector space. Words that have similar nearby words are very close vectors that point in the same direction. This representation helps to extract semantic information.
There are various word embedding models available. In a proposed model and embedding model is constructed using a simple python library, called "gensim", which is a simple toolkit developed for handling various NLP tasks. Moreover, it is based on Global Vectors for word representation, named GloVe. an algorithm is an extension to the word2vec which is introduced by (Mikolov et al., 2013).
There are n words in the item row vector obtained from the preprocessing phase. The embedding layer begins by mapping each word into a dense vector xi and then concatenate these word vectors into a dense matrix to represent the item description document. The document matrix will be denoted by Di  R Pl where l is the length of the document and P is the size of the embedding dimension for each word in the item document:

The Convolution Layer
The convolutional layer is responsible for extracting contextual content features from the item's description. A contextual feature ciR is extracted by jth shared weight wR PWS where WS is the window size that represents the number of nearby words. For example, creating contextual features ci from x(i:(i+ws-1)): where, * is the convolution operator, bR is a bias for w j and f is the non-linear activation function, ReLU. Then, a contextual feature vector of a document is produced by jth shared weight w:  (13) where, c j  R l-ws+1 numerous contextual feature vectors should be produced nc as many as various shared weights Wc are applied.

The Pooling Layer
The pooling layer is responsible for providing a highlevel representation of item features that comes from the convolution layer. These contextual feature vectors obtained from the convolution layer have variable lengths (l-ws+1) so max-pooling operation extracts only the most important contextual feature vector as Equation 14 to construct features vectors in a fixed-length nc: where, c j is a contextual feature vector of length (l-ws+1) extracted by jth shared weight.

The Output Layer
At the output layer, conventional nonlinear projection is applied in order to get latent features vectors : where, i latent vector of item i, W represents all the weight and bias variables which will be updated in the convolutional neural network and Di item's description document.

Phase3: Rating Prediction
where, qi is a vector of latent factors of item i and pu is a vector of latent features of user u. The proposed ConvSVD++ model employs the item's latent features extracted by CNN to the latent factor training process of SVD++, Fig. 5. Item latent vector qi is divided into two parts (Zhang et al., 2017), Cnn(W, Di) part for the feature vector extracted from item's content, ϵi R k (i = 1...n) part signifies the latent item-based offset vector. β is a hyper-parameter to normalize Cnn(W, Di). The predicted rate is estimated using Equation 18: where, R(u) is a set of items implicitly rated by user u. A user is modeled by where pu is a user vector that is learned from the given explicit ratings. This vector is complemented by the sum Which represents the aspect of implicit feedback. Since the yj's are centered around zero (by the regularization), the sum is normalized by   1 2 Ru  , in order to stabilize its variance across the range of observed values of |R(u)|.

Optimization
Our Proposed is a tightly integrated model that learns and optimizes parameters simultaneously. CNN and SVD++ integrated approaches are mutually affected by each other and allow two-way interaction between them. We utilized a backpropagation algorithm to optimize CNN weights. The objective function of the model is to minimize the regularized squared error on the training process while learning parameters: where, λ is a constant that controls the degree of regularization and usually determined by crossvalidation. Stochastic gradient descent is employed for optimization. Iterate all known ratings in K, K is the set of the (u,i) pairs for which rui is known the training set: where, γ1 and γ2 are the learning rates, λ1 and λ2 are the regularization weights. Due to the high cost of updating the parameter y. an efficient training process applied to decrease the computation time of SVD++ using algorithme1 the same as (Zhang et al., 2017).

Experimental Environment
The proposed model is coded in python 3.7 and runs on a personal laptop with the following specification: Windows 10 operating system, 64-bit Operating System, x64-based processor, an Intel® Core(TM) i7-9750U CPU @ 2.80 GHz, 16.00 GB RAM, 1 TB disk space.

Data Description
An experiment is conducted on the most common dataset MovieLens 1 M. MovieLens 1 M: Consists of 1,000,209 ratings of 3900 movies made by 6040 MovieLens users as shown in Table 1. Ratings are discrete values from 1 to 5. Such a dataset is downloaded from an online webpage at https://grouplens.org/datasets/movielens. Plot summaries are available at http://www.imdb.com/.plot example in Fig. 6. Data set contains three files users, items and ratings. User information is in the file "users.dat" and is in the following format: UserID::Gender::Age::Occupation::Zip-code.All ratings are contained in the file "ratings.dat" and are in the following format: UserID::MovieID::Rating:: Timestamp.User information is in the file "users.dat" and is in the following format: UserID::Gender::Age::Occupation::Zip-code. Movie's summary is stored in "plot.dat" and is in the format: MoviesID:: Plot. We preprocessed the movie's plot for all dataset elements as follows:  Set maximum length of raw documents to 300  Eliminate stop words  Determine tf-idf score for each word  Eliminate corpus-specific stop words that have the document frequency higher than 0.5  Pick up the top 8000 diverse words as a vocabulary  Eliminate all non-vocabulary words from raw documents

Experimental Settings
Configuration setting for CNN and SVD++ algorithms are set as follows: 1) The maximum document length is set to 300 2) The word latent vectors are initialized by pre-trained word embedding models with a dimension size of 200. They will be trained in the optimization method 3) Several window sizes (3, 4 and 5) employed in the convolution layer to consider the different lengths of the surrounding words 4) We apply 100 shared weights per window size 5) We employ dropout and set dropout rate to 0.2 to avoid CNN from over-fitting 6) We set gamma1 = 0.007, gamma2 = 0.007, Lambda1 = 0.

Experimental Result
To evaluate the performance of the proposed model and all baseline models based on Movielens-1 M datasets, we randomly split the dataset into a training set (80%) and a test set (20%). The training set contains at least a rating on every user and item to deal with all users and items. As the evaluation measure, we used Root Mean Squared Error (RMSE), which is directly related to an objective function of a conventional rating prediction model. The proposed ConvSVD++ model is trained for 30 iterations. Figure 7 to 9 show the change of RMSE and MAE for each iteration for Movielens-1 M data set for training set 50, 70 and 80% respectively. Table 2 shows the RMSE and MAE for different ratio of the training set to the entire dataset.  7 9 11 13 15 17 19 21 23 25 27 Table 3 and Fig. 10 present the RMSE of ConvSVD++ and the baseline models for Movielens-1M dataset. There is an improvement of ConvSVD++ compared to the state of the art models.

Conclusion
In this study, we present a hybrid deep CF model, called ConvSVD++. It utilizes latent features from items' descriptions through CNN and SVD++ that joins the user's implicit feedback. CNN considers the contextual information of the input that leads to a significant performance. The result approves that the proposed model ConvSVD++ provides better accuracy compared to baseline models. In the future, we further can incorporate temporal dynamics factors to the proposed ConvSVD++ model to get more accurate results. Additionally, further experiments should be conducted in various datasets with different density levels to ensure that the proposed model handles sparsity data and gives an accurate recommendation.