Ensemble Divide and Conquer Approach to Solve the Rating Scores’ Deviation in Recommendation System

: The rating matrix of a personalized recommendation system contains a high percentage of unknown rating scores which lowers the quality of the prediction. Besides, during data streaming into memory, some rating scores are misplaced from its appropriate cell in the rating matrix which also decrease the quality of the prediction. The singular value decomposition algorithm predicts the unknown rating scores based on the relation between the implicit feedback of both users and items, but exploiting neither the user similarity nor item similarity which leads to low accuracy predictions. There are several factorization methods used in improving the prediction performance of the collaborative filtering technique such as baseline, matrix factorization, neighbour-base. However, the prediction performance of the collaborative filtering using factorization methods is still low while baseline and neighbours-base have limitations in terms of over fitting. Therefore, this paper proposes Ensemble Divide and Conquer (EDC) approach for solving 2 main problems which are the data sparsity and the rating scores’ deviation (misplace). The EDC approach is founded by the Singular Value Decomposition (SVD) algorithm which extracts the relationship between the latent feedback of users and the latent feedback of the items. Furthermore, this paper addresses the scale of rating scores as a sub problem which effect on the rank approximation among the users’ features. The latent feedback of the users and items are also SVD factors. The results using the EDC approach are more accurate than collaborative filtering and existing methods of matrix factorization namely SVD, baseline, matrix factorization and neighbours-base. This indicates the significance of the latent feedback of both users and items against the different factorization features in improving the prediction accuracy of the collaborative filtering technique.


Introduction
Recommendation System (RS) is one of the solutions for information overloading to improve the quality of social networks. The personalized recommendation system utilizes the rating scores of the common users to predict the suitable item to be recommended to the target user. The scores in the rating matrix represent the significant features for the users and items, but the rating matrix commonly consists of unknown rating scores (data sparsity) which lower the quality of the predicted scores' accuracy. However, during the streaming of rating scores into the rating matrix, some rating scores deviate from its accurate places (Cui et al., 2014). Usually, the deviation is caused by the streaming of the huge amount of rating scores in the rating matrix without care for sorting and managing these scores to extract the accurate latent feedback. In fact, the position of any rating score after streaming of these scores into the rating matrix effect on the values of the latent feedback. The position of the rating scores is a significant factor for predicting the unknown rating scores. Furthermore, the Collaborative Filtering (CF) solves the limitation of RS.
CF (Armentano et al., 2012) is used for RS to explore the similarity of users based on the explicit features (rating scores). Other latent feedbackof users and items can be explored from the rating matrix based on a prediction method such as Singular Value Decomposition (SVD) which utilizes the decomposition vectors of the known rating scores (Koren, 2009). When the rating score deviates from its appropriate cell in the rating matrix, the results of SVD prediction will give low accuracy. Therefore, once the streaming is completed, there is a need to rearrange all rating scores based on the similarities between users or items.
Few methods have been introduced for rating scores rearrangement such as the Divide and Conquer algorithm (DC) (Gu and Eisenstat, 1995). DC is used to solve a problem of the misplaced object or outlier as a result of data streaming. Mackey et al. (2011) have used the Divide-Factor-Combine (DFC) algorithm to deal with the base matrix factorization. The DFC algorithm randomly divides the large-scale matrix factorization task into smaller sub-problems and solve those subproblems in parallel and then combine them using ensemble methods based on low-rank approximations (Mackey et al., 2011). Cui et al. (2014) have proposed the state-of-the-art divide and conquer k-means clustering algorithm to reduce the imprecision in rearranging the streaming data. Mackey et al. (2011) have rearranged the matrix factorization based on the ensemble method and Cui et al. (2014) have identified the data places based on the clustering method and its relations. However, none of these methods have focused on the similarity of users (sim u ) and the similarity of items (sim i ). Hence, the Ensemble Divide and Conquer (EDC) approach is proposed to solve the data sparsity and also the rating scores' deviation (misplace). The EDC approach is instituted by the SVD algorithm which extracts the relationship between the latent feedback of users and the latent feedback of the items. Besides, this work exploits the ranges scale of rating scores as a sub problem which effected on the approximation among the rating scores. Therefore, the normalization method of the rating scores will provide the accurate approximation among users' features.
The EDC approach exploits the relation of latent feedback between sim u and sim i and the combination of sim u and sim i for improving the accuracy of personalized RS based on three methods. The first method is called Divide and Conquer based on sim u (DCU). The second method is the Divide and Conquer based on sim i (DCI). The third method is the Divide and Conquer based on sim u and sim i (DCUI) which combine the methods of DCU and DCI. The sim u or the sim i will be measured based on the squared Euclidean distance which is used in k-means algorithm. Furthermore, EDC combines four methods which are SVD, DCU, DCI and DCUI for selecting the lowest error and the highest accuracy of prediction. In addition, the CF provides the personalized recommendations of the set of users. Therefore, the average prediction accuracy of the set of the users is computed to benchmark the experimental methods. The EDC method provides a specific value of the Root Mean Squared Error (RMSE) for each user. To evaluate the proposed methods of EDC correctly, the total beneficiaries of the users are computed for each method separately. The beneficiaries of each method are the users who provide the lower RMSE by this method. The total beneficiaries of each method will be compared to the total number of the users to extract the ratio of this method. The proposed EDC methods are differentiated from the previous divide and conquer methods (Mackey et al., 2011;Cui et al., 2014;Mirbakhsh and Ling, 2013) which used k-means for generating random clusters.
The contributions of this paper are threefold. The first is introducing the EDC approach to improve the accuracy of prediction for Movie Lens dataset compare to SVD algorithm. This achievement also indicates EDC novelty in solving the deviation of scores during poststreaming by utilizing the sim u and the sim i and their relations. The second is that, the EDC approach gives the lowest RMSE compared to CF, SVD, baseline, MF and neighbours-base where the lowest RMSE is the best. Lastly, the performance of all benchmark methods are improved by different percentages based on normalizing the rating scores of users in the rating matrix from a range of [0-5] to [0-1]. The normalization of rating scores was performed based on standard data mining step to improve the accuracy of the RS. The first part of the paper gives the introduction to the problem of rating score deviation and a brief on the proposed EDC method. The second part focuses on the related works and the steps of EDC, while the third part shows the experimental results and discussion and conclusion.

Related Works
The clustering techniques help to divide the huge sparse rating matrix to k matrices by identifying the similar users and similar items which reducing the dimensionality of the rating matrix. The technique of kmeans clustering is one of the widely used iterative optimization algorithm (Han et al., 2011). It is observed as a popular clustering approach, due to its integrity of execution (Xu and Wunsch, 2008). Therefore, this algorithm will be used as the main tool for the EDC approach to divide the rating matrix into k clusters. There are five proximity measures which are squared Euclidean distance, city block, hamming, cosine and correlation of coefficient. These measures are used in the k-means algorithm for computing and optimizing the summation of the proximity between the members and the centroid point of the clusters. The main convergence distance measures are squared Euclidean distance, city block and hamming distance. While the main similarity measures are the cosine and correlation of coefficient (Khalil et al., 2009). Divide and Conquer algorithm (DC) is used to reduce the noise in Matrix Factorization (MF) by dividing the large scale MF task into subproblems. Mackey et al. (2011) proposed Divide-Factor-Combine (DFC) approach for reducing the noise of MF with missing entries or outliers. DFC contains 2 algorithms which are DFC-PROJ and DFC-NYS based on the approximation technique. DFC-PROJ divides the orthogonal original MF randomly into sub matrix factorizations while DFC-NYS selects sub matrix and uniformly at random. Clearly, DFC deals with random columns and random rows for rearrangement the MF. The combination among sub matrices based on the approximation factor improves the scalability of matrix factorization (Mackey et al., 2011).
During the streaming of data into memory, k-means would face a big challenge of reusing the large data, where each object in each iteration would be fetched from disk into memory, which means the data in memory cannot be recycled and causing poor temporal locality. The collaborative DC algorithm has been proposed to improve the state-of-the-art k-means algorithm and to identify the clusters based on reducing the misplaced objects. The collaborative seeding among different partitions have accelerated the convergence inside each partition and the convergence factor of each cluster, which improve the quality of existing clusters (Cui et al., 2014). However, neither Mackey nor Cui have exploited the sim u and the sim i features for RS. Besides, the relation factors between users and items are not exploited, which have not made personalized RS possible.

Matrix Factorization
Currently, Matrix Factorization (MF) has become a common approach for CF (Mirbakhsh and Ling, 2013), where MF is one of the most effective prediction approaches which are utilized to address the sparse data . SVD is a traditional MF technique which is used to predict the sparse rating scores for Movie Lens and E-Commerce datasets in RS based on CF (Sarwar et al., 2000). SVD has the ability to extract the latent feedback of users and the latent feedback of items based on the relation between users and items and reducing the dimensionality of a rating matrix. Moreover, this approach is able to calculate low-rank approximations, which can be used to calculate the sim i (Koren, 2008).The factors of the latent feedback can be extracted by the SVD algorithm as shown in Equation 1: This equation is available in several programming languages such as Matlab. Figure 1 is an example showing the input to the SVD algorithm and the output factors using this algorithm. SVD produces three matrices which are the matrix of the latent feedback of users P, the diagonal matrix B and the matrix of the latent feedback of items V.
Equation 2 is used in several matrix factorization methods for predicting the sparse rating scores. Equation 2 uses the latent matrices of P and V for predicting the sparse rating scores in the rating matrix: where, ui r ⌢ is the predicted value of the sparse rating score and P u is the latent feedback of user u and V T is transpose the matrix V. This method uses the stochastic gradient descent algorithm (Koren, 2010) to reduce the error prediction. Further features of both users and items can be extracted using the baseline method. Baseline method illustrates the effects of users and items separately. There are two factors based on baseline, which are the users' base b u and the items' base b i which are extracted using standard deviation. Equation 3 (Koren, 2008) shows the predicted value using the factors of baseline: where, µ represents the mean of the rating scores of users. Figure 2 shows an example of predicting the sparse rating scores using the baseline method. Some of the prediction values by baseline are more than the rating range [0-5] which show the over fitting problem. Furthermore, The MF method uses the factors of baseline and the factors of SVD to learn the factorization features within Equation 4 (Koren, 2009): The matrix factorization methods are incorporated with the base features of the neighbours. For example, the model of neighbours-base (Koren, 2010) integrates the factors of baseline with the distance between the rating scores and the base features of the neighbours who provide the rating scores for each item as shown in Equation 5 (Bell and Koren, 2007): where, N is a set of neighbours that provide item i by rating scores and x is a neighbour which rated item i. The vector of sim x is the similarity between neighbour x and the target user.r xi , b xi are the rating score of neighbour x and its baseline prediction value, respectively. The prediction methods of SVD, baseline, MF and neighbours-base will be used to evaluate the proposed model.

Collaborative Filtering
CF is one of the filtering techniques which RS uses for personalized recommendations (Zhang et al., 2011). The members of RS give rating scores about a set of items based on their interests and for personal recommendations based on CF, RS recommends its members based on these rating scores (Bobadilla et al., 2013). CF is classified into two types which are memory-based CF and model-based CF. Memory-based CF is used to produce recommendations based on the rating scores of all common users which stores in memory. The rating scores are arranged in the rating matrix and then similarity between the common users and the target user is calculated for predicting the users' interest on items (Ren et al., 2013). Therefore, for each target user, a group of common users who have rated the common items more similarly can be recognized as neighbours of the target user (Adibi and Ladani, 2013).
Furthermore, top k of users that have high similarities is taken as the nearest neighbours of the target user. Among the limitations of the memory based CF techniques is that the similarity values are determined based on common items and consequently these values of similarity are unreliable because data are sparse when the common items are few (Su and Khoshgoftaar, 2009). On the contrary, the Model-based CF build a model from the specified rating matrix and use prediction method such as Singular Value Decomposition (SVD) to predict the unknown rating scores.
There are three stages in CF process. First, computing the similarity between the common users with the target user, where the cosine function (Zheng and Li, 2011) as shown in Equation 6 is commonly used in this stage (Ahn, 2008): where, r u,i is the rating score which a user u gave to an item i and K is the number of all common items which rated by both users. Second, computing the predicted rating score value for the item i. This is obtained by the deviation from the mean as an aggregation method as shown in Equation 7 (Ahn, 2008): where, λ i is the predicted rating score value for the item i, v ua and v uh are the average rating of the target user u a and the common user u h respectively, M is the number of common users who have rated item i x . Last, the Root Mean Squared Error function (RMSE) (Bobadilla et al., 2013;Patra et al., 2015) is used to benchmark the prediction accuracy of RS as shown in Equation 8: where, U is the set of the target users, n is the number of items that rated by the target user u, r i is the rating score by user u for the item i and λ i is the predicted rating score value for the item i. Equation 8 provides the average RMSE for evaluating the accuracy prediction of the whole set of target users U.

Dataset Description
Several experimental studies have used the Movie Lens dataset (Bobadilla et al., 2012;Lisboa et al., 2013) to evaluate the performance of RS. This dataset recorded the user rating about movies (1-5 scales) for the purpose of building RS. The data were assembled through the website of Movie Lens (movielens.umn.edu) during the seven-month period from September 1997 to April 1998. This data collected 100,000 ratings from 943 users on 1682 movies (each user has rated at least 20 movies) where 95.4% from rates are missing and each user on average rates 5% of the whole items. This data will be used by the EDC method to provide personalized recommendations.

Normalization
The normalization is a method of data transformation for reprocessing the data for the purpose of improving the accuracy and efficiency of mining algorithms involving distance measurements. In RS the rating scores of users for items contain the distance between the range of [0-5] and based on our experiments this distance measurement gives low accuracy especially for Movie Lens dataset.
where, x is the minimum value in the whole matrix and y is the maximum value in the whole matrix. In addition, m is the maximum of target distance and n is the minimum target distance. Table 1 shows the rating score values before and after the normalization.

Methodology
The Ensemble Divide and Conquer algorithm (EDC) solves the problem of the deviation of some rating scores by returning predicted rating scores that have the lowest RMSE. The factors of SVD can be used to predict the sparse rating scores using Equation 2.

Barragáns-Martínez et al. (2010) have used Equation 10
to predict the sparse rating scores: However, Equation 2 and 10 are not convenient for predicting the sparse rating scores, where the predicted values are very small which lower the prediction accuracy of the CF technique. Therefore, the EDC approach uses Equation 11 to predict the sparse rating scores: This method integrates the latent feedback of users, the diagonal matrix and the latent feedback of items to predict the spare rating scores. Figure 3 shows an example for justifying the Equation 11 in EDC approach.
In Fig. 3, the first row u 1 in the rating matrix represents the target user and u 2 to u 6 are the common users and the zero values are the spares rating scores which act the data sparsity problem. The rating matrix and the factors of P, B and V contain the same values in Fig. 1 EDC approach combines three methods for learning the accurate latent feedback. First, sorting the common users using divide and conquer based on sim u . Second, sorting the common items using divide and conquer based on sim i . Third, sorting the rating matrix using the divide and conquer based on both sim u and sim i . EDC uses the kmeans algorithm to divide the rating matrix and k sets by two clusters. These methods can be described as follows.

Divide and Conquer Based on the Similarity of Users
The algorithm of k-means is used to divide the rating matrix to k clusters based on the sim u . After dividing the rating matrix into k clusters, Divide and Conquer based on the sim u (DCU) resorts the clusters based on the best relation between the latent feedback of users. DCU uses k-means to divide his members into k clusters and merge these clusters based on the lowest RMSE. Figure 4 shows an example of the process of DCU with one probability of arrangement. This figure shows how DCU method divides the rating matrix by kmeans algorithm and the users' rating scores arranged based on the cluster's number. The sorted matrix is evaluated by the CF method for getting the value of RMSE.
In Fig. 4, the predicted rating scores are affected by the places of rating scores compared to the original rating matrix in Fig. 3. The arrangement of the rating matrix by cluster number has a positive effect on the predicted rating scores. The method of DCU learns the accutare probability of merging the clusters based on the lowest RMSE. Procedure DCU shows the whole steps and an example of the rating matrix probabilities in Fig.  4. The proposed method uses three clusters (k) which are 2, 3 and 4 and these three k's have 4, 8 and 24 probabilities for merging the clusters respectively:

Divide and Conquer Based on the Similarity of Items
The similarity features of items represent an important factor where the items that are arranged based on sim i gives the accurate prediction more than the different items. Divide and Conquer based on the sim i (DCI) is proposed to learn the accurate relation between latent feedback of the items. Figure 5 shows an example of the process in DCI where the items in the rating matrix are divided by the kmeans algorithm into two clusters and the columns are sorted based on the cluster number. The prediction accuracy in this example is more accurate than the prediction accuracy in Figure 4 where the DCI method has a lower RMSE compared to the RMSE of the DCU method.
Procedure DCI shows the whole steps and also the probabilities of clusters merging in Fig. 5.

Divide and Conquer Based on Users Similarity and Items Similarity
Some of the target users get the accurate predictions based on the DUC method or DCI method. The method of divide and conquer based on sim u and sim i (DCUI) is proposed to combine between the accurate arrangement of DCU method and the accurate arrangement of DCI method. Figure 6 shows an example of this the DCUI process. The prediction performance by DCUI in this example is higher than the prediction performance in Fig. 4 and less than the prediction performance in Fig. 5. The probabilities of merging the clusters are used in the DCU method and in the DCI method. Procedure DCUI shows its learning process in four stages: Rmatrix3→CF→ RMSE1

Procedure DCUI
The approach of EDC combines the three methods of DCU, DCI and DCUI to learn the accurate places of rating scores compared to the original rating matrix.

Ensemble Divide and Conquer Algorithm
The EDC Algorithm shows the main process of this approach as follows: This algorithm rearranges the users and the items in the rating matrix based on the accurate places of the users' rating scores which reduce the deviation of the rating scores during the streaming process into the memory.

Experimental Results
The Movie Lens data set is used to test the EDC approach and benchmark its performance compared to CF and four methods of MF. The average results are taken to avoid the fluctuation of the RMSE for the whole users and to get the real benchmark. In order to evaluate the prediction accuracy for the sparse rating scores by EDC approach, the following observations are performed: • Finding the suitable k of the clustering and the merging process through a comparison among RMSE and the time complexity for each k • Benchmark the coverage of the users' beneficiaries from DCU, DCI and DCUI, where the methods of EDC are testing each target user separately • The comparison between the prediction methods of unknown rating scores based on the original range [0][1][2][3][4][5] and the normalized range [0-1] to benchmark the percentage of improving the prediction quality using EDC and other benchmark method such as SVD, baseline and the neighbours based on baseline

Best k Cluster, RMSE and Time Complexity
The EDC methods use three clusters (k) which are 2, 3 and 4 and these three k's have 2, 8 and 24 probabilities for merging the clusters respectively. These three k's are used to investigate the clustering effect on the latent feedback of members in the rating matrix based on the range scores [0][1][2][3][4][5]. From our feasibility studies, Table 2 shows the performance of 4 clusters is more accurate than 2 clusters and 3 clusters. Furthermore, the performance of EDC is more accurate than the SVD, DCU, DCI and DCUI methods. The time complexity of EDC methods increases in parallel because the number of probabilities for merging these clusters are increased also. However, the time complixity of 4 clusters is a small (less than 10 sec.) and the accuracy prediction of EDC has improved. Therefore, we use 4 clusters for the validations in the next sections because 4 clusters give the accurate predictions during the suitable time of processing. Table 3 shows the percentages of beneficiaries (users) coverage from SVD and other EDC methods based on the range of rating scores [0-5]. As a result of the different behaviours of users, the response to any target user for any method is different to the other target users. Therefore, the total number of the target users (beneficiaries) who have the highest accuracy prediction using each method is investigated using the EDC approach. The percentage of the beneficiaries are represented by dividing the total number of the beneficiaries on the total number of the whole users. For instance, 7 users get high accuracy from the whole users using SVD, then the ratio of coverage is 7 divide on 943 which give 0.74%. EDC has the highest beneficiaries coverage by 99.26% compared to 0.74% by the SVD method (refers to Equation6) where the EDC approach browses the total target users which got the lowest RMSE by using SVD or any method of EDC. Therefore, EDC has improved the performance of SVD by the similarity of users, similarity of items and the combination of them. DCI has beneficiaries more than DCU. The combination of them, DCUI has covered 40% of the beneficiaries which mean DCUI is more accurate than DCU and DCI. EDC collects all beneficiaries based on pair wise comparison.  Average RMSE of 943 target user  SVD  ------------------------------------ Therefore, EDC solves the problems of data sparsity and the deviation of rating scores and it has improved the relation of latent feedback of the rating matrix perfectly. Furthermore, the results indicate that the latent feedback is more effective and more accurate for accurate prediction based on EDC approach.

Normalization Effect
The small range of the known rating scores [0-1] gives high performance of prediction compared to the big range [1][2][3][4][5]. Therefore, these methods are implemented by using both the range scales for browsing the comparison between them. Table 4 shows the average RMSE of each method and a high percentage of improvement based [0-1] comparing to [0][1][2][3][4][5] where the performance of all validation methods are increased based on the scale [0-1]. The neighbour base gives more accuracy than Baseline but less than CF. The shortcoming of the neighbour base is the complexity time is high and not suitable for big rating matrix, e.g., in our experiments EDC takes 7 sec for each user compared to the neighbours-base method which take 485 sec for each user. MF also more accurate than CF, SVD, Baseline and neighbours_base. The proposed approach of EDC has returned the lowest RMSE comparing to CF, Baseline, Neighbours-base and MF methods.

Discussion and Conclusion
The main problems of CF are data sparsity, scalability and cold start (Zhang et al., 2011). The neighbourhood model is one of the most successful approaches that are used to solve the sparsity problem and obtained the accurate recommendations, even though there is lower numbers of ratings available in the neighbours of items. A disadvantage of the neighbourhood approach is the low number of neighbours who can provide the accurate predictions. The method of Baseline is used to extract the base features of users and items and SVD is one of the most accurate and scalable algorithms for prediction and solving the challenges of the data sparsity. MF has achieved the accurate prediction performance compared to CF, SVD, Baseline and neighbours_base methods. As a result of stream rating scores of users for items, some rating scores arranged into imprecise place or far from similar rating scores which give imprecise latent feedback. Therefore, the main purpose of EDC approach is to manage the deviation of the rating scores for getting the best interaction between users and items which effect on two important latent factors which founded by the SVD method. The EDC uses k-means algorithm to divide the common users and common items into k clusters. The EDC approach rearranges the misplaced rating scores in the rating matrix by learning the accurate latent feedback of users and items based on the lowest values of RMSE. The experimental results of EDC give high accuracy for the prediction of sparse rating scores compared to CF and four existing methods of MF in this study.
The results of the existing functions of MF are less than CF because these methods give some predicted values bigger than the range scales of rating scores (over fitting). Finally, EDC produces the accurate latent feedback of users and items based on SVD factors which are more important for prediction than the base features, neighbours-base features and MF features. In the future work, the divide and conquer process will be integrated with the latent features of the rating matrix based on the MF methods.