Cross-Language Semantic Similarity of Arabic-English Short Phrases and Sentences

: Measuring cross-language semantic similarity between short texts is a task that is challenging in terms of human understanding. This paper addresses this problem by carrying out a study of Arabic–English semantic similarity in short phrases and sentences. Human-rated benchmark dataset was carefully constructed for this research. Dictionary and machine translation techniques were employed to determine the relatedness between the cross-lingual texts from a monolingual perspective. Three algorithms were developed to rate the semantic similarity and these were applied to the human-rated benchmark. An averaged maximum-translation similarity algorithm was proposed using the term sets produced by the dictionary-based technique. Noun-verb and term vectors obtained by the Machine Translation (MT) technique were also suggested to compute the semantic similarity. The results were compared with the human ratings in our benchmark using Pearson correlation coefficient and these were triangulated with the best, worst and mean for all human participants. MT-based term vector semantic similarity algorithm obtained the highest correlation (r = 0.8657) followed by averaged maximum-translation similarity algorithm (r = 0.7206). Further statistical analysis showed no significant difference between both algorithms and the humans’ judgement.


Introduction
Semantic similarity is a measure that shows the connection between two words in a text in terms of the idea conveyed. Semantic similarity in natural language engineering has experienced increasing demand of late in a wide range of applications, including linguistics, cognitive science, information retrieval, biomedical informatics and geo-informatics. Semantic relatedness is an extension of semantic similarity, for example cars and petrol can be seen as being more closely related than cars and bicycles, but the latter pair is certainly more similar (Resnik, 1999). Semantic similarity has been widely explored beyond the word unit to the sentence unit in a monolingual domain (Li et al., 2006;O'Shea et al., 2008;Bar et al., 2012;Jimenez et al., 2012;Rios, 2014).
Cross-language semantic similarity is more challenging than monolingual similarity because the semantic relations of terms are evaluated between two different languages. Research studies have found the necessity for cross-language semantic similarity to improve the performance in a number of applications, including Machine Translation (MT) (Zou et al., 2013), Cross-Language Information Retrieval (CLIR) (Zhou et al., 2012) and plagiarism detection across different languages (Barrón-Cedeño et al., 2013). There is certainly a need for research on semantic similarity of short texts in the cross-language domain. In this study, we propose two pre-processing models of semantic similarity for Arabic-English cross-language sentences. The first model includes dictionary-based translation, where an Arabic text is converted into terms, which are then translated into English. The similarity of this English translation is then measured against its English candidate text using the proposed maximum-translation similarity approach. The second model involves using MT followed by a semantic similarity measure of the two texts, based on the algorithms proposed by Lee (2011) and Li et al. (2006). Experimental works have been conducted on a human-rated benchmark created from a standard and a ground-truth dataset.
The remainder of this paper is structured as follows. Section 2 provides an overview of the literature on wordto-word, text-to-text and cross-language semantic similarity techniques, as applied to words or sentences. Section 3 is split into three subsections, each explaining the various proposed algorithms. The algorithms described in section 3.1 are used for the pre-processing and general framework; those in section 3.2 are for the dictionary-based technique, namely averaged maximum-translation similarity; and those in section 3.3 are for the MT-based techniques, namely noun-verb vector-based and term vector-based similarity algorithms. Section 4 presents the experimental design, including the tools and packages used in this study, the datasets involving short phrases from the human language understanding and the constructed benchmark dataset. Section 5 presents the results and discussion of findings and, finally, in section 6, conclusions and recommendations for future research are provided.

Word Semantic Similarity Techniques
Semantic similarity, semantic distance, semantic relations, or more broadly semantic relatedness are all terms used interchangeably in the literature to describe the extent to which term A can be used to indicate or replace term B. Semantic features exploit terms with semantic relationssuch as synonyms, antonyms, hyponyms and hypernyms (Solé-Ribalta, 2014;Liu et al., 2012;Luo et al., 2011)-or semantic dependencies (Li et al., 2006;Muftah, 2009).
HowNet is an online knowledge-based database that relates concepts and attributes of concepts. It is organized into a hierarchy in which each concept is described by a series of attributes called sememes (Dai et al., 2008). WordNet is a lexical database for English (Miller, 1995), which arranges words with the same meanings into groups called synsets. The words are then linked with more abstract concepts called hypernyms and more specific concepts called hyponyms. Some knowledgebased metrics are based on a single taxonomy or, more precisely, on the directed-acyclic graph, which demarcates the boundaries between two concepts in the taxonomy. These measures can be called monotaxonomy metrics, as summarised in Table 1. A mono-taxonomy metric was proposed to evaluate the Information Content (IC) of two concepts based solely on the HowNet taxonomy (Bin et al., 2012). Unlike the originally proposed IC measure which depends on WordNet and a corpus (Resnik, 1995), the IC metric in (Bin et al., 2012) was computed based on HowNet stating that concepts with many hyponyms convey less information than concepts located as the leaves, as follows: log( ( ) 1) ( ) 1 log(max ) hypo c IC c hn where, hypo is the number of hyponyms of a given sememe and max hn is the maximum number of sememes in the taxonomy. The similarity based on the modified IC measure was calculated as follows: (2) Dai et al. (2008) proposed a semantic similarity measure between two concepts based on the semantic similarity of their primary sememes in their concept hierarchy. The primary sememe of a concept, e.g., doctor, is the top term that describes the concept in the tree, which is human, whereby other sememes in the tree such as status and education are modifiers of it. The similarity between two concepts in Dai et al. (2008) was computed using li metric (Li et al., 2006) and the number of common sememes between them, as follows: where, n is the total number of sememes for both concepts and hypo is the number of common sememes. Zhang et al. (2014a) proposed a word semantic similarity measure that combines features obtained from the HowNet taxonomy for both concepts. Zhang et al. (2014a) study combined four features from the tree that holds two concepts including the depth, width, density and overlap as they believed that the more features are considered, the more closeness to what humans perceive is obtained. The equation is expressed as follows: where, α, β, γ and λ are scaling parameters. There have been many other similarity measures proposed based on WordNet, including path, lch (Leacock and Chodorow, 1998), wup (Wu and Palmer, 1994), res (Resnik, 1995), lin (Lin, 1998), jcn (Jiang and Conrath, 1997). Meng et al. (2014) recent study suggested a new metric that combines information density and the path metric and Li et al. (2006) earlier study proposed a semantic similarity combining the shortest path between two words, w 1 and w 2 and the depth of their Least Common Subsumer (LCS) in the taxonomy containing both words. The new metric proposed by Meng et al. (2014) showed more accurate results and outperformed Li et al. (2006) in terms of the similarity coefficient because Meng et al. (2014) metric not only reflects the semantic density information but also the path information. The description and mathematical representation of several word semantic similarity metrics are shown in Table 1.  (Dai et al., 2008) computed using li metric ( (Li et al., 2006) Li et al., 2003) shortest link between two concepts.
lch metric WordNet measures the shortest path between two terms' found in a standard textual corpus.
where P(c) is the probability that c can be (Fernando and found in the corpus. Stevenson, 2008) where S is the set of concepts that subsume both concepts.
res metric WordNet computers a similarity score of two concept 1 2 1 2 ( , ) ( ( , )) res c c IC LCS c c = (Resnik, 1995) synsets based on the IC of their least common subsumer (LCS) in the taxonomy.
lin metric (Lin, 1998) WordNet based on res metric and IC of the words' synsets . Cross-taxonomy metrics, on the other hand, use multiple knowledge-based taxonomies and may work across domains. Examples of these metrics include the lesk (Banerjee and Pedersen, 2003) and hso (Hirst and St Onge, 1998) metrics, which measure semantic relatedness rather than similarity (Corley and Mihalcea, 2005;Budanitsky and Hirst, 2006). Statistical methods for semantic similarity, such as Latent Semantic Analysis (LSA) (Landauer et al., 1998) and Pointwise Mutual Information (PMI) (Turney, 2001), have been derived from large text corpora.
Ontology-based semantic similarity measures have been proposed in recent research studies (Jian-Bo et al., 2013;Sánchez et al., 2012;Ye and Zhan-Lin, 2010). Ontologies are constructed for several domains to structure the concepts in a way that supports logical reasoning and semantic information. One of the ontology-based semantic similarity metrics was denoted as follows (Ye and Zhan-Lin, 2010): where, C 1 and C 2 are the set of abstract concepts for the terms c 1 and c 2 , respectively. Jian- Bo et al. (2013) recommended the development of an ontology-based measurement that combines a graph-based approach with features extracted from the ontology containing both concepts. The semantic similarity of words was studied based on multiple dictionaries (Zhang et al., 2014b). In addition, the degree of commonality between concepts belonging to multiple ontologies has been used to modify the IC semantic similarity of concept pairs across ontologies (Solé-Ribalta, 2014;Batet et al., 2014). Further, multiple trees were constructed from taxonomic relations among entities in an ontology and a multi-tree concept semantic similarity measure was proposed based on the following: (i) Combined tree of features, (ii) updated weights for the nodes in the combined tree and (iii) the premise that the similarity of two concepts is basically the weight of the root in the combined tree that has both entities (Hajian and White, 2011).

Short-text Semantic Similarity Techniques
Because natural language understanding requires more than the semantic similarity of words, several research studies have investigated short-text semantic similarity based on the semantic relations of their words (Li et al., 2006;O'Shea et al., 2008;Lee, 2011;Corley and Mihalcea, 2005;Mihalcea et al., 2006). A survey of studies that have semantically evaluated textual elements is presented below.
Two candidate texts, T 1 and T 2 , can be represented by concept vectors and the similarity between them can be evaluated accordingly using Cosine, Jaccard, Dice, or any similarity coefficient (Alzahrani et al., 2012). For example, the texts can be represented by a binary vector with two entries: 1 if the concept is in the joint word matrix and 0, otherwise, where the joint word matrix W consists of distinct words in both texts (Fernando and Stevenson, 2008). The similarity score was computed as the mathematical product of the binary vectors and the similarity matrix was as follows: where, 1 T and 2 T are the binary vectors of texts T 1 and T 2 . Apart from using binary vectors in the previous study, an earlier study suggested to use the Inverse Document Frequency (IDF) measure combined with a local similarity metric, implemented by any of the word similarity measures (Corley and Mihalcea, 2005;Mihalcea et al., 2006). The semantic similarity of the two texts was derived, as in Equation 7 from the maximum similarity gained by a word w from T 1 and words in T 2 , referred to as maxSim(w, T 2 ) and idf(w) obtained from the relation n w /N, where n w is the number of documents that contain the word w and N is the total number of documents in a large text corpus: Lee (2011) reported a short-text similarity measure that was computed based on the nouns and verbs because it was believed that the semantic similarity should be obtained in a fast but accurate way. Lee's study implemented a Noun Vector (NV) containing a joint noun set from two candidate texts, T 1 and T 2 and a Verb Vector (VV) containing a joint verb set from T 1 and T 2 . The value of an entry in the NV vector (and VV vector, respectively) was defined as the highest wup similarity (Wu and Palmer, 1994) found between the corresponding noun and other nouns in the NV vector (and the corresponding verb and other verbs in the VV vector, respectively). The similarity score between the two texts, the noun vector similarity S N and the verb vector similarity S V were integrated as follows: where, δ is a scaling parameter ∈ [0.5,1] and both vectors are computed as the cosine similarity between the noun vectors and the verb vectors from T 1 and T 2 , respectively. A study by Li et al. (2006) proposed a semantic similarity measure between sentences derived from a semantic similarity and an order similarity as follows: where, S s is the semantic similarity metric and S r is the order similarity metric. S s is computed as the Cosine similarity of the two vectors: The values of an entry in the semantic vector s 1 for text T 1 and s 2 for text T 2 are defined below: where, the li metric is the highest word semantic similarity between the word w i and any word in the candidate text and IC is the information content measure. Further, the order similarity, S r , means that different word orders may convey different meanings and should be counted into the semantic similarity. The word order vectors from T 1 and T 2 can be given as r 1 and r 2 , respectively. The cosine similarity was obtained from the order vectors as shown below: The above short-text similarity methods have been implemented on mono-language texts and compared thoroughly using the English sentence pairs benchmark, as reported in many research studies (Lee, 2011;Islam and Inkpen, 2008). The performance of the reported methods is not directly comparable due to different evaluation metrics and datasets. However from these studies, we can see that methods that take more textual features into consideration, such as Li et al. (2006) combined metric of semantic and order similarities, achieved high correlation coefficient with the human rating but with the cost of slow performance. On the other hand, methods that reduced the textual features, such as Lee (2011) combined metric of noun and verb vector similarities, may have less correlation coefficient with the human intuition but with the advantage of faster computation.
Following to the aforementioned studies, an ongoing series of computational semantic evaluation called SemEval have embodied several systems and methods in 2012, 2013, 2014 and 2015. Several tasks for semantic relatedness evaluation and paraphrase detection on various datasets have been investigated (Agirre et al., 2013;Agirre et al., 2012; Rich semantic analysis on new datasets has been conducted such as using Twitter data (Xu et al., 2015) or by featuring interpretability (Agirre et al., 2015). Generally these studies employ variety of NLP tools including lemmatizers, POS taggers, word sense disambiguation and syntax features. To obtain the similarity score, the methods differ in featuring the semantic notions amongst words and phrases in the candidate texts. Many studies employed WordNet lexical database and its semantic similarity measures, while others used Wikipedia knowledge base. Several studies investigated the use of semantic role labelling, distributional thesaurus and dictionaries, Machine Translation (MT) and machine learning algorithms. None of these tasks tackles crosslanguage semantic similarity and it may be forthcoming in 2016. Although SemEval 2014's task 10 was entitled "multilingual" semantic textual similarity (Agirre et al., 2014), it was separated into English subtask and Spanish subtask whereby each subtask was evaluated via monolingual datasets.

Cross-Language Semantic Similarity Techniques
Owing to the substantial increase in text data available in multiple languages, there has been a lot of research recently investigating semantic similarity measures across languages (Zou et al., 2013;Dai et al., 2008;Stoyanova et al., 2013;Vulic and Moens, 2014;Dai and Huang, 2011). Dai and Huang's study (Dai and Huang, 2011), for example, tested the effectiveness of a word semantic similarity measure for applications in the cross-language domain. They computed the similarities between words using an algorithm based on the Chinese-English HowNet. Their results showed a strong positive correlation with the humans' judgements, suggesting it would be a robust measure for use in cross-language applications. Additional studies (Vulic and Moens, 2014; proposed approaches that identified similar words across languages. Two words in different languages are similar if they generate similar words as their top semantic word responses. Semantic word responding is a cognitive science term indicating the terms that humans associate with a certain cue word. A study conducted by Wu et al. (2010) explored how to generate semantic classes of verbs across languages using parallel corpora.
Methods for cross-language identification of semantic relations have been proposed recently. One example is Stoyanova et al. (2013), which combined word semantic similarity measurements with the morphology and semantic relations obtained from WordNet. An automatic classifier was trained on parallel and comparable English-Bulgarian texts to perform semantic relations labelling and reduce word sense ambiguities. Zou et al. (2013) proposed a method that captures both mono and cross-lingual semantic relations across different languages. The method they proposed stored the bilingual embeddings between Chinese and English from a large unlabelled corpus while utilizing MT to align words with the same meanings.
Complementary to explicit semantic analysis which uses Wikipedia as a knowledge base, cross-language explicit semantic analysis CL-ESA have gained popularity in recent years (Sorg and Cimiano, 2010;Anderka et al., 2009) for computing semantic relatedness between words from different languages. Generally CL-ESA works by mapping both of the query q and the document collection d into a multilingual concept space. In (Anderka et al., 2009), the mathematical representation for CL-ESA was simplified as follows: where, T j q is the matrix transpose of q j and G j.
is the mathematical product representing term-document matrices from the query q j and candidate indexed document from Wikipedia D * . A contribution by Navigli and Ponzetto (2012a) proposed a multilingual semantic similarity approach that used BabelNet; a knowledge-rich lexicon and semantic database which supports multiple languages (Navigli and Ponzetto, 2012a). The proposed approach works by intersecting the semantic graphs from different languages into one core graph and computing the semantic similarity score based on the core graph. Given that w 1 and w 2 are two words from two languages l 1 and l 2 , respectively and G joint is the core graph formed by the intersection between graphs generated from BabelNet between w 1 and w 2 in L different languages, then the semantic relatedness between these two words was computed by Navigli and Ponzetto (2012a) as follows: where, s 1 and s 2 are the different senses for w 1 and w 2 , respectively and G is the graph that holds each two senses. The similarity score was computed as follows: where, paths is the set of all possible paths between s 1 and s 2 in sub graph G and length is the number of nodes in a path p. The method obtained competitive results compared with traditional monolingual and multilingual measures. Though it works on words, it can be expanded in various ways to compute the semantic similarity of cross-language texts beyond the words.

General Framework
The pre-processing algorithm was divided into two parts: One for the English text and one for the Arabic candidate, as shown in Fig. 1.
For the English text, the pre-processing steps included: (i) Tokenization whereby the text was divided into word tokens referred to as [W]; (ii) Part-Of-Speech (POS) disambiguation, or in other words, each token was designated a POS tag, namely noun, verb, adjective and adverb referred to as [N], [V], [AJ], [AV], respectively; and (iii) removal of the stop words such as prepositions and articles. (iv) Then. for each word token w i in A, we found the set of lemmas λ i , knowing the corresponding POS tag for that word (note that in many cases there is one lemma for a word token), as follows: ,1 ,2 ,3 , { , , ,..., }: where, x is the number of different lemma forms that can be found for the word using WordNet. A term set was constructed from the English sentence A as the union of sets t 1 , t 2 ,..,t n , i.e. 1 The same processing steps were applied to the Arabic text, B, with the addition of the translation. First, B was split into word tokens. POS tagging was applied and the most frequent Arabic words were removed. Each word token referred to as w i was reduced to its lemma (Roth et al., 2008), as below: Knowing the lemma of each Arabic word as well as its POS tag, an Arabic-to-English dictionary translation was applied to obtain possible senses (i.e., meanings) for that word in English, as follows: As a final step in the pre-processing of the Arabic text, the translation term set was constructed from the Arabic sentence as 2 1 m j j TT t = = ∪ , where m is the number of unique terms and t j is the translation subset of lemma l j . Figure 2 shows the general framework for this study. After the input texts A and B were pre-processed, we employed three different algorithms.
Following to the dictionary translation technique, we proposed an averaged maximum-translation similarity algorithm between the term set, referred to as T 1 (obtained from the English text) and the translated term set, referred to as TT 2 , to estimate the cross-lingual semantic similarity. The semantic similarity score between the terms was then correlated and averaged as proposed by Yerra and Ng (2005).
Following to the MT technique, we obtained an English version of T 2 and then a term set, denoted as T 2 , was constructed in the same way for the English text. In this path, we used two vector-based semantic similarity algorithms proposed for mono-lingual sentences. One was based on the combined similarity between the noun and verb vectors obtained from both texts, which was proposed by Lee (2011) and the other was based on the semantic similarity of term vectors, which was suggested by Li et al. (2006).

Dictionary-Based Technique
This method was proposed and implemented as a copy detection approach (Yerra and Ng, 2005). The algorithm uses two input vectors, namely T 1 and TT 2 , constructed from A and B, respectively.

Algorithm 1: Averaged Maximum-Translation Similarity
Step 1: Each term in B was correlated with the terms in the English text A using the following function: where, T 1 and TT 2 are the representative term vectors of A and B, respectively and П is the product function.
MaxSim refers to the maximum word semantic similarity obtained between the term t i and the translated terms where, wup metric (Wu and Palmer, 1994) is one of the knowledge-based semantic similarity measures between two terms c 1 and c 2 found useful in our previous work (Alzahrani et al., 2015), as follows: Step 2: The degree of similarity between the candidate texts was computed as the averaged summation using the equation:

Machine Translation-Based Techniques
Machine translation techniques have improved over recent years and have become, in some languages, almost as accurate as human translation for short phrases and sentences. When the Arabic candidate text is translated into English as a pre-processing step, the problem is shifted into a mono-lingual sentence semantic similarity problem. We decided to use the vector similarity methods proposed in earlier research studies and obtained promising results (Li et al., 2006;Lee, 2011).

Algorithm 2: Noun-Verb Vector Based Similarity
In this algorithm, we employed MT followed by the mono-lingual semantic similarity method (Lee, 2011).
The algorithm was based on the combined similarity between the noun and verb vectors from the two texts. The following steps were implemented: Step 1: B was translated into English using Google Translate API.
Step 2: The translated text was pre-processed in the same way as for the English text.
Step 3: The term vector was constructed from B as , where m is the total number of unique terms.
Step 4: A joint noun set from the two candidate texts, A and B, was constructed as the Noun Vector (NV), where the value of an entry was defined as the maximum word semantic similarity found between the corresponding noun and other nouns in the NV vector, as follows: where, MaxSim is the maximum semantic similarity as described in algorithm 1.
Step 5: Similarly, a joint verb set from A and B was constructed, namely VV and then the verb vectors containing the maximum wup similarity were obtained between each verb and other verbs in the VV vector, as below: Step 6: The Cosine similarity values between the noun and verb vectors were computed as follows:

VV VV Sim T T Cos VV VV VV VV
Step 7: The similarity score was computed based on the noun vector similarity Sim N and the verb vector similarity Sim V : where, δ is a scaling parameter ∈ [0.5,1].

Algorithm 3: MT-based Term Vector Based Similarity
The algorithm was based on the following steps: Step 1: B was translated into English using Google Translate.
Step 2: The translated text was pre-processed as in the previous algorithm and the term vector T 2 was constructed.
Step 3: A joint term set from the two candidate texts, A and B, was constructed and referred to as a term vector TV, where the value of an entry was defined as the maximum word semantic similarity found between the corresponding term and other terms in the candidate text, as follows: where, li similarity metric (Li et al., 2006) was used in this algorithm to find the MaxSim between any two terms.
Step 4: The Cosine similarity values between the term vectors were computed as follows:

Tools and Packages
For the pre-processing of the English and Arabic input texts, we employed the Stanford NLP tools (Toutanova et al., 2003;Monroe et al., 2014). We also used the NLTK (Edward and Steven, 2002) for various tasks including the computation of WordNet-based semantic similarity metrics.

Datasets
To evaluate the proposed methods, we used sentence pairs annotated with ground-truth human similarity scores. Each pair consists of one element of an English sentence and one element of an Arabic sentence, which may be similar (or dissimilar) to the English sentence in some degree. For our initial investigation, selected sentences from books on natural language understanding with similarity scores close to humans' similarity intuition were used (Li et al., 2006) (Section 4.2.1). Moreover, cross-language similarity benchmark was constructed to evaluate the proposed techniques (Section 4.2.2).

Selected NLP Sentences
In our initial investigation, the sample of sentence pairs were used as follows: (i) The second sentence in each pair was translated into Arabic by a native speaker of Arabic, educated to graduate level and fluent in English as a second language; (ii) The translations were validated (and in some cases, modified) by two language experts, who speak both languages; (iii) A number of pairs from the sample proposed by Li et al. (2006) were excluded because they are too short and the remaining ten pairs were included. Table 2 shows the original sentence pairs by Li et al. (2006) and the proposed translation for the second pair. We assumed that the validity of using the same similarity scores in the English pairs would hold for the Arabic-English pairs because of the short translations given for each sentence (which do not exceed five words for each pair). Besides, the similarity scores obtained by Li et al. (2006) have been proven to be fairly consistent with human intuition.  (Li et al., 2006) and translated 2 nd -item pair into Arabic in this study.

Pilot Cross-Language Human Similarity Benchmark Dataset
In order to evaluate the cross-language semantic similarity in this study, human ratings were collected on a proposed dataset according to the existing designs of semantic similarity benchmarks. The rating participants were selected from among a population of Arabic mother tongue speakers of English as a second language. They were all educated to postgraduate level and all had an upper-intermediate to professional understanding level of English. They were either academics or postgraduate students in English universities.

A. Materials
A group of sixty-five English noun pairs, which have been proven to be fairly consistent in terms of human semantic similarity ratings, were proposed in the literature (Rubenstein and Goodenough, 1965). The definitions of these noun pairs, taken from Collins Cobuild dictionary, were semantically rated by thirty-two human participants in O' Shea et al. (2008). A subset consisting of thirty sentence pairs were selected in order to distribute the rated similarities evenly across the similarity ranges (Li et al., 2006).
In the present study, we proposed a benchmark dataset that made use of this standard dataset but with the second item in each pair replaced by its Arabic translation. The following procedure was used: (i) The second sentence was translated using three methods, namely MT, Human Translation (HT) and the Dictionary Definition (DD) of the original noun pair from a selected Arabic-Arabic dictionary; (ii) The original English sentence and the three translations were tabulated; (iii) To avoid researcher bias, three language experts, educated to PhD level, were asked to choose the most optimal Arabic translation for the English sentence; (iv) Using a majority vote procedure, the translation that indicated the most similar semantic content with no additional phrases was then tabulated with the original English sentence. This table was given to a further two participants to check whether any amendments were needed for each of the Arabic translations. Figure 3 shows a sample of the questionnaire that was distributed for participants to choose the optimal translation and Table 3 shows the majority voting results.

B. Procedure
To rate the similarity of constructed Arabic-English cross-language sentence pairs, seventeen participants were asked to complete a questionnaire, shown in Fig. 4. The participants were all native speakers of Arabic with upper-intermediate to professional proficiency in English as a second language. All of the participants were educated to graduate level or above. The procedure to obtain the human similarity scores is detailed as follows: (i) The order of the sentence pairs was randomized and given to new participants to avoid any evaluation bias; (ii) Following the same rating scale of standard semantic similarity datasets (Rubenstein and Goodenough, 1965), the participants were instructed to rate each sentence pair on a scale from 0.0 to 4.0. A rubric was provided to explain the evaluation scale, where 0 indicated that the two sentences are totally different/dissimilar in their meaning and 4 indicated that the sentences are identical. Table 4 shows the cross-language sentence pairs used in this study with the mean similarity and the standard deviation for the human rating.

Evaluation and Statistical Analysis
Human similarity ratings were obtained as the means score in the range [0,4] and these were then scaled into the range [0,1] to be compared with the proposed semantic similarity algorithms. Statistics such as the mean, standard deviation and Pearson product-moment correlation coefficient r are commonly used for comparisons between human ratings and automated methods (Li et al., 2006). The results from the proposed algorithms were statistically compared with the constructed human-rated benchmark dataset using tstatistical hypothesis testing (Leech et al., 2008). We set a null hypothesis that "the semantic similarity evaluation by the machine and by the human perform equally (i.e., the true mean difference is zero)". A paired t-test was used to test the null hypothesis. To carry out the paired ttest on the benchmark dataset (k = 30), we calculated the difference of the results obtained by the algorithm and the mean human rating for each sentence pair as d i = x iy i , where i = 1,2,…,k and x i refers to the mean value of human rating on the i th pair and y i refers to the Sim score obtained from the proposed algorithm on the i th pair. The mean difference was computed as , which follows a normal distribution with k-1 degrees of freedom under the null hypothesis. Using t-distribution table, we compared T to the t k-1 distribution to obtain the probability value, referred to as the p-value to reject/not reject the null hypothesis. Table 5 shows the results obtained from the proposed algorithms as compared with the human-like similarity obtained by Li et al. (2006) on a sample of sentences. The correlation coefficients, r, were 0.624, 0.793 and 0.928 obtained from the averaged maximumtranslation, noun-verb vector and the MT-based term vector similarity algorithms, respectively. The first pair which has two sentences totally different in their words as well as their meaning (the pairs are shown in Table  2), all algorithms obtained zero similarity as to indicate the fact that they are completely dissimilar. Pairs 2 and 3 have sentences that share common words but their meaning is somehow different. We can see that the human-like similarity is 0.74 but our methods were computed based on the terms that share the same or very similar semantic meaning and, therefore, they obtained higher similarity results (0.89 and 1.0 using the term vector similarity method). Other results in the remainder pairs were almost consistent with the human understanding and they also showed that the MT-based term vector similarity algorithm obtained the highest correlation with the human-like similarity.

Results from the Human-Rated Benchmark Dataset
This section covers the experimental works that we carried out to validate the proposed models. As mentioned above, the ground-truth benchmark was created based on human ratings. Table 6 presents the human similarity scores for each sentence pairs and those obtained by the three algorithms, namely the averaged maximum-translation similarity algorithm, the noun-verb vector similarity algorithm and the MT-based term vector similarity algorithm. The correlation coefficients r obtained by these algorithms were 0.7206, 0.5512 and 0.8657, respectively.
As can be seen from the table, the averaged maximum-translation and noun-verb vector similarities obtained a reasonably good correlation with the human understanding if we take into consideration these sentences were processed from two different languages. MT-based term vector similarity achieved a remarkably better Pearson correlation coefficient with the human intuition significant at the 0.01 level.
However, as mentioned in Li et al. (2006), a further factor should be accounted in order to decide what is the best performance that can be achieved by the computerised similarity algorithms under this particular set of benchmarks and experimental conditions. Thus, an upper bound was determined in this study using leaveone-out resampling technique whereby we repeated the evaluation n times (n = number of the participants). Each time, we computed the Pearson correlation coefficient of the judgement of each individual participant against the group of all participants and then we took the mean as the upper bound. As shown in Fig. 5, the best human participant's correlation coefficient is 0.9445 and the worst is 0.5994 whereas the mean (upper performance) is 0.878. By considering the mean of all human participants as a typical higher performance rate can be attained, we found that our algorithm that used MT-based term vector similarity achieved a close estimation to this upper bound.  Table 6. Experimental results on Arabic-English cross-language short texts benchmark using averaged max-translation, noun-verb vector and MT-based term vector similarity algorithms Similarity algorithms En-Ar Human

Statistical Results and Discussion
Further statistical analysis has been done using t-Test on the results obtained by the human participants versus each of the proposed automatic cross-language similarity algorithms. Table 7 shows the statistical results from the three algorithms as obtained from the benchmark dataset wherein the sample size = 30 and confidence level 0.95. It can be seen from the p-value that maximum-translation similarity algorithm and MTbased term vector similarity algorithm perform equivalently to the human evaluation, while noun-verb vector similarity algorithm is significantly different. This may be because the latter algorithm do not consider all of the terms found in the texts such as adjectives and adverbs and computed based on the nouns and verbs only. Accordingly, we can say that methods that conserved all of the semantic terms in terms of their meanings across two texts in different languages may work comparably and obtain positive results, regardless of the usage of dictionary translation or MT for finding the translation of these terms.

Conclusion and Future Research
This paper proposed and compared different methods for measuring the cross-language semantic similarity between short phrases and sentences. Three algorithms namely the averaged maximum-translation similarity algorithm, the noun-verb vector similarity algorithm and the MT-based term vector similarity algorithm have been investigated on Arabic-English texts. The influences made by this paper can be summarized in two points. First, a standard cross-language benchmark was constructed and verified based on a ground-truth dataset. Second, the proposed algorithms consider the impact of either dictionary translations, noun and verb vectors, or term vectors, in order to judge the relationship of two sentences derived from two different languages in terms of their meaning. These algorithms have been applied for the first time in the Arabic-English cross lingual setting as indicated by the literature review. Thus, our crosslanguage semantic similarity algorithms were developed and tested not only to capture common semantic similarity of two languages, but also to establish a comparison base for further research. To evaluate our cross-language similarity algorithms, we used a set of sentence pairs from computational linguistics. An initial experiment on this data illustrates that the proposed algorithms provides similarity scores that are fairly consistent with human understanding. Next, we compared the similarity results obtained by our algorithms with similarity scores rated by human participants in the benchmark by taking into consideration an upper bound similarity rate obtained by the participants. Statistical results showed that using MT or dictionary translation can both achieve a comparable behaviour if we employ good semantic similarity measurements. Further research will include the construction of a wider selection of sentence pairs annotated with human's ratings and explore these algorithms across different languages. An improvement to the algorithms can be made when we use word sense disambiguation.
More sophisticated algorithms proposed recently such as BabelRelate (Navigli and Ponzetto, 2012b) and CL-ESA (Sorg and Cimiano, 2010;Anderka et al., 2009) will be explored in further studies which in turn would help to apply these techniques on sentences of medium to large lengths. Presently, comparison of our techniques with some of the other algorithms is difficult due to a lack of published work on measuring the semantic similarities in the Arabic-English cross-language domain.

Acknowledgment
The author thanks the participants in the survey and those who provided useful comments on the manuscript.

Ethics
No ehtical issues would arise after the publication of this manuscript.