A CURE Algorithm for Vietnamese Sentiment Classification in a Parallel Environment

: Solutions to process big data are imperative and beneficial for numerous fields of research and commercial applications. Thus, a new model has been proposed in this paper to be used for big data set sentiment classification in the Cloudera parallel network environment. Clustering Using Representatives (CURE), combined with Hadoop MAP (M)/REDUCE (R) in Cloudera – a parallel network system, was used for 20,000 documents in a Vietnamese testing data set. The testing data set included 10,000 positive Vietnamese documents and 10,000 negative ones. After testing our new model on the data set, a 62.92% accuracy rate of sentiment classification was achieved. Although our data set is small, this proposed model is able to process millions of Vietnamese documents, in addition to data in other languages, to shorten the execution time in the distributed environment.


Introduction
Solutions to process big data are imperative and beneficial for numerous fields of research and applications.Clustering can be considered the most significant unsupervised learning problem; similar to other problems of this kind, it deals with findings in a collection of unlabeled data.The clustering method refers to objects that are similar being clustered into the same group, whereas objects that are dissimilar will not be clustered into the same group (or the same cluster), as they will be in different groups (or different clusters) instead.A cluster only includes objects that share similar characteristics.
Clustering Using Representatives (CURE), which proposed the CURE algorithm -a hierarchical clustering algorithm (Guha et al., 1998), is an efficient data clustering algorithm for large databases.Therefore, the objective of this survey is to process numerous Vietnamese big data sets by using the CURE algorithm in the Cloudera distributed environment.The results of this study can be used to cross check sentiment classification for various fields of research and commercial applications.
In this work, each Vietnamese sentence was first transferred into a vector.The training data set had 40,000 Vietnamese sentences, which corresponded to 40,000 vectors that were divided into two groups: a positive vector group with 20,000 positive vectors, and a negative vector group with 20,000 negative vectors.In addition, we also transferred every sentence per document into our testing data set.A document consisted of n vectors if the document had n sentences.Therefore, if a document in the testing data set had n sentences, the document had a set of vectors, including n vectors.With 20,000 documents in the testing data set, we had 20,000 sets of vectors that corresponded to 20,000 documents.
Our new model has been proposed to classify the semantics (positive, negative, neutral) of each document in our testing data set as follows: in the Cloudera (2017) parallel network environment (Hadoop, 2017;Apache, 2017), Hadoop Map (M)/Reduce (R), we used the CURE algorithm to cluster each vector into the positive vector group or the negative vector group.Then, numerous vectors of each document were placed either in the positive vector group or the negative vector group: Documents were clustered into positive polarity if they ■■■ had more vectors in the positive vector group than in the negative vector group.Conversely, documents were clustered into negative polarity if they had more vectors in the positive vector group than in the negative vector group.Lastly, the remaining documents were clustered into neutral polarity if they had an equal number of vectors in both vector groups.
The position of this study is used in a field of the Vietnamese sentiment classification for various Vietnamese surveys and commercial applications.The proposed model can also be applied to other languages.
The motivation of this new model is as follows: The emotional analysis of a document can be identified through its many sentences.Therefore, numerous algorithms in the data mining field can be applied to natural language processing and to semantic classification to process millions of documents.
The novelty of the proposed approach is as follows: A CURE algorithm in the data mining field is applied to sentiment analysis and to classify semantics of documents based on Vietnamese sentences.This algorithm has also been used to process and identify emotions for millions of English documents.The above principles are proposed to classify the semantics of Vietnamese documents, as data mining is used in natural language processing.Therefore, this research demonstrates that the proposed model can be successfully applied to numerous languages.
Our model has various significant contributions for countless research fields and commercial applications as follows: 1) The algorithm of data mining is applied to the semantic analysis of natural language processing 2) This study demonstrates that distinct fields of scientific research are interconnected This study contains 6 sections: Section 1 is the introduction section.Section 2 discusses related works about the Clustering Using Representatives (CURE).Section 3 concerns the Vietnamese data set to classify sentences.Section 4 represents the methodology of our proposed model.Section 5 represents the experimental model and results in this study.The conclusion of the proposed model is stated in section 6.In addition, the references section shows all the reference documents, and the tables are all shown in the appendices section.

Related Work
In this section, we display various summaries of the surveys related to our proposed model.
There are many studies that are related to Vietnamese vectors, Vietnamese segments, etc. (Hoang et al., 2007;Le et al., 2008;Nguyen et al., 2009).In a comparative study directed towards Vietnamese text classification methods, a modified version of the FCM algorithm was presented to address clusters with totally different geometrical properties.The proposed algorithm adopted a novel non-metric distance measure based on the idea of "point symmetry," and experimental results on several data sets were presented to illustrate its effectiveness.In a "hybrid approach to word segmentation of Vietnamese texts," the Bag Of Words (BOW) and Statistical N-Gram Language Modeling approaches achieved a high level of accuracy (Le et al., 2008).Additionally, the authors analyzed the advantages and disadvantages of each approach to find out the best method for specific circumstances.In a hybrid approach to automatically tokenize Vietnamese text, finite-state automata techniques, regular expression parsing, and the maximalmatching strategy were combined, which were augmented by statistical methods to resolve ambiguities of segmentation (Nguyen et al., 2009).The Vietnamese ■■■ lexicon in use was compactly represented by minimal finite-state automaton.A text to be tokenized is first parsed into lexical phrases and other patterns using predefined regular expressions.The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented.Ultimately, the application of a maximum-matching strategy on a graph results in all candidate segmentations of a phrase.
Much research has been conducted related to implementing algorithms and applications in a parallel network environment (Hadoop, 2017;Apache, 2017;Cloudera, 2017).Hadoop is an Apache-based framework used to handle large data sets on clusters consisting of multiple computers and the Map and Reduce programming models (Hadoop, 2017;Apache, 2017).Its two main projects are the Hadoop Distributed File System (HDFS) and Hadoop M/R (Hadoop Map/Reduce).Hadoop M/R allows engineers to program writing applications for the parallel processing of large data sets of clusters, consisting of multiple computers.A M/R task has two main components: (1) Map and (2) Reduce.This framework splits inputting data into chunks, as multiple Map tasks can handle a separate data partition in parallel.Then, the outputs of the map tasks are gathered and processed by the ordered Reduce tasks.The input and output of each M/R are stored in HDFS; as Map and Reduce tasks perform on the pair expressed as (key, value), the formatted input and output formats will also be expressed as (key, value).Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform, which built upon the Apache™ Hadoop® and the latest open source technologies, announced that it will submit proposals for Impala and Kudu to join the Apache Software Foundation (ASF) (Cloudera, 2017).By donating its leading analytic database and columnar storage projects to the ASF, Cloudera aims to accelerate the growth and diversity of their respective developer communities.The world's leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise.Cloudera's customers efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly, and at lower costs.
Much research has been executed that is related to the CURE algorithm (Guha et al., 1998;Yan-Hua et al., 2011;Nian-Yun et al., 2009;Ertöz et al., 2002;Kaya and Alhajj, 2005;Ying et al., 2008;Rani et al., 2014).In data mining, clustering (Guha et al., 1998) has been useful for discovering groups and identifying interesting distributions in the underlying data.As traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers, a new clustering algorithm called CURE was proposed.The CURE algorithm has also been used to analyze the behavior of users in large storage network databases (Yan-Hua et al., 2011).
Experimental results showed that the improved algorithm is not only able to cluster, but also can distinguish between normal and abnormal behaviors; as the data was analyzed by the harmful behavior evaluation system, most of the abnormal behaviors observed were categorized under harmful behaviors (Yan-Hua et al., 2011).For increment data on real networks, the increment mining method was utilized, which is in accordance with the needs of real time network analysis.In another study, to inspect duplicated records, a new method of choosing representative records for a cluster was proposed, based on distance infection weight (Nian-Yun et al., 2009).Researchers have also proposed definitions of density and similarity that work well for high dimensional data (Ertöz et al., 2002), as well as an automated method for mining fuzzy association rules (Kaya and Alhajj, 2005), etc.
The sentiment analysis task classifies a sentence into one of the following predefined categories: positive, negative, or neutral.In order to analyze the sentiment, three different text categorization algorithms are often compared, including Decision Tree, Naive Bayes (NB) and Support Vector Machines (SVM).
Unsupervised classification has been investigated through numerous means (Turney, 2002;Lee et al., 2002a;van Zyl, 2002;Hegarat-Mascle et al., 2002;Ferro-Famil et al., 2002;Chaovalit and Zhou, 2005;Lee et al., 2002b;Gllavata et al., 2004;Phu et al., 2016;Phu et al., 2017a;2017b;2017c;2017d;2017e;2017f;2017g;2017h), such as a simple unsupervised learning algorithm to classify reviews as recommended or not recommended, which classified reviews according to the average semantic orientation of the phrases, specifically the adjectives and adverbs, used in the reviews (Hegarat-Mascle et al., 2002).In another study , the use of an imaging radar polarimeter data for unsupervised classification of scattering behavior was described by comparing the polarization properties of each pixel in an image to that of simple classes of scattering, such as even number of reflections, odd number of reflections, and diffuse scattering (van Zyl, 2002).

Data Set
Our new model was tested on our Vietnamese data set, which includes testing and training data sets.
Figure 1, the testing data set includes 20,000 Vietnamese documents, which contains 10,000 positive documents and 10,000 negative documents.All sentences and documents in our data set were automatically extracted from Vietnamese Facebook, websites, and social networks.Then, we labeled each sentence and document as either positive or negative.
Figure 2, the training data set includes 40,000 Vietnamese sentences, including 20,000 positive sentences and 20,000 negative sentences.All sentences and documents in our data set were automatically extracted from Vietnamese Facebook, websites, and social networks.Then, we labeled both the sentences and the documents as either positive or negative.

Methodology
In this section, we present how our new model is implemented in the Cloudera parallel network environment.The section has two main parts: the first part demonstrates how a Vietnamese sentence is transferred into a vector; and the second part displays how the CURE algorithm (CA) is performed.The second part includes two sub-sections: in the first sub-section, a document of our Vietnamese testing data set is classified into positive or negative polarity by using the CURE algorithm in the sequential environment; in the second sub-section, a document of our Vietnamese testing data set is classified into positive or negative polarity by using CA in the parallel network environment.
The methodology was executed as shown in Fig. 3.
The main ideas of the proposed model are as follows: Step 1: Transfer all the Vietnamese sentences of the training data set into the vectors of the positive vector group and the negative vector group.
Step 2: Split each Vietnamese document of the testing data set into the Vietnamese sentences.Each Vietnamese sentence of this Vietnamese document is transferred into one vector.
Step 3: Use the CURE algorithm to cluster each vector of each Vietnamese document of the testing data set into the positive vector group or the negative vector group of the training data set.
Step 4: Identify the sentiment polarity of each Vietnamese document of the testing data set based on the classification results of clustering.
Step 5: Test the proposed model in the sequential system.
Step 6: Test this survey in the Cloudera parallel system -2 nodes, the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes Transfer one Vietnamese Sentence into one Vector

Word Segmentation and Stop-Words Removal
As Vietnamese is an isolating language, the boundaries between words are not spaces as in English (Hoang et al., 2007;Nguyen et al., 2009).Therefore, we employed a Vietnamese word segmentation program (Le et al., 2008) in this work.All words and numbers are considered as features, usually referred to as tokens (Hoang et al., 2007).All documents are segmented into tokens, and the set of tokens is extracted by removing features that do not provide any information for document classification, such as numbers, dates, and function words (Hoang et al., 2007).
The main ideas of this segment are as follows: Step 1: Word segmentation and stop-words removal (Hoang et al., 2007;Le et al., 2008;Nguyen et al., 2009) are applied to all the sentences in the training data set Step 2: Each document of the testing data set is split into sentences.Word segmentation and stop-words removal are applied to every sentence of all the documents in the testing data set

Vector Representation
Each sentence of every document is a vector space model.The size of the vector comprises a maximum number of words/phrases; therefore, vector representation is executed via Word Segmentation and Stop-Words Removal.For example, if a Vietnamese sentence in the data set has 100 words/phrases, which is the maximum of all the sentences in the data set, then the size of the vector is 100.Each member of the vector has one value, which is calculated with Term Frequency-Inverse Document Frequency (TF-IDF).The overview process to transform each sentence into a vector in the parallel network environment is shown in Fig. 4.

■■■
In the Map (M) stage, the input of M is each sentence, after which M will undergo word segmentation and stop-words removal of said sentence.Then, M will transform this sentence into a vector.The output of M is the words/phrases of said sentence with their weight.
In the Reduce (R) stage, the input of R is the output of the M, and the output of R is one vector of each sentence.
After the implementation of word segmentation and stop-words removal, we used TF-IDF to transfer sentences in the training and testing data sets to the vector space model.The vector space model assigns weights to index terms.It is widely used in information retrieval to determine the relevance of a document for a given query.Both the document and the query are represented as weighted vectors of terms, and these weights are used to compute the degree of similarity between the query and the document.
TF-IDF is a measure that can also be applied as an algorithm to determine the ranking of a certain criterion of a word (phrase).The basic principle of this algorithm is that the significance of a word (phrase) will be proportional to its frequency of occurrence in a sentence and inversely proportional to its number of occurrences in other sentences in the dataset.This algorithm combined with the model space vector is used widely in many fields, such as search engines and text mining.In this research, we only used the most common form of TF-IDF.

TF (Term Frequency)
TF is used to calculate the frequency of occurrence of the word (phrase) t in sentence d.If a word (phrase) appears more frequently, then TF is greater, and vice versa.
The simplest way to calculate TF of the word (phrase) t in sentence d is the frequency of occurrence of t in d: where, Ns (t) is the number of occurrences word (phrase) t in d and W is the total word (phrase) in d.
In addition to the above formula, there is another formula to calculate TF: A simple formula with enhanced frequency: The numerator is the frequency of occurrence of the word (phrase) t in d.The denominator is the frequency of the word (phrase) that appears most in d.
TF is the only measure of the significance of a word (phrase) at the local (sentence) level.The significance level of word (phrase) in the entire data set is not shown, as numerous stop-words appeared several times.Therefore, we conducted the calculation of IDF to limit the significance of those words (phrases).

IDF (Inverse Document Frequency)
IDF is the inverse frequency of a word (phrase) in the data set.It shows the significance of a word (phrase) at the global level.IDF calculations reduce the value of popular words (phrases): with D is number of sentences in dataset and d is number of sentences in dataset which that sentences contain word (phrase) t.
In case if t does not appear in any sentence d of the D dataset then denominator equal to 0, the division is not valid, so it is often replaced by 1+ denominator ρ + (δ ∈ ∆: τ∈δ) that this does not affect the results of calculations.
We can notice that if a word (phrase) appears in the sentences of the data set more frequently, then its IDF value is smaller, and vice versa.However, a word (phrase) with a small IDF may be an important word (phrase), and a word (phrase) with a large IDF may be a popular word (phrase) and, thus, needs to be removed to avoid confounding results.Words (phrases) with small IDFs may be important words (phrases); this depends on the TF measure of that word (phrase), because words (phrases) that are rare may appear only in certain sentences of the dataset, and they are not useful in the classification process.
To identify important words (phrases), we conducted the TF-IDF calculation: If the TF-IDF measure is larger, then it is influential and will more greatly affect the classification.In the TF-IDF vector, if a word (phrase) t i appears in d j , then weight of the word (phrase) in the vector, which will be represented by the TF-IDF (t i , d j ) value, is 0. Therefore, the following formula can be executed: ( , ), 0,

A CURE Algorithm
In the CURE algorithm, the CA in the sequential environment is implemented in the first sub-part, and the CA in the parallel network environment is performed in the second sub-part.

A CURE Algorithm in Sequential Environment
The CA in the sequential environment is executed as follows, as shown in Fig. 5.

■■■
The main ideas of this part are as follows: Step 1: Transfer all the Vietnamese sentences of the training data set into the vectors of the positive vector group and the negative vector group in the sequential environment.
Step 2: Split each Vietnamese document of the testing data set into the Vietnamese sentences.Each Vietnamese sentence of this Vietnamese document is transferred into one vector in the sequential system.
Step 3: Use the CURE algorithm to cluster each vector of each Vietnamese document of the testing data set into the positive vector group or the negative vector group of the training data set in the sequential environment.
Step 4: Identify the sentiment polarity of each Vietnamese document of the testing data set based on the classification results of clustering in the sequential system.
Step 5: Classification results of the Vietnamese testing data set in the sequential environment.
CURE employs a novel hierarchical clustering algorithm that adopts a middle ground between the centroid and the all-point extremes (Guha et al., 1998).In CURE, first, a constant number c of well-scattered points in a cluster is chosen.The scattered points capture the shape and extent of the cluster.Next, the chosen scattered points are contracted towards the centroid of the cluster by a fraction cr.After shrinking, these scattered points are used as representations of the cluster.The clusters with the closest pair of representative points are merged at each step of CURE's hierarchical clustering algorithm.
CURE is robust to outliers and identifies clusters with non-spherical shapes and a wide variation in size.The CURE algorithm has many contributions: It can identify both spherical and non-spherical clusters and choose several well-scattered points as representatives of the cluster instead of a one-point centroid.It uses random sampling and partitioning to speed up clustering.
With the CURE algorithm, we follow the following steps.
For each cluster, c well-scattered points within the cluster are chosen and are contracted toward the mean of the cluster by a fraction α.
The distance between two clusters is equal to the distance between the closest pair of representative points from each cluster.
The c representative points attempt to capture the physical shape and geometry of the cluster.Shrinking the scattered points toward the mean removes surface abnormalities and decreases the effects of the outliers.
The CURE algorithm in the sequential environment is similar to algorithms used in numerous studies (Guha et al., 1998;Yan-Hua et al., 2011;Nian-Yun et al., 2009;Ertöz et al., 2002;Kaya and Alhajj, 2005;Ying et al., 2008;Rani et al., 2014): The input of the CURE algorithm is the positive vector group, the negative vector group and the n vectors of each document of our testing data set.The output of the CA in the sequential environment is the classification results of n vectors into the positive vector group or the negative vector group.After the CA in the sequential environment has been implemented, and the n vectors of the document have been classified, the document is deemed to exhibit a positive sentiment it has more vectors in the positive vector group than in the negative vector group, for the n vectors of the document.Conversely, the document exhibits negative semantics if it has fewer vectors in the positive vector group than in the negative vector group.Finally, the document is deemed to demonstrate neutral emotions if the number of the vectors in both the positive and negative vector groups are equal.

A CURE Algorithm in Parallel Network Environment
The CA in the Cloudera parallel network environment is executed as follows, as shown below in Fig. 6.
The main ideas of this part are as follows: Step 1: Transfer all the Vietnamese sentences of the training data set into the vectors of the positive vector group and the negative vector group in the Cloudera parallel system -2 nodes (the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes).
Step 2: Split each Vietnamese document of the testing data set into the Vietnamese sentences.Each Vietnamese sentence of this Vietnamese document is transferred into one vector in the Cloudera parallel system -2 nodes (the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes).
Step 3: Use the CURE algorithm to cluster each vector of each Vietnamese document of the testing data set into the positive vector group or the negative vector group of the training data set in the Cloudera parallel system -2 nodes (the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes).
Step 4: Identify the sentiment polarity of each Vietnamese document of the testing data set based on the classification results of clustering in the Cloudera parallel system -2 nodes (the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes).
Step 5: Classification results of the Vietnamest testing data set in the Cloudera parallel system -2 nodes (the Cloudera parallel system -3 nodes, and the Cloudera parallel system -4 nodes)

■■■
Fig. 6: A cure algorithm in the cloudera parallel network environment Clustering each vector of each document of the testing data set into the positive vector group of the training data set, or the negative vector group of the training data set, is implemented by using the CURE algorithm with Hadoop Map (M)/Reduce (R) in the Cloudera parallel network environment.
The CURE algorithm in the Cloudera parallel network environment is similar to algorithms used in various works (Guha et al., 1998;Yan-Hua et al., 2011;Nian-Yun et al., 2009;Ertöz et al., 2002;Kaya and Alhajj, 2005;Ying et al., 2008;Rani et al., 2014).The input of the CURE algorithm is the positive vector group, the negative vector group, and the n vectors of each document of our testing data set.The output of the CA in the Cloudera environment is the classification results of n vectors to the positive vector group or the negative vector group.After the CA in the Cloudera environment is implemented and the n vectors of the document are classified, the document exhibits a positive sentiment if it has more vectors in the positive vector group than the negative vector group, for the n vectors of the document.The document exhibits negative semantics if it has fewer vectors in the positive vector group than the negative vector group.Finally, the document express neutral emotions if the number of the in both the positive and negative vector groups are equal.
This part includes two phases: the Hadoop Map (M) phase and the Hadoop Reduce (R) phase.The inputs of the Hadoop Map (M) phase in the Cloudera are the positive vector group of our training data set, the negative vector group of our training data set, and n vectors of each document of our testing data set.

■■■
The outputs of the Hadoop Map (M) phase in the Cloudera are the emotion classification results of the n vectors of each document of the testing data set.The inputs of the Hadoop Reduce (R) phase in Cloudera are the outputs of the Hadoop Map (M) phase in Cloudera and the semantic classification results of the n vectors of each document of the testing data set.The output of the Hadoop Reduce (R) phase in Cloudera is the sentiment classification result of the document of the testing data set; this document is classified into positive emotions, negative semantics, or neutral semantics.
The Hadoop Map (M) phase in the Cloudera parallel network environment is executed as follows, as shown in Fig. 7.
The main ideas of the Hadoop Map (M) phase are as follows: Input: The n vectors of each Vietnamese document of the testing data set, the 20,000 vectors of the positive vector group, the 20,000 vector of the negative vector group.
Output: Results of clustering the n vectors of each Vietnamese document of the testing data set into the positive vector group or the negative vector group, The inputs of the Hadoop Reduce (R) phase.
Step 1: Each vector in the n vectors of the Vietnamese document, do repeat: Step 2: Use the CURE algorithm to cluster this vector into the positive vector group or the negative vector group Step 3: Get a result of this clustering.
Step 5: Results of clustering the n vectors of each Vietnamese document of the testing data set into the positive vector group or the negative vector group Step 6: Transfer the results into the inputs of the Hadoop Reduce (R) phase.
The Hadoop Reduce (R) phase in the Cloudera parallel network environment is executed, as shown in Fig. 8. Step 1: If the number of the vectors of the positive vector group is greater than the number of the vectors of the negative vector group in the n vectors of the Vietnamese document Then return positive Step 2: Else If the number of the vectors of the positive vector group is less than the number of the vectors of the negative vector group in the n vectors of the Vietnamese document Then return negative Step 3: Else return neutral Step 4: Get the polarity of the Vietnamese document of the testing data set.
Step 5: The results of the sentiment classification of the Vietnamese document of the Vietnamese testing data set.

Experiment
We have used the measure such as Accuracy (A) to calculate the accuracy of the results of emotion classification.
The Java programming language was used to save our data sets and implement our proposed model to classify the 20,000 Vietnamese documents.
To implement the proposed model, we used the Java programming language to save the training data set, testing data set, and the results of emotion classification.

■■■
The sequential environment in this research includes one node (one server).The Java language was used to program CA.The configuration of the server in the sequential environment is: Intel® Server Board S1200V3RPS, Intel® Pentium® Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs.The operating system of the server is Cloudera.
We performed CA in the Cloudera parallel network environment.This Cloudera system includes three nodes (three servers).The Java language is used in programming the application of the CURE in the Cloudera.The configuration of each server in the Cloudera system is: Intel® Server Board S1200V3RPS, Intel® Pentium® Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs.The operating system of each server in the four servers is Cloudera.All four nodes have the same configuration information.
The results of the 20,000 Vietnamese documents tested are presented in Table 1.
The accuracy of the 20,000 Vietnamese documents in the testing dataset is presented in Table 2.
We also tested the 20,000 Vietnamese documents in the sequential environment, the Cloudera system with two nodes, the Cloudera system with three nodes, and the Cloudera system with four nodes in Table 3.

Results and Discussion
With our proposed new model, we achieved 62.92% accuracy for the Vietnamese documents in Table 2.In Table 3, the average time of the semantic classification of the CURE algorithm in the sequential environment is 21,600 seconds/20,000 documents.This rate is greater than the average time of the emotion classification of the CA in the Cloudera parallel network environment with three nodes, which 7,198 seconds/20,000 documents.The average execution time of the sentiment classification of the CA in the Cloudera parallel network environment with four nodes demonstrated the fastest time, which is 4,689 seconds/20,000 documents.The average execution time of the sentiment classification of the CA in the Cloudera parallel network environment with two nodes is faster than the average execution time of the sentiment classification of the CA in the Cloudera parallel network environment with three nodes.

Conclusion
Although our new model was tested on a Vietnamese data set, it can be applied to other languages.In this paper, our model was tested on 20,000 documents, which is a small data set.However, our model can be applied to data sets for big data, with millions of Vietnamese documents.
In this work, we proposed a new model to classify sentiments for the Vietnamese documents by Clustering Using Representatives (CURE) with Hadoop Map (M) /Reduce (R) in Cloudera parallel network environment.At the time of the execution of this study, not much research has been published that has shown the use of clustering methods to classify data.Our research shows that clustering methods can be used to classify data and, in particular, to classify emotion for text.
The accuracy of the proposed model depends on many factors: 1) The CURE-related algorithms: To increase accuracy, the CURE-related algorithms can be improved or replaced with another algorithm.2) The positive vector group: The positive vector group depends on the 20,000 positive Vietnamese sentences and the algorithms that are used to transfer the sentences to the vectors.To obtain an increased rate of accuracy, the algorithms can be further developed.3) The negative vector group: The negative vector group depends on the 20,000 negative Vietnamese accuracy of the data sets, the selected sentences could be chosen with regard to a specific domain (such as textbooks, cartoons, etc.).This selection could also be used to standardize sentences and documents on the Internet, as well.
The execution time of the proposed depends on many factors: 1) The performance of the distributed environment: The performance of the parallel system depends on the Cloudera system, Hadoop Map /Reduce, the algorithms, and the performance of the nodes 2) The Cloudera system and Hadoop Map/Reduce: To increase the execution time, the Cloudera system and Hadoop Map/Reduce can be improved or replaced with other distributed environments and parallel functions 3) The CURE-related algorithms: To increase the execution time, the CURE-related algorithms can be improved or replaced with other algorithms 4) The performance of the nodes: The performance of the nodes depends on the number of servers and the performance of each server.To increase the execution time, the number of the servers can be increased, or the performance of each server can be improved The proposed model has many advantages and limitations.It uses the CURE algorithm to classify semantics of Vietnamese documents based on Vietnamese sentences.The proposed model can process millions of Vietnamese documents in the shortest time.This study can be performed in the distributed systems.It can be applied to other languages.However, low accuracy and a relatively high investment of financial costs and time are also associated with the implementation of this proposed model.
To understand the scientific values of this research, we conduct to compare our model's results with many studies, as shown in the tables.
Our model's results are compared with the works in Tables 4 and 5 (Hoang et al., 2007;Le et al., 2008;Nguyen et al., 2009).
Tables 8 and 9, our model's results are compared with the latest research on Vietnamese sentiment classification, Vietnamese sentiment analysis, and Vietnamese opinion mining.
Tables 10 and 11, our model's results are compared with the latest researches of the sentiment classification, sentiment analysis, and opinion mining.
Tables 12 and 13, our model's results are compared with the latest works on unsupervised classification (Turney, 2002;Lee et al., 2002a;van Zyl, 2002;Hegarat-Mascle et al., 2002;Ferro-Famil et al., 2002;Chaovalit and Zhou, 2005;Lee et al., 2002b;Gllavata et al., 2004).(Ertöz et al., 2002) Yes No NM NM NM No A novel method of implementing a density based approach over a Produs algorithm to cluster even small data points.(Kaya and Alhajj, 2005) Yes NM NM NM NM No An automated method for mining fuzzy association rules (Ying et al., 2008) Yes NM NM NM NM No An appropriate CURE algorithm and C4.5 decision tree method are adopted to establish a new costume sales forecasting model (Rani et al., 2014) Yes

■■■
Table 7: Comparisons of our model's merits and demerits with the researches related to CURE algorithm in (Guha et al., 1998;SUN Yan-Hua et al., 2011;Nian-Yun et al., 2009;Ertöz et al., 2002;Kaya and Alhajj, 2005;Ying et al., 2008;Rani et al., 2014) Works Merits Demerits (Guha et al., 1998) The authors' experimental results confirm that the quality of clusters produced by CURE No mention is much better than those found by existing algorithms.Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms, but also to scale well for large databases without sacrificing clustering quality.(Yan-Hua et al., 2011) Experiment results show that the improved algorithm is not only able to cluster, but also No mention can distinguish the normal and abnormal behaviors.Analyzed by harm behavior evaluating system, most of the abnormal behaviors belong to harm behaviors.For increment data on the real net, it also gives the method of increment mining, which accords with the need of real time network analyzing.(Nian-Yun et al., 2009) To inspect duplicated records, the Clustering Using Representatives (CURE) algorithm is No mention ameliorated.The definition of pre-sampling is put forward, which can find the distribution of duplicated records so as to improve exactness of random sampling in record sets.A new method of choosing representative records for a cluster is proposed, which is based on distance infection weight.With this method, representative points are selected not only according to the density of the clusters, but also according to the importance of points including some isolated points.This method can make selecting representative points suitable.Both theory and experiment show that it is an effective approach to detect the similar duplicated records (Ertöz et al., 2002) This approach handles many problems that traditionally plague clustering algorithms, e.g., No mention finding clusters in the presence of noise and the outliers and finding clusters in data that has clusters of different shapes, sizes, and density.The authors have used their clustering algorithm on a variety of high and low dimensional data sets with good results, but in this work, they present only a couple of examples involving high dimensional data sets: Word clustering and time series derived from NASA Earth science data.(Kaya and Alhajj, 2005) The authors compared the proposed GA-based approach with other approaches from No mention the literature.Experiments conducted on 100K transactions from the US census in the year 2000 show that the proposed method exhibits a good performance in terms of execution time and interesting fuzzy association rules.(Ying et al., 2008) The CURE algorithm carries out grouping of similar items in terms of sales prospect and No mention the C4.5 decision tree finds understandable links between these clusters and selected descriptive criteria.Based on the test of 568 historical sales data and 326 new data, the performance efficiency of forecasting model is analyzed.Finally, aiming at the classification error of forecasting model, a further improvement method is given.(Rani et al., 2014) These are not pure hierarchical clustering algorithm, some other clustering algorithm No mention techniques are merged into hierarchical clustering in order to improve cluster quality and also to perform multiple phase clustering.This study presents a comparative analysis of these two algorithms namely BIRCH and CURE by applying Weka 3.6.9data mining tool on Iris Plant dataset.Our research Our model's merits and demerits are illustrated in the Conclusion section.

Studies
CA SC L SD DT PNE Approach (Ha et al., 2011) No Yes VL Yes Yes No +HAC clustering +Semi-supervised SVM-kNN classification (Bang et al., 2015) No Yes VL Yes Yes No +Decision Tree +Naive Bayes (NB) +Support Vector Machines (SVM +Feature selection technique, χ2 (CHI).(Kieu and Pham, 2010) No Yes VL Yes Yes No A rule-based system using the Gate framework (Vu and Park, 2014) No Yes VL No No No A method to construct VSWN from a Vietnamese dictionary, not from WordNet.(Nguyen et al., 2014) No Yes EL Yes Yes No A supervised machine learning approach to handle the task of document-level sentiment polarity classification (Le et al., 2015) No Yes VL Yes Yes No +An approach to extracting and classifying aspect-terms for Vietnamese language.+Semi-supervised learning GK-LDA (Trinh et al., , 2016) No Yes EL Yes Yes No A crossed-domain sentiment analysis system

Fig. 5 :
Fig. 5: A CURE algorithm in the sequential environment

■■■
sentences and the algorithms that are used to transfer the sentences to the vectors.To obtain an increased rate of accuracy, the algorithms can be further developed.4) The 20,000 positive sentences of the training data set: To increase the accuracy, we can increase the number of the positive sentences in the training data set.5) The 20,000 negative sentences of the training data set: To increase the accuracy, we can increase the number of the negative sentences in the training data set.6) The testing data set: The training data set must be similar to the testing data set.7) The testing and training data sets: To improve the

Table 1 :
The results of the 20,000 Vietnamese documents in

Table 2 :
The accuracy of our new model for the 20,000

Table 3 :
The execution time of our new model for the 20,000 Vietnamese documents in the testing data set