High Precision Latent Semantic Evaluation for Descriptive Answer Assessment

: This paper proposes an approach to evaluate student’s descriptive answers, using comparison-based approach in which student’s answer is compared with the standard answer. The standard answers contains domain specific knowledge as per the category (how, why, what, etc.) of questions asked in the examination. Several state-of-art claims that LSA correlates with the human assessor’s way of evaluation. With this as background, we investigated evaluation of students’ descriptive answer using Latent Semantic Analysis (LSA). In the course of research, it was discovered that standard LSA has limitations like: LSA research usually involves heterogeneous text (text from various domains) which may include irrelevant terms that are highly susceptible to noisy, missing and inconsistent data. We propose a new technique inspired by LSA, denoted as “High Precision Latent Semantic Evaluation” (HPLSE), LSA has been modified to overcome some of the limitations; this has also increased precision. By using the proposed technique (HPLSE), for the same datasets, average score difference and standard deviation between a human assessor and computer assessor has reduced and the Pearson correlation coefficient (r) has increased considerably. The new technique has been discussed and demonstrates on various problem classes.


Introduction
The current system of manual evaluation has some limitations due to which it becomes important to automate the descriptive answers evaluation.It has been noticed that different assessors give different marks to the same response.Additionally, it takes a lot of assessors to evaluate large number of answer sheets.
Evaluation of objective answer is an easy task and well supported by many systems, but descriptive answers evaluation is still an open problem.Various student essays evaluation systems have been under development since 1960s.A National network of US universities supported the development of system to grade essays for thousands of high school students' essays.It scores essays by processing number of essays on the same topic, each scored by two or more human assessors.In 1960s computer technology was not stable enough or accessible enough to expand into large scale.Some of the systems, such as, Intelligent Essay Assessor, State of essence, Summary Street, Apex, Autotutor and Select-a Kibitzer; though differing in subject domain and the similarities, all are LSA-based.
All such systems claim that LSA correlates with the human assessors.
This was one of the motivations of looking at LSA for our research.LSA is a statistical natural language processing (NLP) method for inferring meaning from a text.It was developed by researchers at Bellcore as an information retrieval technique (Deerwester et al., 1990) in the late 1980s.LSA provided an advantage over keyword-based methods, which could induce associative meanings of the query (Deerwester et al., 1990) rather than relying on exact matches.LSA uses linear algebra techniques to learn the conceptual correlations for a collection of text.
Most of the systems mentioned above and further in section 2, are comparison based in which student's response is compared with standard answer/essay.In a broad view, all such systems have three major modules: student answer representation, standard answer or reference answer representation and the comparison unit.Available systems are useful for essay grading and short answer grading systems, but descriptive answer evaluation system is still an open research issue.Our approach is also comparison based.
After experimenting with LSA for evaluation of students' descriptive answers for various categories of questions, it has been observed that, there is a significant gap between the assessment by human assessor and the results of our computer assessor.
LSA, in general, can be considered as an excellent information retrieval technique, but for this specific task of assessment of students' descriptive answers, the results are not satisfactory.The reason can be that some of the basic features of the technique are not suitable for this problem.LSA has been modified to overcome some of these issues/limitations and the proposed technique, denoted as 'High Precision Latent Semantic Evaluation' (HPLSE) has been used for automation of descriptive answer evaluation process, with much better results.
The paper is organized as follows: Section 2 explains research works related to the field of automation of descriptive answer evaluation using LSA and list of LSA modifications done so far.Section 3 explains methodology used to determine the semantic similarity between two texts.Section 4 proposes a new technique denoted as High Precision Latent Semantic Evaluation (HPLSE) and its implementation with the results.Section 5 lists several issues, conclusion and areas of improvement that future studies will address.

Literature Review
In this section, research work related to the field of Descriptive Answer Assessment (DAA) has been discussed.Methods and techniques implemented so far for automation of DAA process are discussed.Details of LSA technique with the kind of modifications has been done are also mentioned.
Several state-of-the-art short answer graders require manually designed patterns which have to be matched with the student's response; if matched, implies correct response.One of the information extraction-based system (Sukkarieh et al., 2005) is developed by the Oxford University to fulfil the need of the University of Cambridge Local Examinations Syndicate (UCLES) as many of the UCLES exam questions are short answers questions.In this system, hand crafted patterns are filtered from the training datasets by human experts and the student responses were matched with these patterns.

Research Work Relevant to the Field of Automation of DAA
Considerable work has been done in the area of using LSA to evaluate essays and to provide content-based feedback, but evaluating descriptive answers is still an open problem.
A text similarity approach was taken in (Kumar and Dey, 2013), for grading short answers without any human interventions unlike previous work.Texts from student answer are compared with the texts of standard answer by applying similarity measures.The standard answer is expanded with the topper part (best matched answers) of the students in next iteration, to increase the adequacy of the standard/reference answer.This issue has been already raised in the introduction section of this report as it's an important aspect of automation of descriptive answer evaluation process as well.
Instead of matching the student's textual answer with the textual patterns in the training dataset, this approach (Da Silva et al., 2012) adopted a model in which, the comparisons of student's cognitive structure (concept maps) with reference ontology was used.For comparing the student's concept map with the reference concept map, an alignment tool (COMA++) has been used.The alignment technique for learning assessment is used for the identification of entities with the same meaning i.e. checking the semantic similarity between two entities even when the two strings are not identical.
Online tools that support managing of online assessments such as Moodle and Zoho are based on string matching technique for short answers but long answer evaluation is still handled manually by most of the systems.Some of the approaches are based on keyword matching, sequence matching, quantitative analysis, fuzzy system, rule based system which provides some solution for online assessment of answer sheets, but the general descriptive answer evaluation is still an open problem.

Research Work Related to LSA Technique
LSA, initially proposed as a text search technique, gradually was used to deal with natural language processing tasks like content analysis, document summarization, semantic analysis and patent analysis etc.An improvement to LSA was introduced as Probabilistic Latent Semantic Analysis (PLSA), but according to researchers, in PLSA number of parameters grows linearly with the size of corpus.This leads to problems of overfitting (Zhu and Li, 2012).Another problem with the model is that it is not clear how to assign probability to a document outside of the training set.
Improvements to PLSA lead to LDA (Latent Dirichlet Allocation).Researchers (Zhu and Li, 2012) claimed that LDA provides more intuitive topic model but it has evidently much lower precision values for any case of given parameters and thus the LSA is a better choice for comparative summarization.
This research work (Martínez-Huertas et al., 2018) focuses on automatic essay evaluation, specifically on automatic assessment of student's summaries using traditional LSA and inbuilt rubric (a novel LSA).Two conditions are analyzed using inbuilt rubric method: few vs. many lexical descriptors required to accommodate expert rubric and weighted vs. non-weighted method.The weighted method is intended to penalize for irrelevant terms/excess number of terms written by the students.But practically, in DAA negative marking for irrelevant terms are not acceptable.So use of weighted and non-weighted method doesn't contribute much in universities DAA system.Pearson correlation between human expert judgment and inbuilt rubric is 0.79 which is better than traditional LSA (r = 0.67).A general corpus has been used for training purpose, if we use domain specific corpus then it can increase performance of the inbuilt rubric.
Leonhard and Dai (2009) proposed a topic based multi-document summarization method based on Probabilistic Latent Semantic Analysis (PLSA), in which sentences and queries are represented as probability distributions over latent topics.In this work, researchers have primarily focused on investigating the capability of PLSA approach to model documents from various topics.Researchers evaluated three similarity measures in this approach: The symmetric Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence and the cosine similarity.They combine query-focused features and thematic sentence features into an overall sentence score.Both PLSA and LSA approaches are implemented for the same data samples but the performance improvements are not significant at p < 0.05.
The mathematical technique, Singular value Decomposition (SVD) is applied in LSA for dimension reduction and to eliminate noisy information.In one of the research work (Fallucchi and Zanzotto, 2009), analyses of the effect of SVD feature selection with respect to the baseline are explored.Manual feature selections are compared with the SVD feature selection for validation.It has been concluded that SVD feature selection shows improvement in its performance, but still needs to explore some issues such as: (1) whether SVD feature selection has an positive effect in syntactic features space or not?(2) Are SVD Feature selection is better in comparison with other unsupervised feature selection models in case of probability taxonomy learning?
SVD has also been used to encrypt images (El Abbadi et al., 2014) and the decrypted images are close to the original one.SVD can be used for text encryption.The encryption and decryption time of images using SVD is also very promising.
SVD shows improvement in many areas of research and capable of solving various research problems.Many of the research work across various fields exploit LSA, but empirical evidences require more investigations.
Research Work Related to DAA using LSA LSA, in general, can be considered as an excellent information retrieval technique, but for this specific task of assessment of students' descriptive answers, the results are not satisfactory.After experimenting with LSA for evaluation of students' descriptive answers for various categories of questions, it has been observed that, there is a significant gap between the assessment by human assessor and the results of our computer assessor.The reason can be that some of the basic features of the technique are not suitable for this problem.LSA has been modified to overcome some of these issues/limitations and the proposed technique, denoted as "High Precision Latent Semantic Evaluation" (HPLSE) has been used for automation of descriptive answer evaluation process, with much better results.
Researchers (dos Santos and Favero, 2015), have used LSA for automatic evaluation of written answers where LSA pre-processes the answers using unigrams and bigrams of words.Use of n-gram (n = 1, 2) technique has improved the accuracy of the system.This idea of using n-gram technique instead of traditional Bag of Words technique can be adopted in future work.In this research work, reference answer is considered as a first document in term-document matrix and student answers as the other document.The accuracy of the system is 78.5%.
Researchers (Anirudh et al., 2016) have proposed a score recommendation system that works well for descriptive answers with smaller amount of variations from the assessor's perspectives.The method used in this system does not rely on any kind of domain specific corpus.Evaluation score had been calculated on the basis of analysis of student's answers against an answer key.In further work, the feature computations can be improved with the domain specific corpus and can further enhance the accuracy of a system.
Researchers (Thomas et al., 2015), have also used LSA for automatic answer assessment and the proposed system assesses the descriptive answers by comparing it with the ideal answer using LSA, positional indexing and spell checking.A word-document matrix is created, where words are collected from the submitted student answers and student descriptive answer are considered as a document.The relevant keywords with the index position are given by the teacher.The order of keywords written by students is compared with the keyword order of ideal answer using positional indexing.Cohen's Kappa method is used to get the strength of agreement between teacher and tool.The results obtained by experimenting with the three different students' datasets are 0.64, 0.73 and 0.61 kappa score.The system fails to handle the cases where most of the students give wrong answers.
The review so far shows that various methods and techniques have been implemented to solve the research problem of automation of descriptive answer evaluation process.Techniques and method such as graphical representation of student answer using LSA, textual representation using PLSA and LSA have been tried.Most of the systems mentioned above have used comparison-based approach, in which students' descriptive answer are compared with standard answer/essay.
The Pearson correlation between human assessor and system are in the range {0.6-0.78.5}, which definitely needs some improvement with the capability of handling exceptional cases.The exceptional cases like when most of the students have written wrong answers in the examination or out of scope answers.There are many such cases which should be handled first to bring descriptive answer evaluation system in a practical field.Available systems are useful for essay grading and short answer grading systems, but descriptive answer evaluation system is still an open research issue.

Methodology
This section covers the overall proposed approach.The proposed system is a comparison-based evaluation system, in which the students' descriptive answer would be compared with the standard descriptive answer.
Generally, a descriptive answer having more than one sentence has a complex structure.A teacher evaluating such an answer looks for a collection of information in the answer as per the category of question asked in the examination.Where and how to find these points depends on the category of question.For example, in "How" type of question answer written in various steps in a process is usually expected in a standard order; altering the steps may change the outcome.But, the list of points in an answer written for "what" and "why" type of questions, permits a more flexible ordering.Keeping this in mind, we attempted to analyze the questions usually asked in examinations and identify categories based on structural and property similarity (detailed description mentioned in Table 1).We briefly discuss our attempt in this regard here.

Syntactic Structure of Descriptive Answers
Various categories of questions are asked by teachers in the exam question paper of universities/institutions. Examples of which are explain, describe, what, why, how, justify, define, elaborate, short notes, comparison based, etc.Some categories of question like, "draw" and "calculate" are excluded from the list because diagrams and mathematical expressions are out of the scope of this research work.After exploring different exam question papers of the universities, a list of categories of questions are formed.Below mentioned table include things expected to be covered in the answer.Analysis of various categories of questions gives clarity about the syntactic structure of descriptive answers and evaluation parameters.The length of the descriptive answer can vary from a phrase to a sentence to a page or multiple pages.After classifying the questions from exam question paper into different categories, samples of exam answer sheets were collected from the universities/institution.
The descriptive exam answer sheets of engineering students which were related to the subjects of computer engineering stream like distributed computing system, artificial intelligence etc. were collected.
Samples of exam answer sheets were evaluated from five different assessors, to check for variation in marks allocation.These, already assessed answer sheets of students, are analyzed with the support of other teachers/assessors in order to understand the psychology of assessor and the way he/she allocate/deduct marks for the answers written by the students in the examination.
Some assumptions have been taken for simplifying the task of automation, such as, sentences in student's descriptive answer are assumed to be grammatically correct, with no spelling mistakes.Only textual answers are considered.Diagrams and mathematical expressions are out of the scope of this research work.

Proposed Approach
The proposed comparison-based approach determines the similarity between student and standard descriptive answer.Broadly, the system has three major modules: the standard answer representation, the student answer representation and a comparison unit (as shown in Fig. 1).

Standard textual answer:
The standard textual answer is the precise answer written by a domain expert or a teacher.
Student textual answer: The samples of student textual answer used for this experiment are free-form text and are in a range of 5-6 grammatically correct English sentences (approx.80-100 words).
Domain specific corpus: The domain specific corpus includes data from various e-resources and textbooks related to that domain.
Steps to create domain specific corpus: Step1: Collect domain related textual data from various e-resources and textbooks.
Step2: Analyze textual data by calculating frequency count of all the unigrams, bigrams and trigrams occurred in the domain related textual data using text analyzers (http://onlineutility.org/text/analyser.jsp)it's a Free software utility which allows to find out the most frequent phrases and frequencies of it.Non-English language texts are supported.It also counts number of words, characters, sentences and syllables and calculates lexical density.Step4: If frequency of a keyword is beyond threshold level {Hfv -high frequency value keywords (generally list of stop words) and Lfv -low frequency value (rare keywords don't contribute in defining meaning of a concept)} then it should be discarded from the corpus (please refer Fig. 2).Step5: After filtrations, use this domain specific corpus for HPLSE algorithm.The basic differences between LSA and HPLSE are discussed mentioned in Table 2) as follows:

HPLSE (High Precision Latent Semantic Evaluation) Technique
A modified algorithm (A modified version of LSA) has been introduced as High Precision Latent Semantic Evaluation (HPLSE).
HPLSE is a technique in natural language processing derived from LSA, for finding the semantic similarity between the students' descriptive answers and the standard answer.
In contrast with LSA, index words (rows in termdocument matrix) are collected from domain specific corpus and not from the documents pool or paragraphs.This modification expected to increase the precision and recall of HPLSE for evaluating student descriptive answers.
In LSA, the frequently used words in the answer become part of index terms pool.So when a large number of students would have written wrong descriptive answers, the irrelevant index words would become part of the index terms pool, resulting into false outcomes.Such problems have been rectified using HPLSE.

High Precision Latent Semantic Evaluation (Modified version of LSA)
The steps of applying HPLSE for automated assessment of descriptive answer are: Step 1: HPLSE begins with the construction of a termdocument matrix X. Determine the unigram, bigram and trigram from the domain specific corpus collected from various e-resources and place in the rows of term-document matrix, X.
Step 2: Consider all unigrams, bigrams, trigrams from domain specific corpus as rows in a termdocument matrix and all students' descriptive answers as documents.
In a term-document matrix (X), each unigram, bigram and trigram are represented by a row (i = 1,2,3………m) and each student descriptive answers is represented by a column (j = 1,2,3……n), with each matrix cell, initially representing the number of times (term frequency, tf ij ) the associated term appears in the student descriptive answer.
Step 3: Construct a t x 1 query matrix q, by considering terms from standard descriptive answer and calculating term frequency of each term as a cell value of query matrix.
Step 4: Weight each entry tf ij in X using TF-IDF (Term Frequency-Inverse Document Frequency) weight function.Weight function is used to determine the importance of the each term.The new weighted matrix is X w .
Step 5: Singular value decomposition (SVD) is applied on the matrix X w to decompose matrix X w into three other matrices, an m by r term-concept vector matrix (U), an r by r singular values matrix (S), r by n concept-document vector matrix (V T ), which satisfy the following relations: Step 6: Choose an optimum dimension k to reduce X w' : where, U k and VT k matrices, define the term and vector spaces.Dimension reduction is used to keep the important information, while reducing the noisy data from the dataset by setting less important dimensions to zero.
A rule of thumb for finding out the optimum value of k is to retain enough singular values to make up to 90% of the energy in S, i.e., the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values (http://infolab.stanford.edu/~ullman/mmds/ch11.pdf).
To find out the optimum dimension value (k), use this empirical formula: Number of optimum dimension is typically on the order 100 to 300 dimensions in LSA, but HPLSE also works well in less number of dimensions.

(
) ( ) ( ) (5) where, Cos (q', V') -> similarity match between student answer and standard descriptive answer.computer assessor has reduced (please refer Fig. 4 and  5) and the Pearson correlation coefficient (r) has increased considerably (please refer Fig. 6).The reasons for improvements are: • Ability of a system to retrieve the relevant and reject the irrelevant phrases from the student's descriptive answer • Precise relevance check according to human assessor perception provides high precision • Pruning of extra terms from the corpus, reducing polysemy In Future studies, categories of question like How, compare, draw, evaluation of mathematical expression may be tried.

Fig. 3 :
Fig. 3: Block diagram-comparing LSA and HPLSE technique 2 semantic vectors in a semantic space Score calculation = total marks * cosine similarity 3…r r = number of singular values in the singular matrix S.

Table 1 :
Various categories of questions

Table 2 :
LSA v/s HPLSE S no.LSA HPLSE 1 Bag of Words technique is used.Assumption: Each N-gram technique where N=1,2,3 Unigram, word meant only one concept and each concept was bigram and trigram are collected from the corpus.described only by one word and words are assumed to have only one meaning.