Dynamic Bayesian Networks in Classification-and-Ranking Architecture of Response Generation

Problem statement: The first component in classification-and-ranking architecture is a Bayesian classifier that classifies user utterances into response classes based on their semantic and pragmatic interpretations. Bayesian networks are su fficient if data is limited to single user input utterance. However, if the classifier is able to co llate features from a sequence of previous n-1 user utterances, the additional information may or may n ot improve the accuracy rate in response classification. Approach: This article investigates the use of dynamic Bayes ian networks to include time-series information in the form of extended fea tures from preceding utterances. The experiment was conducted on SCHISMA corpus, which is a mixed-i nitiative, transaction dialogue in theater reservation. Results: The results show that classification accuracy is i mproved, but rather insignificantly. The accuracy rate tends to deterio rate as time-span of dialogue is increased. Conclusion: Although every response utterance reflects form an d behavior that are expected by the preceding utterance, influence of meaning and inten tions diminishes throughout time as the conversation stretches to longer duration.


INTRODUCTION
Response generation is a process of natural language generation in dialogue systems, which is responsible for providing dialogue responses as part of an interactive human-machine conversation. In humanhuman conversation, dialogue is mutually structured and timely negotiated between dialogue participants. Speakers take turns when they interact, they interrupt each other but their speeches seldom overlap. Each speaker is affected by what the other speaker has said and what each speaker says; affect what the next speaker will say. Similarly, human-machine conversation through dialogue systems must exhibit comparable qualities. But for dialogue systems to recognize turns, consider interrupts and maintain coherence, response generation must rely on pragmatic interpretation, apart from semantic understanding of user input utterances.
Classification-and-ranking generation is an alternative to grammar-based or statistical-based generation. This type of generation assumes that each user utterance represented in some context has its counterpart response in the dialogue corpus, hence promoting open-domain quality (Mustapha et al., 2010). Classification-and-ranking architecture consists of two components: a Bayesian classifier to classify user utterances into response classes based on intentions of user input utterance and an Entropic ranker that scores the candidate response utterances according to semantics relevant to the user utterance (Mustapha et al., 2008). The generation architecture is shown in Fig. 1.
Nonetheless, processing and generating natural languages requires understanding the interaction of complex knowledge sources, disguised in many forms. Much of what we understand about language is known with various degrees of certainty due to oversimplifying assumptions on many independencies and dependencies between context and meaning of language. Bayesian Networks (BN) is a natural choice of approach in Natural Language Processing (NLP) because it offers a formal treatment to uncertainties and independencies. Fig. 1: Classification-and-ranking generation By using BNs, we are able to introduce the relationship of knowledge in two ways; within the structure of the network as well as the probability distributions of the network parameters. We can then reason under the uncertainty via the joint distributions.
Bayesian networks have also gained significant attention in other domain such as in medical (Elsayad, 2010;Saat et al., 2010) or intrusion detection (Khor et al., 2009;Mehdi et al., 2007). Nonetheless, a BN is useful when the parameters in the domain are static, for example when each feature has a single and fixed value. However, in dialogue systems, utterances are produced in turns by two speakers, therefore data arrives sequentially turn after turn. Instead of analyzing one utterance at every turn, we can apply BN theory on the present and previous utterances from a sequence of turns. This leads to the proposed Dynamic Bayesian Networks (DBN) in response classification.
Dynamic Bayesian networks: Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic processes. A DBN is a specific type of BNs and consists of time-slices, whereby each time-slice or time-step contains its own network and variables. Given the time-variant property, however, a DBN must maintain the same network structure at each time-slice and the cross arcs are extended only between two consecutive time slices. Albeit the name of DBNs, the network structure really does not change. The term "dynamic" is to refer to time-dependent or sequencedependent modeling and has nothing to do with the structure.
In a DBN, each position in the sequence-slice is characterized by n random variables. Within each slice, the random variables are represented by an ordinary BN, duplicated along the sequence positions. The sequential dependencies are represented by a set of arcs that connect the nodes across the consecutive sequence in the network. The topology of a DBN is a repeating structure and the Conditional Probability Distributions (CPDs) within each structure also do not change in each sequence-slice.
A DBN is defined as the pair (B 1 , B → ) where B 1 is a BN that represent the prior state or initial state distribution of the state variables (Z 1 ) (Ribeiro-Neto et al., 2000). Z t = (U t , X t , Y t ), where U t , X t and Y t represents the input, hidden and output variables of DBN. In the simplest problem where there are only two consecutive BNs, B → is a two-slice temporal BN (2TBN) with transition as shown in Eq. 1: where Z i t is the i-th node at time t. Again, Z may represent either U t , X t or Y t . In turn, Parents (Z i t ) are the parents of Z i t , whether in the same or previous slice. Since the structure repeats across the sequence of process, the parameters for slices t = 2, 3 and so forth remain the same. Note that nodes in the first slice of a DBN do not have parameters associated with them. Therefore, the parameters of the model can be fully described by only using the first two slices. JPD for the sequence of length T is obtained by "unrolling" the DBN as shown in Eq. 2: In modeling a dialogue corpus, a sequence of user utterance is termed as sequence-slice in DBN, where t represents an utterance at time t. Hence, t = 0 represents the current utterance, t = 1 represents the previous one utterance, t = 2 represents the previous two utterances and t = 3 represents the previous three utterances.

MATERIALS AND METHODS
Human being performs the task of classification in a variety of activity ranging from cognitive to behavioral tasks. We make decisions based on information available on hand and faithfully rely to such decisions yet in new but analogous situation. Once we settle with a particular situation, we will weigh and rank the options to make the best decision given the opportunities and constraints in that situation. In a classification experiment, as in any classification task, the main task is to determine which of a set of classes some observation belongs to. In response classification, the objective is to identify a response class for each response utterances, maximizing P (response class | user utterance). The list of response classes is shown in Table 1.
The experiment is to assess the classification accuracy of correct predictions of response class rc, given the user utterance U. The probability equation to find the best response class is given by Eq. 3: rc arg max P(U rc)P(rc) Dynamic Bayesian Networks (DBN) model the sequence of user utterances in time-slices t, t-1, t-2 and so forth. This is possible through an extended set of semantic and pragmatic features that are extracted from a sequence of previous n-1 user utterances. In this experiment, we limit t to three previous utterances.

Dialogue corpus:
The dialogue utterances under study are sourced from SCHISMA corpus. SCHISMA (SCHouwburg Informatie Systeem) is a Theater Information and Ticket Reservation system (Hoeven et al., 1996). The dialogue corpus is a collection of 64 text-based, human-machine dialogues obtained through a series of Wizard of Oz experiments. It contains 920 user utterances and 1,127 server utterances in total. In total, there are 2,047 individual utterances in 1,723 turns.
SCHISMA is a mixed-initiative, transaction dialogue, in which there are two types of interaction: inquiry and transaction (Hulstijn, 2000). In transaction dialogue, both user and system must collaborate to achieve agreement on several issues like ticket price or discount availability before reaching the point of reservation. This model is more complex than questionanswering systems because at any point, both parties may request information from each other and the user in particular, may retract any previous decisions and take the conversation in a totally different direction. The dialogue excerpts in Fig. 2 illustrate the complexities in the structure of mixed-initiative, transaction dialogues.
The dialogue commences with user browsing for information on theater performances. However, upon asking the ticket price as shown in Line 7, the system replies with a question to clarify on the presence of reduction card, which will give different ticket price altogether.
Features: Each utterance is analyzed from the perspective of speech actions, which is fully    (2) semantic content in the form of input frame. These observed features are of utterance properties that uniquely constitute the user utterance, during a particular turn of a conversation. The SCHISMA corpus is readily tagged with DAMSL annotation scheme by Keizer and Akker (2007). In addition, there are two types of features extracted out from the user input utterances, which are semantic features and pragmatic features (Mustapha et al., 2009) as shown in Table 2 and 3, respectively.

DBN classification:
Training the SCHISMA corpus under Bayesian Networks (BN) and Dynamic Bayesian Networks (DBN) is carried out by using Probabilistic Network Libraries (PNL) (Intel, 2004). PNL support for dynamic Bayesian networks is the main reason to choose PNL over other statistical toolkit like WEKA (Ian and Frank, 2005). Structural learning in PNL is carried out using Hill-Climbing algorithm. This algorithm basically  searches the space of Directed-Acyclic Graph (DAG) and builds the best arch to match training set based on the scoring function. Figure 3 illustrates one possible structure for an "unrolled" DBN produced by PNL. Figure 3 illustrates how the DBN is "unrolled" across the sequence of utterances, from t = 0 until t = 3. The dotted lines are tracking each particular feature variable. In testing accuracy of the DBN, a 10-fold cross validation was applied to split the SCHISMA corpus into ten approximately equal partitions training and testing set, each being used in turn for testing while the remainder combined for training.

RESULTS
The goal of the experiment is to investigate the impact of features extracted from previous n user input utterances, if the semantic or pragmatic representation from the preceding utterances has any influence over the accuracy rate in classification of response utterance. The results are compared with Bayesian networks classification by (Mustapha et al., 2009). Table 4 shows comparison of accuracy percentages for response classification task using both semantic features and pragmatic features from previous user utterances.

DISCUSSION
From Table 4, BN classification yields 73.9% accuracy percentage. As DBN classification is performed, results show that accuracy is improved, but rather insignificantly, either through time-series features in semantic content (semantic-n) or pragmatic (pragmatic-n). Even though previous topics contributed to increase in accuracy rate immediately with t = 1 and previous intentions only contributed to the increase in accuracy rate when t = 2, at the end the accuracy rate deteriorated as the time-span of dialogue increased.
This observation shows that intentions and semantic content in previous utterances do not gives enough impact to uniquely characterize input utterances. Figure 4 illustrates the changes in classification accuracy rates for the experiment. The xaxis shows time-series factor, where t = 0 represents the current utterance, t = 1 represents the previous one utterance, t = 2 represents the previous two utterance and t = 3 represents the previous three utterance. The yaxis shows classification accuracy in percentages.

CONCLUSION
The underlying philosophy of classification-andranking architecture in natural language generation is for the response generator to directly learn response utterances from the domain corpus. The classification experiment was designed to classify response utterances into response classes in order to delimit the searching space for ranking the utterances. Through ranking, the response with highest probability will be returned as the final response to user input utterance.
The results for time-series experiments using dynamic Bayesian techniques are consistent with findings of dialogue act recognition in utterances (Ali et al., 2006), whereby consideration of previous n utterances in a dialogue does not necessarily affect the classification accuracy. While the first two previous utterances may increase the accuracy rate, the accuracy will deteriorate as the time-span of dialogue increased.
Although every response utterance reflects form and behavior that are expected by the preceding utterance, influence of meaning and intentions diminishes throughout time as the conversation stretches to longer duration.