A Software Agent for Speech Abiding Systems

,


INTRODUCTION
Speech and natural language understanding are the key technologies that will have the most impact in the next 15 years. Enabling users to speak and listen to a computer will greatly enhance the ability for users to access computers at any time from nearly any place. Speaking and listening is so fundamental that people take it for granted. Everyday people ask questions. They give instructions. Speaking and listening are necessary for learning and training, for selling and buying, for persuading and agreeing and for most social interactions. For the majority of people, speaking and understanding spoken speech is simply the most convenient and natural way of interacting with other people. So, is it possible to speak and listen to a computer? Yes.
Speech technologies allow companies to offer the option of a self-service interface to manage tasks like rate quotes or reservations, technical troubleshooting and customer support as well as the ability to handle complex customer dialogs, capturing all the information required to provide detailed responses.
Despite the significant progress that has been made in the areas of speech recognition and spoken-language processing, building a successful dialogue system still requires large amounts of development time and human expertise. In addition, spoken dialogue systems algorithms often have little generalization power and are not portable across application domain.
Motivation for speaking and listening to a computer to retrieve speech content: Despite physical impairments such as blindness or poor physical dexterity. Speaking enables impaired callers to access computers. Callers with poor physical dexterity (who cannot type) can use speech to enter requests to the computer. The sight-impaired can listen to the computer as it speaks. When visual and/or mechanical interfaces are not an option, callers can perform transactions by saying what they want done and supplying the appropriate information. If a person with impairments can speak and listen, that person can use a computer. To bypass the limitations of small keyboards and screens. As devices become smaller, our fingers do not. Keys on the keypad shrink-often to the point where people with thick fingers press two or more keys with one finger stroke. The small screens on some cell phones may be difficult to see, especially in extreme lighting conditions. By speaking and listening, callers can bypass the small screen of many handheld electronic devices.
A specific problem in speech input is the room acoustics, where environmental noise may prevail, so that the frequency-dependant reflections overlay a sound wave along walls and objects with the primary sound waves. Also, word boundaries have to be defined, which is not easy, because most speakers or most human languages do not emphasize the end of one and beginning of that next word. A kind of time standardization is required to be able to compare a speech unit with existing samples. The same word can be spoken fast or slow. However, we cannot simply clench or stretch the time axis, because elongation factors are not proportional to the total duration. Speech recognition and understanding called (abiding), by machine or a system is still a difficult and largely unsolved problem and there are a number of areas of active research that are being explored in the attempt to conquer the remaining serious problems. It is not possible to understand speech and audio signal processing in any depth without having a solid background in the mathematical underpinnings of signal processing and pattern recognition. So, advances in speech processing owe much to advancing computer technology; but, in addition, this progress has been dependent on the mathematical disciplines of digital signal processing. The connection between speech and Digital Signal Processing (DSP) is straightforward. Speech depends greatly on filtering, both production and perception.
The objective of this study to capture an error free Speech Content after having considered all the characteristics of NL. There is a viable solution to solve these issues through a well defined algorithm with the use of Fast Fourier Transform and also filtering through Decimation [1] . However let us touch upon this issue here, by giving the Signal Flow Graph (SFG) after the FFT and decimation process, forming a "Butter-fly" symmetry [2] , which is really feasible to have a solution through the hardware and/or through the software. Here we have attempted to solve through software. The detail study and design and required tools and software are given in the following paragraphs. Why to focus on the FFT: The DFT [3] computational yields the spectrum of a finite sequence and hence its great importance in signal processes applications. In analyzing speech signal variations with the Discrete-Time Fourier Transform (DTFT), we encounter the problem that a single Fourier transform cannot characterize changes in speech content over time such as time-varying formants and harmonics. In contrast, the Fast Fourier Transform (FFT) consists of separate Fourier Transform for each instant in time. In particular, we associate with each instant the Fourier transform of the signal in the neighborhood of the instant, so that spectral evolution of the signal can be traced in time.

MATERIALS AND METHODS
There are a great variety of FFT algorithms. They can all derive from successive applications of a single operation, by representing a one-dimensional string of numbers as a two-dimensional array. If we have Npoint sequence, the integer N is either a prime or a composite number. If N is composite, it can be expressed as the product N 1 N 2. If either or both N 1 and N 2 are composite, further reduction is permissible. For example, we can express the number 60 as (12×5) or (3×4×5) or (2×2×5×3) and so on. The term radix is commonly used to describe this decomposition. If N can be expressed as a product of the same integer 'r', the FFT algorithm is called a radix 'r' algorithm [2] . If N is a power of 2 then DFT can be computed in N log 2 N operations. Thus for example, if N = 1024, log 2 N = 10, then the number of operations = 1024×10 = 10240. This is in contract with the "brutal force" DFT computation which takes N 2 = 1024×1024 operations. Therefore: 1024 1024 1024 Savings 102.4 10240 10 The savings is a factor lf 100 (ignoring the details). Therefore FFT is 100 times better than direct DFT in terms of operations alone.
The motivation for the Fourier transforms comes from the study of Fourier series. In the study, of Fourier series, complicated periodic functions are written as the sum of simple wave that are mathematically represented by Sine and Cosine. Due to properties of Sine and Cosine it is possible to recover the amount of each wave in the sum by an integral. In many cases, it is desirable to use Euler's Formula which states that e e j = Cos 2 + i Sin 2 , to write Fourier series in terms of the basic waves e 2 j This has the advantage of simplifying many of the formulae involved in Fourier analysis. This passage from Sine and Cosine to complex exponentials makes it necessary for the Fourier co-efficient to be complex valued. The usual interpretation of this complex number is that it gives both the amplitude (and size) of the waves present in the function and the phase (or the initial angle) of the wave.
Using compression techniques to reduce the number of windows through the use of the powerful mathematical tools like Discrete Fourier Transform Discrete Fourier Transform (DFT) [3] and the Discrete Cosine Transform (DCT) [3] suitably adapted. In other words, capturing speech content for the signalbased audio data through discrete transformations.
Most new systems for the processing of the speech are now digital and as such are based on the fundamental mathematical tools, namely Z-Transform When the Z-transform is evaluated on the unit circle in the Z-plane, then Z = e jθ and = wt.
The Z-transform becomes: The basic form of the DFT given by: It is readily apparent that computation of (m) for any single value of m will require (in general) N complex multiplications and N complex additions. Therefore, Computing a Complete set of N values for (m) will entail N 2 Complex multiplications and N 2 complex additions. Furthermore, values of  The operations represented by the above equations are depicted in the Signal Flow Graph (SFG) (Fig. 1). In an SFG there are nodes and edges. Each node represents a signal that is obtained by summing together all of the signals represented by the edges directed into the node. Each edge represents the multiplication of a weight times the signal that is represented by the edge's source node. An edge's weight is indicated by an annotation near the arrowhead used to indicate the edge's direction. By decimation, "Butter-fly" symmetry is achieved which would be congenial for hardware [2] implementation, if necessary and as well software implementation in Part-II. Sometime DFT, after decimation process may be called as Decimated Fourier Transform (DFT).
Part II: Providing an environment for developing and deploying a Speech User Application/Interface A software study was conducted [4] . It is realized that an effective languages and tools are required for speech processing [5] . Based on the study these software are proposed: MS Speech Server, SALT/C#, .NET Framework, SDK, under Windows environment.
Why to Focus on SALT: VoiceXML and SALT are both markup languages that describe a speech interface. However, they work in very different ways, largely due to two reasons: (i) They have different goals; (ii) They have different Web heritages. VoiceXML is designed for telephony applications. It was developed to allow the specification of Interactive Voice Response (IVR) applications in a markup language that leveraged the benefits of the World Wide Web. It is a simple, highlevel dialog markup language that facilitates the authoring of system-driven and mixed-initiative voice dialogs over telephones and cell phones. SALT targets speech applications across a whole spectrum of devices, including telephones, PDAs, tablet computers and desktop PCs. Since many devices also contain displays, multimodal interactions are a key focus. Developers use SALT with existing Web programming standards to author system-driven, user driven and mixed-initiative voice dialogs and multimodal applications. These differences are manifested mainly in (i) The form of the markup, (ii) The programming and execution model and (iii) The level of the programming interface available to the developer.
Scope: VoiceXML incorporates speech interface, data and control flow, SALT focuses on the speech interface Programming model VoiceXML has a built-in, formfilling algorithm, SALT enables application developers to write customized dialog flow. Level of API: VoiceXML has a high-level API and SALT has a lower level API. Other standards: VoiceXML and SALT both use W3C standards. Both VoiceXML and SALT recommend the use of SRGS and SSML as grammar and speech output formats, respectively. In addition, SALT also recommends the use of NLSML as a recognition result format and CCXML as a telephony call control language (or a call control object closely modeled on CCXML as an alternative). Licensing VoiceXML may be subject to royalty payments and SALT will be royalty-free. The future: Although SALT and VoiceXML were developed to solve different problems, these problem spaces are beginning to converge. Some VoiceXML developers are asking for a stripped-down version of VoiceXML, without the FIA, so they can write their own turn taking strategies for complex speech applications. Other VoiceXML developers are asking that VoiceXML be modularized so that its tags can be embedded into other languages. SALT already applies a model that fits these roles. Figure 2 shows the flow of processing of SAS. The user will interact with help of a microphone and in the front end signal processing and extraction is done by a DSP, after the analogue to digital conversion process is done. This captured error free speech content are then matched with predefined recorded sample which have gone through the same process and stored. Through the function global decoder an automatic intelligent construct is made through a subroutine language model, to response an unorganized query into a disciplined query and response. Figure 3 shows the proposed architecture that can support the proposed agent SAS. The development tools are Speech application SDK and Microsoft Visual Studio. The SALT clients are through telephones, pocket PC and desktops. The Web Server could be Microsoft Speech server having the interface ASP.net or C# which could do the job of prompts and grammatical disciplines. If we attempt to do through web IIS (Internet Information Server) need to be configured. Through the SALT with HTML along with any scripts can interface with server to fetch the speech content from the speech core. In order to illustrate programmatic result processing, the binding of the recognition results into the relevant input fields is accomplished below by the script functions procOriginCity() and procDestCity(), which are triggered by onReco events of the relevant <listen> elements. The handler for an unrecognized speech event onNoReco is used to play an appropriate message -the SayDidntUnderstand prompt, which restarts the cycle on its completion. A sample program (part of the main program) is shown as per the SALT structure shown in the Fig. 4.
By deploying SAS, a primitive level of application is generated for creating a domain specific speaker and language independent processing.

RESULTS
A software agent is created to capture zero-error speech content instantly and is stored in the speech database as samples for that particular domain. These samples are speaker independent and language independent. Further this is be retrieved for actions and also being queried for intelligence and intellisense response automatically. In essence, SAS agent facilitates for developing a primitive level of a global oriented domain specific, speaker and language independent applications, for any speech users interface business process systems. The SAS agent will enhance the accessibility of the users those who know natural language by 85% in number and will reduce the cost of transactions processing [5] in business applications, projected for about 50%.

DISCUSSION
The same level of agents, components and products will have more impact and be the ingredients of key technologies of business processing in the years to come. Even the blind and visually impaired people can also be supported by these technologies [6] . The first commercial implementation of a free speech speaker verification system is implemented in a call center [7] . A single source formatting [4] is also possible by the combinations of Graphic User Interface (GUI) and Voice User Interface (VUI) by leveraging the existing infrastructure. People who can able to speak will interact with computer in any language the know with help of the intelligent agent SAS proposed.

CONCLUSION
The Speech Abiding System (SAS) agent along with required hardware architecture can be made as a product instead of component, to accomplish all speech users' applications. It might emphasize naturalness in the mechanism of interoperability and reduce disintermediation (known delays), in the Information Technology Cycle (ITC), in the spectrum of computer application programming. And as Natural Language (NL) speech applications move into the mainstream, chances are your enterprise will install the technology during the next 12-18 months [8] .

ACKNOWLEDGEMENT
The researcher would like to thank Dr. S.N. Subbramanian, Director cum Secretary, Dr. S. Rajalakshmi, Correspondent, SNS College of Technology, Coimbatore, India for their motivation and continual constant encouragement.