Recognition of Nigerian Major Languages Using Neural Networks

Veiw figure View Figure

The first block consists of the acoustic environment plus the transduction equipment (microphone, preamplifier and AD converter) that have a strong effect on the generated speech representations. For instance we can have additional impact generated from additive noise or room reverberation. The second block is intended to deal with these problems, as well as deriving acoustic representations that are both good at separating classes of speech sounds and effective at suppressing irrelevant sources of variation. The third block must be capable of extracting speech specific features of the pre-processed signal. This can be done with a variety of techniques like cepstrum analysis and the spectrogram. The fourth block tries to classify the extracted features and relates the input sound to the best fitting sound in a known 'vocabulary set' and represents this as output. The commonly used approaches for speech classification include:

1) Template-based approaches: in which unknown speech is compared against a set of prerecorded words (templates), in order to find the best match. This has the advantage of using perfectly accurate word models; but it also has the disadvantage that the prerecorded templates are fixed, so variations in speech can only be modeled by using many templates per word, which eventually becomes impractical.

2) Knowledge-based approaches: in which “expert” knowledge about variations in speech is hand-coded into a system. This has the advantage of precisely modeling variations in speech; but unfortunately such expert knowledge is difficult to obtain and use successfully, so this approach was judged to be impractical, and automatic learning procedures were sought instead.

3) Statistical-based approaches: in which variations in speech are modeled statistically by Hidden Markov Models (HMMs), using automatic learning procedures. This approach represents the current state of the art. The main disadvantage of statistical models is that they must make a priori modeling assumptions, which are liable to be inaccurate, handicap the system’s performance.

4) Learning based approaches: To overcome the disadvantage of the HMMs, machine learning methods could be introduced such as neural networks and genetic algorithm programming. In these machine learning models, explicit rules or other domain expert knowledge do not need to be given, they can be learned automatically through emulations or evolutionary process.

5) The Artificial intelligence approach: attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing, analyzing, and finally making a decision on the measured acoustic features. Expert systems are used widely in this approach. A good example of the artificial intelligence approach is the ANN. ANN offers an approach to computation that mimics biological nervous systems. The concept of ANNs is rooted deep into the recognition that though the human brain performs the functions about a million times slower than the digital computers, yet the human brain is more efficient when it comes to performing a complex set of the tasks such as speech synthesis, visual information processing, handwriting analysis etc. This is partially attributed to the fact that human brain is massively a parallel structure of biological neurons. ANNs are physical cellular system that can acquire, store and utilize experimental knowledge. ANNs have been applied to an increasingly number of real world problems of considerable complexity.

The paper is organized as follows. Section II is the Literature Review that critically reviews previous research similar to the speech recognition system identifying their limitations and shortcomings and the improvement this work gives to the different reviewed projects. Section III gives a detailed design of the system and its specifications. It explained speech signal processing, Speech Recognition and its Algorithms; Artificial Neural Networks and back-propagation, giving their key features and the feature extraction method used. Section IV present the implementation and testing of the speech recognition system. The results obtained are also discussed there. Section V concludes the work and presents the achievements of the project, it shortcomings and future improvements.

2. Literature Review

There are two basic approaches of using Neural Networks in speech recognition, which are the static approach and the dynamic approach. In the static approach, the neural network accepts all input speech data at once, and makes a single decision. On the other hand, for the dynamic approach, the neural network processes a small window of the speech at one time, and this window slides over the input speech data while the network makes a series of local decisions, which have to be integrated into a global decision at a later time. Both approaches are being applied in phoneme recognition as well as word level recognition. A few researches related to this method are discussed.

In ^[4], a Multi-layered Perception and the Recurrent Neural Network (RNN) models are applied for Standard Yoruba (SY) isolated tone recognition. The neural networks were trained on SY tone data extracted from recorded speech files. Their results led to three major conclusions that SY tone recognition problem can be implemented with MLP and RNN; the RNN training converges slower than the MLP using the same training data; the accuracy rates achieved using the RNN was found to be higher (although not significantly) than that of the MLP on the inside and outside test data sets; and that the SY mid tone has highest recognition accuracy. However, the efforts required for building the RNN is relatively more than those required for building the MLP. The MLP produced an accuracy of 87.50% for the tests carried out inside for H tone, it also produced 71.30% for the outside test. Similarly, the RNN model produced an inside test of 89.50% for the H tone while it produced 76.10% for outside test. The general conclusion from these results is that the RNN tone has the best recognition rate. A speech recognition component which would implement a set of reading lessons to assist adult illiterates in developing better reading capabilities was develop in ^[5]. The first stage involved the identification of the different alternatives for the different components of a speech recognition system, such as using linear predictive coding, Hidden Markov Models, Neural Networks or K-Nearest Neighbor Classifier for the pattern recognition block. The NN classifier trained using the Al-Alaoui Algorithm overcomes the HMM in the prediction of both words and sentences. They also examined the KNN classifier which gave better results than the NN in the prediction of sentences. The segmentation of Arabic sentences was also considered in their work, and they proved the problems in applying it to Arabic speech. In the project they implemented several classifiers for the Arabic speech recognition problem.

In ^[6], a Multilayer Perceptrons was applied in a more difficult task for alphabet recognition. A static input buffer of 20 frames was applied, in which each spoken letter was linearly normalized, with 8 spectral coefficients per frame. Training on three sets of the 26 spoken letters and testing on a fourth set, the performance achieved was 85% in speaker dependent experiments, matching the accuracy of a dynamic time warping (DTW) template-based approach.

The performance of a recurrent network and a feedforward network on a digit recognition task was compared in ^[7]. The feedforward network was an MLP with a 500 msec input window, while the recurrent network had a shorter 70 msec input window but a 500 msec state buffer. They found no significant difference in the recognition accuracy of these systems, suggesting that it’s important only that a network have some form of memory, regardless of whether it’s represented as a feedforward input buffer or a recurrent state layer.

3. System Design and Implementaton

There are a number of different approaches to the implementation of a speech recognition system, but this work considered the four major processing steps suggested in ^[2], namely: data preparation; training; testing and analysis; and implementation.

3.1. Data Preparation

The first stage of any recognizer development is data preparation. Speech data is needed both for training and for testing. In the system built, two different corpora were collected of audio recordings. One of the collected corpora is an open source audio recording of Yoruba, Igbo and Hausa Speeches, recorded in a silent environment and the other is a collection of Yoruba Igbo and Hausa recorded from scratch by different speakers in a noisy environment. It follows from above that before the data can be recorded, a phone set must be defined to cover both training and testing and a task grammar must be defined. The task grammar defines constraints on what the recognizer can expect as input. The system built provides an interface to receive input speech and recognizes it either as Yoruba, Igbo or Hausa. Each speaker repeated each word at least 5 times in different tones to provide accuracy of the system.

The speakers were given a list of sentences which they had to read aloud. Each word was pronounced at least five times by each speaker. The training corpus consisting of about 1000 words were recorded and labeled using AVS Audio Editing Software.

Figure 2. Screenshot of the data recording process using AVS audio editor

Download as

Veiw figure View Figure

Figure 3. Speech file conversion interface

Download as

Veiw figure View Figure

Silence was removed from each speech file using the AVS audio editing software which enables the user to select the portion of speech file to save.

All the speech data recorded were converted in order for it to be processed in MATLAB workspace as shown in Figure 3. Each file was converted to 4800Hz 16bits wav file using the AVS audio editing software.

3.2. Training

Back propagation is the best known training algorithm for multiple layer neural networks and still one of the most useful. It has lower memory requirements than most algorithms, and usually reaches an acceptable error level quite quickly, although it can then be very slow to converge properly on an error minimum. The Neural Network was designed in the MATLAB workspace using the following parameters.

Number of output: 2

Number of input: 13

Number of hidden neurons: 20

Number of hidden layer: 1

Number of training samples: 300

Number of validation samples: 15

Number of testing samples: 15

Performance: Mean Squared Error (MSE)

Data division: Random (DIVIDERAND)

Figure 4. Creating the neural network data

Download as

Veiw figure View Figure

Figure 5. Realized Neural Network

Download as

Veiw figure View Figure

For the feature extraction, the Mel Frequency Ceptral Coefficients (MFCC) features were computed with the following parameters.

Pre-emphasis coefficients(apre) = -0.95

Frame size = 256 samples (16ms)

Frame overlap = 85 samples (5.3ms)

Number of triangular bandpass filters = 20

Number of MFCCs =13

Thirteen (13) MFCCs are extracted from each speech file. Figure 6 depicts bar graph of one of the speech extracted.

Figure 6. Plot of magnitude over coefficient ranks

Download as

Veiw figure View Figure

4. Simulation Results

4.1. Testing Results

The recognition performance evaluation of the speech recognition system is measured on a corpus of data different from the training corpus. A separate test corpus, with new speech records was created as it was previously done with the training corpus. The test corpus consists of 15 different speech samples for each language and the labeled data were converted to MFCC's. In order to test for speaker independency of the system, some of the subjects who participated in creation of the testing corpus had not participated in the creation of the training corpus. A total of 6 speakers participated in this testing.

Table 1. Testing Results

Download as

View current table in a new window

Tables index

Veiw figure View Table

4.2. Training Results

The training results were automatically generated using the MATLAB "trainers" function as shown in Figure 7.

Figure 7. Screenshot of Training/Error Results

Download as

Veiw figure View Figure

Figure 8. Screenshot of the Confusion Matrix

Download as

Veiw figure View Figure

The confusion matrix of the network was generated as shown in Figure 8. This result shows that Yoruba and Hausa speech samples are more easily confused by the neural network model.

Figure 9. Screenshot of training state at 17 epochs

Download as

Veiw figure View Figure

Figure 10. Screenshot of the Receiver Operating Characteristics

Download as

Veiw figure View Figure

5. Performance Evaluation

From the simulation, results show that adequate functioning of neural networks depends on the sizes of the training and test set. The Percentage recognition rate used is computed as the total number of recognized speech sample to that of the total speech sample multiplied by 100 as shown in Figure 11.

Figure 11. Performance analysis after Training

Download as

Veiw figure View Figure