Modelling Diabetes Mellitus among Adult Kenyan Population Using Artificial Neural Network

Artificial Neural Network (ANN) is a parallel connection of a set of nodes called neurons which mimic biological neural system. Statistically, ANN represents a class of non-parametric models which is capable of approximating a non-linear function by a composition of low dimensional ridge functions. This study aimed at modeling diabetes mellitus among adult Kenyan population using 2015 stepwise survey data from Kenya National Bureau of Statistics. Data analysis was carried out using R statistical software version 3.5.0. Among the input variables Age, Sex, Alcoholic status, Sugar consumption, Physical Inactivity, Obesity status, Systolic and Diastolic blood pressure had a significant relationship with diabetic status at 5% level of significance. A multi layered feed-forward neural network with a back propagation algorithm and a logistic activation function was used. Considering a parsimonious model, the model selected had the eight input variables with two neurons in the hidden layer since it gave a minimum MSE of 0.0580 reported. 75% of data was used for training while 25% was used for testing. The sensitivity of the trained network was reported as 75% while specificity was 94.29%. The overall accuracy of the model was 84.64% . This implied that the model could correctly classify an individual as either diabetic or not with an accuracy rate of 84.64%. A 10-fold cross validation was carried out and an average MSE of 0.0686 reported. Kolmogorov-Smirnov test of normality was carried out and at 5% level of significance, for most parameter estimates, we failed to reject the null hypothesis and concluded that the network parameter estimates were asymptotically normal and consistent. With a good choice of risk factors for diabetes, neural network structures could be successfully used to accurately model diabetes melitus among Kenyan adult population.


Introduction
Artificial Neural Networks have recently received a great deal of attention in many fields of study.This is due to the fact that ANN attempts to model the capabilities of human brain.They have been used in a variety of applications where statistical methods are traditionally employed.Globally, an estimated 422 million adults were living with diabetes in 2014, compared to 108 million in 1980.The global prevalence (age-standardized) of diabetes has nearly doubled since 1980, rising from 4.7% to 8.5% in the adult population.This reflects an increase in associated risk factors such as being overweight or obese.Over the past decade, diabetes prevalence has risen faster in low-and middle-income countries than in high-income countries [1].Diabetes caused 1.5 million deaths in 2012.Higher-than-optimal blood glucose caused an additional 2.2 million deaths, by increasing the risks of cardiovascular and other diseases.Forty-three percent of these 3.7 million deaths occur before the age of 70 years.
The percentage of deaths attributable to high blood glucose or diabetes that occurs prior to age 70 is higher in low-and middle-income countries than in high-income countries [1].
Diabetes can be classified as type 1(which requires insulin injections for survival) and type 2(where the body cannot properly use the insulin it produces).The majority of people with diabetes are affected by type 2 diabetes.This used to occur nearly entirely among adults, but now occurs in children too.
Sophisticated laboratory tests are usually required to diagnose diabetes.To complement this, researchers are nowadays turning to use of computer based diagnoses which sometimes can be more accurate than the clinical diagnosis.One such computer based diagnosis is the use of Artificial Neural Network.The neural network, firstly developed in 1943, is a part of artificial intelligence developed to predict a model outcome.When the output of the network is discrete, then this is a classification and when the output has continuous values it is performing prediction [2].This is a suitable and powerful tool to help doctors in the medical field with several advantages such as the ability to deal with a great amount of data and reduced time of diagnoses.The ability of neural networks to produce good prediction results in classification and regression problems has motivated its use on data related to health outcomes such as death or illness diagnosis [3], [4].In such studies, the dependent variable of interest is a class label, and the set of possible explanatory variables which are the inputs to the neural networks may be binary or continuous.In this study, ANN was used to classify the individual either as diabetic or non-diabetic based on input variables.The input variables were the physical risk factors for diabetes (Age, Sex, Smoking behavior, Alcoholic status, Salt consumption, Sugar consumption, Physical Inactivity) and secondary risk factors (Obesity status, Systolic and Diastolic blood pressure).
Diagnostics of diseases is broad and challenging area.Its task is to detect a disease that patient with the symptoms have.This process is very complicated, because not all disease's symptoms are specific to only one disease and often the symptoms overlaps.Errors caused by human factor are not rare in this process.To eliminate human error, in modern medicine, different technologies are used nowadays.Some of them are clinical decision support systems.Using information about a patient's condition in the mathematical model, the probable diagnosis can be determined.These mathematical models include Artificial Neural Networks.Artificial Neural Networks (ANNs) play a vital role in the medical field in solving various health problems like validating clinical diagnosis of various diseases.
The main objective of this study was to apply artificial neural network in diagnosing diabetes mellitus among Kenyan adult population.Specifically, the study aimed at: (1) determinining the relationship between diabetes mellitus status and various risk factors, (2) exploring the asymptotic properties of Artificial Neural Network parameter estimates and (3) ascertaining the best Artificial Neural Network models for diagnosing diabetes.

Literature Review
Generally, in Kenya not much of the study has been carried out to diagnose diabetes mellitus using ANN for adult Kenyan population.However, a lot of research has been done using ANN in medical diagnosis worldwide

Classical Models Versus ANN models for Prediction and Diagnosis
Some of the classical Statistical tools applied for prediction and diagnosis in many disciplines are Discriminant analysis [5,6]; Logistic regression [7]; Bayesian approach [8] and Multiple Regression [9,10,11,12].All these models have been proven to be very effective for solving relatively less complex statistical problems [13].On the other hand, real world problems are very complex in nature and as such classical models rely heavily on priori assumptions.
To overcome this problem, Artificial Neural networks are increasingly becoming important due to the following reasons.First, as opposed to the classical model-based methods, ANNs are data-driven self-adaptive methods in that there are few a priori assumptions about the models for problems under study.They learn from examples and capture very complex functional relationships among the data even if the underlying relationships are unknown or hard to describe [14].Second, ANNs can generalize.After learning the data presented to them (a sample), ANNs can often correctly infer the unseen part of a population even if the sample data contain noisy information.Third, ANNs are universal functional approximators.It has been shown that a network can approximate any continuous function to any desired accuracy [15,16,17,18,19].ANNs have more general and flexible functional forms than the traditional statistical methods can effectively deal with.Due to these properties, ANN is increasingly becoming popular as compared to traditional statistical models.

Artificial Neural Network in Medical Diagnosis
Artificial neural networks provide a powerful tool to help doctors to analyze, model and make sense of complex clinical data across a broad range of medical applications.Most applications of artificial neural networks to medicine are classification problems; that is, the task is on the basis of the measured features to assign the patient to one of a small set of classes [20].
There are several reviews concerning the application of ANNs in medical diagnosis.The concept was first outlined in 1988 in the pioneering work of [21] and since then many papers have been published.In his work, [22] used artificial neural networks to find potent combination of key variables which accurately identified specific analytes and their level of toxicity.He found that ANN can find potent biomarkers embedded in any type of expression data, mainly proteins which systematically identify the treatment classes of interest with a near 100% accuracy.Whether these proteins are useful in actual diagnosis is tested by presenting the computer model with unknown classes.
Reference [23] developed one of the most successful application of ANN in clinical diagnosis of myocardial infarction.He trained ANN on a group of 356 patients with and without acute myocardial infarction in a cardiac intensive care setting.Using a multi-layer feed forward network trained using a back propagation algorithm, the ANN had unprecedented sensitivity of 92% and a specificity of 96%

Artificial Neural Network in Diabetic Mellitus Diagnosis
Application of Artificial Neural Network in diagnosing diabetes mellitus has been extensively used by various authors specifically using the Pima Indian data set taken from the UCI machine learning repository.This database has a well validated data resource for exploring the prediction and classification of diabetes mellitus.The data set has eight attributes i.e Number of times pregnant, Plasma glucose concentration (a 2 h in an oral glucose tolerance test), Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), 2-h serum insulin (lU/ml), Body mass index (weight in kg/(height in metres) 2 ), Diabetes pedigree function and Age (years).
Various researchers have used different algorithms and techniques to compare the various classification accuracies obtained.[24] applied neural network classification to Pima Indian diabetes dataset.Using various combinations of pre-processing and missing value techniques, the experimental system achieved an excellent classification accuracy of 99% which is among the best.
Reference [25] applied artificial neural network using Levenberg-Marquardt (LM) algorithm and a probabilistic neural network(PNN) structure to pima Indian data set to diagnose diabetes.They obtained an accuracy of 82.37% and 78.13% using Multi-Layer Neural Network (LM algorithm) and PNN respectively.
Reference [26] used the same Pima data set for diagnosing diabetes onset.They used multilayer feedforward neural network with back propagation training algorithm to classify patients as diabetic and not diabetic.Using a sigmoid transfer function for the hidden and the output layer and a momentum rate of 0.66 and a learning rate of 0.33, they obtained a classification accuracy of 82%.Comparing this classification accuracy to other algorithms, multilayer feed-forward trained with back-propagation algorithms was higher than other algorithms like nearest neighbor with backward sequential selection of feature.
Reference [27] in their work developed Artificial Neural Network models using both classification and predictive neural networks for the rapid diagnosis of diabetes mellitus.They used a dataset with 465 records which were divided into 440 training data sets and 25 testing data sets.The classification network which was trained using Genetic learning had 19 input variables and the target output variables was the "Diagnosis".The classification results for the training data set showed that 88.41% of the data was correctly classified while 76% of the test set was correctly classified.Generally, both neural network models were able to learn the problem with the predictive network giving a better performance of 84% correctly classified records as opposed to 76% achieved by the classifier network on the same data set.
Reference [28] proposed a method to predict diabetes mellitus using back propagation algorithm of Artificial Neural Network.They treated the problem of diagnosing diabetes as a binary classification i.e those predicted to be diabetic falling under category 1 and non-diabetic under category 0. They used the supervised multilayer feedforward network architecture with back propagation algorithm.The input parameters used were: Random Blood Sugar test result, Fasting Blood Sugar test result, Post Plasma Blood Sugar test, age, sex and occupation.They measured the performance of the network in terms of absolute error calculated between network response and desired target.The network achieved a classification accuracy of 92.5%.i.e the model was able to predict whether a person was diabetic or not at 92.5% accuracy.
As in [29], used neural network based rule discovery system to determine the presence of hypoglycemic episodes based on the type 1 diabetic patients' physiological parameters, rate of change of heart rate, corrected QT interval of electrocardiogram signal and rate of change of corrected QT interval.He used a sample size of 420 patients with 320 data sets used to develop the neural network based rule discovery system and 100 data sets used to validate its performance.The sensitivity and specificity were found as 79.30% and 60.53% respectively which are considered to be reasonable and better than the ones found by the commonly used methods, statistical regression, genetic programming and fuzzy regression.

Methodology
The study utilized secondary data from 2015 Kenya Stepwise survey for Non Communicable Diseases risk factors.Artificial Neural network was used to classify diabetic and non diabetic patients using several input variables (diabetes risk factors).More specifically, a multi layered feed-forward neural network with logistic activation function was used.a 10-fold Cross validation was carried out to validate the model.

Study Area
The study was carried out in all the forty seven counties of Kenya as shown in Figure 2. A nationally representative sample was selected from the fifth National Sample Survey and Evaluation Programme (NASSEP V) Frame.

Study Subjects
The recommendation for STEPs was to draw sample population from the targeted population by use of age-sex groups.The age groups used intervals of 12 years of individuals aged 18 years to 69 years.The population covered by the 2015 Kenya STEPS survey was defined as the universe of non-institutionalized population of men and women aged 18 -69 years.A sample of households was selected and one person identified within the age groups of interest in the households was eligible for interview and measurements [30].

Sample Size Determination
Following the recommendations detailed in STEP-wise approach to surveillance (STEPS) manual, the survey drew sample population from the targeted population by use of age-sex groups.The age groups used intervals of 12 years of population age 18 years to 69 years, resulting into eight groups.
The sample size was calculated using the formula; Using the values, 0.5, P = Z =1.96 (95 percent confidence Interval), P =50 percent (as recommended by WHO for countries who have not conducted a STEPS survey before) and e =0.05, the initial estimated sample size was 384.Further adjustments that included multiplication of the sample by 1.5 (design effect to cater for complex survey), 8 (the number of 12 year age-sex groups and 1.25 (to cater for 20 percent non-response) yielded a sample of 5,760.The sample was further adjusted to ease allocation into various strata.
The sample was allocated into all the 92 strata in the NASSEP V frame, ensuring that a minimum of two clusters were selected per strata.This was achieved using power allocation method.
The sample size for 2015 Kenya STEPS survey was 6,000 individuals selected from a total of 200 clusters (100 in urban and 100 in rural) with a uniform sample of 30 individuals per cluster [30].

Sample Inclusion and Exclusion Criteria
The inclusion criteria was: i).Individuals aged between 18 and 69 years.ii).Willing and able to provide informed consent for participation.
The exclusion criteria was: i).Individuals not aged between 18 and 69 years.ii).Unable or unwilling to provide informed consent or assent.

Sampling Strategy a) Sample Frame
Administratively, Kenya is divided into 47 Counties.In turn, each county is subdivided into Sub-Counties.Prior to the enactment of the current constitution in 2010, the sub-counties had not been created but similar units were the districts.Each district was divided into divisions, each division into locations and each location into sub-locations.In addition to these administrative units, prior to the 2009 population census, each sub-location was subdivided into census enumeration areas (EAs) i.e. small geographic units with clearly defined boundaries.A total of 96,251 EAs were developed.The list of EAs is grouped by administrative units and includes information on the number of households and population.This information was used in 2010 to design a master sample known as the fifth National Sample Survey and Evaluation Programme (NASSEP V) with a total of 5,360 selected EAs [30].
The NASSEP V master frame follows a two-stage stratified cluster sample format.The first stage involved selection of Primary Sampling Units (PSUs) which were the EAs using probability proportional to size (PPS) method, with the measure of size being the households from 2009 census.The second stage involves the selection of households for various surveys.The frame was designed in a multi-tied structure with four sub-samples (C1, C2, C3 and C4), each consisting of 1,340 EAs that can serve as independent frames.The NASSEP V frame used the counties as the first level stratification and further sub divided into rural and urban sub domains.The sampling was done independently within rural -urban sub domains.Each sampled EA was developed into a cluster and undergone listing and mapping process and clusters are within measure of size of average of 100 households (between 50 households and 149 households) [30].

b) Sample Selection
The 2015 Kenya STEPS survey sample was selected in three stages.Stage one involved selection of PSUs (i.e.clusters), households and individuals.

c) Selection of PSUs
The selection of clusters was done using the Equal Probability Selection Method (EPSEM).The clusters were selected systematically from NASSEP V frame with equal probability independently within the urban-rural domains.The process involved ordering the clusters by county, then by urban/rural, and finally by unique geocode.The resulting sample retained properties of PPS as used in creation of the frame.

d) Household selection
Using the total number of households from each sampled cluster available from the NASSEP V, a uniform sample of 30 households per cluster was selected using systematic sampling method.This procedure of selecting the sample households with a random start was done by the following criteria: Let L be the total number of households listed in the cluster; Let R be a random number between (0, 1); Let n be the number of households selected in the cluster; Let = / I L n be the sampling interval.1.The first selected sample household is k ( k is the serial number of the household in the listing) if and only if: 2. The subsequent selected households are those having serial numbers: ( 1) * k j I + − (rounded to integers) for = 2,3,..., .j n Random numbers were different and independent from cluster to cluster [30].

e) Individual selection
All the selected clusters and corresponding households were loaded into Personal Digital Assistants (PDAs).During interviews, all the eligible household members were listed down and PDA used to randomly select one for interviews using the inbuilt Kish Grid method [30].

Statistical Model
Artificial Neural Network (ANN) was used to classify individuals as either diabetic or not based on physical and behavioural characteristics as input variables.Since secondary data was used in this study, it will be first cleaned by checking missing data and outliers.Outliers will be excluded in the final analysis for the model.Chi square test will be carried out to determine the relationship between diabetes mellitus status and various risk factors.
At the inferential stage, a multi layered feed-forward neural network with logistic activation function model will be used to fit the data.Schwarz information Criterion (SIC), will be used for model selection.Classification Accuracy rate and Mean squared error (MSE) will be reported.To validate our diagnosis model, a 10 fold cross validation will also be carried out.

Chi-Square Test of Independence
In order to determine the relationship between diabetes mellitus status and various risk factors, Chi-square test of independence/no relationship was carried out.Two variables are said to be statistically independent if the population conditional distributions of Y are identical at each level of .When two variables are independent, the probability of any particular column outcome j is the same in each row.Statistical independence is, equivalently, the property that all joint probabilities equal the product of their marginal probabilities, = ij i j π π π + + for = 1,..., i I and = 1,..., j J; that is, the probability that X falls in row i and Y falls in column j is the product of the probability that X fall in row i with the probability that Y falls in column j [31].
Consider the null hypothesis 0 ( ) H .The Pearson test statistic is used to make such comparisons and it has large-sample chi-squared distributions [31].
The Pearson chi-squared statistic for testing 0 H is: usually sufficient for a decent approximation as discussed in [31].
The chi-squared distribution is concentrated over nonnegative values.It has mean equal to its degrees of freedom df , and its standard deviation equals 2df .As df increases, the distribution concentrates around larger values and is more spread out.

Relative Risk
This is a ratio of two proportions.For 2 × 2 tables, the relative risk is the ratio,

Introduction to Neural Network
An artificial neural network (ANN) is a parallel connection of a set of nodes called neurons which mimic biological neural system.Statistically, ANN represents a class of non parametric models which is capable of approximating a non linear function by a composition of low dimensional ridge functions [33].It represents a function of explanatory variables which is composed of simple building blocks and which may be used to provide an approximation of conditional expectations or, in particular, probabilities in regression [34].ANN is widely used in classification, regression and statistical pattern recognition problems.α is the weight from the bias node to the output node [34].Considering an input vector 1 = ( ,..., )

Definition of the ANN
to the th h hidden node is the value 0 =1 ( ;q) = .
The output ( ;q) h φ X of the th h hidden node is the value ( ;q) = ( ( ;q)).
The net input to the output node is the value Finally, the output ( ;q) g X of the network is the value We note that q stands for all the parameters 0 ,..., m α α and , = 1,..., , = 0,..., hj W h m j d of the network [34].We The most appropriate choice of the activation function above is the logistic function given as where α is the learning rate while b is called the bias.
In this study, we assumed a statistical model that relates Y and ( ;q) g X as follows: i.e these data are used to come up with an estimator q for θ [34].

Training the Network
There are two types of network training i.e Supervised and unsupervised learning.In this study supervised training will be used.The supervised training of a neural net requires the following: The selection of an initial weight set.
3. A repetitive method to update the current weights to optimize the input-output map.

A stopping rule
The maximum likelihood method is used to find the optimal estimator q for the network [34].
The task here is to minimize the error in equation ( 6).

The conditional density of
x is given as: The second and the third term of the above equation is independent of the weights q and therefore can be omitted so that maximizing equation ( 7) is equivalent to minimizing The weights are then adjusted in such a way that the error function in equation ( 8) is minimized.However, this study is on classification and the target variable is binary.

The probability weights of
and the likelihood of equation ( 9) is given by and the negative of the log likelihood is given as =1 ( q) = { ln( ( ;q)) ( 1)ln(1 ( ;q))} where q is the value of q that maximizes the equation above i.e q 0 q = min ( q). arg S ∈Θ Y, X; (13) In equation (11), the weights are adjusted in such a way that the error between the targets Y and the actual output ( q), g X; is minimized.The goodness of the network approximation can be evaluated using a penalty function, π , that measures how well network output ( q) g X; matches the "target" output y corresponding to given inputs .
x Since the output is binary, negative entropy is a good penalty [35].Performance as a function of q for given x and y can be measured as ( , , ) ( , ( ; )).q y x y g X θ π θ ≡ A measure of overall network performance is given by the expected penalty, ( ) , where the random target/input pair ( , ) Y X is drawn from the population distribution governing the phenomenon of interest.Choosing q to solve q 0 min (q), Q ∈Θ yields a network producing the smallest average penalty, given an input randomly drawn from the operating environment.This provides an objective way to choose the "best" approximation and formalizes the requirement that the network "generalizes" well.There are various methods of minimizing equation (8).These include Backpropagation, Quasi-Newton method and Simulated annealing method.
In this study, back propagation method was used to minimize the error.

Back Propagation Method
This is a kind of coordinate wise gradient descent method.
The goal is to find a set of weights

., )
hj W h m j d W that minimizes our objective function, equation (11).Therefore, the partial derivative of the objective function with respect to a weight represents the rate of change of the error function with respect to that weight (it is the slope of the objective function).Moving the weights in a direction down this slope will result in a decrease in the objective function.This intuitively suggests a method to iteratively find values for the weights.We evaluate the partial derivative of the objective function with respect to the weights, and then move the weights in a direction down the slope, continuing until the error function no longer decreases [36].Mathematically, the weights are adjusted as follows, taking a unipolar activation function ( ) Taking individual weights, we have the th r iteration weight as λ representing the step gain [34].

Parameter Estimation
We first discuss the concept of existence of the estimator q .Existence of a solution to equation ( 13) is guaranteed by the following lemma with assumption that Θ is compact [34].Lemma 1. Assume (11) and (12) holds, then there exists a solution of the maximum likelihood equation (13).Proof.By our choice of (.) , ( q) g X; given by ( 12) is continuous in X and θ , and 0 < ( q) < 1 g X; for all q X, .Therefore, ( q) S Y, X is continuous in q for all Y, X , and it assumes it minimum on compact sets.
Next we discuss the concept of the model irreducibility/Redudancy.
We say that a neural network (with a fixed set of parameters) is "redundant" if there exists another network that represents exactly the same relationship function q (.) g . A related definition is the reducibility of q stated by [37] as follows.
Definition: For ψ satisfying equation ( 5 for some i j ≠ , where 0 denotes the zero vector of the appropriate size. A reducible q with symmetric sigmoidal f leads to a redundant network because it gives a q (.) g function that can be represented by another network by deleting the th h neuron, where is described in the conditions above.
For condition (a), it is obvious.For (b), delete the th h neuron and replace 0 α by 0

Model Identifiability
This is a fundamental problem in neural network.The parameters are not unique since we have a different set of parameters with an identical distributions of ( , ) Y X [38].Let the weights be represented as follows: 0 and b = ( , for = 1,..., ) where 0 1 = ( , ,..., ) . At this point we note two kinds of transformations that make the input-output map invariant: i) The function is unchanged if we permute i s β ′ .For example if 1 β and 2 β are interchanged, ( q) g X; remains unchanged.
The following two conditions must be satisfied by the activation functions.
is linearly independent.As a result of the above two conditions and assuming models (4), ( 5) and ( 6) with a continuous function f satisfy condition A. (NB: f ψ ≡ ).Suppose that q is irreducible.Also assume that the distribution of x has the support d  .Then the following apply as discussed by [38]: a) q is identifiable up to the family of transformations generated by (16).That is, if there exists another q * such that ( q ) = ( q) g g * X; X; , then there exist a transformation generated by ( 16) that transforms q * to q .b) Under further assumption that f is continuously differentiable and satisfies condition B, the matrix is non singular.Here x is a column vector and denotes the gradient of ( ) g θ x hence S is a square matrix.Also, the expectation E is taken with respect to the random vector x .
Any non decreasing symmetric sigmoidal function that satisfies condition B also satisfies condition A. [38].Also any non decreasing function satisfying the first two properties of equation ( 5) must be a cumulative distribution function (cdf) of a one dimensional random variable.Condition A says that 0 { ( ), > 0} f bx b b + are independent, which is equivalent to the mixture probability density functions being identifiable.
Since in this study we are dealing with classification problem, the correctly classified case where 0 ( ) = ( ;q ) p g x X for some 0 q b ∈ Θ , equation ( 20) is solved for 0 q = q .i.e 0 (q) S is minimized at the true parameter value 0 q .In general if there is no true value, we may define 0 q as By minimizing equation ( 18), we get the estimator q .Consistency of this estimator q therefore means that q converges in probability to 0 q as the sample size tends to infinity [34].
Next, we discuss the asymptotic normality of the network parameters.In a classical context, our model can be written as follows: = ( ) , = 1,..., .
From the above equation, the residuals i ε can therefore be expressed as; = ( ).

Since ( , )
i i Y X are i.i.d and We note that ( ) i var ε does not depend on q .
Since the residuals i ε are not only i.i.d but also bounded in absolute value by 1, their assumptions reduce to, A1).The activation function ψ is bounded and twice continuously differentiable with bounded derivatives.A2).0 (q S ) has a global minimum at 0 lying in the interior of Q and with a positive definite Hessian , for all , Suppose that assumptions A1 to A5 are satisfied.Then, for n → ∞ , with 0 q, q as above 0 1 2 ( q q ) (0, ) ) (q ) = ( ;q )(1 ( ;q )) ( ;q ).( ;q ) We note that the asymptotic covariance matrix, Σ + Σ reflects the two sources of error in 0 (q ) B and 2 0 (q ).B which vanishes in the correctly specified case while which reflects the randomness in the response variable 1 Y [34].The asymptotic normality of network parameter estimates was determined by use of normal quantile-quantile (qq) plots.

Normality Test
Kolmogorov-Smirnov test of normality will be used to test our hypotheis.Suppose we have an i.i.d sample  1 , … . .,   with some unknown distribution  and we would like to test our hypotheis that  is equal to a normal distribution  0 .
Lets denote by () = ( 1 ≤ ) a c.d.f of a true underlying distribution of the data.We define an empirical c.d.f by that counts the proportion of the sample points below level .For any fixed point  ∈ ℜ , the law of large numbers implies that , i.e the proportion of the sample in the set (−∞, ] approximates the probability of this set.It is easy to show that from here that this approximation holds uniformly over all  ∈ ℜ, sup |  () − ()| → 0 .i.e the largest difference between   and  goes to 0 in probability [39].
The key observation in Kolmogorov-Smirnov test is that the distribution of this supremum does not dependon the 'unknown' distribution  of the sample if  is continous distribution.

Model Selection and Complexity Regularization
A network model with sufficiently large number of hidden units can approximate any unknown function.When a training sample is fixed, a complex network with a large number of hidden units may over fit the data.Thus, there is a trade off between approximation capability and over-fitting while implementing ANN models.One easy approach to regularizing the network complexity is to use model selection criteria.Two such criteria are the Schwarz Information criterion (SIC) proposed by [25] and Predictive Stochastic Complexity criterion (PSC) introduced by [40].
In this study, we used the Schwarz Information Criterion (SIC) which is given as; The first term is the goodness of fit measure (Regression Mean Squared Error) while the second term penalizes model complexity.The Mean Squared Error (MSE) is given by; This MSE was also used to determine the number of hidden neurons but comparison was made with SIC.Using the SIC criterion, we started with a single hidden neuron and determined SIC (1).Then the second hidden neuron was added and SIC(2) determined.The process continued until an extra hidden neuron did not improve the SIC.We therefore estimated 1 h + models in order to choose a model with h neurons [34].

Cross Validation
Cross-validation is a process that can be used to estimate the quality of a neural network.When applied to several neural networks with different free parameter values (such as the number of hidden nodes and back-propagation learning rate), the results of cross-validation can be used to select the best set of parameter values.The initial data set is divided into k subsets of approximately equal size.The model is then estimated k times, each time leaving out one of the subsets.A series of Mean squared error is computed on the basis of the omitted subset.This method is called leave out one cross validation [41].

Model Assessment
Inorder to assess the fitness of the model, Accuracy, Sensitivity and Specificity were reported.The accuracy of a diagnostic test is often assessed with two conditional probabilities: Given that a subject has the disease, the probability the diagnostic test (prediction) is positive is called Sensitivity [31].Given that the subject does not have the disease, the probability that the test is negative is called Specificity.The overall accuracy of the model is the average of specificity and sensitivity.The sensitivity, Specificity and Accuracy are calculated as follow; ( ) Sensitivity Specificity = +

The Data
The study utilized secondary data from 2015 Kenya Stepwise survey for Non Communicable Diseases risk factors.The input variables were the physical risk factors i.e.Age, Sex, Smoking behavior, Alcoholic status, Salt consumption, Sugar consumption, Physical activity/Inactivity, Obesity status, Systolic and Diastolic blood pressure, while the output variable was diabetic status (diabetic or not diabetic).An obese person in this study is any person whose Body Mass Index was greater than or equal to 30 while a diabetic person is someone whose fasting glucose was greater than or equal to 6.1mmol/l.
The table below summarizes the variables and its measurements.From Table 2, its clear that the age of respondents was well within the survey inclusion criteria.The minimum age was 18 years and the maximum age was 69 years while mean age of respondents was approximately 38 years.The average systolic blood pressure was 126.6mmHg while the average diastolic pressure was 82.05.The SBP ranged from 80mmHg to 218 mmHg while DBP ranged from 48mmHg to 129 mmHg.
Table 3 shows frequency distribution of the categorical input variables.The results from the study showed that 91.2% of respondents did not smoke or had never smoked.It is also clear that, of all the respondents, 10.3% were obese.Only 7.0% of the respondents were diabetic.

Relationship between Diabetes Mellitus and Various Risk factors
Inorder to find the relationship between diabetic status and the various input variable, a cross tabulation was carried out and summary results presented in the table below.
At 5% level of significance, Sex of respondent, Alcohol consumption, sugar consumption, physical inactivity and Obesity were significant while Smoking and Salt consumption were not significant.This implies that, there is a strong relationship between diabetic status and the significant factors while there is no relationship or association between diabetic status and smoking or salt consumption.
From this analysis, all the significant variables have a relative risk greater than one.The risk of having diabetes is atleast 29% higher for females as compared to males.For those who consume alcohol, the risk of diabetes is at least 33% higher as compared to those who do not consume alcohol.Those who consume excess sugar are 2.2 times likely to have diabetes as compared to those who do not consume excess sugar.It is also evident that, those who are physically inactive have a 73% higher risk of having diabetes as compared to those who are physically active.Those who are Obese are 2.18 times likely to have diabetes as compared to those who are not obese.Inorder to fit the Neural network model, smoking status and salt consumption will not be considered since they do not have any significant relationship with diabetes mellitus.

Model Selection
The model with the least MSE was selected as per the Table 5 below.From Figure 1, its very clear that the MSE increases with increase in number of hidden nodes.The MSE is minimum at nodes 2 implying that in order to regularize the network, a model with two hidden nodes should be chosen.
We now train our model with eight input variable and two hidden nodes.

Network Training
Before training the network, the data set was split into two i.e training set and test set.75% of data set was for training the network while 25% was for testing and validating.A plot of the network with weights is a shown in Figure 2.
The trained network had twenty one weights.The training process needed 401 steps until all absolute partial derivatives of the error function were smaller than 0.01 (the default threshold).The estimated weights range from -39.2000 to 3.6194.For instance, the intercepts of the first hidden layer are -1.3032 and 3.0729 and the four weights leading to the first hidden neuron are estimated as 3.6194, -27.3465, -35.4901, 0.2369, -0.7540, -8.7735 and 0.9512 for the covariates age, sex, alcohol, sugar, inactive, sbp, dbp and obese, respectively.A summary table for weights is as in Table 6.

Trained Network Assessment
To assess the fitness of the model, a cross classification of the actual data and the predicted outcome using test data set was reported.Table 7 below shows the results of the confusion matrix.The sensitivity of the trained network was reported as 75% while specificity was 94.29%.This implied that the overall accuracy rate was 84.64%.This implied that the model could correctly classify an individual as either diabetic or not with an accuracy rate of 84.64%.These results are consistent with other neural network models for binary classification.
After the model was trained, a 10 fold cross validation was carried out inorder to test the generalization of the model.The MSE for each fold was reported as in Table 8.The average of these results gives the test accuracy of the algorithm.From this study, it is clear that the validated average MSE was 0.0686 or the error rate was 6.86%.In order to test our hypothesis, Kolmogorov-Smirnov test of normality was used.The hypothesis stated that; 0 : H The Artificial Neural Network parameter estimates are asymptotically normal and consistent.
Table 9 gives the results of the test for the various parameters.
At 5% significance level, we do not reject the null hypothesis for all the parameters except w71.We therefore conclude that most parameter estimates did not have a significant departure from normality.Its only the estimator, w71 that have a significant departure from normality at 5% since it has a P-Value of 0.0325.

Normal Q-Q Plot
The Normal Q-Q plot, or Normal quantile-quantile plot, is a graphical tool used to assess if a set of data plausibly came from some Normal theoretical distribution.It allows one to see at-a-glance if the assumption of normality is plausible and if not, how the assumption is violated and what data points contribute to the violation.If both sets of quantiles came from the same distribution, the points should form a line that is roughly straight.A qq-plot to study the behavior of the ANN parameter estimates with a simulation of large sample shows that the parameter estimates aligned themselves in a straight line.Clearly showing that the ANN parameter estimates had a normal distribution and thus no violation of normality assumption.This is demonstrated in Figure 3 and Figure 4.

Conclusions
Advancement in modern computing has lead to the use of artificial neural networks which mimics the human brain.Combined with the statistical analysis, artificial neural networks are used to identify complex patterns among data.This study aimed at modeling diabetes mellitus using ANN.This combined with clinical diagnoses can greatly assist the clinicians and doctors in correctly diagnosing the underlying disease.The accuracy obtained from the trained model is a good indication that, with good investment and further research in this field, the classification accuracy can be improved and hence, the model can be used in future.
In this study the Diagnosis of Diabetes Mellitus has been modeled using neural network classifier.In order to come up with a network architecture, chi square test of statistics was first carried out inorder to establish the input variables that had a significant relationship with diabetes mellitus.For variables that were continous, a stepwise model building was carried out and the networks MSE did not increase indicating that they were also significant.Inorder to determine the appropriate number of hidden nodes (size of the network), MSE and SIC were used to determine the parsimonious model.The model with more than seven nodes did not converge.Within the models that converged, the model with two hidden node had the minimumm MSE and thus was chosen as the model that could fit the data well.
The ANN network had 9 inputs neurons, one hidden layer with two neurons and the output layer had one neurons.The hidden and output layers used the sigmoid transfer function and were trained using the back propagation algorithm.The data was split into training set, test set and validation set.A 10 fold cross validation was carried out inorder to test the classification accuracy of the model.The sensitivity of the trained network was reported as 75% while specificity was 94.29%.The overall accuracy of the model was 84.64%.As conclusions, It was seen that with a good choice of risk factors for diabetes, neural network structures could be successfully used to help diagnose diabetes disease among Kenyan adult population.

Recommendations
This study sets a precedent in modeling diabetes mellitus using artificial neural network among adult Kenyan population.With increasing interest in artificial intelligence, I would recommend that future research be focused in embracing this new field of statistical computing.More important would be to integrate clinical/medical diagnosis with artificial intelligence.More complex machine learning algorithms like support vector machine, self organizing maps should be applied in diagnosing diabetes.
Baseline label of selected indicator, = e Margin of error.

A
relative risk of 1 occurs when π 1 = π 2 i.e when the response is independet of the group.[31] them as vectors.In prediction and classification problems, the activation function ( ) x ψ is usually chosen to be symmetric sigmoidal function i.e fixed bounded continuous non decreasing function.
same value of ( q) g X; and hence the same distribution of Y .The transformations described above generate a family with 2 !m m elements.Call this family of transformations .

2 .
Condition B: Assume that f is differentiable and f ′ is its derivative.The class of functions

Table 5 . Model Selection Using MSE
Figure 1.Plot of MSE Against Number of Hidden Nodes

Table 9 . Kolmogorov-Smirnov Test
Figure 3. QQ plot for Neural Network weights