Mining Quantitative Association Rules in HIV Protein Sequences

Anubha Dubey, Usha Chouhan

  Open Access OPEN ACCESS  Peer Reviewed PEER-REVIEWED

Mining Quantitative Association Rules in HIV Protein Sequences

Anubha Dubey1,, Usha Chouhan2

1Department of Bioinformatics, Manit, Bhopal (M.P), India

2Department of Mathematics, Manit, Bhopal (M.P), India


Lot of research has gone into understanding the composition and nature of proteins, still many things remain to be understood satisfactorily. It is now generally believed that amino acid sequences of proteins are not random, and thus the patterns of amino acids that we observe in the protein sequences are also non-random. In this study, we have attempted to decipher the nature of associations between different amino acids that are present in a HIV protein. This very basic analysis provides insights into the co-occurrence of certain amino acids in a HIV protein. Such association rules are desirable for enhancing our understanding of protein composition and hold the potential to give clues regarding the global interactions amongst some particular sets of amino acids occurring in proteins. The aim of association rules mining is to reveal underlying interactions in large sets of data items. Knowledge of these rules or constraints is highly desirable for the in-vitro synthesis of artificial proteins. This will also give new insights to understand protein-protein interactions in HIV.

At a glance: Figures

Cite this article:

  • Dubey, Anubha, and Usha Chouhan. "Mining Quantitative Association Rules in HIV Protein Sequences." Journal of Biomedical Engineering and Technology 1.2 (2013): 26-30.
  • Dubey, A. , & Chouhan, U. (2013). Mining Quantitative Association Rules in HIV Protein Sequences. Journal of Biomedical Engineering and Technology, 1(2), 26-30.
  • Dubey, Anubha, and Usha Chouhan. "Mining Quantitative Association Rules in HIV Protein Sequences." Journal of Biomedical Engineering and Technology 1, no. 2 (2013): 26-30.

Import into BibTeX Import into EndNote Import into RefMan Import into RefWorks

1. Introduction

Proteins are important constituent of cellular machinery of any organism. Recombinant DNA Technologies have provided tools for the rapid determination of DNA sequences and, by inference, the amino acid sequences of proteins from structural genes [1]. The proteins are sequences made up of 20 types of amino acids. Each amino acid is represented by a single letter alphabet, as given in Table 1. Each protein adopts a unique 3-dimensional structure, which is decided completely by its amino acid sequence. A slight change in the sequence might completely change the functioning of the protein. Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked together in varying sequences to form a vast variety of proteins [13].

Table 1. Single letter codes of amino acids

The heavy dependence of protein functioning on its amino acid sequences has been a subject of great anxiety. Research has been done to determine the information content per amino acid in proteins by Yockey [2] and Strait and Dewey [3]. Table 1 shows the standard amino acid abbreviations and side chain properties. Depending on the polarity of the side chain, amino acids vary in their hydrophilic or hydrophobic character [14].These properties are important in protein structure and protein-protein interactions.

There has been a continuing debate on whether the amino acid sequences of proteins are random or have statistically significant deviations from sequences. White and Jacobs [5] have shown that any sequence chosen randomly from a large collection of nonhomologous proteins has a 90% or better chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. They claimed that proteins have evolved from random sequences but have developed significant deviations from randomness during the process of evolution. Pande et al [4] mapped protein sequences to random walks to detect differences in the trajectories of a Brownian particle. They found pronounced deviations from pure randomness which seem to be directed towards the minimization of energy in the 3D structure.

In this study, we take a further step in this direction by trying to predict if there are any co-occurrence patterns among the 20 amino acids. We have attempted to find out rules that can tell that occurrence of one amino acid is more likely when another amino acid is present or absent. Such rules are called “association rules”, and the corresponding technique is called “association rule mining” (ARM). In ARM terminology, the amino-acids may be considered as items, and the protein sequences as “baskets” containing items. Association rule mining has been further described. Proteins are polymers of length usually in hundrends. Since the length is much larger, all the 20 amino acids are present in majority of proteins, and thus we will not be able deduce any significant rule just based on presence or absence. To obtain more meaningful association rules in this context, we have incorporated the normalized frequencies of amino acids observed in each protein, and also discovered “quantitative association rules”, which tell that if one amino acid A is present with a frequency, another amino acid B is likely to be present with frequency. Our quantitative association rule mining procedure [8] enables us to find these numbers and .

This study shows how association rule mining could be implemented for finding quantitative rules in proteins.

2. Association Rule Mining

Mining association rules (ARs) can be classified as one of the most popular and prominent areas in data mining. It aims at discovering the interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. This term was coined by Agrawal et al. [18] and amazingly it stills become an active research in knowledge database discovery. For the past decades, ARs have been widely used in various types of applications such as retail transaction, stock market analysis, etc. In current scenario the association rule mining has been an active area of research in medical databases [19]. Association rule mining algorithm with proposed HIV infected patients treatment dataset is also applied [20].

Before we begin with the description of the algorithm, it will be helpful to review some of the key concepts of association rule mining. We use the same notation as used in [9]. Let be a set of k elements, called items. Let be a set of n subsets of I. We call each a basket of items. For example, in the market basket application, the set I consists of the items stocked by a retail outlet and each basket is the set of purchases from one register transaction. Similarly, in the “document basket” application, the set I consist of all dictionary words and all proper nouns, while each basket is a single document in the corpus. Note that the concept of a basket does not take into account the ordering or frequency of items that might be present. An association rule is intended to capture a certain type of dependence among items represented in the database B. More importantly, we say that if the following two hold.

1.  Occurs together in at least s% of the n baskets (the support).

2. Of all the baskets containing at least c% also contain (the confidence).

This definition is also extended to I J, where I and J are disjoint sets of items instead of single items. Let us consider an example of a document basket application. The baskets in this case are many short stories that are available at our disposal, while the items within each basket are the words. A reader might observe that stories which contain the “sword” also frequently contain the word “blood”. This information can be represented in the form of a rule as:


Rule support and confidence are the two measures of rule interestingness [10]. They respectively reflect the usefulness and certainty of discovered rules. A support of 5% for an association rule means that 5% of stories under analysis show that “blood” and “sword” occur together. A confidence of 55% means 55% of the stories that contain the word “sword” also contain the word “blood”. Typically, association rules are considered interesting if they satisfy both minimum support threshold, and minimum confidence threshold. Such threshold can be set by users or domain experts. As pointed out [9], it should be noted that the symbol is misleading since such a rule does not correspond to real implications; clearly, the confidence measure is merely an estimate of the conditional probability given.

2.1. The Apriori Algorithm

The most commonly used approach for finding association rules for finding association rules is based on the Apriori algorithm [6]. Apriori employs an iterative approach known as a level wise search, where k itemsets (sets containing k items) are used to explore (k+1)-itemsets. First the set frequent (i.e. having more than the minimum support) 1-itemsets is found. The efficiency of the levelwise generation of frequent itemsets is improved by using the Apriori property which says that all nonempty subsets of a frequent itemset must also be frequent. This is easy to observe, because if an itemset I does not satisfy the minimum support threshold, then the set I’=I{} ,containing all elements of I and an extra element , cannot occur more frequently than I, and thus cannot satisfy the minimum support threshold.

2.2. Quantitative Association Rules

While the association rule model described above suffices for many applications, it is not adequate when the frequency of each item in the basket is variable and cannot be ignored. For example, in the previously considered example, a user might be interested in the rules of the form:


This rule represents that a story that contains between 30 to 35 occurrences of “sword” and 14 to 16 occurrences of “war”, is also likely to contain 50 to 52 references of “blood”. Such rules are called quantitative association rules.

The rule represents that a story that contains between 30 to 35 occurrences of “sword”. And 14 to 16 occurrences of “war”, is also likely to contain 50 to 52 references of “blood”. Such rules are called quantitative association rules. In case of HIV, equation 2 can be written as

It means that if a person’s CD4 count lavel is 200-300 or 350 and viral load is 100000 to 1 million then person is highly recommended for HIV treatment.

The ARCS system [11] for mining quantitative association rules is based on rule clustering. Essentially this approach maps pairs of quantitative attributes onto a multidimensional grid, with the number of dimensions. Techniques for mining quantitative rules based on x-monotone and rectilinear regions were presented in [7]. Approach proposed by Srikant et al [8] works by fine-partitioning the values of the quantitative attributes, and then generating rules of interest.

3. Algorithm

Our implementation is based on the partitioning approach described in [8]. We consider 20 attributes in proteins, each related to an amino acid. The value of each attribute in a basket (here protein) is the frequency of the frequency of the corresponding amino acid in the protein. Since the proteins are of varying lengths, we normalize this frequency by dividing by the length of the protein.

The main steps in the algorithm are as follows:

1. Partition the attributes: We have divided each of the 20 attributes into 10 intervals. In [8], the authors have discussed the notion of partial completeness to quantify the amount information lost due to partitioning. It has been further shown that for a given number of partitions, equi-depth partitioning (each partition having equal support) gives the minimum loss information, and is thus optimal. Thus, we have used equi-depth partitioning in our method. For the sake of completeness and comparison, we have also experimented with equi-distant partitioning, in which all intervals are of equal length.

2. The intervals/partitions are mapped into consecutive integers, which are used to represent the intervals. The order of intervals is preserved in the mapping.

3. Find the support for each of the intervals. Also the consecutive intervals are combined as long as their support is less than a predetermined maximum support. This is actually needed in case of equi-distant partitioning when some of the intervals may have very small support and thus it makes sense to combine them with the adjoining intervals. In equi-depth partitioning, all intervals have equal support, and thus this problem does not arise. We identify the set of all intervals which have more than a minimum support minsup. This is called the set of frequent items.

Next we find all sets of items whose support is greater than minsup. These are called the frequent itemset, and the algorithm is based on Apriori algorithm, discussed in the previous section.

4. The frequent itemsets are used to generate association rules, each itemset can give rise to number of association rules by dividing into two parts: antecedents and consequences. For example, an itemset {P, Q, R} can lead to the following rules.each itemset can give rise to number of association rules by dividing into two parts: antecedents and consequences. For example an itemset {P, Q, and R} can lead to the following rules-

The confidence conf for each of the rules is determined as the conditional probability of conclusion given precedent. For example, for the rule

If the confidence is greater than a pre-determined minimum confidence, min conf, the rule is kept, otherwise it is removed.

4. Results

The HIV protein sequences are taken from the UNIPROT database [15], containing only those sequences which are complete. This reduces the biasness. In this study our focus is on driving associations applicable to HIV proteins. Figure 1 shows the rules obtained with minimum support of 30 proteins 12 association rules have been generated, which have confidence more than 50 % that was obtained by N.Gupta at al [17]. The universe of chains that can be built from 20 amino acids is extremely large and diverse. In light of this fact, the confidence and support of the rules presented in Figure 1 are quite significant.

In this paper we are trying to use these rules for HIV proteins to generate some useful information. As an example, the eighth rule indicates that proteins containing large amount of Arginine(R) and very low amount of Serine (S) are likely to contain no Cysteine(C). Cysteines are the amino acids that participate in the formation of disulphide bonds. Such rules provide some insight into the interaction and role of these amino acids in proteins, and have important consequences in the emerging field of synthetic biology where biological entities are designed and synthesized in the lab.

The amino acid composition of HIV proteins have been obtained from Procos Software developed by Department of Bioinformatics MANIT Bhopal (M.P) [16]. By applying this rule we found large amount of Arginine, good quantity of cysteine will lead to low quantity of cysteine this shows that disulphide bond is weak. The best characterized examples of disulphide bond are the immune cell receptor CD4, the HIV-1 envelope glycoprotein gp120 and integrin receptor. HIV-1 binds to CD4 via its gp120 envelope protein. This binding leads to the interaction of the complex with a chemokine receptor, triggering fusion of the viral and cell membrane, leading to HIV entry and infection. Hence if weak disulphide bond is present in gp120 then HIV enters and causes infection.

Figure 1. Associations obtained using equi-depth partitioning. Each interval obtained in angular brackets) has an amino acid, and frequency range with protein length scaled to 500. The support is the number of proteins in our dataset of 3728 proteins containing all the intervals present in the association rule.

To see the performance of the algorithm changes when equi-distant rules are used, we created 10 intervals of equal length with frequency ranges 0-9, 10-19,......,80-89, and 90-500. The proteins lengths are scaled to 500 and the frequencies are increased or decreased in proportion. The association rules obtained from this approach are shown in Figure 2. As expected, the method gets heavily biased in favour of those intervals which have very high support. For example, Cysteine (C) and Tryptophan (W) are the less frequent amino acids in proteins; in most proteins the frequency of these amino acids is close to zero. These rules show that Cysteine and Tryptophan occur in range 0-9, same criteria was observed in HIV amino acid composition. Cysteine residues pair to form unique cross-links (disulphide bridges) in proteins. These bridges have an important role in stability, as suggested by experimental engineering of disulphides, and are very common in extracellular soluble globular proteins. This probably reflects the need for increased stability in such proteins in the extracellular environment (see, for example, Goldenberg, 1985; Pantoliano, 1987; Matsumara et al., 1989a; Pace, 1990; Betz and Pielak, 1992; Cooper et al., 1992; Eder and Wilmanns, 1992; Betz, 1993; Pace et al 1988). The prevailing view is that disulphide bonds have been added during evolution in order to enhance the thermal stability of proteins which must function in the oxidizing extracellular environment which may have varying physical chemical properties.

Figure 2. Associations obtained using equi-distant partitioning. Representation is same as in Figure 1. Note that consequence part in all the rules contains either Cysteine (C) or Tryptophan (W). Result section shows the behaviour of this discussion.

5. Conclusion

This approach shows presence or absence of amino acid plays a significant role in the structure and / or function of proteins. The association rules prove very important role in the design and synthesis of artificial peptides, outside the cell. As this era shifting from small molecule drugs to biologics (bio molecular drugs) which leads to new knowledge in the post genomic area. Synthetic peptides may be a new source of HIV drug. Association rule mining also gives new insight to study protein-protein interaction, discover gene interactions and many more. It also helps in disease diagnosis on the basis of related symptoms. Hence association rules play a very significant role in current era of science.


[1]  Brenden,C. And Tooze, J. Introduction to protein structure (Garland Publishing, New York, 1991).
In article      PubMed
[2]  Yockey, H.P.(1977). On the information content of cytochrome. J.Theor. Biol.67,147-151.
In article      CrossRef
[3]  Strait, B.J. & Dewey, G.(1996). The Shannon information entropy of protein sequences. Biophy. J.71, 148-155.
In article      CrossRef
[4]  Pande, S.V.,Grosberg, A.Y. & Tanaka, T. (1994). Non randomness in protein sequences: evidence for a physically driven stage of evolution? Proc. Natl. Acad. Sci. U.S.A 91, 12972-12975.
In article      CrossRef
[5]  White, S.H. & Jacobs, R.E. (1993). The evolution of proteins from random amino acid sequences –I. Evidence of proteins from the lengthwise distribution of amino acids in modern proteins. J. Mol. Evol.36, 79-95.
In article      CrossRefPubMed
[6]  Agrawal. R and Srikant, R. (1994). Fast algorithms for mining association rules. In Proc of the 20th Int’l Conference on very large databases, Santiago, Chile, September 94.
In article      
[7]  Fakuda, T, Morimoto, Y., Morishita, S. And Tokuyama, T. (1996). Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. In proc of the 20th Int’l Conference on Very Large Databases, Santiago, Chile, September’ 94.
In article      
[8]  Srikant, R. And agrawal, R. (1996). Mining quantitative association rules in large relational tables. Proc. ACM SIGMOID.
In article      
[9]  Brin, S., Motwani, R., and Silverstein, C. (1997). Beyond market basket : Generalizing association rules to correlations. In proc. 1197 ACM SIGMOID, pp 265-276. Tuscon, AZ.
In article      
[10]  Han, J. And Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, 2001.
In article      
[11]  Lent, B., Swami, A. And Widom, J. (1997). Clustering association rules. In Proc.Int’l Conf. Data Engineering (ICDE’97), PP220-231, England.
In article      
[12] scop/.
In article      
[13]  "The Structures of Life”. National Institute of General Medical Sciences. 2008-05-20.
In article      
[14]  Creighton, Thomas H. (1993). Proteins: structures and molecular properties. San Francisco: W. H. Freeman. Chapter 1. ISBN 0-7167-7030-X.
In article      
In article      
[16]  Lavanya Rishishwar, Neha Mishra, Bhasker Pant, Kumud Pant, K. R.Pardasani “ProCos: Protein Composition Server ”, Bioinformation, Volume 5 Issue 5 November 2010.
In article      PubMed
[17]  Nitin Gupta et al, Data Mining, LNAI 3755,PP.273-281, 2006.
In article      
[18]  R. Agrawal, T. Imielinski and A. Swami, Database Mining: A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, 5 (6), 914 (1993).
In article      CrossRef
[19]  Abdullah et al., “Detecting critical least association rules in medical databases, International Journal of Modern Physics: conference series, vol.1, no.1,1-5, 2010.
In article      
[20]  K. Rameshkumar, “Extracting Association Rules from Hiv Infected Patients Treatment Dataset”. Trends in Bioinformatics, 4: 35-46, 2011.
In article      
  • CiteULikeCiteULike
  • MendeleyMendeley
  • StumbleUponStumbleUpon
  • Add to DeliciousDelicious
  • FacebookFacebook
  • TwitterTwitter
  • LinkedInLinkedIn