Characterisation of Academic Journal Publications Using Text Mining Techniques

The ever-growing volume of published academic journals and the implicit knowledge that can be derived from them has not fully enhanced knowledge development but rather resulted into information and cognitive overload. However, publication data are textual, unstructured and anomalous. Analysing such high dimensional data manually is time consuming and this has limited the ability to make projections and trends derivable from the patterns hidden in various publications. This study was designed to develop and use intelligent text mining techniques to characterise academic journal publications. Journals Scoring Criteria by nineteen rankers from 2001 to 2013 of 50 th edition of Journal Quality List (JQL) were used as criteria for selecting the highly rated journals. The text-miner software developed was used to crawl and download the abstracts of papers and their bibliometric information from the articles selected from these journal articles. The datasets were transformed into structured data and cleaned using filtering and stemming algorithms. Thereafter, the data were grouped into series of word features based on bag of words document representation. The highly rated journals were clustered using Self-Organising Maps (SOM) method with attribute weights in each cluster.


Introduction
Ranking of journals is widely used in academic circles in the evaluation of an academic journal's impact and quality. Journal rankings are intended to reflect the place of a journal within its field, and the prestige associated with it. Journal rankings can be used to evaluate the research impact of individual academics. Hence rather than measuring the impact of an academic's individual articles, universities and governments use the ranking of the journal as a proxy for the quality and impact of an academic's articles [1]. Some measures used in the ranking include impact factor, eigenfactor, scimago journal rank, h-index and expert survey. Recently, some journals were blacklisted in some institutions due to poor ratings. As a result of this, [2] produced a list called The Journal Quality List, which is a collation of journal rankings from a variety of sources. It is published primarily to assist academics to target papers at journals of an appropriate standard. The list was originally collated by the Bradford University School of Management (1997)(1998)(1999)(2000)(2001). Since then, the list has been updated and extended periodically to keep it current. It contains rankings of different journals, and is used all over the world. There are more than 5000 downloads yearly by academics across the world, cited in different academic publications [3].
The Journal Quality List (JQL) comprises of academic journals in the following broad areas: Economics, Finance, Accounting, Management, and Marketing. The rankings for each journal include sources such as Foundation National pou l'Enseignment de la Geston des Entreprises  [2].
For the purpose of this study, we are concentrating on highly rated journals in the JQL 2013 list and we shall be using text mining techniques to elicit hidden knowledge from these journals.

Literature Review
Data mining is the process of analysing data from different perspectives (large databases or Big Data) and summarizing it into useful and previously unknown information for users [4,5]. It derives its name from the similarities between searching for valuable information in a large database and mining a mountain for a vein of valuable core, which is, transforming data dust to data 'gold' [6,7,8,9].
Text data mining is a natural extension of data mining [9], and follows steps similar to those in data mining. The qualitative difference in text mining, however, is that it processes data from natural language text rather than from structured databases of facts [10]. Companies use text mining software to draw out the occurrences and instances of key terms in large blocks of text, such as articles, Web pages, complaint forums, or Internet chat rooms and identify relationships among the attributes [11]. Often used as a preparatory step for data mining, text mining often translates unstructured text into a useable databaselike format suitable for data mining for further and deeper analysis [12]. [7] also described text mining as an emerging technology that can be used to augment existing data in corporate databases by making unstructured text data available for analysis.
There exist some relationships between data mining, information retrieval, statistics, web mining, computational linguistics, natural language processing and text data mining. The problem of Knowledge Discovery from Text (KDT) [13] is to extract explicit and implicit concepts and semantic relations between concepts using Natural Language Processing (NLP) techniques. Its aim is to get insights into large quantities of text data. KDT, while deeply rooted in NLP, draws methods from Statistics, machine learning, reasoning, information extraction, knowledge management, and others for its discovery process. KDT plays an increasing role in emerging applications, such as Text Understanding [14]. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The documents retrieved should be relevant to the information needs of the user who performed the search query.
If the set of documents relevant to a query is denoted as {Relevant}, and the set of documents retrieved is denoted as {Retrieved}, then the set of documents that are both relevant and retrieved is denoted as {Relevant} ∩ {Retrieved}. The two basic measures for assessing the quality of text retrieval [15] are Precision and Recall. Precision is the percentage of retrieved documents that are in fact relevant to the query. Recall is the percentage of documents that are relevant to the query and were, in fact, retrieved. It is formally defined as Text Mining is a non-traditional information retrieval (IR) method whose goal is to reduce the effort required of users to obtain useful information from large computerized text data sources. Traditional Information Retrieval often simultaneously retrieves both "too little" information and "too much" text [16,17]. However, in Information Retrieval (otherwise known as Information Access), no genuinely new information is found. The desired information merely coexists with other valid pieces of information.
Natural Language Processing (NLP) is a sub field of Artificial Intelligence (AI) and linguistics regions [18]. In NLP, Text Mining applications are also quite frequent and they are characterized by multilingualism [19]. Use of Text Mining techniques to identify and analyze web pages published in different languages, is one of its examples [14]. The main aim of NLP studying is the generation and realizing of natural languages. One direction of NLP research relies on statistical techniques, typically involving the processing of words found in texts [20]. One of the NLP applications in text retrieval is usage of these techniques as a necessary component in web search engines, via automated translation tools or in summary generators [21].
NLP techniques are used for text that is typically syntactically parsed using information from a formal grammar and a lexicon, the resulting information is then interpreted semantically and used to extract information about what was said [22]. It includes techniques like word stemming (removing suffixes) or a related technique, lemmatization (replacing an inflicted word with its base form), multiword phrase grouping, synonym normalization, part-of-speech (POS) tagging (such as elaborations on noun, verb, preposition), word-sense disambiguation, anaphora resolution and role determination (such as subject and object) [18,21].
The difference between regular data mining [23] and text mining is that in text mining the patterns are extracted from natural language texts rather than from structured databases of facts. Text mining tries to apply these same techniques of Data mining to unstructured text databases. To do so, it relies heavily on technology from the sciences of Natural Language Processing (NLP), and Machine Learning to automatically collect statistics and infer structure and meaning in otherwise unstructured text. The usual approach involves identifying and extracting key features from the text that can be used as the data and dimensions for analysis. This process is called feature extraction, is a crucial step in text mining.
Web mining [24,25,26] is the activity of identifying patterns implied in large document collection. Web mining is an integrated technology in which several research fields are involved, such as data mining, computational linguistics, statistics, and informatics. There is no generally acceptable definition of Web Mining. Since web mining derives from data mining, its definition is similar to the well-known definition of data mining [19]. Nevertheless, Web mining has many unique characteristics compared with data mining.
Text-mining is ideally suited to extract concepts out of large amounts of text for a meaningful analysis. It has been used in a wide variety of settings, ranging from biomedical applications to marketing and emotional/sentiment research where a lot of data needs to be analysed in order to extract core concepts. Text-mining achieves this, by applying techniques from information retrieval (such as Google), natural language processing, including speech tagging and grammatical analysis, information extraction, such as term extraction and named-entity recognition and data mining techniques, such as pattern identification [27,28].
Applications of text mining methods are diverse and include Bioinformatics [29], Customer profile analysis, Anti-Spam Filtering of Emails, Event tracks, Text Classification for News Agencies [30] and Web Search [31]. These applications also extend to any sector where text documents exist. For instance, history and sociology researchers can benefit from the discovery of repeated patterns and links between events, crime detection can profit by the identification of similarities between one crime and another [32], and unsuspected facts found in documents may be used in order to populate and update scientific database [33]. Other areas include updating automatically a calendar by extracting data from e-mails [33,34], identifying the original source of a news article [35] and monitoring inconsistencies between databases and literature [36]; biomedical applications (for example, identification of biological entities, automatic extraction of protein interactions and associations of proteins to functional concepts); marketing applications (customer relationship management), and sentiment analysis [37].

Methodology
In this section, the processes that were involved in this study were discussed. The raw texts in electronic format were extracted from the abstracts of academic journal publications. These electronic resources were downloaded from online digital data sources. These raw texts were preprocessed and transformed. The features were extracted from the transformed texts and converted to form structured data. These structured data were further analysed and the results interpreted for Knowledge Management purposes (decision making). Figure 1 presents the logical text mining operations carried out in this study. The data sources were electronic resources from the web. These were in form of text, CSV, doc, xls and html documents/files. These data were transformed and loaded into the application data files as shown in Figure 2.

Text Data Collection
The text data used were downloaded from journals in the Journal Quality List. The text data was imported into Microsoft Excel (CSV) format. Journal Quality List (JQL) is a collation of journal rankings from a variety of sources which is published primarily to assist academia to target papers at journals of an appropriate standard [2]. The list was originally collated while the editor was associated with the Bradford University School of Management (1997)(1998)(1999)(2000)(2001)

Text Data Selection
Article abstracts, authors' bibliometrics (authors' affiliations) and keywords of the case studies were used for the analysis. The text data for this research was obtained from journal publications ranked in the Journal Quality List (JQL). As at 2013, there were 50 editions of the JQL published [2]. The 50 th edition contained the ranking of journals from 2001 to 2013. The journals cited in the edition were used in this study. Some journals in this edition were rated high. A non-probability method known as Purposive Sampling was used in selecting the journals from the JQL. Purposive sampling is a sampling technique whose elementary units are chosen according to the discretion of the expert who is familiar with the relevant characteristics of the population.

Text Pre Processing
This stage involved the use of information extraction tools to analyse the unstructured data by identifying all the key features within the text. Usually the abstracts, authors' bibliometrics and keywords provided enough semantic information that the journals used. These information were extracted from the full text and parsed into independent sentences. This was implemented using the text mining software (MinerText 1.0) that was developed for this study.
In text pre-processing, the document was first split into a series of words (features). Adjectives, adverbs, nouns and multi-word were extracted from the document. Word frequency and inverse document frequency were two parameters used in filtering terms. Low term frequency (TF) and document frequency (DF) terms were often removed from the indexing of journal documents. To better match concepts among terms, words were stemmed based on Porter's algorithm. It contained keywords, title words, and clue words.
In "Bags of words" representation each word was represented as a separate variable having numeric weight. The most popular weighting schema is normalized word frequency tfidf. The tf-idf (term frequency-inverse document frequency) statistic was based on the frequency of a given term in the record. This was normalized by being divided by the total number of times term appeared in all records.
where idf is the inverse document frequency tf is the term frequency (number of word occurrences in a document) df(w) represents document frequency (number of documents containing the word) N gives the number of all documents tfidf(w) is the relative importance of the word in the document The text extraction involved the identification and extraction of texts from the scientific publications. Text characterisation involved the following transformation processes: Transform Cases: In this stage, all characters (tokens) in the documents were transformed into lower cases respectively.
Tokenization: The language used in this study was Part of Speech (POS) tagger. Instead of using the complex methods, the tokenization process was accomplished by using space to split the sequence of characters into a sequence of tokens using non-letter character. This resulted in tokens consisting of one single word, which was used in building the word vector.
Filter Tokens (by Length): The tokens were filtered based on their length (that is, the number of characters they contained) which involved the minimum and maximum numbers of characters contained in each token. This was done by splitting the text of a document into a sequence of tokens. Minimum of two characters and maximum of twenty-five characters were used in filtering the total length. Stop Word Removal: Extremely common words which would appear to be of little value in helping select documents matching a user's need were excluded from the vocabulary entirely. English stop words were filtered from the document by removing every token, which equalled a stop word from the built-in stop word list. These words were stop words and the process was called stop word removal. Such words included a, an, the, and the prepositions. Tokens contained in the stop word list were discarded.

Feature Selection and Attribute Generation
In this stage, a subset of the features was selected to represent a document. This created an improved text representation since many features had little information content. Stop Words were removed, and words stemmed down to their roots. Stemming identified a word by its root and reduced dimensionality (number of features). Features were selected based on classification and some irrelevant attributes were removed. Document Representation: Vector Space model is one of the efficient methods of representing documents as vectors using the term frequency weighting scheme. The entire data collection from the PDF/XML file was represented as vectors using the Vector Space Model.
Euclidean Distance: The cosine measure is a similarity measure rather than a distance. Distances are more comfortable to work with. Similarities were easily converted to distances using (3) by organising similarities into a positive-definite matrix C, where ij-th element of this matrix indicated the similarity of the i-th and j-th documents.
When two documents were the same (c ii = c jj ) then the distance was zero. Alternatively, i, j = records; m = number of variables; x= matrix of the i-th and j-th documents.

Text Clustering
The clustering algorithm used for this study was Self-Organising Maps (SOM). SOM is good in clustering because of the highest weights attributed to words in the clusters; besides it suggested more words that were used for classification in the journals.
The procedure of SOM for text clustering was summarized as follows: i. Each node's weight was initialized. ii. A vector was chosen at random from the set of training data and presented to the SOM network. iii. Every node in the network was examined to calculate which ones' weights were most like the input vector. The winning node was commonly known as the Best Matching Unit (BMU), Using Euclidian Distance: I = current input vector W = node's weight vector n = number of weights iv. Adapting the vectors of the winner and its neighbors using (6): Where mi (t+1) = neuron vector after adaptation mi (t) = original vector of neuron I (neuron vector before adaptation) α(t) = learning rate hi (t) = neighbour rate x(t) = document vector t =time [x(t) -mi (t)] = distance between neuron vector and document vector. v.
Step (ii) was repeated for N iterations. Table 1 presents the summary of the seven highly rated journals selected by 19 rankers from 2001 to 2013, with their impact factors, total number of volumes and their total number of issues in each volume. The total number of issues of all the highly rated journals from inception till July 2013 was 1149 issues. Simple random sampling was used for this study in selecting the journal articles with the sample size of 10%. These were analysed as The abstracts (text data) of the assessed journals were converted to word features. The stop words in the text were removed, and the resulting word data were stemmed down to their roots. Stemming identifies a word by its root and reduces dimensionality (number of features). Features were then selected based on classification. Some irrelevant attributes were removed. The abstracts (text data) were converted to word features.

Results and Discussions
The word dictionary constructed for highly rated journals is presented in Table 2.
The list of words was generated using bag of words representation and the number of times each word appeared in the documents, as shown in Figure 3.    Table 3 also shows the features that were generated. Table 4 presents the clusters obtained by clustering the highly rated journal data by Institutions, Designations, Faculties and Locations.
In Table 4 In cluster one, it was observed that most of the publications were from the universities and the highest number of publications came from North America. Cluster two shows that most publications were mostly Management journals, and majority of the publications were from North America. Cluster three revealed that the highest number of publications came from North America. Also, most of the publications were university publications. Cluster four showed that most publications came from North America and were university publications. Most of the dominant authors were Professors, and were mostly management journals. It was observed in cluster five that most of authors were Professors from universities. Most of these publications were from North America. They were mostly Business Administrative journals. In cluster six, most publication areas that dominated were from Finance. They originated from universities and were mostly from North America. Cluster seven revealed that, most of these publications were Management journals and were from the universities.

Conclusions
In this study, we focused on the characterization of academic journal articles using text mining techniques. This was done by using the developed text mining software to capture abstracts of academic publications from highly rated journals of Journal Quality List. These abstracts comprised of several issues with many articles. The raw article documents were split into a series of words (features). Stop words were removed, and words stemmed down to their roots, thereby transforming the unstructured data into structured data. The processed data 'mined' in this study identified patterns and extracted valuable information and new knowledge. The data of highly rated were classified into Institution, Location, Designation and Faculty. Self-Organizing Maps (SOM) clustering algorithm was used to also cluster the data. SOM was better in clustering because of the highest weights attributed to words in the clusters; besides it suggested more words that were used for classification in the journals. It was also able to depict texts in more figurative and better visual way. The model developed in this study is invaluable and useful for analysing and discovering patterns in academic electronic resources. It has also helped in identifying and characterizing the most relevant factors or features for determining how academic journals are ranked in academic institutions.