A collective learning approach for semi-supervised data classification

Öz Semi-supervised data classification is one of significant field of study in machine learning and data mining since it deals with datasets which consists both a few labeled and many unlabeled data. The researchers have interest in this field because in real life most of the datasets have this feature. In this paper we suggest a collective method for solving semi-supervised data classification problems. Examples in R 1 presented and solved to gain a clear understanding. For comparison between state of art methods, well-known machine learning tool WEKA is used. Experiments are made on real-world datasets provided in UCI dataset repository. Results are shown in tables in terms of testing accuracies by use of ten fold cross validation


Introduction
In recent years with the increasing availability of data, classification, a method of data mining, is studied by many researchers enourmously. Data classification is creation of a specific recognition system that uses a training set selected from a specific dataset. If this specific dataset consists of unknown labeled data, the process is called unsupervised; if it consists of known labeled data, it is called supervised; and if it consists of both known and unknown labeled data, it is called semi-supervised data classification.
Semi-supervised classification techniques can be thought as a compromise between unsupervised and supervised techniques and it aims to utilize from both of them. Much used semi supervised learning methods can be alined as selftraining, co-training, transductive support vector machines, and graph-based methods.
Self training is a wrapper method around a supervised classifier. To apply self-training, for each instance x to be classified, besides its predicted class label, the classifier must be able to output a certainty score, i.e., an estimation of how likely the predicted class label is correct [1]. Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples are studied by Hajmohammadi et. al. [2]. Also different applications of this method are presented by Chinaei [3] and Kanga et. al. [4]. Transductive support vector machines (TSVMs) use each labeled and unlabeled data in training phase and it searches a reliable separating hyperplane. This method is used and examined by Bruzzone et. al. [5] and Ordin [6]. In co-training, proposed by Blum and Mitchell in 1998, two classifiers are trained. It uses the predicitons of each classifier on unlabeled examples to augment the training set of the other [7]. Graph based methods use nodes and edges to model the whole dataset as a graph. Nodes represents labeled and unlabeled datasets and edges defines the similarities between points. These methods fundamentally predicts a function on the graph such that it should be close to the given labels on the labeled examples, and also it should be smooth on the whole graph [8].
Supervised data classification is studied by various researches and lots of different approaches and methods have beeen presented for solving supervised data classification problems. In this paper, we use various supervised classification algorithms defined in Waikato Environment Knowledge Analysis WEKA program as Naive Bayes, Logistics, Classification via Regression, Decision table and Decision trees. A brief information about the used algorithms is given in Section 2 and a detailed one is presented by Alpaydın [9] and Frank et. al. [10].
Unsupervised data classification also known as clustering, deals with the problem of organization of a collection of objects into clusters based on similarity. One of our approach that we suggest for semi-supervised data classification in this paper is a combination of unsupervised and supervised classification technique. We use clustering method just for specifying the center points of clusters. We prefer to use kmeans method as a clustering method since it is the most preferred one. This method will be explained in the next section. More detailed information about clustering and kmeans method is proposed by Bagirov et. al. [11].
In section 2, suggested algorithm for semi-supervised data classification is defined also different approaches for the initilization of the algorithm are discussed. Instructions are given to understand why we prefer these approaches. Brief explanations about supervised classification algorithms that will be used after initilization are presented. In section 3, ordinary datasets in R 1 are generated and applied to defined algorithm's different approaches for a clear understanding. Besides for comparison, experiments are made on real-world datasets via state of art methods provided in WEKA. Obtained evaluation results are presented by use of ten-fold cross validation. Finally in section 4 we conclude the paper.

Matherial and method
In semi-supervised data classification, the aim is to benefit from both labeled and unlabeled data during the learning process. Learning from labeled data is called supervised learning and learning from unlabeled data is called unsupervised learning.
In this paper, we aim to transform the semi-supervised data classification problem to a supervised one since various methods have been defined for supervised learning. In accordance with this purpose, center points of the classes are aimed to be found by using the labeled points. We use two methods for finding them.
Suppose that we are given labeled A, B and unlabeled C sets We suggest three approaches for finding the center points of given A and B datasets in the below subsections.

k-means method (first approach)
In the first approach, we experienced commonly used k-means clustering method on labeled data for determining the center points of the classes. Actually k-means method is an unsupervised method whose goal is to partition the unlabeled dataset into k parts in terms of the similarities. But in our experiment we use it just for finding the center points of the labeled datasets. Since we make binary classification we imply k-means algorithm on two datasets (A and B ) individually.
Choose a seed solution consisting of k centers (not necessarily belonging to A), Step 2.
Allocate data points to its closest center and obtain kpartition of A, Step 3.

Recompute centers for this new partition and go to
Step 2 until no more data points change cluster.
After obtaining 2 center points by use of k-means algorithm. We calculate all distances between unlabeled points and each center points. The closest center points' class is given to the unlabeled one. This approach requires (2 ) iterations after finding center points.

Mean method (second approach)
In the second approach, we use mean points of labeled A and B datasets as center points.
Suppose that we are given A and B classes, consisting of respectively m and p n dimensional vectors. Calculation of mean points can be given as (1) and (2): Center point of = 1 , 2 , … , ) , such that After calculation of mean points, we calculate all distances between unlabeled points and obtained two center points. The closest center points' class is given to the unlabeled one. This approach requires 2 iterations after finding center points.

Calculating all distances (third approach)
In the third approach, different from other two approaches, by thinking so basicly, we dont find any center point. We calculate all distances between unlabeled point and every labeled point. Unlabeled point is labeled with the closest labeled points' class. This approach requires ( + ) iterations such that , , is respectively the number of unlabeled points, points of A and points of B. We dont experiment this approach since it is not so effective in terms of running time for large datasets.

A collective algorithm for semi-supervised data classification
Here if we contrast given three aproaches in terms of number of iterations in a sense running time of labeling the unlabeled points process, the third one is not so effective in large datasets. Such that ( + ) > (2 ) and ( + ) > 2 where k (defined cluster number in k-means method) is less than m (number of points in A) and p (number of points in B).
From this point of view, second approach looks more effective but it shouldn't be forgotten that the given iteration numbers are calculated after finding the center points. That is to say here the elapsed time during the process of finding centers is so important.
When we contrast the first and second approaches, in the second one by determining the number of clusters before implying k-means method, we can obtain more effective results by using two or more center points for each classes, and also it should not be forgotten that in the use of clustering method both distance metric and number of clusters (here we call them center points) can be determined by the experimenter in terms of the structure of the data.
In this paper, we experiment both approaches in initilization of the semi-supervised algorithm given below.

Algorithm 2:
A collective semi-supervised data classification algorithm via clustering and supervised data classification methods.
Suppose that we are given labeled A, B and unlabeled C sets including m, p and t n-dimensional vectors, respectively: Step 0 (Initilization): Find center points of A and B datasets via one of suggested approaches.
Assign each C point to the closest center points' class. Use Euclidian distance for calculations as in (3): Where, Here, k is the number of center points that's determined by the experimenter before implying clustering method. If second approach (mean) is used in initilization, k is assumed to be 1.
Redetermine both A and B sets in accordance with new labeled points of C set, Step 3.
Apply a supervised data classification algorithm by use of redetermined A and B datasets as training set, Step 4.
Define the obtained function or model that separates the sets A and B and STOP.
In Algorithm 2, selection of supervised data classification is left to the discretion of the experimenter. In the numerical experiments given in the next section various supervised data classification methods from WEKA are used. Brief explanation about used supervised data classification methods can be given as follows: Naive Bayes: Bayesian classifiers assign the most likely class to a given point described by its feature vector. The probability of X point being in class C is computed as equation (4): Where = ( 1 , … , ) [13].
Logistics: Logistic regression is an alternative method to the Linear Discriminant Analysis that generates classifier functions to separate two or more groups by minimizing the misclassification cost. Logistic regression method has fewer assumptions than linear discriminant models. Logistic regression model with maximum likelihood estimators is preferred for solving problems under nonnormality [14], [15].

Classification via Regression: It uses regression methods for classification. Class is binarized and one regression model is built for each class value [16].
Decision Table: It uses a simple decision table majority classifier [17].
In the next section, numerical experiments will be presented on Algorithm 2 via suggested approaches.

Numerical experiments
In this section of the paper, for a clear understanding, we solved semi-supervised data classification problems in one dimensional space by implementation of Algorithm 2 with different supervised data classification methods and different initilization approaches that we disscussed in section 2. In the second part of this section we imply the suggested algorithm on real-world datasets to see the efficieny in large datasets. Besides for comparison we present the state of art methods' results on the same datasets. We use MATLAB and Weka Program in implementation and we experiment Naive Bayes, Logistics, Classification via Regression, Decision table and J48 methods that we explained in section 2 as supervised data classification methods.
In testing phase Ten-fold cross validation is used. It is explained as follows in [19]; the dataset D is randomly separated into 10 heterogenous subsets (the folds) 1 , 2 , . . , 10 of roughly equal size. The inducer is trained and tested 10 times; each time, it is trained on \ 2 and tested on .

Experiments via generated datasets
We generate datasets defined in 1 for the experiments. Mean value of points in = 13.
We assign each C point to the closest center points' class. Redetermined A and B sets according to labels of C points Since the obtained A and B datasets after labeling the unknown ones are the same for each approach, we don't apply Supervised data classification methods for each individually.
We use redetermined A and B datasets as training set in the supervised methods. Obtained evaluation results by use of ten-fold cross validation is given in Table 1.
In terms of the results in Table 1, it is seen that, in little datasets that can be separated with a linear separator, both first and second approaches (mean, clustering) for finding center points doesn't differ from each other such that redetermined A and B sets are the same in each approach. Also running times doesn't differ from each other since the dataset is not so large.  Obtained ten-fold cross validation results, after implementing supervised data classification methods on A and B datasets, are given in Table 2. Obtained ten-fold cross validation results, after implementing supervised data classification methods on A and B datasets, are given in Table 3. In example 2, it seen that redetermined A and B datasets for each approaches (mean, k-means) are different from each other unlike example 1. This is because, given datasets' distribution is different from each other such that given A and B datasets cannot be separated linearly in example 2.
As can be seen from Table 2 and 3 the best accuracy results for both of the approaches are the same. It was obtained in WEKA-J48. However in the other methods, Table 2 and 3 results differ from each other in terms of the differences of used center points that form different training sets.

Experiments via real-world datasets
For comparison between state of art methods on real-world datasets we use the Semi-Supervised Learning and Collective Classification package in the well-known machine learning tool WEKA [20].
Brief explanation of used WEKA semi-supervised classification methods are given as follows: LLGC: "Learning with local and global consistency" was presented in 2003 by Zhou and his friends [21]. It is a collective classifier that generates a smooth classifier function for labeled and unlabelled data.
YATSI: "Yet another two stage idea" was presented in 2006 by Driessens and his friends [22]. It is a collective classifier that uses the given classifier to train on the training set and labeling the unlabeled data. As classifier we choose J48 that generates pruned or unpruned C4.5 decision tree and for predictions as nearest-neighbor-search algorithm we chose KDTree search algorithm that uses Euclidian distance as distance function.
Collective Tree: It works similar to Random Tree with some little differences. It constructs a tree that considers k randomly chosen attributes at each node [23]. We choose 1 as the random seed number to be used. For implementations, Liver Disorders, Breast Cancer Wisconsin Diagnostics (WBCD), Heart Disease, Ionosphere and Blood Transfusion Service Center real-world datasets obtained from UCI Machine Learning Repository by Lichman [24] are used. Number of attributes and instances of these datasets are given in Table 4.
Since we make semi-supervised experiments, we use randomly %20 of the whole dataset as labeled and the rest as unlabelled data. We present the evaluation results by use of Ten-fold cross validation (mentioned in the section start) in Table 5. All the classifiers were tested with their default parameters.
When we interpret the evaluation results in Table 5. Algorithm 2 that uses combination of methods get better results than WEKA algorithms. When we compare Algorithm 2 with mean approach and k-means approach for finding center points, it can be seen that the implementation with mean method gets better results on three datasets however Blood and Ionosphere datasets get the best result in k-means approach.
That's because k-means method states the actual structure of these datasets.

Conclusion
In this paper we experiment a collective classification method for semi-supervised data classification problems. Supervised data classification techniques are used after labeling the unknown labeled points. For the process of labeling the unknown ones we use the center points of each classes. We present three approaches for finding center points. For a clear understanding we solve two examples in one dimensional space by using each approach in initilization of Algorithm 2. Also we present ten-fold cross validation results of implied algorithms. For comparison between state of art methods and to see the efficiency of suggested approaches in large datasets, we make implemantations of suggested algorithm and WEKA algorithms on real-world datasets.
The collective method that we suggest in Algorithm 2 can be useful for semi-supervised data classification problems. The choice of the approach that will be used in the initilization can be changed according to the structure of the given dataset and the existent programs. Therefore the choice of proper initilization approaches and supervised data classification methods that will be used in the suggested collective algorithm should be left to the researcher.