SGTlightSpectral Graph TransducerAuthor: Thorsten Joachims <thorsten@joachims.org>Cornell University Department of Computer Science Version: 1.00 |
SGTlight is an implementation of a Spectral Graph Transducer (SGT) [Joachims, 2003] in C using Matlab libraries. The SGT is a method for transductive learning. It solves a normalized-cut (or ratio-cut) problem with additional constraints for the labeled examples using spectral methods. The approach is efficient enough to handle datasets with several ten-thousands of examples.
The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the author. If you use SGTlight in your scientific work, please cite as
I would also appreciate, if you sent me (a link to) your papers so that I can learn about your research. The implementation was developed on Windows 2000 with mcc and Matlab R12, but compiles also on Linux and Solaris. The source code is available at the following location:
https://download.joachims.org/sgt_light/current/sgt_light.tar.gz
If you just want the binaries, you can download them for the following systems:
Please send me email <tj@cs.cornell.edu> and let me know that you got SGTlight. I will put you on my mailing list to inform you about new versions and bug-fixes.
To compile and install the source code of SGTlight you need to download sgt_light.tar.gz. Unpack it with
gunzip -c sgt_light.tar.gz | tar xvf -
You compile the source code with
make
which compiles all executables (type make info for additional options). Note that you have to have Matlab installed with the mcc compiler. You will have five executables:
sgt_buildknngraph (creates a k-NN graph from feature vectors)
sgt_decompgraph
(compute first d eigenvectors of graph Laplacian)
sgt_predict (predict labels of test
examples)
sgt_traintestsplit (make random test/training split)
sgt_addgraphs (adds adjacency matrices of two
graphs)
To run the executables, you need the Matlab runtime library, which is available as the compressed archive mglinstaller. The runtime library is delivered with Matlab (see here). Copy mglinstaller into your sgt-light directory and call
mglinstaller
Solaris: setenv LD_LIBRARY_PATH ./bin/sol2:$PATH
Windows: set PATH=.\bin\win32;%PATH%
Linux: setenv LD_LIBRARY_PATH ./bin/glnx86:$PATH
cp toolbox/matlab/datatypes/cellfun.* .
cp toolbox/matlab/sparfun/arpackc.* .
To install one of the binary archives of SGTlight (e.g. sgt_light_solaris.tar.gz) you need to download it and unpack it with
You will have five executables:gunzip -c sgt_light_solaris.tar.gz | tar xvf -
gunzip -c sgt_light_windows.tar.gz | tar xvf -
gunzip -c sgt_light_linux.tar.gz | tar xvf -
sgt_buildknngraph (creates a k-NN graph from feature vectors)
sgt_decompgraph
(compute first d eigenvectors of graph Laplacian)
sgt_predict (predict labels of test
examples)
sgt_traintestsplit (make random test/training split)
sgt_addgraphs (adds adjacency matrices of two
graphs)
To run the executables, you need the Matlab runtime libraries, which are included in the binary archive (also, see here for details on the Matlab libraries). Finally, you must extend your search path by executing
Solaris: setenv LD_LIBRARY_PATH ./bin/sol2:$PATH
Windows: set PATH=.\bin\win32;%PATH%
Linux: setenv LD_LIBRARY_PATH ./bin/glnx86:$PATH
This section explains how to use the SGTlight software. The background is given in [Joachims, 2003].
SGTlight consists of three main modules and two additional programs for convenience. The main modules are, first, a program (sgt_buildknngraph) for constructing the k nearest neighbor graph from data in the form of feature vectors given in SVMlight format (see here). Second, a program (sgt_decompgraph) for computing the d smallest eigenvectors of the Laplacian of a graph, and third a module (sgt_predict) for computing the acutal predictions for a given training and test set. Furthermore, there is a program (sgt_traintestsplit) for computing a random test/training split and a program (sgt_addgraphs) for adding two adjacency matrices (used for co-training).
You can get a description of the parameters of each module by calling it with the option -?.
The following examples illustrate how to use the different programs.You will find an example text classification problem at
https://download.joachims.org/sgt_light/examples/example1.tar.gz
Download this file into your sgt_light directory and unpack it with
gunzip -c example1.tar.gz | tar xvf -
This will create a subdirectory example1. Documents are represented as feature vectors. Each feature corresponds to a word stem (9947 features). The task is to learn which
Reuters articles are about "corporate acquisitions". There are 300 positive and 300 negative examples in the file examples.dat. The feature numbers correspond to the line numbers in the file words. To run the example, execute the commands:sgt_buildknngraph -k 50 example1/examples.dat example1/graph.gra
sgt_decompgraph -d 80 example1/graph.gra example1/graph.dcg
The first command takes the input points in example1/examples.dat and computes the weighted 50 nearest neighbor graph (using cosine similarity) which is stored in example1/graph.gra. See (svm_buildknngraph -?) for additional options. The second command computes the 80 first eigenvectors of the (normalized) Laplacian of example1/graph.gra and stores them in example1/graph.dcg. Both sgt_buildknngraph and sgt_decompgraph need to be called only once for each dataset, since their result is independent of the particular test/training split and the labels in the training set.
To generate a random test/training split with 10 examples in the training set, call the command
sgt_traintestsplit -n 10 example1/examples.dat example1/train.dat example1/test.dat
The files example1/train.dat and example1/test.dat contain the train and test labels in the same order as in the file example1/examples.dat. +1 stands for a positive label, -1 stands for a negative label, and 0 for unlabeled. To compute the predictions for this training/test split, call
sgt_predict -c 1000 -l example1/test.dat example1/train.dat example1/graph.dcg example1/pred
Since the true test labels are given (-l example1/test.dat), the prediction accuracy on the test set is computed and printed to stdout. The trade-off between training error and complexity is set via -c 1000 similar to an SVM. The individual predictions on the training and test examples are output to example1/pred, in the same order as in the file example1/train.dat. A positive value indicates prediction as a positive example, a negative value indicates a negative prediction.
Sorry, not yet ready.
If you find bugs or you have problems with the code you cannot solve by yourself, please contact me via email <
tj@cs.cornell.edu>.This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.
[Joachims, 2003a] |
Thorsten Joachims, Transductive Learning via Spectral Graph
Partitioning, Proceedings of the International
Conference on Machine Learning (ICML), 2003. |
Last modified August 22nd, 2003 by Thorsten Joachims <thorsten@joachims.org>