This is a tool for training semantically compositional word vectors and syntactic transformation matrices. Please refer to the following paper for the underlying theory:
Learning Semantically and Additively Compositional Distributional Representations, ACL2016
The tool requires shell, make, c++, java and scala for building. It trains vectors and matrices on dependency parses of corpora.
In this demo, we compile the source, download a sample corpus, train a model on that corpus and investigate the learned model.
Assume the working directory is the root of this git repo.
cd external/
./get_Eigen-3.2.9.sh
cd ../cpp/
make
cd ../mylib/scala/
mkdir classes
scalac -d classes $(find . -name '*.scala')
cd ../../scala/
mkdir classes
scalac -d classes $(find . -name '*.scala')
cd ../
cd data/
./get_enwiki-partial-parsed.sh
cd ../
This corpus is a small portion extracted from English Wikipedia (using the tool Wikipedia Extractor), then parsed by Stanford Parser (2015-12-09 release).
for fn in $(ls data/enwiki-partial-parsed/parse-??.out); do echo $fn; scala -cp scala/classes/ ud_to_dcs.en_stanford_simple.DepTree $fn > $fn.dcs; done
script/make_vocab.sh 200 1000 data/enwiki-partial-parsed/parse-??.out.dcs
As an example, we train a model on the previously obtained DCS trees. We train 6 epochs, using 10 parallel threads, and set the dimension of vectors to 200, using 4 negative samples per data point during training. This will take about one hour.
cpp/train words.sort roles.sort model- 0 6 10 200 4 data/enwiki-partial-parsed/parse-??.out.dcs
The output model file will be named model-006
(one model per epoch).
cpp/see_model words.sort roles.sort model-006
Then, use command such as
M book/NOUN
or
M learn/VERB SUBJ ARG
or
M learn/VERB COMP ARG
or
M learn/VERB ARG about
etc. to investigate the model.
A 250-dim model trained on the whole English Wikipedia dump, which took about one week for parsing (by a 12core * 20 cluster) and three days for training (by a 24core computer), can be downloaded as follows.
cd acl2016_eval/
./get_enwiki-model.sh
cd ../
A 200-dim model trained on a collection of 19-th century novels (collected from Project Gutenberg by Microsoft Sentence Completion Challenge).
cd acl2016_eval/
./get_MSRSentComp-model.sh
cd ../
cpp/acl2016_eval/phrase_similarity acl2016_eval/phrase_similarity/mitchell10-AN.txt acl2016_eval/enwiki-model/words.cut1000.sort acl2016_eval/enwiki-model/roles.cut10000.sort acl2016_eval/enwiki-model/model-dim250-001 /dev/null
The following will convert the training and testing data into features used by the libsvm toolkit.
cpp/acl2016_eval/relation_classification_svm_feature 16000 acl2016_eval/relation_classification/semeval10t8_train_converted.txt acl2016_eval/enwiki-model/words.cut1000.sort acl2016_eval/enwiki-model/roles.cut10000.sort acl2016_eval/enwiki-model/model-dim250-001 > semeval10t8_train.svm
cpp/acl2016_eval/relation_classification_svm_feature 2717 acl2016_eval/relation_classification/semeval10t8_test_converted.txt acl2016_eval/enwiki-model/words.cut1000.sort acl2016_eval/enwiki-model/roles.cut10000.sort acl2016_eval/enwiki-model/model-dim250-001 > semeval10t8_test.svm
Then, an SVM model can be trained and tested using the tool.
svm-train -g 0.25 -c 2.0 semeval10t8_train.svm semeval10t8.model
svm-predict semeval10t8_test.svm semeval10t8.model semeval10t8.out
cpp/acl2016_eval/sentence_completion acl2016_eval/sentence_completion/MSRSentComp-converted.txt acl2016_eval/sentence_completion/MSRSentComp-answers.txt acl2016_eval/MSRSentComp-model/words.cut50.sort acl2016_eval/MSRSentComp-model/roles.cut1000.sort acl2016_eval/MSRSentComp-model/model-dim200-018 /dev/null
cpp/acl2016_eval/sentence_completion acl2016_eval/sentence_completion/MSRSentComp-converted.txt acl2016_eval/sentence_completion/MSRSentComp-answers.txt acl2016_eval/MSRSentComp-model/words.cut50.sort norole acl2016_eval/MSRSentComp-model/model-dim200-norole-034 /dev/null
You can use Stanford Parser with option -outputFormat "wordsAndTags,typedDependencies" -outputFormatOptions "stem,basicDependencies"
to parse your corpus.
The following is an example to download the parser and then parse a simple sentence:
cd external/
./get_Stanford-Parser-2015-12-09.sh
cd ../
echo "I am happy." | script/lexparser.sh -
We also provide a tool to parse a large corpus by running several Stanford parsers in parallel JVMs. To compile:
cd stanford_parser_run/
mkdir classes
javac -cp "../external/stanford-parser-full-2015-12-09/*" -d classes $(find . -name '*.java')
cd ../
To parse, write the paths to all files you want to parse into a single file, say fnlist.txt
, one path per line. Then, run the following command:
script/parse_files.sh fnlist.txt output_dir 10
It will parse all files listed in fnlist.txt
using 10 parallel parsers. The result will be output into output_dir
. (WARNING: if output_dir
already exists, this command will erase the existing contents.)