Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.
/ rdt Public archive

RDT: Russian Distributional Thesaurus (Русский Дистрибутивный Тезаурус)

License

Notifications You must be signed in to change notification settings

nlpub/rdt

Repository files navigation

rdt

RDT: Russian Distributional Thesaurus (Русский Дистрибутивный Тезаурус)

This package let you efficiently use word graph of the Russian Distributional Thesaurus.

Quickstart

  1. Download the pre-packed resource:
wget http://panchenko.me/data/russe/rdt.pkl
  1. Install dependencies, e.g.:
pip install -r requirements.txt
  1. Load the distributional thesaurus (specify path to the downloaded 'rdt.pkl' file):
from dt import RDT, DistributionalThesaurus
rdt = RDT(dt_pkl_fpath="rdt.pkl")

Loading takes about 5 minutes and the resulting structure occupy around 1.3 Gb of RAM. This is however more efficient than parsing the CSV file into a dict in terms of both time and memory consumption. This implementation relies on marisa trie for storing keys and on numpy array for storing similarity scores.

  1. Search for nearest neighbours:
for w,s in rdt.most_similar(u"граф"):
    print w,s