EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı (https://github.com/faraday/wikiprep-esa). It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.
This manual provides information on the functionality, setup and usage of EasyESA package.
Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles' words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus "semantic relatedness" between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.
For more information on ESA, please refer to the paper by Evgeniy Gabrilovich and Shaul Markovitch: "Wikipedia-based semantic interpretation for natural language processing" (http://www.jair.org/media/2669/live-2669-4346-jair.pdf).
EasyESA provides the following functionalities:
The EasyESA package includes a setup script for linux.
The setup script will perform the following steps:
The setup can be done in three ways, depending on the user needs and memory/storage availability:
$./run.sh <destination dir>
Simply execute the setup_all.sh script:
$./setup_all.sh <destination dir> <number of preprocessing threads>
Step 4 will take about 3 days to complete on a modern computer (I7 quad core) and use about 200GB of storage space for the early 2013 Wikipedia dump. Step 5 will take about 4 days to complete and use about 30GB on the same specs and Wikipedia dump.
If you already have a wikipedia dump and wish to use it, just comment line 5 of setup_all.sh and put your enwiki-???-pages-articles.xml.bz2 renamed to enwiki-latest-pages-articles.xml.bz2 in the destination directory. The setup script will skip only the download step (step 1).
If you already have a wikipedia preprocessed dump (Zemanta format), place all the preprocessed .xml files in the destination directory and execute the setup_preprocessed.sh script:
$./setup_preprocessed.sh <destination dir>
The steps 1, 3 and 4 will be skipped.
EasyESA service can be used online from
The service parameters are:
Query for the semantic relatedness measure between the words computing and sensor.
Query for the concept vector of the word coffee with maximum length of 50 concepts.
Query for the concept vector overlapping between the words computing
and sensor, and the context windows of both words for each
concept in the overlap.
Simple semantic search (link):
Add data items (e.g. apple, juice, einstein, theory, speed of
light, ...) and do a keyword search (e.g. relativity). Video: avi ogv.
Simple word sense disambiguation (link):
Add a sentence (e.g. the power grid went down), select a word to
get the ranked senses (e.g. power). Video: avi ogv.
EasyESA is distributed under GPL.
An EasyESA benchmark is available here.
Please refer to the publication below if you are using ESA in
Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings) (pdf).