EasyESA


Easy Semantic Approximation with Explicit Semantic Analysis


1. Overview

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı (https://github.com/faraday/wikiprep-esa). It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.

This manual provides information on the functionality, setup and usage of EasyESA package.

2. Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles' words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus "semantic relatedness" between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.

For more information on ESA, please refer to the paper by Evgeniy Gabrilovich and Shaul Markovitch: "Wikipedia-based semantic interpretation for natural language processing" (http://www.jair.org/media/2669/live-2669-4346-jair.pdf).

3. EasyESA

EasyESA provides the following functionalities:

Semantic relatedness measure
Given two terms, returns the semantic relatedness: a real number in the [0,1] interval, representing how semantically close are the terms. The more related the terms are, the higher the value returned.
Concept vector
Given a term, returns the concept vector: a list of Wikipedia article titles (concepts) with the associated score for the term.
Query explanation
Given two terms, returns the concept vector overlapping between them and the "context windows" for both terms on each overlapping concept. A context window for a given pair (term, concept) is a short segment from the Wikipedia article represented by the concept, containing the term.

  1. EasyESA was developed as an improvement over Wikiprep-ESA. The main differences are:


4. Downloads

Install MongoDB
EasyESA Binaries: easyEsa
Source code: easyEsa_src
Setup Script: setup_all.sh
Database and Indexes: English Wikipedia 2013 (Index) or English Wikipedia 2006.

5. Installation

The EasyESA package includes a setup script for linux.

The setup script will perform the following steps:

  1. Download the latest Wikipedia dump.
  2. Download and install all the dependencies.
  3. Split the Wikipedia dump, using more than one thread.
  4. Preprocess the dump using Wikiprep (Zemanta's version) (http://www.tablix.org/~avian/git/wikiprep.git).
  5. Generate the ESA terms and concept vectors from the Wikipedia preprocessed dump.
  6. Generate the database and indexes.
  7. Start the EasyESA services.

The setup can be done in three ways, depending on the user needs and memory/storage availability:

5.1. Simple run (Recommended)

You can download the EasyESA database and indexes for English Wikipedia 2013 (Index) or English Wikipedia 2006.

Simple procedure:

1. Extract easy_esa.tar.gz into INSTALL_DIR/.
2. Extract data*.tar.gz into INSTALL_DIR/mongodb/data.
3. Extract index*.tar.gz into INSTALL_DIR/index.
4. Start mongodb: mongod --dbpath mongodb/data/db
5. Start the EasyESA service: java -jar easy_esa.jar 8890 INSTALL_DIR/index &

On Linux, you can execute the run.sh script for steps 4 and 5:
  $./run.sh <destination dir>

5.2. From setup script only (full setup)

Simply execute the setup_all.sh script:

  $./setup_all.sh <destination dir> <number of preprocessing threads>

where

Step 4 will take about 3 days to complete on a modern computer (I7 quad core) and use about 200GB of storage space for the early 2013 Wikipedia dump. Step 5 will take about 4 days to complete and use about 30GB on the same specs and Wikipedia dump.

5.3. From setup script with previously downloaded Wikipedia dump

If you already have a wikipedia dump and wish to use it, just comment line 5 of setup_all.sh and put your enwiki-???-pages-articles.xml.bz2 renamed to enwiki-latest-pages-articles.xml.bz2 in the destination directory. The setup script will skip only the download step (step 1).

5.4. From setup script with preprocessed Wikipedia database

If you already have a wikipedia preprocessed dump (Zemanta format), place all the preprocessed .xml files in the destination directory and execute the setup_preprocessed.sh script:

  $./setup_preprocessed.sh <destination dir>

The steps 1, 3 and 4 will be skipped.

6. Usage & Online Service

EasyESA service can be used online from

  http://vmdeb20.deri.ie:8890/esaservice 

or locally

  http://localhost:8890/esaservice

The service parameters are:

task
The query function to be called. The choices and their parameters are:

6.1. Examples

6.1.1. Semantic relatedness measure query

  http://vmdeb20.deri.ie:8890/esaservice?task=esa&term1=computing&term2=sensor

Query for the semantic relatedness measure between the words computing and sensor.

6.1.2. Concept vector query

  http://vmdeb20.deri.ie:8890/esaservice?task=vector&source=coffee&limit=50

Query for the concept vector of the word coffee with maximum length of 50 concepts.

6.1.3. Explain query

  http://vmdeb20.deri.ie:8890/esaservice?task=explain&term1=computing&term2=sensor&limit=10000

Query for the concept vector overlapping between the words computing and sensor, and the context windows of both words for each concept in the overlap.

7. Demonstrations

Simple semantic search (link): Add data items (e.g. apple, juice, einstein, theory, speed of light, ...) and do a keyword search (e.g. relativity). Video: avi ogv.

Simple word sense disambiguation (link): Add a sentence (e.g. the power grid went down), select a word to get the ranked senses (e.g. power). Video: avi ogv.

8. License

EasyESA is distributed under GPL.

9. Benchmark

An EasyESA benchmark is available here.

10. Publication

Please refer to the publication below if you are using ESA in your experiments.

Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings) (pdf).

11. Contact

Danilo Carvalho, Çağatay Çallı, Andre Freitas, Edward Curry.

Insight Centre for Data Analytics
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway

contact: andre - dot - freitas | at | deri [dot] org