How does it work ?

Natural Language Queries over Linked Data

Treo (Irish): direction, path.

What is it?

Treo is a natural language based semantic search engine for Linked Data. The main goal behind Treo is to abstract data consumers from the representation of the datasets, allowing expressive natural language queries over Linked Datasets.

Treo was initially developed and is currently being improved at the Digital Enterprise Research Institute (DERI), Ireland.

The Problem

Linked Data brings the vision of exposing and interlinking datasets on the Web by using Semantic Web standards. This vision creates the potential of adding a structured data information layer on the Web which can be consumed by both humans and applications. Consuming Linked Data today, however, can be challenging. Linked Data brings a scenario where users may need to query/search over potentially thousands of highly heterogeneous datasets.

The traditional approach for querying databases based on structured queries such as SPARQL (Figure 1), fail at this scale, since it is unfeasible for data consumers to be aware of the data model of all potential datasets of interest.

Figure 1: The traditional approach for querying databases assumes the awareness of users of the data model behind each dataset. Users have to look and understand the data model of each dataset in order to build structured queries over them.

Figure 1: The traditional approach for querying databases assumes the awareness of users of the data model behind each dataset. Users have to look and understand the data model of each dataset in order to build structured queries over them.

An ideal solution would abstract users from the datasets, in a scenario where users express their information needs in the most intuitive way (through natural language queries for example) and these information needs are semantically matched against existing datasets structures (Figure 2).

Figure 2: The ideal query mechanism for databases would abstract users from the representation of the data. Users could freely express their information needs while the query/search engine matches the query with datasets elements.

Figure 2: The ideal query mechanism for databases would abstract users from the representation of the data. Users could freely express their information needs while the query/search engine matches the query with datasets elements.

The critical problem is that the structure and terms used in the users’ queries typically differ from the representation of the information in the datasets (Figure 3). In order to address this problem a query/search mechanism needs to cope with a robust semantic matching approach. The provision of a natural language query approach with a robust semantic matching is the focus of this work.

Figure 3: Structure and vocabulary differences between user natural language queries and information present in the dataset.

Figure 3: Structure and vocabulary differences between user natural language queries and information present in the dataset.

 

Benefits for Web Users: Direct Answers

Current search engines available on the Web are not able to provide direct answers to users. As a consequence search becomes a time consuming and error prone process. If we put our example natural language query on Google, it returns 25.300.000 potential pages containing the terms present in the query (Figure 4). In order to find the answer users need to navigate through some of the links in the list and read through the web pages in order to find the desired answer.

 

Figure 4: Google returns returns 25.300.000 pages for answering the query ‘From which university did the wife of Barack Obama graduate’.

Figure 4: Google returns returns 25.300.000 pages for answering the query ‘From which university did the wife of Barack Obama graduate’.

 

Search engines such as WolframAlpha target direct answers, however, it does not address our example query (Figure 5).

Figure 5: WolframAlpha result for the example query .

Figure 5: WolframAlpha result for the example query .

There are already Linked Datasets available on the Web which contains valuable information which can be used to provide direct answers to users’ information needs (Figure 6).

Figure 6: The Linked Data Cloud as of September 2011.

Figure 6: The Linked Data Cloud as of September 2011.

 

Using the information on the Linked Data Web, Treo is able to answer the query. From the three returned answers, two contain the final information and one answer details related information (Figure 7).

Figure 7: Treo results for the example query.

Figure 7: Treo results for the example query.

Treo works as a semantic best effort solution, where instead of expecting crisp results, as in structured database queries or in question answering (QA) systems, Treo returns a concise list of semantically related results.

The Solution: How does Treo work?

Treo is a natural language query mechanism for linked data that focuses on the semantic matching between user queries and Linked Datasets. Despite focusing on RDF data, the principles behind the Treo approach can be transported to other databases, generic labeled graphs and semantic representation of unstructured texts.

Treo’s query processing approach combines entity search, spreading activation search, and distributional semantic relatedness as the key elements to address the semantic matching problem.

The distributional semantic model is a critical element to match query terms to dataset terms, using semantic information which is embedded in large textual resources available on the Web such as Wikipedia. This allows an automatic and robust semantic matching technique which addresses the limitations of previous WordNet-based solutions.

The Treo’s query processing approach works through 3 major steps:

  1. 1.     Entity Search and Pivot Entity Determination: Consists in determining the key entities in the user query (what is the query is about?) and mapping the entities in the query to entities on datasets. The mapping from the natural language terms representing the entities to the URIs representing these entities in the datasets is done through entity search step.   The URIs define the pivot entities in the datasets, which are the entry points for the semantic search process. In the example query, the term Barack Obama is mapped to the URI http://dbpedia.org/resource/Barack Obama in the dataset (step 1 in Figure 8).

 

  1. 2.     Query Syntactic Analysis: The user natural language query is pre-processed into a partial ordered dependency structure (PODS) (step 2, Figure 8), a format which is closer from the triple-like (subject, predicate and object) structure of RDF. The construction of the PODS demands the previous entity recognition step. The partial ordered dependency structure is built by taking into account the dependency structure of the query, the position of the key entity and a set of transformation rules. An example of PODS for the example query ’From which university did the wife of Barack Obama graduate?’ is shown as gray nodes in Figure 8.

 

  1. 3.     Semantic Matching (Spreading Activation using Distributional Semantic Relatedness): Taking as inputs the pivot entities URIs and the PODS query representation, the semantic matching process starts by fetching all the relations associated with the top ranked pivot entity. Starting from the pivot entity, the labels of each relation associated with the pivot node have their semantic relatedness measured against the next term in the PODS representation of the query. For the example entity Barack Obama, the next query term wife is compared against all predicates/range types/objects associated with each predicate (e.g. spouse, child, religion, etc). The relations with the highest relatedness measures define the neighboring nodes which will be explored in the search process. The search algorithm then navigates to the nodes with high relatedness values (in the example, Michelle Obama), where the same process happens for the next query term (graduate). The search process continues until the end of the query is reached (step 4, Figure 8), working as a spreading activation search over the RDF graph, where the activation function (i.e. the threshold to determine the further node exploration process) is defined by a semantic relatedness measure.

The query processing approach returns a set of triple paths, which are a connected set of triples defined by the spreading activation search path, starting from the pivot entities over the RDF graph.

Figure 8: Steps of the Treo query processing approach.

Figure 8: Steps of the Treo query processing approach.

 

Additional information on the Treo approach can be found in [1], [5].

T-Space: Treo’s Information Retrieval Model

The steps above show the rationale of the Treo query processing engine. In order to define a scalable solution, a Vector Space Model was proposed based on the Treo principles.

The elements of the Treo construction of a semantic space based on the principles behind Treo define a search/index generalization which can be applied to different problem spaces, where data is represented as labeled data graphs, including graph databases and semantic-level representations of unstructured text.

The proposed approach introduced in this work embeds an RDF graph into a vector space, adding geometry to the graph structure. The vector space is built from a distributional model, where the coordinate reference frame is defined by interpretation vectors mapping the statistical distribution of terms in the reference corpora. This distributional coordinate system supports a representation of the RDF graph elements which allows a flexible semantic search of these elements (differential aspect of distributional semantics). The distributional model enriches the original semantics of the topological relations and labels of the graph. The distributional model, collected from unstructured data, provides a supporting commonsense semantic reference frame which can be easily built from available text. The use of an external distributional data source which provides this semantic reference frame is a key difference between the T-Space and more traditional VSM approaches. Figure 8 shows the steps of the Treo query processing approach mapped as T-Space search operations.

 

Figure 9: Steps of the Treo query processing approach over the T-Space.

Figure 9: Steps of the Treo query processing approach over the T-Space.

 

More Info