Natural Language Queries over Linked Data
Treo (Irish): direction, path.
What is it?
Treo is a natural language based semantic search engine for Linked Data. The main goal behind Treo is to abstract data consumers from the representation of the datasets, allowing expressive natural language queries over Linked Datasets.
Treo was initially developed and is currently being improved at the Digital Enterprise Research Institute (DERI), Ireland.
Linked Data brings the vision of exposing and interlinking datasets on the Web by using Semantic Web standards. This vision creates the potential of adding a structured data information layer on the Web which can be consumed by both humans and applications. Consuming Linked Data today, however, can be challenging. Linked Data brings a scenario where users may need to query/search over potentially thousands of highly heterogeneous datasets.
The traditional approach for querying databases based on structured queries such as SPARQL (Figure 1), fail at this scale, since it is unfeasible for data consumers to be aware of the data model of all potential datasets of interest.
An ideal solution would abstract users from the datasets, in a scenario where users express their information needs in the most intuitive way (through natural language queries for example) and these information needs are semantically matched against existing datasets structures (Figure 2).
The critical problem is that the structure and terms used in the users’ queries typically differ from the representation of the information in the datasets (Figure 3). In order to address this problem a query/search mechanism needs to cope with a robust semantic matching approach. The provision of a natural language query approach with a robust semantic matching is the focus of this work.
Benefits for Web Users: Direct Answers
Current search engines available on the Web are not able to provide direct answers to users. As a consequence search becomes a time consuming and error prone process. If we put our example natural language query on Google, it returns 25.300.000 potential pages containing the terms present in the query (Figure 4). In order to find the answer users need to navigate through some of the links in the list and read through the web pages in order to find the desired answer.
Search engines such as WolframAlpha target direct answers, however, it does not address our example query (Figure 5).
There are already Linked Datasets available on the Web which contains valuable information which can be used to provide direct answers to users’ information needs (Figure 6).
Using the information on the Linked Data Web, Treo is able to answer the query. From the three returned answers, two contain the final information and one answer details related information (Figure 7).
Treo works as a semantic best effort solution, where instead of expecting crisp results, as in structured database queries or in question answering (QA) systems, Treo returns a concise list of semantically related results.
The Solution: How does Treo work?
Treo is a natural language query mechanism for linked data that focuses on the semantic matching between user queries and Linked Datasets. Despite focusing on RDF data, the principles behind the Treo approach can be transported to other databases, generic labeled graphs and semantic representation of unstructured texts.
Treo’s query processing approach combines entity search, spreading activation search, and distributional semantic relatedness as the key elements to address the semantic matching problem.
The distributional semantic model is a critical element to match query terms to dataset terms, using semantic information which is embedded in large textual resources available on the Web such as Wikipedia. This allows an automatic and robust semantic matching technique which addresses the limitations of previous WordNet-based solutions.
The Treo’s query processing approach works through 3 major steps:
- 1. Entity Search and Pivot Entity Determination: Consists in determining the key entities in the user query (what is the query is about?) and mapping the entities in the query to entities on datasets. The mapping from the natural language terms representing the entities to the URIs representing these entities in the datasets is done through entity search step. The URIs deﬁne the pivot entities in the datasets, which are the entry points for the semantic search process. In the example query, the term Barack Obama is mapped to the URI http://dbpedia.org/resource/Barack Obama in the dataset (step 1 in Figure 8).
- 2. Query Syntactic Analysis: The user natural language query is pre-processed into a partial ordered dependency structure (PODS) (step 2, Figure 8), a format which is closer from the triple-like (subject, predicate and object) structure of RDF. The construction of the PODS demands the previous entity recognition step. The partial ordered dependency structure is built by taking into account the dependency structure of the query, the position of the key entity and a set of transformation rules. An example of PODS for the example query ’From which university did the wife of Barack Obama graduate?’ is shown as gray nodes in Figure 8.
- 3. Semantic Matching (Spreading Activation using Distributional Semantic Relatedness): Taking as inputs the pivot entities URIs and the PODS query representation, the semantic matching process starts by fetching all the relations associated with the top ranked pivot entity. Starting from the pivot entity, the labels of each relation associated with the pivot node have their semantic relatedness measured against the next term in the PODS representation of the query. For the example entity Barack Obama, the next query term wife is compared against all predicates/range types/objects associated with each predicate (e.g. spouse, child, religion, etc). The relations with the highest relatedness measures deﬁne the neighboring nodes which will be explored in the search process. The search algorithm then navigates to the nodes with high relatedness values (in the example, Michelle Obama), where the same process happens for the next query term (graduate). The search process continues until the end of the query is reached (step 4, Figure 8), working as a spreading activation search over the RDF graph, where the activation function (i.e. the threshold to determine the further node exploration process) is deﬁned by a semantic relatedness measure.
The query processing approach returns a set of triple paths, which are a connected set of triples deﬁned by the spreading activation search path, starting from the pivot entities over the RDF graph.
T-Space: Treo’s Information Retrieval Model
The steps above show the rationale of the Treo query processing engine. In order to define a scalable solution, a Vector Space Model was proposed based on the Treo principles.
The elements of the Treo construction of a semantic space based on the principles behind Treo deﬁne a search/index generalization which can be applied to diﬀerent problem spaces, where data is represented as labeled data graphs, including graph databases and semantic-level representations of unstructured text.
The proposed approach introduced in this work embeds an RDF graph into a vector space, adding geometry to the graph structure. The vector space is built from a distributional model, where the coordinate reference frame is defined by interpretation vectors mapping the statistical distribution of terms in the reference corpora. This distributional coordinate system supports a representation of the RDF graph elements which allows a flexible semantic search of these elements (differential aspect of distributional semantics). The distributional model enriches the original semantics of the topological relations and labels of the graph. The distributional model, collected from unstructured data, provides a supporting commonsense semantic reference frame which can be easily built from available text. The use of an external distributional data source which provides this semantic reference frame is a key difference between the T-Space and more traditional VSM approaches. Figure 8 shows the steps of the Treo query processing approach mapped as T-Space search operations.