Research Projects

I like to think that I do system research since I spend most of my time writing code and debugging new systems. I see system research as a really fun way to verify some hypotheses or to test whether some ideas are indeed working in practice.

In this page, I report a short overview of the projects that I developed so far. Some of them are still active, others… not really. Please, don’t hesitate to contact me if you would like to know more, or — even better — if you would like to contribute to some of them.

WebPIE

I started this project with my master thesis, and continued the development throughout my PhD. Initially I called it reasoning-hadoop, but later I renamed it as WebPIE (Web Parallel Inference Engine). WebPIE is a forward-chaining reasoner that applies (most of) the RDFS and OWL Horst rules over very large collections of RDF data to materialise any possible conclusion. It relies on the Hadoop framework to distribute the computation, and in our largest experiments we showed how we can use it to perform not-trivial reasoning over about three times the entire size of the Semantic Web using only a cluster of moderated size (64 machines).

This work is described in my master thesis, in a number of scientific publications, and in part of my PhD thesis. All the code is available online, and here you can read a short tutorial on how to launch it on the Amazon EC2 cloud.

Ajira

Ajira is a project that I started after my PhD. The goal of this project was to investigate whether we could generalize the computation that we were performing during reasoning into a generic distributed data processing engine. As a result, I developed (in collaboration with Ceriel Jacobs) a framework that is capable not only to perform generic batch computation (“mapreduce” style), but also stream computing. A key feature of Ajira is that it is capable of executing very complex workflows, allowing not only multiple map-reduce phases, but also iterations and recursion. To the best of our knowledge, we could not find any other data-processing framework as expressive as Ajira.

Recently, we compared the performance of Ajira against other state of the art approaches like Hadoop, Storm, Esper, and Spark, using some classic algorithms and showed that Ajira outperforms them significantly. These experiments are described in a scientific publication that should be available very soon. Ajira is written in Java. All the code, documentation and tutorial are reported here.

QueryPIE

QueryPIE is a parallel and distributed backward-chaining reasoner, which is designed to apply (most of the) RDFS and OWL 2 RL rules when a user queries a large RDF knowledge base. I started to work on this project during my PhD since I wanted to investigate on the feasibility of approaches which do not require a full materialisation, which is a significant limitation of forward-chaining reasoners like WebPIE.

The main idea behind QueryPIE is to pre-compute only some materialisation, and exploit this partial pre-materialisation to reduce the computation of reasoning at query time. This type of “hybrid” reasoning allows the reasoner to reduce significantly the cost of reasoning paying the price of a small pre-computation. In our experiments, we show that with this technique we can answer both atomic and SPARQL queries over billion of RDF triples in a few milliseconds in the best case, with a pre-computation that takes between few seconds to less than an hour.

Currently, this technique is described in my PhD thesis and in a few publications. The (Java) source code is available here, but unfortunately documentation is still lacking.

DynamiTE

After my PhD I became interested in performing reasoning over streaming data, rather than static RDF knowledge bases. In this context, it is crucial to perform incremental reasoning, limiting the computation to consider only the new data that arrives in the system.

DynamiTE is a parallel reasoner which adapts few well-known techniques that originated from the database community to maintain a consistent materialisation of the inference when we add or remove information. In AI such system would be classified as a truth maintenance system since its main purpose is calculate new inference that is valid and retract all the information which is no longer true.

By using our system, we showed how we can perform incremental reasoning over large knowledge bases with billion of RDF triples. While obviously the computation depends on the amount of information that is added or removed, with our system we show that this process can be parallelized effectively. Furthermore, our system allows efficient querying since all the data is stored in a series of B+Tree indices.

Currently the system is described in a single publication. The (Java) code is available here.

RDFPig

I joined this project (which was led by Spyros Kotoulas) when I visited Yahoo! Research Labs in Barcelona in 2010. The main problem addressed by this project consisted in how perform very expensive analytical SPARQL queries over very large amounts of RDF data crawled from the Web.

In RDFPig, we implemented an approach that uses a combination of dynamic programming and sampling to calculate the best query plans and to execute them in different stages. To perform the queries, we used Apache’s Pig language, which translates relational operators into sequence of MapReduce jobs. This introduced a new challenge that consisted in defining an appropriate cost model suitable for a distributed framework where data was is not indexed.

The technique and evaluation is described in a scientific publication. The code is available here

YSR

YSR (Yet another Semantic web Reasoner, pronounced “Wiser”) is a very simple but yet high performant RDFS and OWL reasoner that I have developed using the SWI-Prolog suite when I was still a master student. It allows the user to perform a full materialisation over RDF data. Initially it was developed as a reasoner to be used within the MaRVIN project. Since MaRVIN is a project written in Java while the reasoner is in Prolog, a web interface was developed so that the two programs could exchange data using the HTTP protocol. This project is no longer active nor maintained.

A short documentation about this project (along with the code) is available here.