Reasoning-Hadoop (MSc thesis on MapReduce and Semantic Web)
Reasoning-Hadoop is a master thesis project that I have developed during my master in AI at the Vrije Universiteit. The purpose of the thesis was to research about the possibility of using a MapReduce framework to implement reasoning in Semantic Web.
I took about 6 months. The reasoning process consists in iteratively apply some rules to and existing data set (encoded in RDF) and materialize the derived triples. There are already many existing reasoners but they can handle a limited amount of data. In my thesis I wanted to build a reasoner that could handle data on a web scale. A distributed approach was chosen because it can guarantee scalability in two dimensions, hardware and number of machines, so therefore it can potentially offer an higher degree of scalability than a single machine approach.
First I started to research over RDFS reasoning, because the rules are relatively easy. The Hadoop framework is an excellent piece of software but a straightforward and naive implementation of the rules execution is not performant. The main part of the work consisted in researching how to achieve higher performances. With several not trivial optimizations, the process is now much faster, and so far, there is no other reasoner that fast. Just to give an example, the program computes the RDFS closure over 850M of triples crawled from the Web in less than 1 hour, using 33 machines with commodity hardware.
After I focused more on the ter Horst OWL rules. OWL reasoning is more complex than RDFS reasoning and the optimizations implemented before do not apply there. My OWL implementation is still at an early stage and the performances are far from being competitive. However there is still room for optimizations and it is too early to give a definitive answer.
The purpose of this page is to collect some information in case the reader wants to know a bit more about it. I decided to publish all the code on Internet, in case somebody wants to improve it, but the code is not very readable. If I have time I will clean it up and make it "presentable". If you have any question, don't hesitate to contact me.
Documentation:
Master thesis it is the most complete documentation over the project. If you want to know everything about it, this document is the best starting point.
Software:
The code repository is available at:
https://launchpad.net/reasoning-hadoop
I took about 6 months. The reasoning process consists in iteratively apply some rules to and existing data set (encoded in RDF) and materialize the derived triples. There are already many existing reasoners but they can handle a limited amount of data. In my thesis I wanted to build a reasoner that could handle data on a web scale. A distributed approach was chosen because it can guarantee scalability in two dimensions, hardware and number of machines, so therefore it can potentially offer an higher degree of scalability than a single machine approach.
First I started to research over RDFS reasoning, because the rules are relatively easy. The Hadoop framework is an excellent piece of software but a straightforward and naive implementation of the rules execution is not performant. The main part of the work consisted in researching how to achieve higher performances. With several not trivial optimizations, the process is now much faster, and so far, there is no other reasoner that fast. Just to give an example, the program computes the RDFS closure over 850M of triples crawled from the Web in less than 1 hour, using 33 machines with commodity hardware.
After I focused more on the ter Horst OWL rules. OWL reasoning is more complex than RDFS reasoning and the optimizations implemented before do not apply there. My OWL implementation is still at an early stage and the performances are far from being competitive. However there is still room for optimizations and it is too early to give a definitive answer.
The purpose of this page is to collect some information in case the reader wants to know a bit more about it. I decided to publish all the code on Internet, in case somebody wants to improve it, but the code is not very readable. If I have time I will clean it up and make it "presentable". If you have any question, don't hesitate to contact me.
Documentation:
Master thesis it is the most complete documentation over the project. If you want to know everything about it, this document is the best starting point.
Software:
The code repository is available at:
https://launchpad.net/reasoning-hadoop