Jacopo Urbani

"You must believe in spring" - Bill Evans

me I am an assistant professor in Computer Science at the Vrije Universiteit Amsterdam (VUA) and a guest researcher at CWI.

My research focuses on how to extract new (and interesting) knowledge from large datasets which are primarily available on the Web. If you are interested, please check out my publication list on Google Scholar or DBLP to have a better idea of my research area.

I received a number of awards for my research. A few papers that I co-authored have received either a honorable mention or a best paper award at top conferences. In 2010, my work on forward inference with MapReduce has won the IEEE SCALE challenge. In 2012, the Network Institute awarded me the prize “Most Promising Young Researcher Award”. In 2013, my PhD was awarded with the qualification cum laude, which was given only to 5% of the theses in our department. In 2014, my PhD work received an honourable mention as best PhD thesis in Computer Science in the country. The award was given by the Christiaan Huygens society, after a selection performed by KNAW (Royal Netherlands Academy of Arts and Sciences).

Latest news

16/08/2022

New paper at SIGMOD 2023 on performing scalable probabilistic reasoning using Trigger Graphs. This work was done in collaboration with Samsung AI (Cambridge, UK).

[Abstract]
The role of uncertainty in data management has become more prominent than ever before, especially because of the growing importance of machine learning-driven applications that produce large uncertain databases. A well-known approach to querying such databases is to blend rule-based reasoning with uncertainty, but techniques proposed so far struggle with large databases. In this paper, we address this problem by presenting a new technique for probabilistic reasoning that exploits Trigger Graphs (TGs) – a notion recently introduced for the non-probabilistic setting. The intuition is that TGs can effectively store a probabilistic model by avoiding an explicit materialization of the lineage and by grouping together similar derivations of the same fact. Firstly, we show how TGs can be adapted to support the possible world semantics. Then, we describe techniques for efficiently computing a probabilistic model, and formally establish the correctness of our approach. We also present an extensive empirical evaluation using a prototype called LTGs. Our comparison against other leading engines shows that LTGs is not only faster, even against approximate reasoning techniques, but can also reason over probabilistic databases that existing engines cannot scale to.
15/04/2022

New paper at KR 2022 on performing rule-based reasoning with existential rules on data streams. This work was done in collaboration with Markus Krötzsch (TU Dresden) and Thomas Eiter (TU Wien).

[Abstract]
We study reasoning with existential rules to perform query answering over streams of data. On static databases, this problem has been widely studied, but its extension to data that changes rapidly has not yet been considered. To bridge this gap, we consider LARS, a well-known framework for rule-based stream reasoning, and extend it to support existential rules. For that, we show how to translate LARS with existentials into a semantics-preserving set of existential rules. As query answering with such rules is undecidable in general, we describe how to leverage the temporal nature of streams and present suitable notions of acyclicity that ensure decidability. Our contribution also includes a preliminary empirical evaluation over artificial streams.
22/02/2022

New paper at ESWC 2022 on fact classification with Knowledge Graph embeddings and enseble-based learning

[Abstract]
Numerous prior works have shown how we can use Knowledge Graph embeddings for ranking unseen facts that are likely to be true. Much less attention has been given on how to use embeddings for fact classification, which is a related task where we do not rank facts but label them either as true or false. A direct conversion of the ranked lists of facts into true/fact labels tends to yield a low accuracy. This makes fact classification with embedding a non-trivial problem. In this paper, we tackle this challenge with a new technique that exploits ensemble learning and weak supervision, following the principle that multiple weak classifiers can make a strong one. Our method is implemented in a new system called DuEL. DuEL post-processes the ranked lists produced by the embedding models with multiple classifiers, which include supervised models like LSTMs, MLPs, and CNNs and unsupervised ones that consider subgraphs and reachability in the graph. The output of these classifiers is aggregated using a weakly supervised method that does not need ground truths, which would be expensive to obtain. Our experiments show that DuEL produces a more accurate classification than other existing methods, with improvements up to 72% in terms of F1 score. This suggests that weakly supervised ensemble learning is a promising technique to perform fact classification with embeddings.
[GitHub]
07/11/2021

New paper at EMNLP 2021 on robust stance classification with BERT-based inconsistency detection

[Abstract]
We study the problem of performing automatic stance classification on social media with neural architectures such as BERT. Although these architectures deliver impressive results, their level is not yet comparable to the one of humans and they might produce errors that have a significant impact on the downstream task (e.g., fact-checking). To improve the performance, we present a new neural architecture where the input also includes automatically generated negated perspectives over a given claim. The model is jointly learned to make simultaneously multiple predictions, which can be used either to improve the classification of the original perspective or to filter out doubtful predictions. In the first case, we propose a weakly supervised method for combining the predictions into a final one. In the second case, we show that using the confidence scores to remove doubtful predictions allows our method to achieve human-like performance over the retained information, which is still a sizable part of the original input.
[Link] [arXiv] [GitHub]