TY - JOUR
T1 - Formalizing and Validating Wikidata's Property Constraints using SHACL and SPARQL
AU - Ferranti, Nicolas
AU - De Souza, Jairo Francisco
AU - Ahmetaj, Shqiponja
AU - Polleres, Axel
PY - 2024
Y1 - 2024
N2 - In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the most extensive collaboratively maintained open data knowledge graphs on the Web. The World Wide Web Consortium (W3C) recommends the Shapes Constraint Language (SHACL) as the constraint language for validating Knowledge Graphs, which comes in two different levels of expressivity, SHACL-Core, as well as SHACL-SPARQL. Despite the availability of SHACL, Wikidata currently represents its property constraints through its own RDF data model, which relies on Wikidata's specific reification mechanism based on authoritative namespaces, and - partially ambiguous - natural language definitions. In the present paper, we investigate whether and how the semantics of Wikidata property constraints, can be formalized using SHACL-Core, SHACL-SPARQL, as well as directly as SPARQL queries. While the expressivity of SHACL-Core turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all 32 current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalization with Wikidata's violation reporting system and discuss limitations in terms of evaluation via Wikidata's public SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints defined by the Wikidata community, in order to improve the quality and accuracy of data in this collaborative knowledge graph. On the other hand, as a ``byproduct'', our formalization extends existing benchmarks for both SHACL and SPARQL with a challenging, large-scale real-world use case.
AB - In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the most extensive collaboratively maintained open data knowledge graphs on the Web. The World Wide Web Consortium (W3C) recommends the Shapes Constraint Language (SHACL) as the constraint language for validating Knowledge Graphs, which comes in two different levels of expressivity, SHACL-Core, as well as SHACL-SPARQL. Despite the availability of SHACL, Wikidata currently represents its property constraints through its own RDF data model, which relies on Wikidata's specific reification mechanism based on authoritative namespaces, and - partially ambiguous - natural language definitions. In the present paper, we investigate whether and how the semantics of Wikidata property constraints, can be formalized using SHACL-Core, SHACL-SPARQL, as well as directly as SPARQL queries. While the expressivity of SHACL-Core turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all 32 current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalization with Wikidata's violation reporting system and discuss limitations in terms of evaluation via Wikidata's public SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints defined by the Wikidata community, in order to improve the quality and accuracy of data in this collaborative knowledge graph. On the other hand, as a ``byproduct'', our formalization extends existing benchmarks for both SHACL and SPARQL with a challenging, large-scale real-world use case.
KW - Wikidata
KW - Data quality
KW - Knowledge Graphs
KW - Constraints
KW - Shapes Constraint Language
KW - SPARQL
M3 - Journal article
SN - 1570-0844
JO - Semantic Web Journal (SWJ)
JF - Semantic Web Journal (SWJ)
ER -