Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Link Search Menu Expand Document

Data Quality Constraints

This chapter discusses Stardog’s Integrity Constraint Validation (ICV) - a feature to enforce data integrity and help improve the knowledge graph’s correctness and consistency. This page provides an overview and shows you the basic usage of this feature.

Page Contents
  1. Overview
  2. SHACL Constraints
  3. Adding Constraints
  4. Validate SPARQL query
  5. Validating Constraints
    1. Validating Specific Graphs
    2. Validating Specific Nodes
    3. Validating Shapes from Specific Graphs
    4. Validating Specific Shapes
    5. Validating External Shapes
    6. Limiting Number of Violations
  6. Validate SPARQL service
    1. Relationship between the VALIDATE query and VALIDATE service
  7. ICV & Reasoning
  8. ICV Guard Mode
  9. SHACL Extensions in Stardog
    1. Query Dataset Specification in SPARQL constraints
  10. SHACL Support Limitations

Overview

Stardog Integrity Constraint Validation (“ICV”) validates RDF data stored in a Stardog database according to constraints described by users that make sense for their domain, application, and data. These constraints are written in SHACL (Shape Constraint Language). Using a high-level language as a constraint language for RDF and Linked Data has several advantages:

  • Unifying the domain model with data quality rules
  • Aligning the domain model and data quality rules with the integration model and language (i.e., RDF)
  • Being able to query the domain model, data quality rules, integration model, mapping rules, etc. with SPARQL
  • Being able to use automated reasoning about all of these things to ensure logical consistency, explain errors and problems, etc.

Typical ICV usage is to add constraints to a database that has the domain data and validate the database to see if there are any violations. It is also possible to enable guard mode, which will enforce the constraints at database modification time.

SHACL Constraints

SHACL and other data quality concepts are demonstrated in our Data Quality Training. You can find an illustrative example of SHACL constraints for the music tutorial dataset in the tutorials repo.

Adding Constraints

SHACL is expressed as RDF, so SHACL constraints can be added to a Stardog database like any other RDF data. Best practice is to store SHACL definitions in one or more named graphs to make managing them easier.

$ stardog data add -g urn:example:constraints stardog-tutorial-music stardog-tutorials/shacl/music_shacl.ttl

By default, the validation process will use any SHACL definition in any named graph, but the database configuration option icv.active.graphs can be set to a list of named graphs to restrict which named graphs will be used to look up constraints. The shape graphs can be specified at validation time, as explained below.

When SHACL constraints are stored in a named graph, clearing the named graph will remove the constraints from the database:

$ stardog data remove -g urn:example:constraints stardog-tutorial-music

The SHACL constraints in the database can be queried with SPARQL as regular data. But the icv export command can also be used to show the list of SHACL constraints:

$ stardog icv export stardog-tutorial-music
ShaclConstraint{http://stardog.com/tutorial/SongShape}
ShaclConstraint{http://stardog.com/tutorial/AlbumShape}
ShaclConstraint{http://stardog.com/tutorial/ArtistShape}
ShaclConstraint{http://stardog.com/tutorial/BandShape}

If the -f/--format option is used, the contents of the constraints can be exported in any desired RDF format. For example, the following command will export the constraints in pretty Turtle format:

$ stardog icv export -f pretty stardog-tutorial-music

Validate SPARQL query

Validation is the process of checking whether a database is valid with respect to the integrity constraints. The result is a validation report, which is a collection of violations. If there are no violations, we say the database conforms to the constraints (i.e., it is valid). Each violation points to a node and a constraint, along with other auxiliary information to explain what has been violated. The validation report can be retrieved by executing a VALIDATE query. VALIDATE is a new top-level query form, separate from SELECT, CONSTRUCT, or other query types introduced by Stardog. The VALIDATE query returns an RDF graph as its result, similar to CONSTRUCT and DESCRIBE queries. The result is a SHACL validation report as defined in the SHACL specification.

The syntax of VALIDATE queries is as follows:

VALIDATE (ALL  | [<IRI>+] [GRAPH <IRI>+]) 
[USING SHAPES (<IRI>+ | GRAPH <IRI>+ | <QuadData>) ]
[LIMIT <int>] 
[LIMIT PER SHAPE <int>] 

where IRI and QuadData are defined in the SPARQL grammar.

The details of VALIDATE queries are explained in the following sections.

Validating Constraints

The VALIDATE query in its simplest form looks as follows:

VALIDATE ALL

This query validates all the named graphs in the database using all the constraints stored within the database. More complex forms of the query can validate different subsets of the data using a subset of the constraints, as explained below.

For a valid database, the result of this query is a report that looks as follows:

@prefix sh: <http://www.w3.org/ns/shacl#> .

[
a sh:ValidationReport ;
sh:conforms true
] .

An example validation report showing some violations looks as follows:

@prefix : <http://stardog.com/tutorial/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .


[
    a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :SongLengthShape ;
        sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
        sh:focusNode :Love_Me_Do ;
        sh:resultPath :length ;
        sh:value 125.0 ;
        sh:resultMessage "Value must have datatype xsd:integer"
    ] , [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :AlbumDateShape ;
        sh:sourceConstraintComponent sh:MaxCountConstraintComponent ;
        sh:focusNode :Please_Please_Me ;
        sh:resultPath :date ;
        sh:resultMessage "There must be <= 1 values"
    ] , [
        a sh:ValidationResult ;
        sh:resultSeverity sh:Violation ;
        sh:sourceShape :AlbumTrackShape ;
        sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
        sh:focusNode :McCartney ;
        sh:resultPath :track ;
        sh:resultMessage "There must be >= 1 values"
    ]
] .

Validating Specific Graphs

By default, validation works over the whole database and validates the contents of every named graph. Validation can be performed over one or more specific named graphs:

VALIDATE GRAPH ex:graph1 ex:graph2

Validating specific graphs means any node outside these graphs will not be considered a target for validation.

If a node violating a shape is stored in multiple named graphs, validation results for that node will be duplicated for each occurrence of the node.

Validating Specific Nodes

The focus of validation can be made as fine-grained as specific nodes in the graph:

VALIDATE ex:MyNode

In this case, only the specified node(s) will be validated using the applicable constraints. No other node will be considered a target. Any constraint for which the specified node(s) are not targets will be ignored. If the specified node IRIs are not found in the database, a violation will be returned for each such undefined node.

Validating Shapes from Specific Graphs

By default, the VALIDATE query will use any SHACL constraint in any named graph, but the database configuration option icv.active.graphs can be set to a list of named graphs to restrict which named graphs should be used to look up the constraints. It is also possible to define the shapes graph within the VALIDATE query:

VALIDATE ALL USING SHAPES GRAPH ex:constraintGraph

This query will use any shape defined within the specified graph(s).

If the same shape is stored in multiple named graphs, validation results for that shape will be duplicated for each occurrence of the shape.

Validating Specific Shapes

The shapes to validate can be directly specified in the validation command, too:

VALIDATE ALL USING SHAPES :SongShape :AlbumShape

In this case, no other shape will be validated. If the specified shape IRIs are not found in the database, a violation will be returned for each such undefined shape.

Validating External Shapes

It is also possible to perform validation using shapes that are not stored within the database. In this case, the shapes can be specified in-line with the VALIDATE query:

VALIDATE ALL USING SHAPES {
    :NameShape a sh:NodeShape ;
       sh:property [
         sh:path :name ;
         sh:minCount 1 ;
         sh:datatype xsd:string
       ] .
}

Limiting Number of Violations

The number of violations might be too high for some databases, such that it is desirable to limit the number of violations included in the validation report for performance or readability reasons. A limit can be defined to stop validation when a certain number of violations have been found:

VALIDATE ALL LIMIT 10

In some cases, it is desirable to limit the violations reported per shape. This is useful to get a quick summary of all the shapes for which there are violations. A query hint can be used to limit the violations reported per shape. The following query will return at most 10 violations per shape:

VALIDATE ALL LIMIT PER SHAPE 10

For example, if there are three shapes for which there are violations then this query will return 30 results.

The LIMIT PER SHAPE can be used in conjunction with the query LIMIT as well:

VALIDATE ALL LIMIT 100 LIMIT PER SHAPE 1

This query would return at most one validation limit per shape, but if there are more than 100 shapes with validation results, the validation would stop after the first 100 results have been returned.

Validate SPARQL service

In addition to the VALIDATE query explained above, Stardog supports a way to perform validation using the SERVICE keyword. This allows validation to be done as part of a SELECT query. Each solution returned by the validate service corresponds to a distinct violation result. If there are no violations, the service will return no solutions.

The following example shows how the service can be invoked:

PREFIX icv: <tag:stardog:api:icv:>
SELECT * {
    SERVICE icv:validate {
      # service input parameters
      [] icv:dataGraph :myDataGraph;
         icv:shapesGraph :myShapesGraph;
      
      # service output parameters
         icv:resultSeverity ?severity ;
         icv:resultMessage ?message ;
         icv:sourceShape ?shape ;
         icv:sourceConstraint ?constraint ;
         icv:sourceConstraintComponent ?component ;
         icv:focusNode ?focusNode ;
         icv:resultPath ?path ;
         icv:value ?valueNode ;
    }
}

Similar to VALIDATE queries, data graphs and shape graphs can be specified for the validation process. If no input parameters are given, the validation will be over the whole database using all the constraints. The service supports only constant input parameters; that is, input parameters cannot be specified as variables that will be bound by other parts of the query.

Below is an example result of the validate service:

+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+
|               severity               |                                message                                 |                    shape                     | constraint |                       component                        |                  focusNode                   |                path                |                 valueNode                  |
+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+
| http://www.w3.org/ns/shacl#Violation | "Value must have datatype xsd:integer"                                 | http://stardog.com/tutorial/SongLengthShape  |            | http://www.w3.org/ns/shacl#DatatypeConstraintComponent | http://stardog.com/tutorial/Love_Me_Do       | http://stardog.com/tutorial/length | 1.2E3                                      |
| http://www.w3.org/ns/shacl#Violation | "There must be >= 1 values"                                            | http://stardog.com/tutorial/AlbumTrackShape  |            | http://www.w3.org/ns/shacl#MinCountConstraintComponent | http://stardog.com/tutorial/McCartney        | http://stardog.com/tutorial/track  |                                            |
| http://www.w3.org/ns/shacl#Violation | "There must be <= 1 values"                                            | http://stardog.com/tutorial/AlbumDateShape   |            | http://www.w3.org/ns/shacl#MaxCountConstraintComponent | http://stardog.com/tutorial/Imagine          | http://stardog.com/tutorial/date   |                                            |
+--------------------------------------+------------------------------------------------------------------------+----------------------------------------------+------------+--------------------------------------------------------+----------------------------------------------+------------------------------------+--------------------------------------------+

Complex SHACL property paths are serialized as multiple triples in RDF. However, since the resultPath in the SPARQL service is bound to a single RDF value, this complexity cannot be expressed. If the result path in a violation is a predicate path, the resulting variable will be bound to the corresponding IRI. If the result path is a complex property path, the variable will be bound to the string representation of the path; e.g. ex:firstProperty/ex:secondProperty*.

The validate service can be used to validate external constraints by providing the shapes in-line, within the SERVICE block.

Relationship between the VALIDATE query and VALIDATE service

The VALIDATE query form and the validate SPARQL service provide two different ways to retrieve the validation results. The VALIDATE query can be thought as syntactic sugar for a CONSTRUCT query using the SPARQL service:

PREFIX icv: <tag:stardog:api:icv:>
PREFIX sh: <http://www.w3.org/ns/shacl#> 
CONSTRUCT {
    ?report a sh:ValidationReport ;
    sh:conforms ?conforms ;
    sh:result ?result .
    ?result 
        a sh:ValidationResult ;
        sh:resultSeverity ?severity ;
        sh:resultMessage ?message ;
        sh:sourceShape ?shape ;
        sh:sourceConstraint ?constraint ;
        sh:sourceConstraintComponent ?component ;
        sh:focusNode ?focusNode ;
        sh:resultPath ?path ;
        sh:value ?valueNode ;
}
WHERE {
  BIND(bnode("_:ValidationReport") as ?report)
  OPTIONAL {
    BIND(bnode() as ?result)
    SERVICE icv:validate {
      _:serviceParams icv:dataGraph :staging;
         icv:resultSeverity ?severity ;
         icv:resultMessage ?message ;
         icv:sourceShape ?shape ;
         icv:sourceConstraint ?constraint ;
         icv:sourceConstraintComponent ?component ;
         icv:focusNode ?focusNode ;
         icv:resultPath ?path ;
         icv:value ?valueNode ;
    }
  }
  BIND(!bound(?result) as ?conforms)
}

This is not exactly true due to complex property paths, as explained in the warning above.

ICV & Reasoning

An integrity constraint may be satisfied or violated in either of two ways: by an explicit statement in a Stardog database or by a statement that’s been validly inferred by Stardog. For this reason, the validation results will change if reasoning is enabled or disabled. By default, reasoning is disabled for validation but can be enabled just like in any other SPARQL query. For example, in the CLI, you can use the -r, --reasoning option:

$ stardog query --reasoning testdb "VALIDATE ALL"

ICV Guard Mode

Stardog will also apply constraints as part of its transactional cycle and fail transactions that violate constraints. We call this “guard mode”. It must be enabled explicitly in the database configuration options. Using the command line, these steps are as follows:

  1. Take the database offline.

     $ stardog-admin db offline myDb
    
  2. Enable ICV with the icv.enabled database configuration option.

     $ stardog-admin metadata set -o icv.enabled=true myDb
    
  3. Bring the database back online.

     $ stardog-admin db online myDb
    

Once guard mode is enabled, modifications of the database (via SPARQL Update or any other method), whether adds or deletes, that violate the integrity constraints will cause the transaction to fail.

SHACL Extensions in Stardog

This section discusses the SHACL Extensions in Stardog that are not covered by the SHACL standard.

Query Dataset Specification in SPARQL constraints

SPARQL Constraints in SHACL are supported by Stardog. However, the only way to define the dataset for constraint queries is to put FROM or FROM NAMED statements directly in the query, which is not always convenient. Consequently, to address the need, a non-standard SHACL property is introduced as an extension by Stardog, namely tag:stardog:api:shacl:fromNamed.

Let’s exemplify the concept: Assume that we need to have data in both staging and production graphs (called :stagingGraph and :productionGraph, respectively). The graphs should not have any matching data about the same node. Thus, we’d like to validate :stagingGraph against :productionGraph within the constraint query, in such a way that only nodes in :stagingGraph are to be validated (i.e., are target nodes for the shape) while the constraint query is executed against :productionGraph:

# The shape using the SPARQL Constraint below
:DepartmentShape
  rdf:type sh:NodeShape ;
  sh:targetClass :Department ;
  sh:sparql :OneDirectorOnly-sparql
.

# The SPARQL Constraint with the custom extension
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix stardog: <http://api.stardog.com/> .
@prefix stardogSh: <tag:stardog:api:shacl:> .

:OneDirectorOnly-sparql
  rdf:type sh:SPARQLConstraint ;
  sh:message "a department can only have one director." ;
  stardogSh:fromNamed :productionGraph ;
  sh:select """
            prefix : <http://api.stardog.com/>
            SELECT ?this
            WHERE {
                # this is evaluated against the staging graph
                ?this :director ?director .
                GRAPH ?g {
                    # this is evaluated against the production graph
                    ?this :director ?anotherDirector .
                }
                FILTER ( ?director != ?anotherDirector )
             }
            """ ;
.

Imagine trying to accomplish the example task above of constraining data matches between two graphs by using an existing option for --named-graphs in icv report. Such a constraint would be applicable to departments only in :stagingGraph since ?this would be pre-bound to departments in :stagingGraph (as it would be assigned to --named-graphs). However, the new stardogSh:fromNamed property now makes the constraint query (with pre-bound ?this) run against :productionGraph, and that query should not find a matching instance.

With the introduction of this extension, it is necessary to address the priority between given --named-graphs, the FROM part of the query, and the FROM NAMED part of the query, while constructing the query dataset for SPARQL constraint:

  • if stardogSh:fromNamed is provided, it takes precedence over both FROM NAMED in the query and --named-graphs to define the named part of the query dataset.
  • --named-graphs takes precedence over FROM NAMED graphs in the query.
  • stardogSh:fromNamed has no effect on the default part of the query dataset. It’s defined by --named-graphs (if provided) or FROM in the query. If none are specified, it’s all local graphs (including the default graph).

SHACL Support Limitations

Stardog supports all the features in the core SHACL Language with the following exceptions:

  1. Stardog supports SPARQL-based constraints but does not support prebinding the $shapesGraph or $currentShape variables in SPARQL.
  2. Stardog does not support property validators.
  3. Stardog does not support the Advanced Features or the JavaScript Extensions.