Semantics and Interoperability
Semantic and Structural Interoperability for Metador Schemas¶
Prerequisites:
- basic understanding of Metador schemas and plugins (covered in previous tutorials)
- basic understanding of the concepts and tools used for the semantic web (RDF, OWL, JSON-LD, etc.)
Learning Goals:
- Understand the relationship and interplay between schemas in Metador and semantic standards
- Learn how to make a schema semantic using JSON-LD annotations
Introduction¶
Each Metador schema represents a structural encoding for objects that belong to some abstract or real-world category we have in our mind - we know what it means, because we can read the documentation and look at the code if needed. To make this information visible to a machine, service, tool that uses your metadata, but does not know anything about Metador, this "human level metadata" must be provided to give the metadata meaning - semantics. This does not only help machines, but also other people interested in your data and metadata.
In the context of Metador, we say that a schema is semantic if it is aligned and compatible with existing linked-data / semantic web standards. We assume that you have at least a rough idea about the vision, purpose and existing tooling, and you now want to know how to connect your schemas to existing vocabularies and ontologies, so that your metadata is interoperable with these tools, can be unambiguously interpreted, added into a knowledge graph, queried using SPARQL and profit from all the other nice features that the semantic web ecosystem provides. In the following, we will describe all the necessary steps to do this.
Context is Important: Preparation steps for semantic schemas¶
Until now, all the schemas we defined were structured, provided validation of the metadata, but were lacking a formal semantic interpretation of the fields, unless they inherited from a schema that already was providing some semantics (thus at least covering the inherited fields). If your schema represents a type for which a semantic standard already exists, you can make your schema fully semantic rather easily.
Semantics is provided for schemas by attaching JSON-LD annotations to them. The consequence of doing this is that every serialized object (as JSON or YAML, etc) will contain a @context
and a @type
field. These additional fields are "tacked on" automatically to each metadata object, and consequently, if you feed non-semantic metadata (either by hand, or using harvesters) into Metador schemas, you will obtain semantic metadata on the "output", meaning that tools and humans working with Metador containers will be able to make sense of your metadata with much less effort, given that the schemas you use are semantically "enriched" with the JSON-LD fields. For this to work, you have to do some preparations first.
A semantic standard such as an OWL ontology will define multiple kinds of entities and they can have various interrelationships and properties. Each schema you define should ideally be a representation of one such entity, which is exactly what will be declared as the @type
for your schema. If your schema does not match a defined entity, this is also fine - you still can and should use a suitable ontology, but it just could require some more work on the @context
. In any case, the first step is to have a conceptual understanding how schemas and fields are supposed to map onto the ontologies that you use.
Semantics Beginner: You will usually find a default context you can use in the documentation of your semantic metadata standard. It will typically be a URL (which points to the context object), such as https://w3id.org/ro/crate/1.1/context
or can be even as simple as just https://schema.org
.
Semantics Expert:
If you are combining multiple standards or have another use-case for a custom context (e.g. building a schema that does not correspond to a defined @type
), you can use an arbitrary JSON-like object as a context. For example, your context can be a Python dict
that defines the interpretations for your fields, e.g.:
# this is just a Python dict encoding the JSON-LD context:
my_context = {
"name": "http://schema.org/name",
"image": {
"@id": "http://schema.org/image",
"@type": "@id"
},
"homepage": {
"@id": "http://schema.org/url",
"@type": "@id"
}
}
You have to make sure that your context is valid JSON-LD and makes sense using other external tools, if necessary.
Semantics Beginner: The context is like a dictionary for looking up semantic interpretations, so your schema only makes semantic sense if the names you use have a definition. If you are using an existing standard, consult its documentation for the correct property names for various object types.
Semantics Expert:
If you are using a custom context, you probably have a good understanding of this. One limitation you have to keep in mind is that you cannot use "qualified" names using a prefix as is often done, because you cannot easily have a colon in a field name in Python. Your context therefore must fully define all the concrete field names you use. This means that you cannot call a schema field foaf:name
, but you are free to use any valid JSON-LD, including these kind of abbreviations, within your context definition - as long as in the end all the actual field names used in the schema are declared without any namespace prefix.
Otherwise these fields will have no semantic interpretation and will remain opaque to semantics-based tools.
Adding JSON-LD annotations to schemas¶
Now assuming that you understood and defined your semantic context, let us see how it can be used in schemas.
Everything you need for development of semantic schemas lives in the metador_core.schema.ld
module.
If you intend to define a semantic schema and do not extend another already semantic schema, use LDSchema
as the base class (instead of MetadataSchema
) to distinguish it from non-semantic schemas.
For a one-off way to attach some JSON-LD fields to a schema, you can use the default ld
decorator:
from pydantic.color import Color
from metador_core.schema.ld import LDSchema, ld
my_context = "https://www.example.com/my/context" # <- could also be some more complex JSON-LD object
@ld(context=my_context, type="Animal")
class MySemanticSchema(LDSchema):
furColor: Color
# create an instance:
myAnimal = MySemanticSchema(id_="https://www.animalid.org/01234", furColor="#ff8000")
# serialize the instance:
animalJson = myAnimal.json(indent=2)
print(animalJson)
# deserialize it back:
sameAnimal = MySemanticSchema.parse_raw(animalJson)
print("Loaded back same animal?", myAnimal == sameAnimal)
{ "@id": "https://www.animalid.org/01234", "furColor": "#ff8000", "@context": "https://www.example.com/my/context", "@type": "Animal" } Loaded back same animal? True
Notice that we did not specify @context
and @type
for the metadata object myAnimal
- the schema knows them already and just "tacks them on" to each animal metadata object, and whenever it is serialized/stored (in a Metador container, JSON file, etc.), it will have the correct annotation. When loading a serialized animal with these annotations, the schema will also not complain as long as these are exactly the ones we attached to the schema (remember, Metador does not actually understand semantics!).
You see that we used id_
even though we did not declare it. The id_
field is automatically available to all semantic schemas derived from LDSchema
, in order to set the JSON-LD @id
of a semantic object.
This is a property specific to each instance of a schema, a concrete metadata object, whereas the @type
and @context
are identical for all the instances. In Python/OOP jargon - @context
and @type
are class variables (with the special property of being constant values) and are attached using schema decorators, whereas @id
is an actual instance variable specific to individual objects - just as all the fields you usually define in your schema. Naturally, in Metador schemas we call fields that behave like @context
and @type
simply constant fields.
Custom LD Decorators¶
In most cases, you probably will have a context that you want to use for a whole collection of schemas, and ideally, each of them will represent a different @type
. In this case, first you should define a decorator to be able to quickly attach your @context
(and possibly the @type
) to a schema with less redundancy:
from metador_core.schema.ld import ld_decorator
my_semantics = ld_decorator(context=my_context)
The custom decorator works just like the ld
decorator, but has two advantages:
- you will not need to state the
context
for every single schema anymore - if your context needs to change, you only need to change it once for your decorator
Therefore, it is advisable to define a custom decorator whenever you use the a context for multiple schemas. A usage example:
from metador_core.schema.types import NonEmptyStr
# define our semantic schema:
@my_semantics(type="Animal")
class MySemanticSchema(LDSchema):
furColor: Color
print("Same result as above with custom decorator:")
print(MySemanticSchema(id_="https://www.animalid.org/01234", furColor="#ff8000").json(indent=2))
@my_semantics
class AnotherSemanticSchema(LDSchema):
something: NonEmptyStr
print("Instance of a schema with the @context, but no @type:")
print(AnotherSemanticSchema(something="hello").json(indent=2))
Same result as above with custom decorator: { "@id": "https://www.animalid.org/01234", "furColor": "#ff8000", "@context": "https://www.example.com/my/context", "@type": "Animal" } Instance of a schema with the @context, but no @type: { "something": "hello", "@context": "https://www.example.com/my/context" }
(Semantics Beginner) Q: I still don't get it, how exactly is the schema "better" now by adding these fields?
A: Never forget that machines are really, really stupid. Using structured ways to organize the metadata (using Metador schemas, JSON, etc.) instead of using free-form natural language helps a technical system to understand structure, the "shape" of your metadata - which is an important step forward. Technical systems can to a lot with data and metadata without any understanding, because the required understanding is provided by humans - the software developers who understand the domain and metadata and write software which uses it. The advantage of adding such semantic "hints" might not be obvious, if you are able to understand the field names and read the corresponding documentation. But imagine a schema designed in a language you don't know - could you make sense of a schema like this?
from metador_core.schema import MetadataSchema
from metador_core.schema.types import Int
class RuffleMeta(MetadataSchema):
"""A Ruffle is just a simple combination of the Quirzl phase and the Shpongle factor
measured during the Xylic-Yzgel process at a fixed time step."""
quirzl: NonEmptyStr
"""Quirzl phase of the Ruffle."""
shpongle: Int
"""Shpongle factor of the Ruffle."""
Maybe this is something you are familiar with, but with weird names, or maybe this is a field of science you have never seen before - you have no chance to know either way. Semantic methods such as JSON-LD annotations solve this problem by connecting your schemas and their fields to a formalized system for knowledge representation - objects that refer to the same entity in an ontology are supposed to mean the same kind of thing, regardless of how the field, schema or object is named. This helps a human who does not understand your language or domain, and also helps a machine which is trying to process your data without knowing all of its context as well as you do.
(Semantics Expert) Q: Why do I have to re-create all the schemas in Metador by hand? That's double-work!
A: There are multiple reasons why unfortunately you cannot simply import your ontology into Metador.
On a technical level, the RDF-based semantic web / linked data technologies work by a different logic than most other ways of organizing data. For example, OWL was designed with logical reasoning in mind, assuming that the available information is already existing and making sense. It does not care about validation and not allow to easily check for the "shape" of the information or have any assumptions about it. This shortcoming was addressed by the semantics community by creating other RDF-based languages, such as SHACL.
But while the validation problem might be seen as solved within the RDF-based world, the concepts do not trivially translate into structures in concrete programming languages. The best that a suitable tool, even if it would exist, could do would be trying to approximately convert from SHACL or OWL into some other way of defining entities - usually such automatic conversions are not readily usable and can at best be a starting point for manual tweaking and inspection. Even for non-semantic standards such as JSON Schema this is a non-trivial task and tooling could not automatically handle certain conceptual gaps between the logic of JSON Schema and other technologies with overlapping purpose, such as pydantic (which Metador uses). For example, neither JSON Schema nor RDF-based languages natively support OOP-like schema inheritance that way Metador allows and relies upon, and neither of them addresses the issue of parsing and normalization.
The non-technical answer is that there are many scientists that with some experience with Python, but there are very few semantics experts in research institutions who could formalize all requirements purely relying on RDF, OWL and SHACL. These languages are also not the area of expertise of typical software developers and research software engineers and many transformations that can easily be done based on JSON can be challenging using e.g. SPARQL queries. But both researchers and software engineers can quickly learn to use @context
and @type
correctly and can understand a high-level documentation of a well-designed and well-documented ontology enough to make use of semantics in their schemas.
So Metador supports and encourages providing semantics to increase interoperability between systems, but it is not and does not try to be a fully semantics-based system. Instead, it tries to make the gap between the typical JSON-based web technologies and the world of semantics as narrow as possible, by encouraging pragmatic use of JSON-LD.
Advanced: Constant fields in general¶
The same machinery can be used to attach arbitrary schema-specific (i.e. equal for all objects of that schema) constant fields that behave exactly like the JSON-LD annotations, i.e.:
- constant fields are not required when creating or loading a metadata object
- if those fields already exist in the object that is loaded into a schema, they are discarded
- when serializing an object, the object will have the constant fields as defined by its schema
Constant fields are useful for enriching metadata objects with additional information that is fully determined by their schema and which they always should "carry along with them", e.g. to provide additional required information when the metadata objects are used outside of Metador ecosystem.
If you need this functionality, take a look at the more general add_const_fields
decorator:
from metador_core.schema.decorators import add_const_fields
help(add_const_fields)
Help on function add_const_fields in module metador_core.schema.decorators: add_const_fields(consts: Dict[str, Any], *, override: bool = False) Add constant fields to pydantic models. Must be passed a dict of field names and the constant values (only JSON-like types). Constant fields are optional during input. If present during parsing, they are be ignored and overriden with the constant. Constant fields are included in serialization, unless `exclude_defaults` is set. This can be used e.g. to attach JSON-LD annotations to schemas. Constant fields are inherited and may only be overridden by other constant fields using this decorator, they cannot become normal fields again.
Summary¶
Semantics¶
- Semantically aligned schemas must be based on
LDSchema
(or use a semantic parent schema) - The JSON-LD
@id
field is accessed within Python asid_
(but@id
is expected in the input and used for output) - Before aligning your schemas with semantic standards, understand your
@context
and@type
s of entities that your schemas represent - Use
ld_schema_decorator
to create a@context
/@type
decorator for your schemas that share the same context - Use that decorator to attach these JSON-LD fields to a schema representing an entity type in your ontology
- Metador cannot check the correctness of your JSON-LD annotations
Constant Fields¶
- The general mechanism used for JSON-LD annotation of schemas is called constant fields
- Constant fields are fixed and equal for all objects created with the same schema
- Constant fields are not required and ignored if present when loading or creating a schema instance
- Constant fields of the schema are always attached when serializing a metadata object