Designing Metadata Schemas
Writing Schemas: Types! Types everywhere!¶
In this tutorial you will learn everything you need to know for modelling and representing your metadata and defining your own high-quality Metador schemas.
Prerequisites:
- basic understanding of Metador schemas and plugins (covered in previous tutorials)
- Optional, but helpful: some experience with dataclasses and/or pydantic
Learning Goals:
- Learn how to create a new schema or one based on an existing schema
- Learn how to model your metadata using expressive type hints
- Understand best practices for schema design and some common pitfalls to avoid
Defining a schema¶
So the following is a perfectly valid, very simple schema:
from metador_core.schema import MetadataSchema
from metador_core.schema.types import Bool
class SimpleSchema(MetadataSchema):
"""My new schema, totally unrelated to any other schemas."""
fun: Bool
"""Flag whether the person reading this tutorial is having fun."""
SimpleSchema(fun=True) # instantiate schema / create metadata object
SimpleSchema(fun=True)
As you have seen in the last tutorial, a schema can be exposed as a plugin (just as any other kind of Metador plugin) by:
- adding a
Plugin
inner class that (at least) defines aname
and aversion
- declaring an entrypoint with the same
name
in the correct plugin group (for schemas:"metador_schema"
)
But there are still a number of questions before you can become productive with schema development, especially:
- how can you extend an existing schema correctly to get all the advantages provided by Metador?
- how can you express the requirements for the values that can go into all the different fields?
By the end of this tutorial, you will have an answer to both of these and many other questions.
Extending an existing schema: The absolute minimum¶
In the first tutorial, we mentioned that one core feature of schemas in Metador is that they can be easily extended and encouraged you to do so. We discussed metadata for a custom image format as an example. In the following, you will learn everything you need to do this in practice and we will arrive at a reasonable schema by the end of this tutorial.
We want to extend the core.imagefile
schema to have some format-specific extra fields, so we take that schema as the base class for our schema. Furthermore, at the time we start writing our schema the core.imagefile
schema is available in a certain version. To make sure that our schema works as expected in the future, we must also state which version of core.imagefile
we intend our schema to be based on.
Metador will try to detect if you forget to do this, but it might not be able to for all ways the plugins can be interacting. Therefore keep this in mind when developing plugins - otherwise, when the plugins you use are updated with changes that your plugin is not prepared for, things might break.
In the important special case when you try extending a plugin class that you retrieved without a stating a version, Metador will actively stop you:
from metador_core.plugins import schemas
ImageFile = schemas["core.imagefile"] # <- we stated no version
try:
class NicheImage(ImageFile):
"""Schema for the .niche image format."""
except TypeError as e:
print(e)
NicheImage: Cannot inherit from plugin 'core.imagefile' of unspecified version!
To request a plugin with an expected version you want to use, access it using get
and pass a version triple,
like in the following example:
ImageFile = schemas.get("core.imagefile", (0,1,0)) # <- now this is better!
class NicheImage(ImageFile):
"""Schema for the .niche image format."""
class Plugin:
name = "dummy.imagefile.niche"
version = (0, 1, 0)
In practice, you would also register your schema plugin as an entrypoint in your own Python package, as described in the previous tutorial. For practical purposes, we will bypass this step in this and following tutorials, and instead will manually "load" the schema into the plugin system using the register_in_group function. That way you can conveniently stay in this notebook and follow along, without needing to copy-paste everything into a project.
from metador_core.plugin.util import register_in_group
register_in_group(schemas, NicheImage) # <- only used in tutorials, must not and cannot be used in practice!
Notebook: Plugin 'dummy.imagefile.niche' registered in 'schema' group!
On the surface nothing exciting happened, internally though, your schema was processed the same way it would be when loaded from an entrypoint - many checks were performed, and now it is accessible through the plugin interface.
Now the schema is registered, so let's try to access it through the plugin system:
MySchema = schemas["dummy.imagefile.niche"]
print(MySchema)
Schema <class '__main__.NicheImage'> (dummy.imagefile.niche 0.1.0) (version unspecified) ======================================================================================== Description: ------------ Schema for the .niche image format. Fields: ------- id_ type: Annotated[Optional[NonEmptyStr], Field(alias='@id')] origin: metador_core.schema.ld.LDSchema name type: Optional[Text] origin: metador_core.schema.common.schemaorg.Thing description: Name, title or caption of the entity. alternateName type: Optional[List[Text]] origin: metador_core.schema.common.schemaorg.Thing description: Alternative names of the entity. identifier type: Optional[Union[URL, Text]] origin: metador_core.schema.common.schemaorg.Thing description: Arbitrary identifier of the entity. Prefer @id if the identifier is web-resolvable, or use more specific fields if available. url type: Optional[URL] origin: metador_core.schema.common.schemaorg.Thing description: URL of the entity. description type: Optional[Text] origin: metador_core.schema.common.schemaorg.Thing description: Description of the entity. version type: Optional[Union[NonNegativeInt, Text]] origin: metador_core.schema.common.schemaorg.CreativeWork description: Version of this work. Either an integer, or a version string, e.g. "1.0.5". When using version strings, follow https://semver.org whenever applicable. citation type: Optional[Set[Union[LDOrRef[CreativeWork], Text]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef description: Citation or reference to another creative work, e.g. another publication, scholarly article, etc. abstract type: Optional[Text] origin: metador_core.schema.common.schemaorg.CreativeWork description: A short description that summarizes the creative work. keywords type: Optional[Set[Text]] origin: metador_core.schema.common.schemaorg.CreativeWork description: Keywords or tags to describe this creative work. author type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef description: People responsible for the work, e.g. in research, the people who would be authors on the relevant paper. contributor type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef description: Additional people who contributed to the work, e.g. in research, the people who would be in the acknowledgements section of the relevant paper. maintainer type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef producer type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef provider type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef publisher type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef sponsor type: Optional[List[LDOrRef[OrgOrPerson]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef editor type: Optional[List[LDOrRef[Person]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, LDIdRef dateCreated type: Optional[DateOrDatetime] origin: metador_core.schema.common.schemaorg.CreativeWork dateModified type: Optional[DateOrDatetime] origin: metador_core.schema.common.schemaorg.CreativeWork datePublished type: Optional[DateOrDatetime] origin: metador_core.schema.common.schemaorg.CreativeWork copyrightHolder type: Optional[LDOrRef[OrgOrPerson]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Person, Organization, LDIdRef copyrightYear type: Optional[Int] origin: metador_core.schema.common.schemaorg.CreativeWork copyrightNotice type: Optional[Text] origin: metador_core.schema.common.schemaorg.CreativeWork license type: Optional[Union[URL, LDOrRef[CreativeWork]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef about type: Optional[Set[LDOrRef[Thing]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: Thing, LDIdRef subjectOf type: Optional[Set[LDOrRef[CreativeWork]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef hasPart type: Optional[Set[LDOrRef[CreativeWork]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef isPartOf type: Optional[Set[Union[URL, LDOrRef[CreativeWork]]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef isBasedOn type: Optional[Set[Union[URL, LDOrRef[CreativeWork]]]] origin: metador_core.schema.common.schemaorg.CreativeWork schemas: CreativeWork, LDIdRef contentSize type: <class 'pydantic.types.StrictInt'> origin: metador_core.schema.common.rocrate.FileMeta (plugin: core.file 0.1.0) description: Size of the object in bytes. sha256 type: <class 'metador_core.schema.types.NonEmptyStr'> origin: metador_core.schema.common.rocrate.FileMeta (plugin: core.file 0.1.0) description: Sha256 hashsum string of the object. encodingFormat type: MimeTypeStr origin: metador_core.schema.common.rocrate.FileMeta (plugin: core.file 0.1.0) description: MIME type of the file. width type: Pixels origin: metador_core.schema.common.ImageFileMeta (plugin: core.imagefile 0.1.0) schemas: Pixels description: Width of the image in pixels. height type: Pixels origin: metador_core.schema.common.ImageFileMeta (plugin: core.imagefile 0.1.0) schemas: Pixels description: Height of the image in pixels. bitrate type: Optional[Text] origin: metador_core.schema.common.schemaorg.MediaObject description: Bitrate of the entity (e.g. for audio or video). duration type: Optional[Duration] origin: metador_core.schema.common.schemaorg.MediaObject description: Duration of the entity (e.g. for audio or video). startTime type: Optional[TimeOrDatetime] origin: metador_core.schema.common.schemaorg.MediaObject description: Physical starting time, e.g. of a recording or measurement. endTime type: Optional[TimeOrDatetime] origin: metador_core.schema.common.schemaorg.MediaObject description: Physical ending time, e.g. of a recording or measurement. filename type: NonEmptyStr origin: metador_core.schema.common.rocrate.FileMeta (plugin: core.file 0.1.0) description: Original name of the file in source directory.
So we have successfully registered a schema, but it will act exactly the same as the parent schema - we did not change or define any fields! Before we actually add some fields for our new image type, we first need to clarify the rules of schema inheritance and understand better how useful fields can be defined. We will revisit our niche image format again and complete it at the end of this tutorial.
The family life of schemas¶
Schemas in Metador only support single inheritance, which means that you can state only one parent schema plugin it claims to extend. This means that each Metador schema has a neat linear inheritance chain. We can inspect the inheritance chain of registered parent schemas like this:
schemas.parent_path("dummy.imagefile.niche", (0,1,0))
[PGSchema.PluginRef(group='schema', name='core.file', version=(0, 1, 0)), PGSchema.PluginRef(group='schema', name='core.imagefile', version=(0, 1, 0)), PGSchema.PluginRef(group='schema', name='dummy.imagefile.niche', version=(0, 1, 0))]
This tells us that core.file
is the parent schema of core.imagefile
, which in turn is the parent schema of our new schema.
We can also check the "descendant" schemas of a schema, like this:
schemas.children("core.file", (0,1,0))
{PGSchema.PluginRef(group='schema', name='core.imagefile', version=(0, 1, 0)), PGSchema.PluginRef(group='schema', name='dummy.imagefile.niche', version=(0, 1, 0))}
We get back a set of all schemas that, directly or indirectly, are based on core.file
and can be used in every
place where core.file
is expected - so one can say that this set is the set of installed schemas which are "compatible" with core.file
as-is, without needing to do anything to the metadata.
Every metadata object that is valid according to a schema must also be valid according to its parent schema.
This is the most important requirement for writing schemas, as without this, building a hierarchy of schemas would have no value. A schema is like a "micro-standard", it can be designed in better or worse ways, but as long as people agree to use it, it has value. Extending a schema in a compatible way is like building on an existing standard, so it is a virtuous thing to do, if you try to make your metadata FAIR.
In our example, your responsibility is to ensure this for the declared parent plugin, core.imagefile
. The authors of core.imagefile
are responsible for making sure that each valid core.imagefile
is a valid core.file
, so you do not have to worry or think about that. If everyone is doing their part, the chain will work and your new schema will also be a valid core.file
by transitivity, that is, "for free" from your perspective.
All schemas are equal, but some schemas are more equal: To Plugin or not¶
So you might already have noticed that not every schema you define must be a plugin - you can define, use and even inherit from schemas that are not plugins just fine, being registered as a plugin which a nice name
and well-maintained version
is just an extra property, one which important schemas usually have. Every schema you register as a plugin and not declare auxiliary
can be attached as-is to a node in a Metador container, e.g. if you want to turn a schema for 3D positions into a plugin, it is something that you should mark as auxiliary
.
So what kind of schemas should be a plugin?
- Every schema that you want others to put into containers must be plugin
- Every schema that you want other people to actually use and possibly extend must be a plugin
- Every schema that is too general in scope or does not make sense on its own can be a plugin, but you must mark it as
auxiliary
- Apart from these guidelines, it is up to you
Remember that nested schemas can be accessed through the Fields
interface, so there is no need to declare all small nested schemas your schema is built from - except you think that they could be useful outside the context of your schema.
On the shoulders of giants: Extending a schema correctly¶
You have already seen in the last tutorial that schemas are defined using Python type hints, the same way dataclasses are defined. We will now look at how you can extend your schema, and how you can make sure that the parent compatibility is not violated.
Note that additional rules must be considered when you update a schema - this will be discussed later in the context of schema versioning.
When you define a new schema based on an existing parent schema, there are only two rules you have to follow:
When loading a metadata object based on your schema, the parent schema will simply ignore your new fields, so adding a new field will cause no problems up the "ancestry chain".
The case needing some more care and consideration is the following:
This means that you must not replace an existing field in a way that your schema would accept a value the parent schema would not accept. Notice that we say "exists" and not "is defined" - the parent itself can have inherited fields on its own which you also have to keep in mind.
If you could remove a field, you could also re-define it to something else entirely in another child schema, which would violate rule Parent Compatibility (2), so this cannot be possible.
If a parent field is mandatory, unfortunately you are out of luck and will have to provide a value. If it is optional, then you and can simply ignore the field.
If you cannot accept these restrictions and design your schema respecting them, then the chosen parent class is not suitable for your purpose - you will either need to discuss changes you would like to see with the author of the parent schema (if they would be generally helpful), look for another better suited parent schema, or simply not inherit from any existing schema.
Now let us look at a toy example. Consider the following parent class:
from metador_core.schema import MetadataSchema
from metador_core.schema.types import Int, Float, Str
from typing import Union, Optional
@register_in_group(schemas)
class Parent(MetadataSchema):
class Plugin:
name = "dummy.parent"
version = (0, 1, 0)
foo: Union[Int, Str]
bar: Optional[Float]
qux: Str
Notebook: Plugin 'dummy.parent' registered in 'schema' group!
Now we will register a child class that satisfies the rules:
from metador_core.schema.decorators import make_mandatory
@register_in_group(schemas)
@make_mandatory("bar")
class Child1(Parent):
class Plugin:
name = "dummy.child1"
version = (0, 1, 0)
foo: Int
new_field: Bool
Notebook: Plugin 'dummy.child1' registered in 'schema' group!
You can see that we used the make_mandatory
decorator - what it does is taking an inherited field from the parent, and make sure that is not Optional
anymore, i.e. is turned into a required field. There are two advantages this decorator provides for this possibly simplest case of a field restriction:
- you do not have to "duplicate" the inherited type just to get rid of the
Optional
- it clearly communicates in what way the field is changed compared to the parent schema
It will ensure that the field actually exists in the parent and will define the correct non-optional type for you automatically.
Now let us see what happens if we violate the parent consistency rules:
try: # we know it will go wrong
@register_in_group(schemas)
class Child2(Parent):
class Plugin:
name = "dummy.child2"
version = (0, 1, 0)
foo: Float
except TypeError as e:
print(e) # show the message
The type assigned to field 'foo' in schema <class '__main__.Child2'> (dummy.child2 0.1.0): <class 'pydantic.types.StrictFloat'> does not look like a valid subtype of the inherited type: typing.Union[pydantic.types.StrictInt, pydantic.types.StrictStr] from schema __main__.Parent (plugin: dummy.parent 0.1.0). If you are ABSOLUTELY sure that this is a false alarm, use the @overrides decorator to silence this error and live forever with the burden of responsibility.
In Child1
, we re-declared foo
to accept int
values, which is fine - the parent is declared as Union[int, str]
, which means that it can accept either an int
or a str
. Now in Child2
we tried to declare a field foo
in our schema as a float, but the parent schema does not allow floats. The system can infer that something is wrong and will refuse to register this faulty schema.
Always remember the fields the parent schema inherited itself - the following attempt will also fail:
try: # we know it will go wrong
@register_in_group(schemas)
class Child3(Child1):
class Plugin:
name = "dummy.child3"
version = (0, 1, 0)
qux: Float
except TypeError as e:
print(e) # show the message
The type assigned to field 'qux' in schema <class '__main__.Child3'> (dummy.child3 0.1.0): <class 'pydantic.types.StrictFloat'> does not look like a valid subtype of the inherited type: <class 'pydantic.types.StrictStr'> from schema __main__.Parent (plugin: dummy.parent 0.1.0). If you are ABSOLUTELY sure that this is a false alarm, use the @overrides decorator to silence this error and live forever with the burden of responsibility.
Child3
inherits from Child1
, which does not define qux
, but it does inherit qux
from Parent
unchanged. Remember that you can use Fields
to inspect all the fields declared in a schema. This will also tell you where a field is actually "coming from":
print(Child1.Fields)
foo type: <class 'pydantic.types.StrictInt'> origin: __main__.Child1 (plugin: dummy.child1 0.1.0) new_field type: <class 'pydantic.types.StrictBool'> origin: __main__.Child1 (plugin: dummy.child1 0.1.0) description: StrictBool to allow for bools which are not type-coerced. bar type: <class 'pydantic.types.StrictFloat'> origin: __main__.Child1 (plugin: dummy.child1 0.1.0) qux type: <class 'pydantic.types.StrictStr'> origin: __main__.Parent (plugin: dummy.parent 0.1.0)
Python Typology¶
After seeing how schema and field inheritance work and learning how to declare a child schema correctly without violating parent compatibility, now you might wonder what kind of types you can use in your schemas in order to express as precisely as possible what values are supposed to go into which fields. The possibilities, in fact, are limitless and there are many ways to model the same requirement. To get you started and give you an idea, we will give a quick overview of the most common and most useful types, give some guidance on how and when to use them, and equally important - what to avoid.
Primitives, times, dates and pydantic built-in types¶
The pydantic
library - the beating heart of Metador schemas - also provides many useful type hints which you can use for various kinds of information, including entities such as IP addresses, URLs, colors and more (for some things we advise against using the types pydantic provides, which again is discussed further below).
Furthermore, in metador_core.schema.types
we provide a number of generally useful types for different purposes which are meant to be imported directly and used in your schemas. Take a look at them and prefer reusing or extenting them before you start defining your own types.
For strings, you should use NonEmptyStr
from metador_core.schema.types
. It will make sure that only strings that are non-empty and have non-whitespace characters will be accepted - this is what you usually want when requesting a string.
For values corresponding to the built-in Python types bool
, int
, float
, you should use the types Bool
, Int
, and Float
in metador_core.schema.types
(these are essentially aliases for the Strict types provided by pydantic). However, for numeric fields you should rarely need them and in many situations will probably use the more precise constrained types which are discussed further below.
You can also use date
, datetime
and time
from the datetime
package to represent the corresponding quantities.
Metador schemas¶
You can always use any other existing Metador schemas when defining a field.
Breaking down complex and nested hierarchical structures into schemas for the different parts should be your preferred and default approach when modelling metadata. For example, we can use one of the schemas we defined above inside another schema and it will work as expected:
class SomeSchema(MetadataSchema):
some_field: Float
nested_obj: Child1
print(SomeSchema(some_field=1.23, nested_obj=Child1(foo=1, bar=3.14, qux="hi", new_field=True)).yaml())
nested_obj: bar: 3.14 foo: 1 new_field: true qux: hi some_field: 1.23
Optional values¶
If you want to declare a field that is relevant, but might not always be available, use typing.Optional
in order to make omitting the value possible (if omitted, the value will automatically be None
for this field).
Only relax a field to Optional
when you are sure that there is no feasible way to provide the desired information consistently. This way you will break less potential child schemas that are based on your schema.
If a child schema will rely on a field being optional, but you suddenly make it mandatory, the child schema will also have to make it mandatory (otherwise it violates the parent rule discussed above by allowing a "missing value", which is None
, that is not allowed in the parent, that is your schema). Also, obtaining missing information for a schema after-the-fact is harder or even impossible (e.g. you cannot re-do the scientific experiment), so it is better to err on the side of "strictness" up-front.
Examples: Optional[Int]
, Optional[SIValue]
Collections¶
To describe a collection of values or objects of the same kind, you can use typing.List
or typing.Set
.
Many things where you might first instinctively use a List
are actually, semantically Set
s, so make sure that you choose one or the other consciously. This has actual practical consequences for the behavior of harvesters (that we will talk about in a different tutorial).
Examples: List[Float]
, Set[AnyHttpUrl]
Literals and Enums¶
Sometimes you want a field to have a fixed value, or a value from a controlled list. The tools that you can use are Literal
s (from typing
) and Enum
s (from enum
). The rule of thumb is that you probably should use a simple Literal
when there is just one or a handful of values that are permitted, but for a longer list you should define an Enum class.
Examples: Literal["a", "b", "c"]
, Literal[0, 42]
, Literal["always_this_value"]
, SimAlgorithm
with SimAlgorithm
e.g. defined as:
class SimAlgorithm(str, Enum):
simple = "simple-simulation"
fancy = "fancy-simulation"
# ... other supported algorithms
Constrained types¶
Constrained types are variants of the primitive and collection types that we discussed above with a restricted value range. They can be used to represent requirements such as:
- the number must be positive
- the value must be in $[-\pi, \pi)$
- the string must have a length between 100 and 1000 characters
- the list can have at most 7 elements
For constrained types, in Metador schemas we prefer using types based on the phantom library. The advantage of using phantom
types is that these work well with schema inheritance and the automatic checks that make sure that your types are compatible with the parent schema, whereas constrained types provided by pydantic
do not have these nice properties.
What to avoid¶
Alternative choices with Union
¶
If you must accept values of multiple different types, you can use typing.Union
.
The first reason to avoid them is that each Union
forces any tool that wants to use objects following your schema to do a case analysis ("if the field is this type, do this, or if that type, do that"). While this kind of "controlled ambiguity" can be useful, desirable or at least non-problematic in certain situations, in the context of metadata modelling it should be avoided, because it complicates the (re-)use of the metadata (including any transformations and mechanical inspection).
The second reason to avoid Union
is that there exist certain unintuitive subtleties that can lead to unexpected/unintended behavior. For example, due to how parsing of the metadata by pydantic
works, when loading a metadata object, a value will always be of the first type (in the order as listed in the Union
) for which the conversion succeeds.
Completely avoiding Union
s could be difficult, e.g. if your schema is implementing an existing standard where such alternatives ("either a number or a string") are already allowed. So please just be especially careful with Union
s and when testing your schema, make sure to pay special attention to how fields with Union
types behave. At least make sure that you understand how the parsing logic works (see the tutorial on custom parsers) before using Union
s extensively.
Unconstrained numbers and strings¶
Avoid "plain" types such as Int
, Float
in your schemas, except in drafting stages of your schema. We used them in the examples to keep it simple, but in actual use cases, there usually are restrictions that should apply - values you actually should exclude because they do not make sense in your context. Always try to formulate a constrained type and only fall back to these primitive types when there is no other solution.
This also applies to strings. For example, above we said that you should use NonEmptyStr
instead of Str
. The reason is that even an empty string ""
is still a string, and so is " "
- it is not considered a missing value, which could lead to unintended consequences down the line. This is an example for a value constraint that seems "common sense", but is easy to overlook when defining a schema and can lead to subtle errors down the line.
Making sure that missing values are represented in an unambiguous way is crucial to make harvesters work correctly. There already exists a unique, unambiguous value representing missing information in Python, which is None
, and the way to state that information is missing is wrapping the type of value in Optional
to allow None
. This kind of discipline might feel unfamiliar to you, but you will see that it can prevent many avoidable mistakes and removes ambiguity - all values that are not None
are considered to be meaningful values.
Optionality is enabled explicitly by Optional, and the unique "missing value" value is None!
Tuple:
There is a type hint Tuple
in typing
that can be used for tuples, but there are not many cases where a tuple would be the best solution. Instead of a tuple where each component is semantically different or has a different type, such as Tuple[int, str, bool]
, you should write a schema instead and give those components a name.
In cases where you expect a sequence of elements of the same type, you usually want a List
(possibly with constraints), except when you really have a fixed number of items.
One defendable case for a Tuple
would be a vector in a space with fixed dimensions, such as a 2D or 3D vector. For such use cases, assuming that you provide documentation about the meaning (e.g. using Tuple[float, float]
and documenting that it is supposed to be a (x,y)
position), this would be a reasonable definition. Even then, defining a helper schema with x: float
and y: float
could be the better choice, because the meaning is made explicit in the field name and thus can be better interpreted without relying on additional documentation.
Dict:
There is a type hint Dict
in typing
, but good use cases for it are rare and of technical nature.
If you have something which looks like a dict
, you should define another schema encoding all the information about the structure, and use that instead. Remember that you can freely define schemas which are not registered as plugins and use them as pieces for building up your larger, possibly nested plugin schema.
dataclasses and other dataclass-like things that are not Metador schemas:
There are many libraries that superficially look like they work the same way or very similarly to Metador schemas, however they will not be compatible - or worse, they can look as if they work, but will break in specific situations. Specifically,
When you consult pydantic documentation, keep in mind that instead of BaseModel
, in Metador the top level class that is used for our models is MetadataSchema
.
Pydantic constrained types:
This applies to pydantic types such as PositiveInt
, NegativeFloat
, and types constructed with the conint
, constr
, etc. helper functions.
Documenting Schemas¶
Use python docstrings, for the schema class as well as for the fields, and document the meaning and purpose. This is the information others will consult when trying to use your schema and helping them to decide whether your schema is useful for their purpose. You should include information that you technically encode into constraint types, if it is not absolutely obvious - you don't have to spell out that a NonEmptyStr
is a nonempty string, but if MySpecialParameter
is actually a constrained type representing number range, it should also be explained in the documentation of the field that uses your custom constrained type. Furthermore, documentation should include all non-technical, human-level information about the intended context for the schema and helps to to use and interpret it correctly.
A well-documented schema might actually contain more documentation text than actual "code", e.g. like this:
from typing import Literal
from phantom.interval import Inclusive
from metador_core.schema.ld import LDSchema, ld_decorator
class FluffinessScore(float, Inclusive, low=0, high=10):
"""Fluffiness score for the animal, carefully estimated by petting it.
The value is in the closed interval [0, 10], where the score is
estimated on a linear scale assuming that
* 0 means "it has no hair"
* 10 means "absolute fluffball"
"""
my_semantics = ld_decorator(context="https://www.example.com/animal-ontology")
@my_semantics(type="Animal")
class AnimalMeta(LDSchema):
"""Metadata for representing animals in the jungle pet discovery project."""
voice: Literal["dog-like", "cat-like", "bird-like", "other"]
"""Voice category of the animal.
We classify animals into:
* dog-like (if they bark)
* cat-like (if they meow)
* bird-like (if they chirp)
* other
"""
fluffiness: FluffinessScore
# if no docstring is given and the typehint is a class,
# then its docstring will be used.
# let's describe an animal!
print(AnimalMeta(id_="https://petid.org/874", voice="dog-like", fluffiness=5).yaml())
'@context': https://www.example.com/animal-ontology '@id': https://petid.org/874 '@type': Animal fluffiness: 5 voice: dog-like
At last: A wonderful niche schema¶
New let's use what we learned to properly define our custom image file format, as promised in the beginning:
from metador_core.schema.types import MimeTypeStr
class NicheMimetype(MimeTypeStr, pattern=r"image/niche"):
"""The MIME type of .niche image files."""
# we extend the semantic context of ROCrate (ImageFile is based on ROCrate), as described here:
# https://www.researchobject.org/ro-crate/1.1/appendix/jsonld.html#extending-ro-crate
my_context=["https://w3id.org/ro/crate/1.1/context", {
"animalMeta": "https://www.example.com/animal-ontology/Animal"
}]
ext_rocrate = ld_decorator(context=my_context)
@register_in_group(schemas)
@ext_rocrate(type="File")
class NicheImage(ImageFile):
"""Schema for the .niche image format.
The format achieves improved compression for images of animals,
given some information about the depicted animal.
"""
class Plugin:
name = "dummy.imagefile.niche"
version = (0, 1, 0)
# we constrain the allowed MIME type field from core.imagefile
encodingFormat: NicheMimetype
"""The MIME type of .niche image files, must be 'image/niche'."""
# we add a new field with the new relevant information for our format,
# which we conveniently already have defined earlier
animalMeta: AnimalMeta
Notebook: Plugin 'dummy.imagefile.niche' registered in 'schema' group!
Now let's take our new schema for a ride and create a metadata object with information about a image file encoded in our .niche
format:
img_meta = NicheImage(
# some dummy values (that a harvester would usually get for you):
filename="someimage.niche",
sha256="abc",
contentSize=123,
height=100, width=200,
# now our custom added fields:
encodingFormat="image/niche",
animalMeta=AnimalMeta(voice="cat-like", fluffiness=3)
)
yaml_meta = img_meta.yaml()
print(f"Metadata as seen through {NicheImage.Plugin.name}:")
print(yaml_meta)
print(f"Metadata as seen through {ImageFile.Plugin.name}:")
print(ImageFile.parse_raw(yaml_meta).yaml())
Metadata as seen through dummy.imagefile.niche: '@context': - https://w3id.org/ro/crate/1.1/context - animalMeta: https://www.example.com/animal-ontology/Animal '@type': File animalMeta: '@context': https://www.example.com/animal-ontology '@type': Animal fluffiness: 3 voice: cat-like contentSize: 123 encodingFormat: image/niche filename: someimage.niche height: '@context': https://schema.org '@type': QuantitativeValue unitText: px value: 100 sha256: abc width: '@context': https://schema.org '@type': QuantitativeValue unitText: px value: 200 Metadata as seen through core.imagefile: '@context': https://w3id.org/ro/crate/1.1/context '@type': File animalMeta: '@context': https://www.example.com/animal-ontology '@type': Animal fluffiness: 3 voice: cat-like contentSize: 123 encodingFormat: image/niche filename: someimage.niche height: '@context': https://schema.org '@type': QuantitativeValue unitText: px value: 100 sha256: abc width: '@context': https://schema.org '@type': QuantitativeValue unitText: px value: 200
You can see how metadata objects created with our new schema are reusable and interoperable, both within the Metador tool ecosystem and beyond:
- As we used schema inheritance, people can still see and access all the fields that are not specific to your schema without having it available
- As we added JSON-LD annotations, a semantic system can understand the metadata without having any idea about Metador and its concepts
Managing Change Responsibly: Versioning of Schemas¶
When you write the first version of a schema, you have a lot of freedom in how you want to design it. But once others start using it, you have the responsibility to be careful with the changes you do, avoid changes that can or will break child schemas that could have been created already. Make sure that the severity of changes is reflected in the version of your schema plugin - which has to follow strict semantic versioning.
If your schema plugin $X$ had version (MAJOR, MINOR, PATCH)
and you did changes to it directly or indirectly resulting in an updated schema $X'$, you have to update the version of your schema as well.
A non-exhaustive list of relevant changes includes:
- Adding, changing or removing fields of $X$
- Adding, changing or removing schema decorators that affect fields of $X$ (such as the LD annotations or
@make_mandatory
) - Doing any of the above to a schema that $X$ depends on (e.g. nested schemas, parent schemas)
- Updating the required version of any plugin that $X$ depends on (e.g. ones you reuse from others)
It is important to understand to effect your changes have on the ability to process metadata that was already created with the previous version. Some changes do not require any action, but others do. Changes that "break" things should be as rare as possible, but of course are sometimes unavoidable. Breaking changes are always annoying, but not necessarily a horrible experience - if managed well, they just require some extra work which usually is straight-forward. Good management of breaking changes in the reason why versioning discipline is important. This includes updating the semantic versioning triple (so that machines can see whether a schema and some metadata are compatible), and communicating the changes by other means (inform users and provide ways to upgrade their existing metadata to the new version of the schema).
Some changes are backward-compatible, meaning that your schema can be put in place of the older version and nothing will break - every metadata object the previous version of your schema created is still valid for the new version.
In rare cases, changes might even forward-compatible, meaning that older versions of your schema will work with metadata objects created by your newer version. For example, if you improve the schema definition without affecting what kind of objects it can process in any way.
Let $v(S)$ denote the set of metadata objects that are valid according to schema $S$. Then the rules for version bumping are as follows (bumping the version here means incrementing the corresponding component and resetting the less important components back do $0$). You have to bump:
PATCH
if $v(X') = v(X)$MINOR
if $v(X') \supset v(X)$MAJOR
if $v(X') \subset v(X)$
You can always loosen up your requirements in future versions later without breaking existing metadata, but not the other way round.
This is quite abstract, so here are a few concrete examples:
- Any new field added in $X'$ or constrained more than before makes $v(X') \subset v(X)$
This is also true if the new field is optional - because if the field is provided, it is validated and must have the correct type, the schema cannot just "ignore" values if they are wrong. Think of the case that someone extended your schema and added a field with the same name to it which has an incompatible type. Your new field then breaks "parent compatibility" for that schema - a breaking change.
- Adding a custom parser that can process more inputs without changing the declared type makes $v(X') \supset v(X)$
We will not discuss metadata normalization and custom parsers in this tutorial (there is a separate tutorial on this topic), but this is one way how the set of accepted values can actually be expanded in a backward-compatible manner.
- Fixing a bug that made $X$ reject something it was actually supposed to accept in the first place can be considered $v(X') = v(X)$
Of course, strictly speaking, the set of accepted objects changes between the two versions - but here it is about the semantic intent. This of course only applies to mistakes that affect only specific edge cases (for example, your integer constraint was off by one, or you were using an open instead of the intended closed interval). Decide wisely what is actually a "bug fix", and what is a "schema change".
- Updating the version of a schema plugin your schema depends on requires you to bump on the same level.
This means, if your schema has version (0, 1, 0)
and was based on a parent schema (or a nested schema) with version (1, 2, 0)
, but you update it to its newest version (2, 0, 0)
, then your own schema version must change to (1, 0, 0)
.
Testing Schemas¶
Writing a schema is only one half of the job, though. In order to make sure that everything works correctly, a schema must be properly tested. Does the schema accept only the values in fields that are supposed to be accepted, and reject values that do not make sense? Does the parent compatibility actually work, if we try it for concrete instances? All these questions and our expectations about how the schema behaves must be codified into a proper set of tests. Especially as a schema can be developed further over time, a test suite will help to detect actual mistakes as well as accidental "breaking changes" you did not consider.
Summary¶
This was a long-winded tutorial, if you are here - congratulations! The good news is that now you have a deep understanding of schemas - the most important entity in the Metador system. You will see that most other plugin types are actually simpler! You really earned a break, and some summary notes:
Plugins¶
- Not every schema you define must be a registered plugin, nested schemas can be accessed through the
Fields
interface - Useful schemas that should not be attachable to container nodes can be plugins, but must be marked as
auxiliary
- Auxiliary schemas (e.g. small nested schemas) that are only useful in the scope of your schema should not be plugins
Inheritance¶
- Every schema is either direct subclass of
MetadataSchema
or specializes/extends an existing parent schema - Parent compatibility: When extending a schema, you can only add new fields or restrict the values allowed for existing fields
Type Hints¶
- Use
Optional
when a field can be missing, but for many reasons you should prefer mandatory fields - Use
Literal
types andEnum
classes for discrete, fixed controlled lists of allowed values - Use classes from the
datetime
package for times and dates - Use classes from the
phantom
package for constrained types, such as number ranges - Use default
pydantic
types for things like URLs - Avoid
pydantic
constrained types, unless in no child schema would want or need to narrow down the constraints - Avoid unconstrained default Python types as hints, unless you really have no special requirements
- Avoid
Tuple
, unless the meaning of the components is rather trivial and suitable for a tuple - Avoid
Dict
, unless you have no idea what it can contain and are sure it is the only way - Avoid
Union
, unless you really need it and understand how parsing in schemas / pydantic works
Versioning¶
- Semantic versioning is to be followed for a responsible management of breaking changes
- For a schema, any change that will reject objects that would have been accepted before is a breaking change
- To minimize breakage, start with strict schemas (tight value bounds, mandatory fields) and relax requirements in revisions, if necessary
- Do not forget implicit changes, such as updating the version of a schema that your schema depends on
Documentation and Testing¶
- Make sure to provide documentation in the schema and field docstrings to help the users of your schema
- Write tests for your schemas
Appendix 2: A little glossary¶
You hopefully were able to follow the explanations with an intuitive understanding of certain terms we use. Going forward, it might be helpful to understand how multiple related but different terms are connected and explicitly point out the different perspectives and contexts they come from.
Schemas, models, classes / fields, attributes, keys:
- A Metador schema is a pydantic model that is extended with many additional features specific to Metador
- A pydantic model is similar (but technically unrelated) to python dataclasses, but with powerful validation capabilities
- All of these are just Python classes whose purpose it is mainly to carry around (meta)data in a structured way
- The attributes of an object are called fields in pydantic and Metador schemas and correspond to keys in unstructured
dicts
Classes, types, sets / instances, values:
- A Python class is a complex data type, just like the regular data types you know, essentially it is just a fancy
dict
- A class instance therefore is just a value of that type, so being a subclass is the same as being a subtype
- Each type has a finite or infinite number of possible values, which we will naturally call value set
- A type is a proper subtype if and only if it its value set is a subset of the value set of the other type
- A subtype is more constrained or narrow than the original type
- A value is-a (instance of) a type if and only if it is contained in the value set of that type
Inheritance and subtypes:
- A subschema (or child schema) is just a subclass of an existing parent schema, which is its base class
So the following terms are intimately connected:
- inheritance, thinking in OOP terms (representing an is-a relationship)
- subtyping, thinking in terms of types and values
- subsets, thinking in terms of value sets
Composition of schemas and types:
- A nested schema is a schema which is used in a field definition of a larger schema
- A schema is a composition of its fields
So you compose schemas / classes / types to describe how larger objects are built up from smaller ones.
So the following terms are also just different ways of looking at the same thing:
- composition of classes (has-a relationship in OOP terms)
- the product of types
- the cartesian product of the value sets
Don't worry about the last two interpretations, if they are not familiar to you. These were just listed to complete the picture, but we will stick to calling this relationship either composition or nesting.
Here a few concrete examples:
- The number
5
relates toint
like a metadata object (class instance) relates to its schema (class) (is-a) - The type
int
relates toUnion[int, str]
like a child schema relates to its parent schema (subtype-of) - The
int
andstr
relate toTuple[int, str]
like a nested schema relates to the whole schema (composition) - The integers relate to
int
like all valid metadata objects relate to their schema (value set)
All of this might be a bit confusing at first, but once the connections between these perspectives "click", you will be rewarded by a better understanding of data in general, which will also guide you to better schemas.