DirSchema Manual¶
DirSchema was created to describe the structure of datasets as well as metadata requirements, under the assumption that the metadata for files is stored in separate JSON files that follow a reasonable and consistent naming convention throughout the dataset.
For this purpose, it lifts JSON Schema validation from the level of individual JSON files to hierarchical directory-like structures, i.e. besides actual files and directories it is easily possible to use a DirSchema for validating archive files, like a HDF5 or ZIP file.
Adapter Interface¶
A DirSchema can be evaluated on any kind of tree-shaped hierarchical entity that supports the following operations:
- return a list of paths to all "files" and "directories" (normalized as described below)
- test whether a path is a "directory" (inner node)
- test whether a path is a "file" (leaf)
- load a "file" (typically JSON) to perform JSON Schema or custom validation on it
In the following we will use the language of directories and files and will refer to the whole tree structure as the dataset. Other directory/file-like structures can be processed if a suitable adapter implementing the required interface is provided.
Path Convention¶
DirSchema rules are evaluated on a set of paths and rely on pattern matching to determine which rule(s) must be applied to which path. Therefore, it is important to understand how files and directories are uniquely mapped to paths.
- The set of paths in a dataset always contains at least the empty path (representing the root directory)
- Furthermore, it contains all contained subdirectories and files (except for ones that are ignored due to metadata convention (see later) or adapter configuration (e.g. ignoring hidden files etc.)
In order to have a unique representation of paths that can be used in regex patterns, all paths are normalized such that:
- each path is relative to the directory root (which is represented by the empty string)
- slashes are used to separate "segments" (i.e. a sequence of directories, possibly ending with a file)
- each segment between two slashes is non-empty
- there is neither a leading nor a trailing slash (
/
) - paths do not contain special "file names" like
.
(current dir) or..
(parent dir)
Example: ""
, "a"
, "a/b/c"
are all valid paths as provided by the normalization
Metadata Convention¶
JSON metadata can be provided for each file and directory. By default, it is assumed that
for each file named FILE
the metadata is located in a file named FILE_meta.json
,
whereas for a directory DIR
the metadata is in DIR/_meta.json
.
The convention can be configured by overriding the prefixes and suffixes that are attached to the path itself and to the filename. The general pattern is as follows:
For a path a/b/c/d
, the metadata is located in:
<PATH_PREFIX/>a/b/c</PATH_SUFFIX>/<FILE_PREFIX>d<FILE_SUFFIX>
ifd
is a file<PATH_PREFIX/>a/b/c/d</PATH_SUFFIX>/<FILE_PREFIX><FILE_SUFFIX>
ifd
is a directory
All these prefixes and suffixes are optional, except for the requirement that either a file prefix or a file suffix must be provided.
All files following the used metadata naming convention are automatically excluded from the set of validated files. These files are seen as merely "companion files" to other files in the dataset. This simplifies the writing of DirSchemas, as otherwise these files would have to be excluded in an ad-hoc manner, which would fix the convention inside a DirSchema. Excluding them allows for changing the convention or using the DirSchema with datasets following different conventions, without changing the DirSchema itself.
Example:
With the default settings, the metadata file for /a/b/c/d
is expected to be found at:
/a/b/c/d_meta.json
ifd
is a file/a/b/c/d/_meta.json
ifd
is a directory
If we would also add a path suffix equal to metadata
, we would get:
/a/b/c/metadata/d_meta.json
ifd
is a file/a/b/c/d/metadata/_meta.json
ifd
is a directory
Validation by JSON Schemas and custom plugins¶
In any context where JSON validation is to be performed and a schema can be provided, it is possible to supply one of the following in the corresponding location of the schema:
- a JSON Schema (directly embedded)
- an URI pointing to a JSON Schema
- a special URI pointing to a custom validation plugin
For referencing JSON Schemas stored outside of the dirschema, the following possibilities exist:
- a
http(s)://
URI - a
file://
URI or an absolute path (equivalent) - a
local://
URI (resolved relative to the directory of the used dirschema by default) - a
cwd://
URI (resolved relative to the current working directory) - a relative path (treated as a
cwd://
path by default)
To access a custom validation plugin, a pseudo-URI starting with v#VALIDATOR://
is
recognized, where VALIDATOR
is a registered plugin.
The cwd://
URI is an explicit version that behaves like normal "relative paths", i.e.
when the validation tool is launched in /a/b
,
a path cwd://c/d
is expanded to /a/b/c/d
.
By default, local://
URIs are expanded relative to the location of the main dirschema
file. The reference directory for interpreting local://
paths can also be overridden to
resolve to an arbitrary different path supplied to the validator during initialization.
Example:
Consider the following setup:
- the dirschema lives in
/my/dirschemas/example.dirschema.yaml
- the dirschema validation is launched in directory
/my/workdir
- A custom validator called
myvalidator
is registered as a plugin
Now let us see how the paths are resolved:
- A JSON Schema referenced as
https://www.example.org/schemas/some_schema.json
remains unchanged (the schema will be downloaded) - A JSON Schema referenced as
file:///schemas/some_schema.json
remains unchanged - A JSON Schema referenced as
cwd://schemas/some_schema.json
expands tofile:///my/workdir/schemas/some_schema.json
- A JSON Schema referenced as
local://schemas/some_schema.json
will expand tofile:///my/dirschemas/schemas/some_schema.json
by default (or some other path, if the local base directory is overridden) - A JSON Schema referenced as
/schemas/some_schema.json
expands tofile:///schemas/some_schema.json
- A JSON Schema referenced as
schemas/some_schema.json
expands tofile:///my/workdir/schemas/some_schema.json
by default (if overridden, any prefix can be added to modify the interpretation of relative paths) - A pseudo-URI
v#myvalidator://something
will call the validation plugin with the current file or directory path and the stringsomething
as argument (the argument can tell the plugin what kind of validation to perform or schema to use).
Thus, custom validation plugins can be used to serve two purposes:
- perform validation beyond what is possible with JSON Schema
- still use JSON Schema internally, but allow to use JSON Schemas that cannot be addressed using the built-in supported protocols
Except for custom validation plugins, all these URIs and pseudo-URIs can be used
also as values for $ref
inside the dirschema or JSON Schemas. The custom plugin
Pseudo-URIs may only be used with the corresponding validation keywords of DirSchema.
Relative paths can be used for convenience throughout the schema and expanded to any
builtin JSON Schema access protocol or custom validator by setting the relative schema
base prefix when launching the validator. Notice that using a custom plugin prefix will
break $ref
resolving of relative paths (you should not use $ref
without access
protocol anyway). If you do it anyway and want relative paths to consistently be resolved
as expected in $ref
s, you must prefix the relative sub-schema location with
cwd://
or local://
stating your intended semantics.
While all the provided ways to refer to external schemas can be useful for applying dirschema in various contexts, consider mixing too many, especially multiple "relative" modes of accessing a validator or JSON Schema as a bad practice. It can make your schemas harder to understand and to reuse.
DirSchema keywords¶
The keywords used in dirschema can be classified into some groups:
- Primitive rules:
type
,valid
,validMeta
The primitive rules are those which perform the actual desired validation on a path.
- Logical connectives:
not
,anyOf
,allOf
,oneOf
The logical connectives work in the same way as in JSON Schema and are used to build more complex rules from the primitive rules.
- Syntactic sugar:
if
,then
,else
Technically, if
/then
/else
is redundant, as its complete behaviour can be
replicated from logical connectives and suitable use of the description
and details
settings.
Practically, it is added as syntactic sugar for the often needed case where a "meta-level" implication such as "if precondition X is true, validate rule Y" is desired, but the user should not be bothered with errors concerning violations of "X" because this is not a real validation error.
To have more human-readable schemas and better error reporting, the guideline is to use
if/then/else
for rule selection and "control flow", whereas the logical connectives are
to be used for actual complex validation rules.
- Pattern matching:
match
,rewrite
,next
The pattern matching keywords are the mechanism for selecting which rules to apply to which paths and constructing relations between paths.
- Settings:
matchStart
,matchStop
,description
,details
The setting keywords affect the behaviour of the evaluation, but have no "truth value".
DirSchema Evaluation¶
When validating a dataset, the DirSchema is evaluated for each path individually and therefore rule violations are also reported for each path separately. For each path, the validation returns a (part of) the unsatisfied constraints as response. Rule evaluation proceeds recursively as follows.
- If a
match
key is present, the path is matched against the expression. - Primitive constraints
type
,valid
andvalidMeta
are evaluated. - Logical constraints
not
,allOf
,anyOf
andoneOf
andif/then/else
are evaluated. - The
next
rule is evaluated on the path (possibly rewritten byrewrite
), if present.
Whenever one of these stages fails, the evaluation of the current rule is aborted. In the following, all available constraints and other keys are explained in more detail.
DirSchema Rules¶
A DirSchema rule is - similar to a JSON Schema - either a boolean (rule that is trivially
true
or false
), or a conjunction of at most one of each kind of possible primitive
and/or complex constraints. A constraint is primitive iff it does not contain any nested
constraint (i.e. primitive rules are leaves in the tree of nested rules).
DirSchema rules are assumed to be JSON or YAML files. In the following it is assumed that JSON and YAML syntax is understood and only the key/value pairs for defining constraints are presented.
Matching and Rewriting¶
As explained above, the complete rule expression is evaluated on each path. To apply different rules to different paths and express dependencies between related paths, DirSchema provides regex matching and substitution for paths.
match¶
Value: string (containing a regex pattern)
Description: Require that the path must fully match the provided regex.
If the match fails, it is assumed that the current rule is not intended for the current path and therefore further evaluation of this rule is aborted.
The behavior of match
can be modified by setting matchStart
and/or matchStop
to
restrict the matching scope to certain path segments. Such an interval is called path
slice.
For example, given the path a/b/c/d
with matchStart: 1
and matchStop: -1
, the match
(and possible rewrite) is performed only on the path slice b/c
.
Capture groups (defined by parentheses in the regex) can be used for the rewrite
in the
current or any nested rule, unless overridden by a different match
.
matchStart¶
Value: integer (default: 0)
Description: Defines the index of the first path segment to be included in the match.
Negative indices work the same as in Python.
For example, to match only in the file name, matchStart
can be set to -1
.
This setting is inherited into contained rules until overridden.
matchStop¶
Value: integer (default: 0)
Description:
Defines the index of the first path segment after matchStart
that is not to be
included in the match.
Negative indices work the same as in Python.
Contrary to Python, a value of 0 means "until the end", like leaving out the end index in a Python slice.
This setting is inherited into contained rules until overridden.
rewrite¶
Value: string (substitution, possibly containing capture references)
Description:
Rewrite (parts of) the current path.
The rewritten path is used instead of the current path in the next
rule,
all constraints on the same level as the rewrite are evaluated on the original path!
Therefore having a rewrite
without a next
rule has no effect.
Capture groups of the most recent match
(i.e. on the same or level or in an ancestor
rule) can be used in the substitution. If there is no applicable match
, a default match
for the pattern (.*)
is assumed and therefore \\1
references the whole matched path or
path slice (determined by the currently active matchStart
/matchStop
).
In principle, this can be used to roughly emulate the functionality of validMeta
,
but as metadata requirements are one of the main use cases, validMeta
is preferable, as it is not hard-coding a metadata file naming convention.
But in a case where more than one metadata file is required for a single file, the
non-standard file could be validated by a combination of rewrite
and valid
, if
there is no other way to express the desired constraints.
Primitive Rules¶
Beside match
, the following primitive rules are provided:
type¶
Value: boolean, "file" or "dir"
Description: Require that the path:
true
: exists (either file or directory)false
: does not exist"file"
: is a file"dir"
: is a directory
valid¶
Value: JSON Schema or string
Description: Require that the path is loadable as JSON by the used adapter and is successfully validated by the referenced JSON Schema or custom validator.
Validation fails if the path does not exist, cannot be loaded by the adapter or is not valid according to the validation handler.
validMeta¶
Value: JSON Schema or string
Description: Require that the metadata file of the current path (according to the used convention) is loadable as JSON by the used adapter and is successfully validated by the referenced JSON Schema or custom validator.
Validation fails if the path does not exist, the metadata companion file does not exist, the metadata file cannot be loaded by the adapter or is not valid according to the validation handler.
Combinations of Rules¶
To build more complex rules, DirSchema provides the same logical connectives that can be
used with JSON Schema. Additionally, an implication keyword next
is provided explicitly
and described further below.
Notice that contrary to typical logical semantics (and just as in JSON Schema),
oneOf/anyOf
evaluate to true for empty arrays, because they are interpreted as "not
existing" instead of being treated as empty existentials.
For each path, the rules are checked in the listed order ("short circuiting"), which
matters for anyOf
- once a rule in the array is satisfied, the following rules are not
evaluated. So prefer putting simpler/the most common case first.
not¶
Value: DirSchema
Description: Logical negation.
allOf¶
Value: Array of DirSchema
Description: Logical conjunction.
anyOf¶
Value: Array of DirSchema
Description: Logical disjunction.
oneOf¶
Value: Array of DirSchema
Description: Satisfied if exactly one rule in the array of DirSchemas is satisfied.
next¶
Value: DirSchema
Description: If all other constraints in the current rule are satisfied, require that the rule provided in the value is also satisfied on the (possibly rewritten) path.
This mechanism exists first and foremost in order to be used in combination with
rewrite
, as just combining multiple rules can be achieved using allOf
.
Additionally, this can be used for sequential "short circuiting" of rule evaluation to modify or refine the four evaluation phases outlined above.
if-then-else¶
if¶
Value: DirSchema
Description:
If specified, will be evaluated on current path.
Depending on result, either the then
or the else
rule will be evaluated.
then¶
Value: DirSchema
Description: If given, must be satisfied in case that the if
rule is satisfied.
else¶
Value: DirSchema
Description: If given, must be satisfied in case that the if
rule is violated.
Error reporting¶
description¶
Value: string
Description: If given, will override all other error messages from immediate child keys of this rule. To completely silence errors from this rule, set to empty string.
If you want to have multiple custom error messages for keys in this rule (e.g. checking
both type
and validMeta
with separate error messages), move these keys into allOf
,
and add individual description
strings to the sub-rules inside allOf
.
details¶
Value: boolean (true by default)
Description: If true, will preserve error messages reported from nested sub-rules
e.g. from logical connectives etc. If false, will discard them. This can be used
in combination with description
to provide higher-level errors for logically complex
rules where the default error report is not helpful.
Modularity¶
In any place where a DirSchema or JSON Schema is expected, one can also use $ref
to
reference them, both in YAML as well as JSON format, located at a remote or local
location. This works for all supported protocols except for custom validation plugins
(i.e. custom validator pseudo-URIs are only permitted as values for valid
and validMeta
).
Examples¶
TODO:
Show non-trivial example for match slice/rewrite and scoping
Show example how next can be used for short circuiting
Show mutex example?