Draft Proposal for the Annot Package
SBML Level 3 Annotation Package. (Keyword: annot).
CISBAN and School of Computing Science
Manchester Centre for Integrative Systems Biology
University of Manchester
Database and Information Systems
18051 Rostock, MV
Proposal tracking number
Number 3009839 in the SBML issue tracking system.
Version number and date of public release
This is version 0 of the Annot package (draft proposal). It reflects the results of the Annotation package meeting, 19–21 May 2010.
URL for this version of the proposal
URL for the previous version of this proposal
Introduction and motivation
Annotations encode meta-information in SBML models. SBML allows users to annotate any SBML component that extends
SBase SBML L3 spec, p. 15. Annotation provides a container for
optional software-generated content not meant to be shown to humans.
The current syntax for encoding of information inside the
annotation element, hereafter referred to as core annotation recommends the use of a defined subset of RDF as described in the SBML L3 specification, section 6.
The core annotation format allows the expression of relationships between SBML elements on the one hand, and resources referred to by values of
rdf:resource attributes on the other. The BioModels.net relation elements (predicates) simply define the nature of the relationship. SBML L3 spec, p. 87
However, as annotations are independent from the model syntax and are not required for successful simulation of the models, it is proposed that it would be more suitable to define annotations in its own package. It is proposed to retain core annotations in the SBML Level 3 Core, but to develop a Level 3 extension package to extend the possibilities of core annotations and therefore support a richer set of meta-information which are currently not expressable. In future Levels, the original core annotations may be completely replaced by this package.
The package builds on the description of the core annotation as currently available from the SBML L3 specification, section 6. A short description of the core annotation standard follows after the introduction to RDF.
Introduction to RDF
The Resource Description Framework (RDF) is a language for representing information about resources, in particular for representing metadata about web resources in the World Wide Web. The RDF Primer generalises the concept of a “web resource” to represent information about things that can be identified on the web, even when they cannot be directly retrieved on the web. RDF-encoded information can be processed by applications. The common framework provided by RDF to express the information in a standardised way leverages the loss-less exchange of information between different applications. RDF builds upon ideas from knowledge representation, artificial intelligence, and data management.
The basic concept of RDF is the identification of things using Uniform Resource Identifiers (URIs). The resources are described by properties with particular property values. The specific terminology used in RDF is (see RDF Primer, sec 2.1):
subject: The part that identifies the thing the statement is about is called the subject.
predicate: The part that identifies the property of the subject that the statement specifies is called the predicate.
object: The part that identifies the value of a property is called the object.
Because of the generality characteristic of URIs, they are used in RDF to identify subjects, predicates and objects in statements. RDF statements effectively take the form of triples, allowing statements to be written in the form:
subject has predicate whose value is object.
The RDF primer extends the concept of URIs to URI references which are defined as:
URIref: A URI reference (or URIref) is a URI, together with an optional fragment identifier at the end. The fragment is separated by the # character.
RDF URIs can be used to encode different kinds of information, including kinds of things, individuals, properties of things, or values of properties.
RDF refecs to a resource as:
resource: A resource is defined as anything that is identifiable by a URI reference (URIref).
Objects in RDF may either be URIrefs, or constant values (literals). Subject and predicate cannot both be literals. Using URIrefs as subject, predicate and object in statements supports the development and use of shared vocabularies on the web. One advantage of using URIrefs for statement definitions is that an URIref allows for the more precise identification of a thing than using a sole string (http://www.ex.org/staffif/1111 identifying a person more precisely than the string “John Smith”). Another advantage is that a thing with a URIref assigned can be further described by other RDF statements, while a literal can not.
RDF allows to model the encoded information in different ways. One way is the representation of the information as a graph of nodes and arcs. An RDF graph is formed based on the idea that the things being described have properties which have values, and that resources can be described by making statements [..] that specify those properties and values (RDF Primer, sec 2.1). The nodes in the graph represent the subject and object of a statement. The arc represents the predicate. It is directed from subject node to object node. Ellipses in the RDF graph represent URIrefs, while boxes represent literals. A sample RDF graph is shown in Figure 1 of the RDF Primer.
A second way to represent RDF statements is the use of the triplet notation. It offers an alternative to the graph representation, e. g. if a graph gets too inconvenient to be drawn. Each statement of the graph is written as a single triple, consisting of the subject, predicate and object (in that order). A triple describes a single arc in the graph, with the subject being the arc’s beginning and the object being the arc’s ending. URIrefs are put in angle brackets (<...>), while literals are put in quotes ("..."). Examples of such notation are:
<http://www.example.org/index.html> <http://www.example.org/terms/creation-date> "May 16, 1999" .
Furthermore, XML can be used to represent statements in a machine-processable way. The syntax for writing RDF in XML is called RDF/XML (RDF syntax, sec 3). The description of a statement is enclosed in an
rdf:RDF XML element. The statement itself is enclosed in an
rdf:description element, being regarded a description about the subject of the statement. The subject is referred to in the
rdf:about attribute inside the
rdf:description element. Nested within the containing
rdf:description element is the property element representing the predicate and object of the statement. The nesting indicates the application of the property on the given subject. More details on the RDF/XML syntax are given in the RDF syntax. An example of RDF/XML representation, marking up the two statements above, is:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:exterms="http://www.example.org/terms/"> <rdf:Description rdf:about="http://www.example.org/index.html"> <exterms:creation-date>May 16, 1999</exterms:creation-date> </rdf:Description> <rdf:Description rdf:about="http://www.example.org/index.html"> <dc:language>en</dc:language> </rdf:Description> </rdf:RDF>
Current SBML annotation standard
According to the current SBML annotation standard, RDF/XML is used to present the RDF statements (see Figure, taken from the SBML L3 Core Specification, p.86).
The current core annotation schema, while written in RDF/XML, supports only a limited subset of RDF/XML. The above syntax must be followed, including the use of the mandatory
rdf:Bag container, and the specification of the subject as a URI in the
rdf:li rdf:resource attribute.
The URI link to an external resource must be perennial. To uniquely identify a controlled vocabulary term or object, the MIRIAM Resources scheme is used. A referenced MIRIAM URI maps to a physical web source, i.e. a URL. The connection between the addressed third-party knowledge and the annotated element is established using any of the model or biological qualifiers listed on [http://www.biomodels.net/qualifiers]. If an annotation follows the proposed scheme, it is considered an SBML MIRIAM annotation. The versioning of SBML annotation elements can be tracked through the history; it allows to store the annotation creators and modification dates.
Problems with core annotation
Statements about attributes
The core annotation specification reuses the RDF approach of providing
rdf:Description elements for SBML XML elements, such as
However, there currently does not exist a mechanism to annotate SBML attributes. See, for example, the following SBML code snippet:
<species metaid="metaid_0000042" id="Y" name="Intravesicular Calcium" compartment="intravesicular" initialConcentration="0.36"> <annotation> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/"> <rdf:Description rdf:about="#metaid_0000042"> <bqbiol:isDescribedBy> <rdf:Bag> <rdf:li rdf:resource="urn:miriam:pubmed:12343565"/> </rdf:Bag> </bqbiol:isDescribedBy> </rdf:Description> </rdf:RDF> </annotation> </species>
Using the current SBML annotation approach, it is not possible to annotate an attribute of an SBML element, such as the initial concentration of a species.
The PubMed annotation in the example states that the species instance as a whole
isDescribedBy a particular PubMed reference (PubMed ID 12343565), while it was the intention to annotate the species' attribute
initialConcentration, effectively stating that the justification for the given
isDescribedBy the PubMed document with ID 12343565.
Statements about statements
With the current scheme all annotations of an SBML element are at the same level. They all relate to the element itself, but cannot be related to another statement. What we are missing in the current SBML approach is the ability to provide "statements about statements". A simple use case is the request to annotate an annotation with the information that "this statement was added by...". A further use case would be those which involve non-binary relationships, such as "protein X is modified by modifier Y in position Z".
Relations between statements
With all annotations for a given SBML element being at the same level, it is currently not possible to define the relation between those different annotations of a particular element. Apart from some conventions mentioned in the spec (see SBML L3 Core Specification, sec. 6.5 on p. 86) there is no fine-granular way of providing information on the annotation relations in a formal and specified manner.
The core annotation standard syntactically limits the annotation of model constituents to:
<rdf:RDF> <rdf:Description rdf:about="#SBML_META_ID"> <RELATION_ELEMENT> <rdf:Bag> <rdf:li resource="URI"/> FURTHER RDF:LI ELEMENT </rdf:Bag> </RELATION_ELEMENT> </rdf:RDF>
That very restrictive approach does not allow the use of other containers than
rdf:Bag (such as
rdf:Bag only groups a set of statements, without implying any further semantics on the meaning of that group. Statements within an
ref:Seq are ordered, whilst those inside an
rdf:Alt represent alternative statements. Each of these three containers represent an open set. Statements within a
rdf:List represent a closed set.
Examples of the ambiguity that can be caused by this limitation is highlighted in the following two examples, where the same container (
rdf:Bag) is used to define the relationship between two alternative annotations for glucose, and to define the relationship between two components of a complex.
The following example effectively demonstrates an implied "or" relationship between two alternative means of annotating glucose (with a ChEBI term or a KEGG Compound term):
<species id=”glc" metaid=”meta_glc” name=“Glucose”> <annotation> <rdf:RDF> <rdf:Description rdf:about=”#meta_glc”> <bqbiol:is> <rdf:Bag> <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3417234"/> <rdf:li rdf:resource="urn:miriam:kegg.compound:C00234"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF> </annotation> </species>
The next example demonstrates an implied "and" relationship between two components of a complex (represented by a UniProt term for the protein, and a ChEBI term for the ligand):
<species id="Ca_calmodulin" metaid="cacam”> <annotation> <rdf:RDF> <rdf:Description rdf:about="#cacam”> <bqbiol:hasPart> <rdf:Bag> <rdf:li rdf:resource="urn:miriam:uniprot:P62158"/> <rdf:li rdf:resource="urn:miriam:kegg.compound:C00076"/> </rdf:Bag> </bqbiol:hasPart> </rdf:Description> </rdf:RDF> </annotation> </species>
The problem here is that the relationship is implied: it is not made explicit by the container (
rdf:Bag) used to define the relationship.
Furthermore, no clear definition of the different or similar meanings between the following two examples is provided:
<rdf:RDF ..> <rdf:Description rdf:about="#metaid_0000001"> <bqbiol:is> <rdf:Bag> <rdf:li resource="x"/> <rdf:li resource="y"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF>
<rdf:RDF ..> <rdf:Description rdf:about="#metaid_0000001"> <bqbiol:is> <rdf:Bag> <rdf:li resource="x"/> </rdf:Bag> </bqbiol:is> <bqbiol:is> <rdf:Bag> <rdf:li resource="y"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF>
The current core annotation scheme does not allow for the definition of negative statements. That is, to make statements along the lines of "protein X is NOT phosphorlylated".
The Annot Package proposal
Neil: Replace all RDF/XML examples with RDF graphs.
The following section summarises the conclusions of the discussion of new concepts incorporated in the Annot package. It is split into three parts: Firstly, it summarises the approaches that solve some of the aforementioned problems (solutions with consensus). Secondly, it summarises solutions where several proposals to problems co-exist (proposed solutions open for discussion). Thirdly, it shows remaining issue where so far no consensus could be reached (open issues).
Namespace and integration with SBML L3
The standard namespace for the annot package is
A new version of the annot package will be released with each new version of the Core package in order to comply with the new version of the Core.
(following the SBML L3 package mechanism description).
In order to use the annot package for SBML L3 models, the annot namespace must be added to the
<sbml> element namespace declarations:
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" level="3" version="1" xmlns:annot="http://www.sbml.org/sbml/level3/version1/annot/version1" ...> ... </sbml>
An SBML model can always be fully understood mathematically without understanding the annot package extension. Therefore, the use of the package for parsing an SBML model is optional. This is indicated by adding the XML attribute
annot:required to the
<sbml> element and setting its value to false:
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" level="3" version="1" xmlns:annot="http://www.sbml.org/sbml/level3/version1/annot/version1" ... annot:required="false" ...> <model> ... </model> </sbml>
Statements about attributes
Sometimes, it is not only necessary to annotate an SBML element, but a more fine-grained annotation of a particular attribute of an element is needed. We consider two solutions.
We proposes the use of XPath (see http://www.w3schools.com/xpath/) to refer to a piece of XML inside the document. XPath is a standard technology for referencing elements and attributes inside an XML document, and it offers a well defined scheme to do so. Furthermore, a great number of tools exist to evaluate XPath expressions.
Therefore, we use the xpath namespace, which allows us to specify any local object in the
One should use the element's
id to refer to it, as in:
The following example shows an attribute annotation using the XPath notation.
<species metaid="metaid_0000042" id="Y" name="Intravesicular Calcium" compartment="intravesicular" initialConcentration="0.36"> <annotation> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"> <rdf:Description rdf:about="xpath://species[id='Y']/@initialConcentration"> <bqbiol:isDescribedBy> <rdf:Bag> <rdf:li rdf:resource="urn:miriam:pubmed:12343565"/> </rdf:Bag> </bqbiol:isDescribedBy> </rdf:Description> </rdf:RDF> </annotation> </species>
The recommended way of providing the XPath is to:
- avoid addressing attributes and elements by their (ordering) number
- use the abbreviated syntax to identify an XML element in the model by its
id, and then refer to the particular attribute.
We would like to state that it will be error-prone to use the XPath concept of addressing attributes and elements by their number, as SBML does not support order. As such, expressions along the lines of
are not recommended for use in the Annot package. Instead, the XPath should be specified as in the example given above.
Secondly, whenever possible, instead of providing the full paths to elements or attributes, the abbriviated syntax should be used, which first selects all elements of the SBML model and then limits the result set depending on the given
In XPath, a double forward slash
// selects from all descendants of the context node as well as the context node itself. At the beginning of an XPath expression, it selects from all descendants of the root node. For example, the XPath expression
species elements in the document.
Statements about statements
We will utilize RDF Reification, the standard method of making statements about statements, as described in the RDF Primer, section 4.3. This will allow statements to be assigned to other statements that have an
rdf:ID assigned. We'll use the
rdf:about construct to refer to that
rdf:ID and then define the annotation.
The following example demonstrates Reification being used to make a statement about a statement:
<species id=”abc" metaid=”meta_abc”> <annotation> <rdf:RDF> <rdf:Description rdf:about=”#meta_abc”> <bqbiol:isDescribedBy rdf:ID=”statement1"> <rdf:Bag> <rdf:li rdf:resource="urn:miriam:pubmed:15387819"/> </rdf:Bag> </bqbiol:isDescribedBy> </rdf:Description> <rdf:Description rdf:about=”#statement1”> <dc:creator>John Smith</dc:creator> </rdf:Description> </rdf:RDF> </annotation> </species>
By adding a
rdf:ID to the first statement (which states that the species
isDescribedBy PubMed document 15387819), a second statement can be specified about this first statement which specifies that the first statement has a specified creator. Effectively the second statement defines that the first statement has creator John Smith.
Neil: The use of blank fields is not an example of Reification. Move this elsewhere...
A second example captures the statement "protein X is modified by modification Y in position Z", utilizing blank nodes:
<species id=”x" metaid=”meta_x” name=“Protein X”> <annotation> <rdf:RDF> <rdf:Description rdf:about=”#meta_x”> <bqbiol:modification rdf:nodeID=”node1"/> </rdf:Description> <rdf:Description rdf:nodeID=”node1”> <bqbiol:modifier rdf:resource=”Y"/> <bqbiol:position rdf:datatype="xsd:integer">Z</bqbiol:position> </rdf:Description> </rdf:RDF> </annotation> </species>
SBML so far has been very unspecific about information on the different people involved in the model building, publishing, curating and maintaining process. We propose to use the
dc:creator from Dublin Core to provide meta-information about persons in general. Depending on where the annotation occurs, the semantics are the following:
dc:creatoris related to the SBML
<model>element's metaID, then it represents the model encoder/creator.
dc:creatoris related to any
rdf:descriptionelement inside an SBML
<annotation>element, then it represents the annotation creator.
- if the referenced
<annotation>element is an SBML model annotation, then the
dc:creatorsays who provided the particular model annotation.
- if the referenced
<annotation>element is an SBML element annotation, then it says who provided the particular element annotation.
- if the referenced
- if no dc:creator is defined for an
rdf:descriptionelement, then it is assumed that the model annotation creator is also the creator of that particular element.
See also New predicates.
Relations between statements
To enable a more detailed description of relations between statements we propose to extend the current SBML annotation scheme to support all kinds of RDF Collections and Containers. RDF provides four different concepts to encode grouped statements, including the three Containers
rdf:Alt, and the Collection
rdf:List RDF Primer, sec. 4:
A resource having the type
rdf:Bag represents an open group of resources or literals
[..] where there is no significance in the order of the members.
A resource having the type
rdf:Seq represents an open group of resources or literals
[..] where the order of the members is significant.
A resource having the type
rdf:Alt represents an open group of resources or literals
that are alternatives (typically for a single value of a property).
A resource having the type
rdf:List represents a closed group of resources or literates
that consists only of the specified members.
In addition to supporting all RDF Collections and Containers, the use of no Collections and Containers will be supported. The core annotations specify that an
<rdf:Bag> must be used. This, however, is unnecessary for single objects which can be specified more simply following the example below:
<species id=”glc" metaid=”meta_glc” name=“Glucose”> <annotation> <rdf:RDF> <rdf:Description rdf:about=”#meta_glc”> <bqbiol:is rdf:resource="urn:miriam:obo.chebi:CHEBI%3417234"/> </rdf:Description> </rdf:RDF> </annotation> </species>
Considering Collections and Containers, an taking the previous examples from the project definitions, the previously implied "or" relationship between two alternative means of annotating glucose (with a ChEBI term or a KEGG Compound term) can be made explicit by using the
<species id=”glc" metaid=”meta_glc” name=“Glucose”> <annotation> <rdf:RDF> <rdf:Description rdf:about=”#meta_glc”> <bqbiol:is> <rdf:Alt> <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3417234"/> <rdf:li rdf:resource="urn:miriam:kegg.compound:C00234"/> </rdf:Alt> </bqbiol:is> </rdf:Description> </rdf:RDF> </annotation> </species>
Similarly, the implied "and" relationship between two components of a complex (represented by a UniProt term for the protein, and a ChEBI term for the ligand) can be made explicit by utilising the
rdf:List collection to specify a closed set:
<species id="Ca_calmodulin" metaid="cacam”> <annotation> <rdf:RDF> <rdf:Description rdf:about="#cacam”> <bqbiol:hasPart> <rdf:List> <rdf:li rdf:resource="urn:miriam:uniprot:P62158"/> <rdf:li rdf:resource="urn:miriam:kegg.compound:C00076"/> </rdf:List> </bqbiol:hasPart> </rdf:Description> </rdf:RDF> </annotation> </species>
Distinction between L3 core and L3 Annot package annotations
To distinguish SBML Level 3 Core annotations from annotations provided through the Annot package, we propose to have a new element
<annot:annotation> from the annot namespace as a sibling of the current
This will allow us to distinguish between L3 Annot package annotations, i.e. the ones in the scope of this draft proposal, and existing SBML annotations, such as software tool annotations, or non-annot-package compliant annotations people might want to use (including existing annotations from older model versions).
The following example shows the annot:annotation element as a sibling of the current SBML annotation element:
<annotation> [SBML STANDARD ANNOTATION] </annotation> <annot:annotation> <rdf:RDF> [ANY VALID RDF AS DEFINED IN THE ANNOT PACKAGE SPEC] </rdf:RDF> </annot:annotation>
The approach chosen here has the advantage of this approach is that it avoids further overloading of the already much used SBML annotation element. It also allows a cleaner distinction between the SBML L3 core and package elements.
The recommended practice for model annotation is the use of the Annot package, as it is less restricted in its syntax, and complies with RDF recommendations.
Cross-references and cross-element annotations
In order to realise self-references, i.e. to refer to an element in the same document, use the existing RDF standard will be supported:
The referencing of non-URI references to existing models (such as the example below), such as web addresses, URLs, or local directories, is NOT supported by this proposal.
<rdf:li rdf:resource="file://../models/BM02#_986127" />
Proposed solutions (open to discussion)
The following issues are still open to discussion. Proposed solutions are included.
The current set of predicates (Biomodels.net qualifiers, http://www.ebi.ac.uk/miriam/main/qualifiers/) may have to be extended to exploit the proposals of this package extension. An example given earlier is adding "submitter" to distinguish between the model creator and the model submitter.
Furthermore, there is currently no support for annotating protein modifications. The predicate "modification" has been suggested in previous examples, above.
To satisfy RDF, predicates should be nouns, representing properties of the subject, rather than verbs as they are in the core annotation. RDF triples should follow the pattern, "SUBJECT has PREDICATE whose value is OBJECT". Core annotations result in nonsensical RDF triples such as "SPECIES has IS_DESCRIBED_BY whose value is PUBMED:12345". It is proposed that the existing Biomodels.net predicates be updated, such that, taking the example above, "IS_DESCRIBED_BY" is replaced by "DESCRIPTION".
Doing so would allow the set of predicates (properties), and relationships between them, to be defined formally in an RDF schema (see http://www.w3.org/TR/rdf-primer/#rdfschema).
This package does not depend on any other SBML Level 3 package.
None existing so far.
Translation to SBML Level 2
See unresolved issues, there are solution proposals, but there was no agreement so far.
Use of the annot package
There is no way to legislate how other packages make use of the Annotation structures coming from this package. Individual packages determine how best to make use of Annotation structures.
Use of old and new annotations
Duplicating semantic information (in both core annotation and the annotation package) is technically possible, but it is considered bad practice and not recommended.
We recommend to rather update the model to the new annotation scheme in that case by transforming the old annotations into the new scheme.