January 2005 annotations proposal
We currently recommend the use of a new approach to storing controlled vocabulary terms in SBML as annotations. This is a replacement for the approach of using CellML metadata described in the SBML Level 2 version 1 specification. If this approach is accepted by the SBML community and there is an SBML Level 2 version 2 specification produced (likely), this will be the approach proposed.
This proposal is for a syntax within the SBML Level 2 Version 1 standard for the annotation of models with references to controlled vocabularies. The approach described is designed to be simple, highly extensible and compliant with existing relevant standards. Within this scheme all the controlled vocabulary terms are extensible without requiring the existence of multiple versions of the controlled vocabulary.
Basic Syntax for Referring to Controlled Vocabulary Terms and Database Identifiers
This section describes the proposed format for referring to controlled vocabulary terms and database identifiers. The syntax consists of Dublin Core identifier elements embedded in RDF elements embedded in SBML
annotation elements. The following is an example of a fragment where a species has been annotated with a reference to a rate law controlled vocabulary term:
The "http://www.biomodels.net/vocabularies/ratelaw#RL:0000001" identifier contained in the rdf:resource attribute identifies the controlled vocabulary term that is used to label the rate law. This Identifier maps to a unique rate law term which is never deleted from the given controlled vocabulary. In the controlled vocabulary resource (not shown) the term has a name, e.g. 'Mass Action', and a definition string in the controlled vocabulary resource. In following sections we outline how we can maintain a controlled vocabulary resource so that terms are not deleted but deprecated.
The value of the rdf:resource attribute is a URI that both identifies the controlled vocabulary and the term in the controlled vocabulary. The resource constraining the term precedes the '#' symbol and the term or database identifier follows the '#' symbol. In the present example, the controlled vocabulary "http://www.biomodels.net/vocabularies/ratelaw" includes the term RL:0000001.
Note that the value of the rdf:resource attribute is a URI, not a URL; as such, a URI does not have to reference a physical web object but just identifies a controlled vocabulary term or database object. (Think of a URI as a label that just happens to look like a URL.) The URI can refer to both SBML-specific controlled vocabularies and to existing controlled vocabularies such as the GO ontology -- there is nothing in this scheme that requires the URI to point to a particular group's controlled vocabulary.
To enable interoperability, the community would have to agree on a set of valid URI syntaxes. These URIs will always be composed as "resource#id". URI syntax rules would not be a fixed part of the SBML standard but would be extendable independently from specific SBML Levels and Versions. We would set up a web page available through sbml.org that points to a new website, biomodels.net, where we would list URI syntaxes and physical links to controlled vocabulary files. This list would simply list the set of strings that could precede the '#' and for each member of this list there would be a brief summary of the syntax for the identifier following the '#'. This scheme doesn't require such a list to operate.
The use of rdf:Bag allows multiple links to external resources for a given SBML object as shown in the following example:
The value of the rdf:about should match the metadid attribute value of the SBML element that it corresponds to. Technically, the use of metaid and rdf:about attributes enables any number of RDF elements to be placed anywhere in the SBML document. Best practice is to place the rdf:Description element for an SBML Element within the annotation element for that SBML element (this enables the editing and deletion of the annotation to be managed in a straightforward manner). Ideally an annotation element should contain only one nested sequence of rdf:RDF, rdf:Description, dc:relation and rdf:Bag elements.
Development and Representation of Controlled Vocabularies
The scheme outline above for referencing controlled vocabularies neither references controlled vocabularies directly nor defines the format for those vocabularies. In this section we propose the use of the Open Biological Ontologies (OBO) controlled vocabulary flat file format, see http://www.geneontology.org/GO.format.html#oboflat or its XML equivalent see http://www.geneontology.org/GO.format.html#XML as formats for sharing controlled vocabularies. These formats have been developed as part of the Gene Ontology project, see http://www.geneontology.org/index.shtml. Using the Gene Ontology flat file format has a number of benefits:
- tools exist for editing controlled vocabularies stored in this format see http://www.geneontology.org/GO.tools.html. The examples below were generated with DAG-Edit, see http://sourceforge.net/project/showfiles.php?group_id=36855.
- the format enables the inclusion of an audit trail thus enabling terms to be split, merged and deleted whilst maintaining term identifiers over time.
The OBO XML format currently has less tool support but is more flexible in other respects. We would prefer to use the XML format but feel that it is impractical until better support is available. We anticipate a gradual switch to the XML format over time. The SBML syntax can support either format.
As an example consider a simple controlled vocabulary for rate laws, consisting of 2 terms Mass Action and Hill, which in the OBO flat file format would be represented as:
if the Hill term is deleted then the file is modified as follows:
another 2 terms can be added:
these 2 terms can be merged:
term definitions can be added to the controlled vocabulary as follows
Relationship to existing standards and proposals
This proposal is written in response to comments about a previous proposal for SBML Level 2 Version 2 discussed at the last hackathon in Heidelberg (October 2004) see http://www.sbml.org/workshops/ninth/supplementary/sbml-level-2-version-2-proposal.pdf The scheme described in this email is better grounded in the RDF and Dublin Core standards than that proposal, is easier to parse and has the advantage that it can be adopted now as SBML L2 best practice.
In addition, we propose that the CellML metadata bioentity element is superseded by the scheme described here which uses just Dublin Core elements. We suggest that the CellML Metadata bioentity element has an unnecessarily complex syntax scheme with resources refereed to via a fixed set of strings specific to CellML or alternatively via a URI. We believe that the scheme proposed here meets the needs of the community and is significantly simpler than the CellML bioentity syntax.
The CellML metadata specification is at: http://www.cellml.org/public/metadata/cellml_metadata_specification.html and the bioentity definition is section 4.10
Feedback is requested
Thanks for reading this far. Please let us know if you think the concepts described here are useful and should become part of SBML best practice and/or future SBML standards.