Specification version: v1.0
The ontology file format stores all information currently used by the Erasmus MC Bioinformatics Department.
Comments
Hash (#) character at the beginning of the line starts a comment in a file and this line is ignored. Empty lines are ignored as well.
String escaping
Tabs and end-of-line characters are escaped by replacing them with \t and \n character sequences respectively.
Field constraints
- [0,1] Optional, can appear only once per concept
- [1] Required, must appear only once per concept
- [0–n] Optional, can appear many times
Format of the explanation
All lines are given as:
Field | Description | Constraint |
Header
First line is an optional comment:
# ErasmusMC ontology file | [0,1] |
Second line must contain the ontology version field:
VR | Ontology version | [1] |
Third line must contain the ontology name field:
ON | Ontology name | [1] |
Fourth line must contain an end of concept markation:
-- | End of header | [1] |
Body
The rest of the file consists of one field per line. The first two characters contain the field identifier, the third character is a space, and the remainder contains the value.
NS | Name space | [0,1] |
Currently, only two namespace values are recognized:
- SemType for defining semantic types.
- Voc for defining vocabularies.
ID | Concept ID | [1] |
Unique concept identifier. Currently, only integers are allowed for concepts other than vocabularies.
NA | Name | [0,1] |
This is the name used to display this object. This term should not be used during indexing.
TM | Term | [0–n] |
Term that can be used for concept recognition. Each term can consist of more than one words. Optionally, the term can be followed by a tab and @ sign, followed by additional term information. This information should be semicolon separated. Two types of information are currently supported:
- For each concept, the first term is indexed as the preferred term. All the remaining terms are used as alternative terms.
- lang = language, using a 2 letter code as specified in ISO639-1.
- match = matching flags. Can be ci (case insensitive) and/or no (normalised). Multiple flags should be comma separated.
- Some further explanation on every possible combination of these two flags. The "term" mentioned below can be either in the Ontology file or in the to-be-indexed text.
- If no flag is set, the term is always matched case sensitively except for the first letter in each word
- If only ci is set, the term is always matched case insensitively
- If only no is set: (1) if the term is an abbreviation (where more than half of the letters are in uppercase), the no flag has no effect on letter case and the term is always matched case sensitively; (2) if the term is NOT an abbreviation, the term is normalized first and consequently is always matched case insensitively
- If both ci and no are set, the term is normalized and then matched case insensitively no matter whether the term is an abbreviation or not.
DF | Definition | [0,1] |
The concept definition
DB | External database identifier | ![0–n] |
An identifier in an external database. The identifier consists of two parts (underscore seperated): the database, and the identifier.
ST | Semantic type | [0–n] |
A number that must refer to a concept in the SemType namespace.
VO | Vocabulary | [0–n] |
A string identifying a source vocabulary that must refer to a vocabulary in the Voc namespace.
DI | Disambiguation | [0,1] |
An identifier for the type of disambiguation to be used. Currently, 2 types are available:
- lo (loose) used for generic concept
- st (strict) used for genes and chemicals
PA | Parent concept ID | [0–n] |
Must refer to an exisiting concept ID.
-- | End of concept | [1] |
Example
# ErasmusMC ontology file VR 1.0 ON Anni2_1_June2009 -- NS Voc ID MSH NA MeSH -- NS Voc ID SNOMEDCT NA SNOMED Clinical Terms -- NS SemType ID 47 NA Disease or Syndrome DF A condition which alters or interferes with a normal process, state, or activity of an organism. It is usually characterized by the abnormal functioning of one or more of the host's systems, parts, or organs. Included here is a complex of symptoms descriptive of a disorder. -- ID 9187 NA Coccidiosis TM Coccidiosis @match=ci,no TM Coccidiosis @lang=es;match=ci,no DF Protozoan infection found in animals and man. It is caused by several different genera of COCCIDIA. DB UMLS_C0009187 ST 47 VO MSH VO SNOMEDCT DI lo -- ID 24530 NA Malaria TM Malaria @match=ci,no TM Malaria NOS @match=ci,no TM Plasmodium Infections @match=ci,no TM Remittent Fever @match=ci,no TM Paludism @match=ci,no TM Plasmodiosis @match=ci,no TM Malarial fever @match=ci,no TM Malaria (disorder) @match=ci,no TM Marsh Fever @match=ci,no TM Malaria @lang=es;match=ci,no DB UMLS_C0024530 ST 47 VO MSH VO SNOMEDCT DI lo PA 9187 --
The converter from the deprecated PSF file format to the new ontology file format is [download:1 psf2ontology.jar] (run it as java -jar psf2ontology.jar file.psf).