wiki:ErasmusMC ontology file format

Specification version: v1.0

The ontology file format stores all information currently used by the Erasmus MC Bioinformatics Department.

Comments

Hash (#) character at the beginning of the line starts a comment in a file and this line is ignored. Empty lines are ignored as well.

String escaping

Tabs and end-of-line characters are escaped by replacing them with \t and \n character sequences respectively.

Field constraints

  • [0,1] Optional, can appear only once per concept
  • [1] Required, must appear only once per concept
  • [0–n] Optional, can appear many times

Format of the explanation

All lines are given as:

Field Description Constraint

First line is an optional comment:

# ErasmusMC ontology file [0,1]

Second line must contain the ontology version field:

VR Ontology version [1]

Third line must contain the ontology name field:

ON Ontology name [1]

Fourth line must contain an end of concept markation:

-- End of header [1]

Body

The rest of the file consists of one field per line. The first two characters contain the field identifier, the third character is a space, and the remainder contains the value.

NS Name space [0,1]

Currently, only two namespace values are recognized:

  • SemType for defining semantic types.
  • Voc for defining vocabularies.
ID Concept ID [1]

Unique concept identifier. Currently, only integers are allowed for concepts other than vocabularies.

NA Name [0,1]

This is the name used to display this object. This term should not be used during indexing.

TM Term [0–n]

Term that can be used for concept recognition. Each term can consist of more than one words. Optionally, the term can be followed by a tab and @ sign, followed by additional term information. This information should be semicolon separated. Two types of information are currently supported:

  • For each concept, the first term is indexed as the preferred term. All the remaining terms are used as alternative terms.
  • lang = language, using a 2 letter code as specified in ISO639-1.
  • match = matching flags. Can be ci (case insensitive) and/or no (normalised). Multiple flags should be comma separated.
  • Some further explanation on every possible combination of these two flags. The "term" mentioned below can be either in the Ontology file or in the to-be-indexed text.
    1. If no flag is set, the term is always matched case sensitively except for the first letter in each word
    2. If only ci is set, the term is always matched case insensitively
    3. If only no is set: (1) if the term is an abbreviation (where more than half of the letters are in uppercase), the no flag has no effect on letter case and the term is always matched case sensitively; (2) if the term is NOT an abbreviation, the term is normalized first and consequently is always matched case insensitively
    4. If both ci and no are set, the term is normalized and then matched case insensitively no matter whether the term is an abbreviation or not.
DF Definition [0,1]

The concept definition

DB External database identifier ![0–n]

An identifier in an external database. The identifier consists of two parts (underscore seperated): the database, and the identifier.

ST Semantic type [0–n]

A number that must refer to a concept in the SemType namespace.

VO Vocabulary [0–n]

A string identifying a source vocabulary that must refer to a vocabulary in the Voc namespace.

DI Disambiguation [0,1]

An identifier for the type of disambiguation to be used. Currently, 2 types are available:

  • lo (loose) used for generic concept
  • st (strict) used for genes and chemicals
PA Parent concept ID [0–n]

Must refer to an exisiting concept ID.

-- End of concept [1]

Example

# ErasmusMC ontology file
VR 1.0
ON Anni2_1_June2009
--
NS Voc
ID MSH
NA MeSH
--
NS Voc
ID SNOMEDCT
NA SNOMED Clinical Terms
--
NS SemType
ID 47
NA Disease or Syndrome
DF A condition which alters or interferes with a normal process, state, or activity of an organism.  It is usually characterized by the abnormal functioning of one or more of the host's systems, parts, or organs.  Included here is a complex of symptoms descriptive of a disorder.
--
ID 9187
NA Coccidiosis
TM Coccidiosis	@match=ci,no
TM Coccidiosis	@lang=es;match=ci,no
DF Protozoan infection found in animals and man. It is caused by several different genera of COCCIDIA. 
DB UMLS_C0009187
ST 47
VO MSH
VO SNOMEDCT
DI lo
--
ID 24530
NA Malaria
TM Malaria	@match=ci,no
TM Malaria NOS	@match=ci,no
TM Plasmodium Infections	@match=ci,no
TM Remittent Fever	@match=ci,no
TM Paludism	@match=ci,no
TM Plasmodiosis	@match=ci,no
TM Malarial fever	@match=ci,no
TM Malaria (disorder)	@match=ci,no
TM Marsh Fever	@match=ci,no
TM Malaria	@lang=es;match=ci,no
DB UMLS_C0024530
ST 47
VO MSH
VO SNOMEDCT
DI lo
PA 9187
--

The converter from the deprecated PSF file format to the new ontology file format is [download:1 psf2ontology.jar] (run it as java -jar psf2ontology.jar file.psf).

Last modified 10 years ago Last modified on Aug 19, 2013, 2:26:13 PM