wiki:DisambiguationSteps

Version 5 (modified by rob.hooft@…, 11 years ago) (diff)

--

Overall process

After the results are retrieved from the indexing engine, a special procedure takes place to assure that results are accurate in the given context. For example, it might happen that the same term is assigned to different concepts (e.g. H1C1 may refer to some gene, but may be a alternative name of some disease). The process of resolving such conflicts is called disambiguation.

In the current Peregrine architecture the disambiguation process is divided into two stages:

  1. Indexing results are fed to the disambiguator. Disambiguator implementation can call some helpers, or chain the request through other disambiguators and merge the results.
  2. Disambiguation results are fed to a disambiguation decision maker, that applies some logic to filter out indexing results based on disambiguation results.

The above procedure is illustrated in the following diagram:

Disambiguator

Two disambiguator implementations are available. Which disambiguator to use is defined in the DI field of the ontology.

Loose disambiguator

This disambiguator was formerly called UMLSDisambiguator. The algorithm it follows is the following:

  • If a concept has synonyms then the assigned weight depends on the minimal distance for the closest synonym ([0.75 .. 0.8])
  • If a term is a preferred term for a concept, then weight [0.7] (SURE_WEIGHT) is assigned.
  • If a concept has no homonyms<ref name="homonym">Two terms (A and B) are called homonyms, if they refer to the different concepts (concept A != concept B) but are the same (or very similar) term (term A is written the same as term B)</ref>, then weight [0.65] (PRETTY_SURE_WEIGHT) is assigned.
  • Otherwise the weight is assigned value [0.5] (UNCERTAIN_WEIGHT).

Strict disambiguator

This disambiguator was formerly called GeneDisambiguator. The algorithm it follows is the following:

  • If the concept under consideration has no homonyms<ref name="homonym"/> or the term is a preferred term and if the term is complex, then weight [0.9] (POSITIVE_WEIGHT) is assigned.
  • If concept has synonyms then the assigned weight depends on the minimal distance for the closest synonym ([0.75 .. 0.8])
  • If concept has keywords the assigned weight depends on the minimal distance for the closest keyword ([0.70 .. 0.75])
  • Otherwise the weight is assigned value [0.1] (NEGATIVE_WEIGHT).

File(PeregrineStrictDisambiguator.png,500px)?

Keyword

A keyword is a token that is rarely used across all concepts. A good keyword:

  1. should be part of a multi-tokened term of the to-be-disambiguated concept.
    • set in IndexedOntology.java: if (normalizedToken == term.getText() || !isComplex.isAComplexKeyword(normalizedToken)) {IsNotAKeywordToken = true;}
  2. should be complex (i.e. longer than 5 chars, or with at least 1 number and 1 letter)
    • set in IsComplexRule.java: isAComplexKeyword().
  3. should appear fewer than e.g. 100 times in the ontology
    • set in PeregrineImpl.java as DEFAULT_KEYWORD_THRESHOLD
  4. When concepts are homonyms (ie. they share at least one term), all tokens belong to that homonym term should not be used as the keywords for these two concepts. (However, these tokens might be used as keywords for other concepts that do not have this homonym term as one of their concept terms).
    • implement a map (isPartOfHomonyms@IndexedOntology.java) and a flag (TokensShouldBeSkippedAsKeywordForThisConcept) to filter out keywords that are part of a homonym.

Disambiguation decision maker

Currently there is only trivial disambiguation decision maker implementation, that removes indexing result if corresponding disambiguation result has weight less then [0.5].

Attachments (3)

Download all attachments as: .zip