wiki:DisambiguationSteps

Steps in the Disambiguation

Overall process

After the results are retrieved from the indexing engine, a special procedure takes place to assure that results are accurate in the given context. For example, it might happen that the same term is assigned to different concepts (e.g. H1C1 may refer to some gene, but may be a alternative name of some disease). The process of resolving such conflicts is called disambiguation.

In the current Peregrine architecture the disambiguation process is divided into two stages:

  1. Indexing results are fed to the disambiguator. Disambiguator implementation can call some helpers, or chain the request through other disambiguators and merge the results.
  2. Disambiguation results are fed to a disambiguation decision maker, that applies some logic to filter out indexing results based on disambiguation results.

The above procedure is illustrated in the following diagram:

Disambiguator

Two disambiguator implementations are available. Which disambiguator to use is defined in the DI field of the ontology.

Loose disambiguator

This disambiguator was formerly called UMLSDisambiguator. The algorithm it follows is the following:

  • If a concept has synonyms then the weight [0.9] (POSITIVE_WEIGHT) is assigned.
  • If a term is a preferred term for a concept, then weight [0.7] (SURE_WEIGHT) is assigned.
  • If a concept has no homonyms, then weight [0.65] (PRETTY_SURE_WEIGHT) is assigned.
  • Otherwise the weight is assigned value [0.5] (UNCERTAIN_WEIGHT).

Strict disambiguator

This disambiguator was formerly called GeneDisambiguator. The algorithm it follows is the following:

  • If the concept under consideration has no homonyms<ref name="homonym"/> or the term is a preferred term and if the term is complex, then weight [0.9] (POSITIVE_WEIGHT) is assigned.
  • If concept has synonyms then the assigned weight depends on the minimal distance for the closest synonym ([0.75 .. 0.8])
  • If concept has keywords the assigned weight depends on the minimal distance for the closest keyword ([0.70 .. 0.75])
  • Otherwise the weight is assigned value [0.1] (NEGATIVE_WEIGHT).

Disambiguation decision maker

The rules are as follows:

  • When the concept weight >= ALWAYS_ACCEPTED_WEIGHT (80), the concept is kept.
  • When the concept weight < MINIMAL_WEIGHT (50), the concept is removed.
  • For all concepts that have a weight of in-between value (<80 and >=50), then:
    • If there is at least one concept with an ALWAYS_ACCEPTED_WEIGHT that is from looseDisambiguator, then we remove all in-between concepts
    • Otherwise, we should add some inbetween concepts to the final result. There are two possibilities:
      • If the difference between two best inbetween weights > TWO_BEST_MATCHES_TOO_CLOSE_DISTANCE (5) then we keep the concept with the highest inbetween weight under the condition that there is no concept with a weight above ALWAYS_ACCEPTED_WEIGHT or this highest inbetween weight is from looseDisambiguator.
      • Otherwise, we keep all concepts with an inbetween weight that is from looseDisambiguator and is not more than TWO_BEST_MATCHES_TOO_CLOSE_DISTANCE lower than the highest inbetween weight.

See also

Last modified 9 years ago Last modified on Oct 26, 2011, 10:22:02 PM

Attachments (3)

Download all attachments as: .zip