The Indexing Process: a hash-based approach to text indexing

Indexing the ontology

The first step Peregrine does during the startup is indexing the ontology. It is done by iterating through all ontology concepts, so it is advisable for ontology to be optimized for such complete sequential scans.

For each concept all terms are considered. A term has few flags that specify how it should be treated:

  • normalised flag: the term should be passed via normalizer
  • case insensitive: if only the first letter of a word is a capital, the word is reduced to lowercase, else the original string is returned (e.g. compare common name like "Atlantic" vs chemical "C1H5O2")
  • default: the term is converted to lower case

After this conversion is done, the term is splitted into tokens by tokenizer and so-called StopWords are thrown away. This might impact the correctness of matching (compare "life in war" vs "life for war"), but in most cases it improves it.

Each token is registered in map of unique tokens. This map serves two major goals:

  1. Later it will allow to skip tokens for input text, that are not in this map.
  2. It also keeps track of keyswords (A keyword is a token that is rarely used across all concepts).

Finally the hash is calculated for the complete sequence of tokens and stored in the map, which maps the calculated hash to array of objects, containing the term meta-information (like term ID, term type, disambiguation type). In other words each of these objects in an array refer the term, which has the same hash.

Matching the the text

General rules for text indexing, that any implementation of the Peregrine interface should follow:

  1. At a given text position the longest term should be matched. For example:
    • input text: "... word1 word2 ..."
    • ontology: concept1 = { "word1" }, concept2 = { "word1 word2" }
    • correct output: { concept2:term1 }
    • incorrect output: { concept1:term1, concept2:term1 }
  2. If a concept has several terms that are equal, then only one indexing result, that refers this concept, should be produced at a given text position. For example:
    • input text: "... word1 ..."
    • ontology: concept1 = { "word1", "word1" }
    • correct output: { concept1:term1 }
    • incorrect output: { concept1:term1, concept1:term2 }
  3. The produced results should be sorted by the position they are found in. For example:
    • input text: "... word1 word2 ..."
    • ontology: concept1 = { "word1" }, concept2 = { "word2" }
    • correct output: { concept1:term1, concept2:term1 }
    • incorrect output: { concept2:term1, concept1:term1 }

Input text is splitted into tokens and StopWords are removed from a stream of tokens. The further indexing of text is performed by applying an indexing window, that has a dynamic size not greater than a threshold value. This threshold vale = maximum number of tokens of longest term + 1. At each iteration a head token of the window is removed, and new token might be appended. This allows streaming the text of any size, however, due to initial WebService? interface orientation, text is always passed as one block.

Consider that at some point of time the the window consists of tokens {"A", "B", "C"}. Then the following is performed: # Sequences of tokens for all possible lengths are generated for this window with the same head token "A". For given example, there are sequences {"A", "B", "C"}, {"A", "B"}, {"A"}. Sequences are generated from longest to shortest to make is possible to match longer terms first. # All possible transformations are applied one by one for each sequence (above mentioned normalization, case transformation, etc). This is necessary to do, as it is not known in advance, which transformations have been applied to a term we want to match. # Then hashes are calculated for each transformed sequence of tokens, and lookup is made in a map. The possible matches are retrieved from this map and checked, that they really match the given piece of text. # The indexing results are created and (in case of further disambiguation postprocessing) enriched with disambiguation information.

The map with unique tokens plays the following optimization role: if the next token in the stream is not present among unique tokens, then there is no term containing this token and the term cannot span over this token. Hence, the window can either end before that token or start just after that token. This optimizations takes some memory (as we need to store all unique words), but allows to speed-up the text indexing considerably (no unnecessary iterations/transformation are made, as they will not match any term in advance).

Implementation Details

  • IndexedOntology:indexText() handles the request of indexing text.
  • IndexerHelper.indexTerms contains maps of indexed terms. E.g. indexTerms[0] contains the map of all terms with Only One token