wiki:Using plain jar files

Use Peregrine with plain jar files

You can use Peregrine as a library in your program by following the following 3 steps:

  • downloading a set of jar files
  • getting the ontology you want
  • installing the lvg normalizer.

Please make sure to read the Prerequisites, and then follow along.

Peregrine jar files you need

  • The following jar files are in the peregrine-1.1.zip file:
    • A set of jars that every deployment needs:
      • common-utils-1.1.jar
      • ontology-api-1.1.jar
      • peregrine-api-1.1.jar
      • peregrine-impl-hash-1.1.jar
      • peregrine-normalizer-1.1.jar
      • peregrine-normalizer-lvg-1.0.jar
      • peregrine-tokenizer-1.1.jar
    • Two alternative ontology handlers. You can use one or both of these jar files for your ontology:
      • File ontology (ontology-impl-file-1.1.jar)
      • Database ontology (ontology-impl-db-1.1.jar)
    • An optional jar for the disambiguation layer. You probably want this, but Peregrine can work without.
      • peregrine-disambiguator-1.1.jar. (The disambiguator layer can be used out of the box without special configuration).
  • Also, we need some 3rd party jar files. You can find these in the peregrine-1.1-external-dependencies.zip file:
    • commons-collections-3.2.jar
    • commons-io-1.4.jar
    • commons-lang-2.4.jar
    • commons-logging-1.1.1.jar
    • guava-r05.jar
    • trove-2.0.4.jar

Getting an ontology

Peregrine needs an ontology to do its work and it supports both file and database ontologies. In this example we will use a file ontology. You can simply download it and store it in a convenient location: test.ontology.zip. This file is very small and contains only 10 concepts and 130 terms to address these concepts. The format that is used for the file ontology is described here: Erasmus MC ontology file format, so you can make your own ontologies as desired.

Installing the lvg normalizer

Peregrine depends on the "lvg normalizer". The normalizer is a piece of software that can standardize words in the English language, so that e.g. words with differences in capitalization or plural/singular forms are recognized as the same concept. The lvg normalizer is part of the Lexical Tools package of the National Library of Medicine (NLM). We are using the 2013 version of this library, and since we are only using the normalizer, the "lite" version is sufficient. This can be downloaded from the lvg2013lite download site (about 240MB download, which unpacks into 800MB mostly large database files). You can unpack the archive file in a convenient location.

To work correctly, Lvg needs to know where its files are on the file system. The default setting "LVG_DIR=AUTO_MODE" will not work for us. Edit the file lvg2013lite/data/config/lvg.properties and set LVG_DIR to the complete path of the installation. Make sure to finish the directory name with a "/" character. It should look approximately like this:

LVG_DIR=/home/myname/packages/Lvg/lvg2013lite/.

Testing it

Make sure you have a java development kit installed. On a debian linux system, this can be installed e.g. using "apt-get install default-jdk".

Here is a simple application to try out invoking Peregrine. Save this as PeregrinePlainJars.java

Look for 2 lines marked EDIT HERE. You need to specify the name/location of your ontology file, and the location of the directory where the Lvg normalizer is stored.

After editing these two locations, compile the program using a command like javac PeregrinePlainJars.java -classpath '../Peregrine-Jars/*' . The directory in the -classpath should be the location where you stored the Peregrine jars.

Then run it like: java -classpath '.:../Peregrine-Jars/*' PeregrinePlainJars . Make sure to mention in the -claspath both the directory that contains the Peregrine jar files as well as the one containing your PeregrinePlainJars.class. Also make sure Peregrine has enough memory to load the whole thesaurus.

If the configuration is working, you should see four indexing results. Please let us know if it's not working for you, then we will help you reach the finish.

import java.io.Serializable;
import java.util.List;
import org.erasmusmc.data_mining.ontology.api.Concept;
import org.erasmusmc.data_mining.ontology.api.Language;
import org.erasmusmc.data_mining.ontology.api.Ontology;
import org.erasmusmc.data_mining.ontology.common.LabelTypeComparator;
import org.erasmusmc.data_mining.ontology.impl.file.SingleFileOntologyImpl;
import org.erasmusmc.data_mining.peregrine.api.IndexingResult;
import org.erasmusmc.data_mining.peregrine.api.Peregrine;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.DisambiguationDecisionMaker;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.Disambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.api.RuleDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.ThresholdDisambiguationDecisionMakerImpl;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.LooseDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.StrictDisambiguator;
import org.erasmusmc.data_mining.peregrine.disambiguator.impl.rule_based.TypeDisambiguatorImpl;
import org.erasmusmc.data_mining.peregrine.impl.hash.PeregrineImpl;
import org.erasmusmc.data_mining.peregrine.normalizer.api.NormalizerFactory;
import org.erasmusmc.data_mining.peregrine.normalizer.impl.LVGNormalizer;
import org.erasmusmc.data_mining.peregrine.normalizer.impl.NormalizerFactoryImpl;
import org.erasmusmc.data_mining.peregrine.tokenizer.api.TokenizerFactory;
import org.erasmusmc.data_mining.peregrine.tokenizer.impl.TokenizerFactoryImpl;
import org.erasmusmc.data_mining.peregrine.tokenizer.impl.UMLSGeneChemTokenizer;

/**
 * Test running Peregrine with plain jar files.
 */
public class PeregrinePlainJars {
    /**
     * Start of the application.
     *
     * @param arguments the unused command-line arguments.
     */
    public static void main(final String[] arguments) {
        new PeregrinePlainJars().printIndexingResults();
    }

    /**
     * Print the indexing results that Peregrine returns.
     */
    private void printIndexingResults() {
        // The ontology file format is described here:
        // https://trac.nbic.nl/data-mining/wiki/ErasmusMC%20ontology%20file%20format
        final String ontologyPath = "../Ontologies/test.ontology"; // EDIT HERE
        final Ontology ontology = new SingleFileOntologyImpl(ontologyPath);

        final String propertiesDirectory = "../Lvg/lvg2013lite/data/config/"; // EDIT HERE
        final Peregrine peregrine = createPeregrine(ontology, propertiesDirectory + "lvg.properties");

        final String text = "This is a simple sentence with labels like Malaria, acromion, acronycin, ectopic acth secretion " +
                            "and immunoglobulin production.";
        final List<IndexingResult> indexingResults = peregrine.indexAndDisambiguate(text, Language.EN);

        System.out.println("Number of indexing results found: " + indexingResults.size() + ".");

        for (final IndexingResult indexingResult : indexingResults) {
            final Serializable conceptId = indexingResult.getTermId().getConceptId();
            System.out.println();
            System.out.println("- Found concept with id: " + conceptId + ", matched text: \""
                               + text.substring(indexingResult.getStartPos(), indexingResult.getEndPos() + 1) + "\".");

            final Concept concept = ontology.getConcept(conceptId);
            final String preferredLabelText = LabelTypeComparator.getPreferredLabel(concept.getLabels()).getText();
            System.out.println("  Preferred concept label is: \"" + preferredLabelText + "\".");
        }
    }

    /**
     * Create a new peregrine object.
     *
     * @param ontology          the ontology to use.
     * @param lvgPropertiesPath the path to the lvg properties.
     * @return the new peregrine object.
     */
    private Peregrine createPeregrine(final Ontology ontology, final String lvgPropertiesPath) {
        final UMLSGeneChemTokenizer tokenizer = new UMLSGeneChemTokenizer();
        final TokenizerFactory tokenizerFactory = TokenizerFactoryImpl.createDefaultTokenizerFactory(tokenizer);
        final LVGNormalizer normalizer = new LVGNormalizer(lvgPropertiesPath);
        final NormalizerFactory normalizerFactory = NormalizerFactoryImpl.createDefaultNormalizerFactory(normalizer);
        final RuleDisambiguator[] disambiguators = {new StrictDisambiguator(), new LooseDisambiguator()};
        final Disambiguator disambiguator = new TypeDisambiguatorImpl(disambiguators);
        final DisambiguationDecisionMaker disambiguationDecisionMaker = new ThresholdDisambiguationDecisionMakerImpl();

        // This parameter is used to define the set of languages in which the ontology should be loaded. Language code
        // used is ISO639. For now, this feature is only available for DBOntology. Thus, we can leave it as null or
        // the empty string in this sample code.
        // final String ontologyLanguageToLoad = "en, nl, de";
        final String ontologyLanguageToLoad = null;

        return new PeregrineImpl(ontology, tokenizerFactory, normalizerFactory, disambiguator,
                                 disambiguationDecisionMaker, ontologyLanguageToLoad);
    }
}


Last modified 7 years ago Last modified on Sep 26, 2013, 3:13:30 PM