source: trunk/id-mapper/README @ 6

Last change on this file since 6 was 6, checked in by rob.hooft@…, 6 years ago

first cut at the pmid2doi upfdater

File size: 5.1 KB
Line 
1This is a set of scripts used to update the mapping of pubmed-id's to DOI's, written to
2run continuously. Prerequisite is a database of all articles in pubmed.
3
4The mapping script is a threaded process, querying the crossref API repeatedly for all
5those pmid's that were not mapped before, at increasing intervals.
6
7The process uses two databases: medline (readonly) and pmid2doi_continuous (r/w).
8
9=== Installation ===
10 * This set of scripts is written to be run as user "pmid2doi". Create that user.
11 * Give the pmid2doi user readonly access to the "medline" database.
12 * Create a directory /opt/pmid2doi-updater, and copy all files there
13 * Copy pmid2doi-updater to /etc/init.d ; check whether the header is OK for your system
14 * Remember to do this late (when you have confirmed everything works):
15       Use your system tools to make sure pmid2doi-updater is started on a reboot
16 * Create a database for the script to run. We use "pmid2doi_continuous" and that is encoded in the ini file. Give
17   the pmid2doi user read/write access.
18 * Manually create the output table in the database:
19CREATE TABLE `complete_mappings` (
20  `pmid` int(11) DEFAULT NULL,
21  `doi` varchar(255) NOT NULL,
22  KEY `pmid_index` (`pmid`) USING BTREE,
23  KEY `doi_index` (`doi`) USING BTREE
24) ENGINE=MyISAM DEFAULT CHARSET=utf8;
25 * You can prime the table with a downloaded mappings file.
26 * Check the ".ini" file. The first time you run it, it will need to create the table, set that variable to 1.
27   Get a crossref API key, and set it in the ini file too
28 * Make sure wget is installed on the system, it is needed to download the journals file from crossref
29 * You can use the "update" procedure below for the first run; but do not forget to disable the create_table
30   option afterwards!
31
32=== Whenever the medline 2013 database has been updated ===
33
34Whenever the medline database has been updated, the table that tells pmid2doi-updater what is left
35to do should be updated too. This is done by comparing the pmid's that are in the medline database
36with the ones in the "mapped" and "todo" tables. Any entries that are not in "mapped" and "todo" are
37added to the "todo" table with a status "never tried these yet".
38
39To make this update, the ini file needs a bit flip.
40
41   % sudo /etc/init.d/pmid2doi-updater stop
42   % cd /opt/pmid2doi-updater
43   % sudo vi mapper.ini
44   [set the update_table=1]
45   % sudo /etc/init.d/pmid2doi-updater start
46   % ls -ltr *log*
47   % less [latest log]
48   2012-10-18 12:31:38.830144: Mapping PMID to DOIs started
49   2012-10-18 12:31:38.838633: Adding new records to the to-be-mapped table...
50   2012-10-18 12:43:51.615996: ...done.
51   There are 5645990 pmids that need to be mapped.
52   ....
53
54Now, you need to set back the update_table flag, otherwise this operation will be needlessly repeated
55every 3 hours when the script is restarted.
56
57   % sudo vi mapper.ini
58   [set the update_table=0]
59
60=== Running the automatic mappings ===
61
62The automatic mappings are started by /etc/init.d/pmid2doi-updater. This should be running as soon
63as the machine has booted. It can be manually "start"ed and "stop"ped like any other system service.
64
65The script actually starts a script "run.sh" in this directory, as the user "pmid2doi". The run.sh
66script actually runs the python process that does the database updates. It also monitors it: sometimes
67the python script gets stuck on an api call, and when this has resulted in 30 minutes without any changes
68int the log file, the python script is killed and restarted. In any case, the python script is killed
69and restarted every 3 hours. Each run of the python script is creating its own log file.
70
71To manually start automatically looping update runs, you can run
72
73    % sudo -u pmid2doi ./run.sh &
74
75The number of threads this will start is in the ini file. Ten is reasonable.
76
77=== Monitoring progress ===
78
79All the results will be written to log files. To check how things progress type
80
81    % sudo -u pmid2doi ./q
82
83that will add a status line to q.log and show you the last 2 lines. You can see how many
84new mappings were found. Once in a while, you may want to move old log files to the "old"
85subdirectory to restart the "q" statistics.
86
87If you want to see the progress in the database, you can run
88
89    % mysqladmin processlist
90
91This will show one command thread per update thread in the python script.
92
93You can check manually in the database how well the mappings are going:
94
95    % mysql pmid2doi_continuous
96    mysql> select count(pmid) from todo;
97    mysql> select count(pmid) from completed_mappings;
98    mysql> select ntried,count(*) from todo group by ntried limit 20;
99
100=== Files ===
101
102titleFile.csv is an index of crossref, it is automatically updated when needed; for this
103the "wget" tool is needed.
104
105The source code for this work used to be in the "wikidata" project on trac.nbic.nl, in the
106Scripts/PubMedPreProcessing/bin/pmid2doi directory. It has been transplanted into the
107nbiceng project, under pmid2doi-updater.
108
109== Troubleshooting ===
110 * Sometimes the database query to get "the first 100 mappings to be done" is suddenly very slow.
111   When that happens, reoptimize the medline_citations and todo tables.
Note: See TracBrowser for help on using the repository browser.