aboutsummaryrefslogtreecommitdiff
path: root/scripts/index-genenetwork
AgeCommit message (Collapse)Author
2024-07-03Return a "-1" if the turtle directory does not exist.Munyoki Kilyungi
* scripts/index-genenetwork (hash_rdf_graph): Remove check for the turtle directory. (is_data_modified): Ditto. (create_xapian_index): Ditto. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-07-03Generate a checksum for all the ttl files.Munyoki Kilyungi
* scripts/index-genenetwork (hash_generif_graph): Rename to hash_rdf_graph. Generate a checksum of all the turtle files inside the ttl directory that's the basis for the GN virtuoso graph. (create_xapian_index): Rename hash_generif_graph -> hash_rdf_graph. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-07-03Add type-hints to hash_generif_graph.Munyoki Kilyungi
* scripts/index-genenetwork (hash_generif_graph): Add proper type hints. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-07-03Refactor how the generif md5 sum is calculated and stored in XAPIAN.Munyoki Kilyungi
* scripts/index-genenetwork (hash_generif_graph): Build the generif checksum by directly building it from the file. (is_data_modified): Update how generif-checksums are verified. (create_xapian_index): Update how generif-checksums are stored in XAPIAN. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-07-03Use correct cache for RIF/Wiki entries.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-07-03feat: drop intermediate folders when running parallel xapian compactJohn Nduli
2024-07-03feat: add support for parallel xapian compactJohn Nduli
2024-07-03feat: index rif and wiki without positionsJohn Nduli
2024-07-03feat: drop common words when building rdf cachesJohn Nduli
2024-07-03feat: set 67 parallel processes to run in prodJohn Nduli
2024-07-03fix: remove namespaces since child processes copy the rdf cachesJohn Nduli
2024-07-03fix: use correct prefix and index key; group wiki cache queryJohn Nduli
2024-07-03feat: add wikidata indexingJohn Nduli
2024-07-03feat: add global wikicacheJohn Nduli
2024-07-03feat: add sparql query to get wikidataJohn Nduli
2024-06-24Use dataset Name instead of FullName for indexingzsloan
The Name is generally used as the identifier, while the FullName can container spaces which can cause problems
2024-06-18Revert "Set the file path for the logger."Munyoki Kilyungi
This reverts commit b21102bc4ad3678173e7c94d3e66333ec7c1d40a.
2024-06-18refactor: drop global variablesJohn Nduli
2024-06-17Check table names in Xapian; if not, default to "-1".Munyoki Kilyungi
Without this check, there will always be an error when this script is run with the "is-data-modified" flag should there be no database in the XAPIAN_DIRECTORY. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-17Fetch distinct comments.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-14fix: typehints in index-genenetwork scriptJohn Nduli
2024-06-14fix: fix incorrect parameters in index_query functionJohn Nduli
2024-06-12Move the generated xapian files to the correct directory.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Set the file path for the logger.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Change the date format for the logger.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Log how long it takes to run the indexing script.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Check for a running process by viewing the build dir's contents.Munyoki Kilyungi
In the CI build, the actual build is run in the xapian_directory/build, which is seen as the xapian_directory in this script. The CI handles clean up WRT removing files related to the build process. * scripts/index-genenetwork (create_xapian_index): Create the xapian directory if it doesn't exist. If the xapian directory has files, exit. Create the temporary directory inside the xapian_directory. Remove "build_directory.rmdir()" Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Return 0 if data changes, else exit with 1.Munyoki Kilyungi
* scripts/index-genenetwork (is_data_modified): Replace click.echo with the respective sys.exit call. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Explicitly pass sparql_uri to script.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Rework how the indexes are built.Munyoki Kilyungi
Right now, the checks are done in Guix's build expression. This moves that work to the index-genenetwork script.
2024-06-12Add method to check the validity of the tables+RDF checksums.Munyoki Kilyungi
* scripts/index-genenetwork (verify_checksums): New function. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-12Generate a SHA256 checksum for the generif graph.Munyoki Kilyungi
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-01Use global cache to store generif metadata.Munyoki Kilyungi
This global caches has 3,528 entries and there's no expectation for it to grow significantly. Since child processes inherit the parent’s memory, we can pass the global cache to them, reducing fetch times from 0.001s to 0.00001s, significantly boosting performance when indexing the entire database and enriching results with RDF metadata. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2024-06-01Add geneRIF to gene index.Munyoki Kilyungi
* scripts/index-genenetwork: Import Template, lru_cache, SPARQLWrapper, JSON (get_rif_metadata): New function. (index_rif_comments): New function. (index_genes): Add rif comments to probeset index. Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
2023-05-31scripts: Write table checksums into index.Arun Isaac
* scripts/index-genenetwork (main): Write table checksums into index.
2023-05-31scripts: Introduce SQLTableClause.Arun Isaac
* scripts/index-genenetwork (SQLTableClause): New variable. (genes_query, phenotypes_query): Express tables using SQLTableClause. (serialize_sql): Serialize SQLTableClause.
2023-05-31scripts: Fold long lines.Arun Isaac
* scripts/index-genenetwork (write_document, index_query): Fold long lines.
2023-05-31scripts: Ensure only one indexing job may run at a time.Arun Isaac
* scripts/index-genenetwork (main): Ensure no other indexing job is running.
2023-05-22Make directory at "path" and all intermediate ones.Frederick Muriuki Muriithi
Make the directory at the given path, and any intermediate ones to avoid errors in the indexing code when the directory, or its parent(s) do not exist.
2023-04-05Enable use of `database_connection` in scripts without current_appFrederick Muriuki Muriithi
There is need to run external scripts using the same configurations as the application but without the need to couple the script to the application. In this case, we provide the needed configuration directly in the CLI, and modify the existing `gn3.db_utils.database_connection` function to allow it to work coupled to the app or otherwise.
2023-02-13scripts: Fallback to 1 worker when indexing.Arun Isaac
* scripts/index-genenetwork (worker_queue): Set default number of workers to 1 if the number of CPUs cannot be determined.
2023-02-13scripts: Type hint xapian indexing script.Arun Isaac
* scripts/index-genenetwork: Import Callable, Generator, Iterable and List from typing. Type hint all functions.
2022-10-18Add xapian indexing script.Arun Isaac
* scripts/index-genenetwork: New file. * setup.py (install_requires): Add click, pymonad and xapian-bindings. (scripts): Add scripts/index-genenetwork.