From f63f9df201c3ae57ad8b52cf5127995c82944be7 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sun, 3 Dec 2023 09:42:57 -0600 Subject: Moving files --- topics/database/genotype-database.gmi | 109 +++++++++++++++++++++ .../queries-and-prepared-statements-in-python.gmi | 95 ++++++++++++++++++ 2 files changed, 204 insertions(+) create mode 100644 topics/database/genotype-database.gmi create mode 100644 topics/database/queries-and-prepared-statements-in-python.gmi (limited to 'topics/database') diff --git a/topics/database/genotype-database.gmi b/topics/database/genotype-database.gmi new file mode 100644 index 0000000..7b8eefc --- /dev/null +++ b/topics/database/genotype-database.gmi @@ -0,0 +1,109 @@ +# Genotype database + +GeneNetwork has been using a plain text file format to store genotypes. We are now moving to a fast read-optimized genotype database file format built on LMDB. +=> https://www.symas.com/lmdb Lightning Memory-Mapped Database + +To convert a plain text GeneNetwork genotype file `BXD.geno` to a genodb genotype database `bxd`, use the `genodb` CLI tool from cl-gn like so. +``` +./genodb import BXD.geno bxd +``` +=> https://git.genenetwork.org/GeneNetwork/cl-gn cl-gn + +genenetwork3 includes a tiny Python library to read the built genodb database. Here is a sample invocation reading the entire matrix, row 17 and column 13 from a database at `/tmp/bxd`. +``` +from gn3 import genodb + +with genodb.open('/tmp/bxd') as db: + matrix = genodb.matrix(db) + print(genodb.nparray(matrix)) + print(genodb.row(matrix, 17)) + print(genodb.column(matrix, 13)) +``` + +Note: Pjotr has also written an implementation with zig for mgamma. + +The rest of this document describes the design and layout of genodb. + +## Database layout + +genodb is an immutable functional database built on the LMDB key-value store. An immutable database may sound like an oxymoron, but is indeed possible and practical. More precisely, in an immutable database, values once put in are never mutated. When a value needs to be changed, a new modified copy of the value is created. The old value is never touched. This immutability means that we can happily pass along and operate on values without worrying whether it may have changed in the underlying database. + +To learn the basics of functional databases, see the following talks: + +=> https://github.com/matthiasn/talk-transcripts/blob/master/Hickey_Rich/FunctionalDatabase.md The Functional Database +=> https://github.com/matthiasn/talk-transcripts/blob/master/Hickey_Rich/DeconstructingTheDatabase.md Deconstructing the Database + +Being a functional database, genodb can store multiple versions of the genotype matrix. These versions are stored efficiently on disk optimizing for disk usage. Two additional copies of the most recent version of the matrix are stored in read-optimized form for fast retrieval. + +### Encoding + +LMDB maps octet vector keys to octet vector values. Any data we put into a LMDB database needs to be encoded to octets (effectively aka bytes). genodb supports the following three data types with their respective encodings. + +* integer: little-endian encoded 64-bit unsigned integer +* string: UTF-8 encoded without a terminating null character +* octet vector: no encoding required, written verbatim + +### Blobs + +The basic unit of storage in the database is a blob. A BLOB is an octet vector PAYLOAD with associated METADATA. To store a blob in the database, we first compute its HASH, and then put PAYLOAD into the database as a key-value pair. HASH is the SHA256 hash of BLOB (both the octet vector payload and its associated metadata). To compute HASH, we first serialize BLOB into a series of octets, and then hash the resulting octet vector. Precisely, if BLOB contains PAYLOAD and is associated with (KEY, VALUE),... pairs of metadata, then hash(BLOB) is given by +``` +BLOB = blob(payload=PAYLOAD, + metadata=[(KEY, VALUE),...]) +hash(BLOB) = SHA256(concatenate(length(BLOB.payload), BLOB.payload, + [length(BLOB.metadata.KEY), BLOB.metadata.KEY, + length(BLOB.metadata.VALUE), BLOB.metadata.VALUE],...)) +``` +This encoding of BLOB into octets is one-to-one. So, assuming there are no hash collisions, every BLOB is uniquely mapped to a HASH. + +Each piece (KEY, VALUE) of METADATA is stored as a key-value pair. KEY is a string identifying that piece of metadata. VALUE is a string, an integer, or an octet vector representing the value of that piece of metadata, and is encoded accordingly. + +### Matrix storage + +We store every version of the genotype matrix in the database, each version as a blob. To construct a MATRIX blob, we first store each ROW and COLUMN of the MATRIX as a separate blob. Each row and column is encoded into an octet vector with one octet corresponding to one element of the genotype matrix. Then, MATRIX is constructed as a blob whose octet vector is the concatenation of the hashes of all the row and column blobs. In addition, the number of rows and columns of MATRIX are also stored as metadata. +``` +ROW = blob(payload=ROW-VECTOR, metadata=[]) +COLUMN = blob(payload=COLUMN-VECTOR, metadata=[]) +MATRIX = blob(payload=concatenate(concatenate(hash(ROW1), hash(ROW2),...), + concatenate(hash(COLUMN1), hash(COLUMN2),...)), + metadata=[("nrows", NUMBER-OF-ROWS), + ("ncols", NUMBER-OF-COLUMNS)]) +``` +We repeat this for every version of the genotype matrix, and associate the concatenated hashes of all the matrix blobs with the "all-versions" key by mutation. +``` +put(key="all-versions", + value=concatenate(hash(MATRIX1), hash(MATRIX2),...)) +``` + +### Fast storage for the current matrix + +We store two additional copies of the current matrix for fast retrieval. This read-optimized version of the matrix is, essentialy, the matrix in its row-major and column-major forms. The row-major form facilitates fast row reads, and the column-major form facilitates fast column reads. If MATRIX0 is the most recent matrix, then the blob CURRENT_MATRIX stored in the database is given by the following. +``` +CURRENT_MATRIX = blob(payload=concatenate(row-major-encoding(MATRIX0), + row-major-encoding(transpose(MATRIX0))), + metadata=[("matrix", hash(MATRIX0))]) +``` +The hash of CURRENT_MATRIX is associated with the "current" key by mutation. +``` +put(key="current", + value=hash(CURRENT_MATRIX)) +``` + +### Design notes + +Note that though even though genodb is a functional immutable database, the setting of the "all-versions" and "current" keys are done by mutation. This is unavoidable. Even a functional database needs to have a tiny amount of state. The trick is to manage and isolate this state so that it is manageable. + +The attentive reader familiar with Guix might note the similarities between the layout of the genodb database and that of Guix's /gnu/store. Indeed, both genodb and the Guix store are functional databases. genodb happens to be realized on LMDB, and the Guix store happens to be realized on the filesystem. + +Storing both rows and columns of older versions of the genotype matrix is redundant since the columns can be entirely derived from the rows. This is a happenstance due to the evolution of the genotype database layout, and may be removed in the future. Indeed, in the future, the older versions of the matrix could also be stored in compressed form for more efficient storage. + +### More thoughts + +The choice of lmdb for column/row based storage is an extremely good idea! I have been writing code against it and its performance is great. + +The storage of both columns and rows for all *versions* of data are perhaps not necessary (as people tend to use the latest and greatest). Having one final matrix blob may also be overkill and may not work for really large datasets. I suggest to store older versions as rows (vectors) only and the current matrix as cols + rows. + +The hashing is a good idea, though its value may be limited on updates where changes are sparse. Nevertheless I want to retain this feature. + +I will modify the format to retain metadata as more free-flow JSON records. This is useful when switching between languages and allows easy adding of various attributes. There will be a global metadata record and a per matrix metadata record. + +We should make it a point that the RDF graph store gets updated when we change one of these files. So they can easily be found with their metadata. diff --git a/topics/database/queries-and-prepared-statements-in-python.gmi b/topics/database/queries-and-prepared-statements-in-python.gmi new file mode 100644 index 0000000..969c27f --- /dev/null +++ b/topics/database/queries-and-prepared-statements-in-python.gmi @@ -0,0 +1,95 @@ +# Queries and Prepared Statements in Python + +String interpolation when writing queries is a really bad idea; it leads to exposure to SQL Injection attacks. To mitigate against this, we need to write queries using placeholders for values, then passing in the values as arguments to the **execute** function. + +As a demonstration, using some existing code, do not write a query like this: + +``` +curr.execute( + """ + SELECT Strain.Name, Strain.Id FROM Strain, Species + WHERE Strain.Name IN {} + and Strain.SpeciesId=Species.Id + and Species.name = '{}' + """.format( + create_in_clause(list(sample_data.keys())), + *mescape(dataset.group.species))) +``` + +In the query above, we interpolate the values of the 'sample_data.keys()' values and that of the 'dataset.group.species' values. + +The code above can be rewritten to something like: + +``` +sample_data_keys = tuple(key for key in sample_data.keys()) + +curr.execute( + """ + SELECT Strain.Name, Strain.Id FROM Strain, Species + WHERE Strain.Name IN ({}) + and Strain.SpeciesId=Species.Id + and Species.name = %s + """.format(", ".join(["%s"] * len(sample_data_keys))), + (sample_data_keys + (dataset.group.species,))) +``` + +In this new query, the IN clause ends up being a string of the form + +> %s, %s, %s, ... + +for the total number of items in the 'sample_data_key' tuple. + +There is one more '%s' placeholder for the 'Species.name' value, so, the final tuple we provide as an argument to execute needs to add the 'dataset.group.species' value. + +**IMPORTANT 01**: the total number of placeholders (%s) must be the same as the total number of arguments passed into the 'execute' function. + +**IMPORTANT 02**: the order of the values must correspond to the order of the placeholders. + +### Aside + +The functions 'create_in_clause' and 'mescape' are defined as below: + +``` +from MySQLdb import escape_string as escape_ + +def create_in_clause(items): + """Create an in clause for mysql""" + in_clause = ', '.join("'{}'".format(x) for x in mescape(*items)) + in_clause = '( {} )'.format(in_clause) + return in_clause + +def mescape(*items): + """Multiple escape""" + return [escape_(str(item)).decode('utf8') for item in items] + +def escape(string_): + return escape_(string_).decode('utf8') +``` + + +## Parameter Style + +In the section above, we show the most common parameter style used in most cases. + +If you want to use a mapping object (dict), you have the option of using the '%()s' format for the query. In that case, we could rewrite the query above into something like: + +``` +sample_data_dict = {f"sample_{idx}: key for idx,key in enumerate(sample_data.keys())} + +curr.execute( + """ + SELECT Strain.Name, Strain.Id FROM Strain, Species + WHERE Strain.Name IN ({}) + and Strain.SpeciesId=Species.Id + and Species.name = %(species_name)s + """.format(", ".join([f"%({key})s" for key in sample_data_dict.keys()])), + {**sample_data_dict, "species_name": dataset.group.species}) +``` + +## Final Note + +While this has dealt mostly with the MySQLdb driver for Python3, the idea is the same for the psycopg2 (PostgreSQL) driver and others (with some minor variation in the details). + +The concept is also similar in many other languages. + +The main takeaway is that you really should not be manually escaping the values - instead, you should let the driver do that for you, by providing placeholders in the query, and the values to use separately. -- cgit v1.2.3