diff options
author | Pjotr Prins | 2018-09-19 09:50:04 +0000 |
---|---|---|
committer | Pjotr Prins | 2018-09-19 09:50:04 +0000 |
commit | 98aeb737941508515acd4a6beccd946ca39d79c1 (patch) | |
tree | 3032ea6ac3418e0ea0f69089fe5001ea777112f2 | |
parent | 2b86b6423c33923a385a341fb4d12b7c38b7c924 (diff) | |
download | pangemma-98aeb737941508515acd4a6beccd946ca39d79c1.tar.gz |
Fix progressbar for K and started on design doc
-rw-r--r-- | RELEASE-NOTES.md | 18 | ||||
-rw-r--r-- | doc/developers/design.org | 52 | ||||
-rw-r--r-- | src/gemma_io.cpp | 8 |
3 files changed, 71 insertions, 7 deletions
diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md index b0923d1..75e1144 100644 --- a/RELEASE-NOTES.md +++ b/RELEASE-NOTES.md @@ -1,11 +1,23 @@ -## ChangeLog v0.97 (2017/12/19) - -This is a massive bug fix release with many improvements. For contributions +For contributions see [contributors](https://github.com/genetics-statistics/GEMMA/graphs/contributors) and [commits](https://github.com/genetics-statistics/GEMMA/commits/master). +## ChangeLog v0.98 (date) + +With the v0.98 release GEMMA has stabilized and contains extensive +error checking. This release contains quite a few bug fixes, +hardware-based floating point checking and speedups. + +Note: This is the last purely C/C++ compilable release because we are +integrating faster-lmm-d code for new functionality in the next +version. Also we are working on a Python and R interface. + +## ChangeLog v0.97 (2017/12/19) + +This is a massive bug fix release with many improvements. + ### Speedup of GEMMA by using optimized OpenBlas * Binary release with OpenBlas optimization for generic x86_64 and for Intel Haswell diff --git a/doc/developers/design.org b/doc/developers/design.org new file mode 100644 index 0000000..859e3f6 --- /dev/null +++ b/doc/developers/design.org @@ -0,0 +1,52 @@ +* GEMMA Design Document + +** Introduction + +With the v0.98 release GEMMA has stabilized and contains extensive +error checking. To move faster we are moving towards integrating the +faster-lmm-d code base which is written in the D programming +language. We are also add interfaces for Python and R (and other +languages). We will try to keep a legacy C++ based GEMMA as long as +possible, but for performance and features it is likely a D compiler +is required. The good news is that most distributions contain D +compilers today. + +** Faster-lmm-d integration + +Faster-lmm-d is mostly a rewrite of GEMMA univariate LMM and +multivariate LMM resolvers. We compile faster-lmm-d as a library that +can be linked against GEMMA. For computing K, for example, there are +two modes: (1) that has all genotype data in RAM and (2) that loads +the genotype data directly from a geno file. + +** Improved data formats + +The original data formats are somewhat lacking because they make error +correction hard. In collaboration with the R/qtl2 project we aim for +supporting newer formats. + +*** Kinship format + +As the kinship mastrix K is symmetric we only need to store half the +data. Also we want to be able to filter and validate on the names of +individuals/samples. Next we compress it. A comparison of formats is +[[https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO][here]]. Decompression speed is most critical and [[https://github.com/lz4/lz4][lz4]] does a great job +there (lz4 is used in CRAM and sambamba). According to [[https://www.dummeraugust.com/main/content/blog/posts.php?pid=173][this comparison]] +text processing is fairly similar between gzip and lz4. lz4 files are +a bit larger, so decompression gains may be offset by network speeds. + +To recognise the tab dilimited file we'll add a header with nind's: + +#+BEGIN_SRC +# GRMv1.0 +# nind=900 +ind1 0.1436717816 0.006341902008 0.007596806816 ... +ind2 0.007996662028 0.008741860935 0.008489758779 ... +... +ind900 0.002311556029 +#+END_SRC + +where each row is one value shorter describing the right top half of +K. This setup allows one to use a K with for exaple ind1 missing - +just remove that row and column. The data will be stored as +name.cXX.txt.lz4 (later add the alternative name.cxx.txt.gz). diff --git a/src/gemma_io.cpp b/src/gemma_io.cpp index d92dc44..3c71ea6 100644 --- a/src/gemma_io.cpp +++ b/src/gemma_io.cpp @@ -1421,7 +1421,8 @@ bool BimbamKin(const string file_geno, const set<string> ksnps, for (size_t t = 0; t < indicator_snp.size(); ++t) { string line; safeGetline(infile, line).eof(); - if (t % display_pace == 0 || t == (indicator_snp.size() - 1)) { + // if (t % display_pace == 0 || t == (indicator_snp.size() - 1)) { + if (t % display_pace == 0) { ProgressBar("Reading SNPs", t, indicator_snp.size() - 1); } if (indicator_snp[t] == 0) @@ -1511,22 +1512,21 @@ bool BimbamKin(const string file_geno, const set<string> ksnps, if (ns_test % msize != 0) { fast_eigen_dgemm("N", "T", 1.0, Xlarge, Xlarge, 1.0, matrix_kin); } + ProgressBar("Reading SNPs", 100, 100); cout << endl; // scale the kinship matrix enforce_gsl(gsl_matrix_scale(matrix_kin, 1.0 / (double)ns_test)); // and transpose - // FIXME: the following is very slow + // FIXME: the following is not so slow - debug_msg("begin transpose"); for (size_t i = 0; i < ni_total; ++i) { for (size_t j = 0; j < i; ++j) { double d = gsl_matrix_get(matrix_kin, j, i); gsl_matrix_set(matrix_kin, i, j, d); } } - debug_msg("end transpose"); // GSL is faster - and there are even faster methods // enforce_gsl(gsl_matrix_transpose(matrix_kin)); |