Fix progressbar for K and started on design doc

author: Pjotr Prins 2018-09-19 09:50:04 +0000
committer: Pjotr Prins 2018-09-19 09:50:04 +0000
commit: 98aeb737941508515acd4a6beccd946ca39d79c1 (patch)
tree: 3032ea6ac3418e0ea0f69089fe5001ea777112f2
parent: 2b86b6423c33923a385a341fb4d12b7c38b7c924 (diff)
download: pangemma-98aeb737941508515acd4a6beccd946ca39d79c1.tar.gz
3 files changed, 71 insertions, 7 deletions
diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md
index b0923d1..75e1144 100644
--- a/RELEASE-NOTES.md
+++ b/RELEASE-NOTES.md
@@ -1,11 +1,23 @@
-## ChangeLog v0.97 (2017/12/19)
-
-This is a massive bug fix release with many improvements. For contributions
+For contributions
 see
 [contributors](https://github.com/genetics-statistics/GEMMA/graphs/contributors)
 and
 [commits](https://github.com/genetics-statistics/GEMMA/commits/master).
 
+## ChangeLog v0.98 (date)
+
+With the v0.98 release GEMMA has stabilized and contains extensive
+error checking. This release contains quite a few bug fixes,
+hardware-based floating point checking and speedups.
+
+Note: This is the last purely C/C++ compilable release because we are
+integrating faster-lmm-d code for new functionality in the next
+version. Also we are working on a Python and R interface.
+
+## ChangeLog v0.97 (2017/12/19)
+
+This is a massive bug fix release with many improvements.
+
 ### Speedup of GEMMA by using optimized OpenBlas
 
 * Binary release with OpenBlas optimization for generic x86_64 and for Intel Haswell
diff --git a/doc/developers/design.org b/doc/developers/design.org
new file mode 100644
index 0000000..859e3f6
--- /dev/null
+++ b/doc/developers/design.org
@@ -0,0 +1,52 @@
+* GEMMA Design Document
+
+** Introduction
+
+With the v0.98 release GEMMA has stabilized and contains extensive
+error checking. To move faster we are moving towards integrating the
+faster-lmm-d code base which is written in the D programming
+language. We are also add interfaces for Python and R (and other
+languages). We will try to keep a legacy C++ based GEMMA as long as
+possible, but for performance and features it is likely a D compiler
+is required. The good news is that most distributions contain D
+compilers today.
+
+** Faster-lmm-d integration
+
+Faster-lmm-d is mostly a rewrite of GEMMA univariate LMM and
+multivariate LMM resolvers. We compile faster-lmm-d as a library that
+can be linked against GEMMA. For computing K, for example, there are
+two modes: (1) that has all genotype data in RAM and (2) that loads
+the genotype data directly from a geno file.
+
+** Improved data formats
+
+The original data formats are somewhat lacking because they make error
+correction hard. In collaboration with the R/qtl2 project we aim for
+supporting newer formats.
+
+*** Kinship format
+
+As the kinship mastrix K is symmetric we only need to store half the
+data. Also we want to be able to filter and validate on the names of
+individuals/samples. Next we compress it. A comparison of formats is
+[[https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO][here]]. Decompression speed is most critical and [[https://github.com/lz4/lz4][lz4]] does a great job
+there (lz4 is used in CRAM and sambamba). According to [[https://www.dummeraugust.com/main/content/blog/posts.php?pid=173][this comparison]]
+text processing is fairly similar between gzip and lz4. lz4 files are
+a bit larger, so decompression gains may be offset by network speeds.
+
+To recognise the tab dilimited file we'll add a header with nind's:
+
+#+BEGIN_SRC
+# GRMv1.0
+# nind=900
+ind1 0.1436717816 0.006341902008 0.007596806816 ...
+ind2 0.007996662028 0.008741860935 0.008489758779 ...
+...
+ind900 0.002311556029
+#+END_SRC
+
+where each row is one value shorter describing the right top half of
+K. This setup allows one to use a K with for exaple ind1 missing -
+just remove that row and column. The data will be stored as
+name.cXX.txt.lz4 (later add the alternative name.cxx.txt.gz).
diff --git a/src/gemma_io.cpp b/src/gemma_io.cpp
index d92dc44..3c71ea6 100644
--- a/src/gemma_io.cpp
+++ b/src/gemma_io.cpp
@@ -1421,7 +1421,8 @@ bool BimbamKin(const string file_geno, const set<string> ksnps,
   for (size_t t = 0; t < indicator_snp.size(); ++t) {
     string line;
     safeGetline(infile, line).eof();
-    if (t % display_pace == 0 || t == (indicator_snp.size() - 1)) {
+    // if (t % display_pace == 0 || t == (indicator_snp.size() - 1)) {
+    if (t % display_pace == 0) {
       ProgressBar("Reading SNPs", t, indicator_snp.size() - 1);
     }
     if (indicator_snp[t] == 0)
@@ -1511,22 +1512,21 @@ bool BimbamKin(const string file_geno, const set<string> ksnps,
   if (ns_test % msize != 0) {
     fast_eigen_dgemm("N", "T", 1.0, Xlarge, Xlarge, 1.0, matrix_kin);
   }
+  ProgressBar("Reading SNPs", 100, 100);
   cout << endl;
 
   // scale the kinship matrix
   enforce_gsl(gsl_matrix_scale(matrix_kin, 1.0 / (double)ns_test));
 
   // and transpose
-  // FIXME: the following is very slow
+  // FIXME: the following is not so slow
 
-  debug_msg("begin transpose");
   for (size_t i = 0; i < ni_total; ++i) {
     for (size_t j = 0; j < i; ++j) {
       double d = gsl_matrix_get(matrix_kin, j, i);
       gsl_matrix_set(matrix_kin, i, j, d);
     }
   }
-  debug_msg("end transpose");
   // GSL is faster - and there are even faster methods
   // enforce_gsl(gsl_matrix_transpose(matrix_kin));
author	Pjotr Prins	2018-09-19 09:50:04 +0000
committer	Pjotr Prins	2018-09-19 09:50:04 +0000
commit	98aeb737941508515acd4a6beccd946ca39d79c1 (patch)
tree	3032ea6ac3418e0ea0f69089fe5001ea777112f2
parent	2b86b6423c33923a385a341fb4d12b7c38b7c924 (diff)
download	pangemma-98aeb737941508515acd4a6beccd946ca39d79c1.tar.gz