diff options
Diffstat (limited to 'doc/developers')
-rw-r--r-- | doc/developers/design.org | 52 | ||||
-rw-r--r-- | doc/developers/profiling.md | 30 |
2 files changed, 0 insertions, 82 deletions
diff --git a/doc/developers/design.org b/doc/developers/design.org deleted file mode 100644 index 859e3f6..0000000 --- a/doc/developers/design.org +++ /dev/null @@ -1,52 +0,0 @@ -* GEMMA Design Document - -** Introduction - -With the v0.98 release GEMMA has stabilized and contains extensive -error checking. To move faster we are moving towards integrating the -faster-lmm-d code base which is written in the D programming -language. We are also add interfaces for Python and R (and other -languages). We will try to keep a legacy C++ based GEMMA as long as -possible, but for performance and features it is likely a D compiler -is required. The good news is that most distributions contain D -compilers today. - -** Faster-lmm-d integration - -Faster-lmm-d is mostly a rewrite of GEMMA univariate LMM and -multivariate LMM resolvers. We compile faster-lmm-d as a library that -can be linked against GEMMA. For computing K, for example, there are -two modes: (1) that has all genotype data in RAM and (2) that loads -the genotype data directly from a geno file. - -** Improved data formats - -The original data formats are somewhat lacking because they make error -correction hard. In collaboration with the R/qtl2 project we aim for -supporting newer formats. - -*** Kinship format - -As the kinship mastrix K is symmetric we only need to store half the -data. Also we want to be able to filter and validate on the names of -individuals/samples. Next we compress it. A comparison of formats is -[[https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO][here]]. Decompression speed is most critical and [[https://github.com/lz4/lz4][lz4]] does a great job -there (lz4 is used in CRAM and sambamba). According to [[https://www.dummeraugust.com/main/content/blog/posts.php?pid=173][this comparison]] -text processing is fairly similar between gzip and lz4. lz4 files are -a bit larger, so decompression gains may be offset by network speeds. - -To recognise the tab dilimited file we'll add a header with nind's: - -#+BEGIN_SRC -# GRMv1.0 -# nind=900 -ind1 0.1436717816 0.006341902008 0.007596806816 ... -ind2 0.007996662028 0.008741860935 0.008489758779 ... -... -ind900 0.002311556029 -#+END_SRC - -where each row is one value shorter describing the right top half of -K. This setup allows one to use a K with for exaple ind1 missing - -just remove that row and column. The data will be stored as -name.cXX.txt.lz4 (later add the alternative name.cxx.txt.gz). diff --git a/doc/developers/profiling.md b/doc/developers/profiling.md deleted file mode 100644 index 0d26453..0000000 --- a/doc/developers/profiling.md +++ /dev/null @@ -1,30 +0,0 @@ -# Profiling - -gperftools (formerly the Google profiler) is included in the .guix-dev -startup script. Compile gemma for profiling: - - make clean - make profile - -Run the profiler - - env CPUPROFILE=/tmp/prof.out ./bin/gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt -gk -o mouse_hs1940 - pprof ./bin/gemma /tmp/prof.out - -and `top` shows - -``` -Welcome to pprof! For help, type 'help'. -(pprof) top -Total: 720 samples - 103 14.3% 14.3% 103 14.3% dgemm_kernel_ZEN - 39 5.4% 19.7% 79 11.0% ____strtod_l_internal - 37 5.1% 24.9% 53 7.4% __printf_fp_l - 36 5.0% 29.9% 36 5.0% __sched_yield - 34 4.7% 34.6% 34 4.7% __strlen_avx2 - 31 4.3% 38.9% 31 4.3% __strspn_sse42 - 26 3.6% 42.5% 116 16.1% ReadFile_geno - 25 3.5% 46.0% 26 3.6% _int_malloc - 23 3.2% 49.2% 23 3.2% gsl_vector_set - 18 2.5% 51.7% 18 2.5% __strcspn_sse42 -``` |