diff options
Diffstat (limited to 'topics/xapian')
-rw-r--r-- | topics/xapian/xapian-index-building-scalability.svg | 293 | ||||
-rw-r--r-- | topics/xapian/xapian-indexing-on-tux01.gmi | 17 | ||||
-rw-r--r-- | topics/xapian/xapian-indexing.gmi | 19 | ||||
-rw-r--r-- | topics/xapian/xapian-scalability.gmi | 17 | ||||
-rw-r--r-- | topics/xapian/xapian-search-queries.gmi | 114 | ||||
-rw-r--r-- | topics/xapian/xapian-search.gmi | 44 |
6 files changed, 504 insertions, 0 deletions
diff --git a/topics/xapian/xapian-index-building-scalability.svg b/topics/xapian/xapian-index-building-scalability.svg new file mode 100644 index 0000000..9d525d4 --- /dev/null +++ b/topics/xapian/xapian-index-building-scalability.svg @@ -0,0 +1,293 @@ +<?xml version="1.0" encoding="utf-8" standalone="no"?> +<svg + width="600" height="480" + viewBox="0 0 600 480" + xmlns="http://www.w3.org/2000/svg" + xmlns:xlink="http://www.w3.org/1999/xlink" +> + +<title>Gnuplot</title> +<desc>Produced by GNUPLOT 5.4 patchlevel 4 </desc> + +<g id="gnuplot_canvas"> + +<rect x="0" y="0" width="600" height="480" fill="none"/> +<defs> + + <circle id='gpDot' r='0.5' stroke-width='0.5' stroke='currentColor'/> + <path id='gpPt0' stroke-width='0.222' stroke='currentColor' d='M-1,0 h2 M0,-1 v2'/> + <path id='gpPt1' stroke-width='0.222' stroke='currentColor' d='M-1,-1 L1,1 M1,-1 L-1,1'/> + <path id='gpPt2' stroke-width='0.222' stroke='currentColor' d='M-1,0 L1,0 M0,-1 L0,1 M-1,-1 L1,1 M-1,1 L1,-1'/> + <rect id='gpPt3' stroke-width='0.222' stroke='currentColor' x='-1' y='-1' width='2' height='2'/> + <rect id='gpPt4' stroke-width='0.222' stroke='currentColor' fill='currentColor' x='-1' y='-1' width='2' height='2'/> + <circle id='gpPt5' stroke-width='0.222' stroke='currentColor' cx='0' cy='0' r='1'/> + <use xlink:href='#gpPt5' id='gpPt6' fill='currentColor' stroke='none'/> + <path id='gpPt7' stroke-width='0.222' stroke='currentColor' d='M0,-1.33 L-1.33,0.67 L1.33,0.67 z'/> + <use xlink:href='#gpPt7' id='gpPt8' fill='currentColor' stroke='none'/> + <use xlink:href='#gpPt7' id='gpPt9' stroke='currentColor' transform='rotate(180)'/> + <use xlink:href='#gpPt9' id='gpPt10' fill='currentColor' stroke='none'/> + <use xlink:href='#gpPt3' id='gpPt11' stroke='currentColor' transform='rotate(45)'/> + <use xlink:href='#gpPt11' id='gpPt12' fill='currentColor' stroke='none'/> + <path id='gpPt13' stroke-width='0.222' stroke='currentColor' d='M0,1.330 L1.265,0.411 L0.782,-1.067 L-0.782,-1.076 L-1.265,0.411 z'/> + <use xlink:href='#gpPt13' id='gpPt14' fill='currentColor' stroke='none'/> + <filter id='textbox' filterUnits='objectBoundingBox' x='0' y='0' height='1' width='1'> + <feFlood flood-color='white' flood-opacity='1' result='bgnd'/> + <feComposite in='SourceGraphic' in2='bgnd' operator='atop'/> + </filter> + <filter id='greybox' filterUnits='objectBoundingBox' x='0' y='0' height='1' width='1'> + <feFlood flood-color='lightgrey' flood-opacity='1' result='grey'/> + <feComposite in='SourceGraphic' in2='grey' operator='atop'/> + </filter> +</defs> +<g fill="none" color="white" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,422.40 L574.82,422.40 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,422.40 L71.92,422.40 M574.82,422.40 L565.82,422.40 '/> <g transform="translate(54.53,426.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.44</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,385.56 L574.82,385.56 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,385.56 L71.92,385.56 M574.82,385.56 L565.82,385.56 '/> <g transform="translate(54.53,389.46)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.45</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,348.72 L574.82,348.72 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,348.72 L71.92,348.72 M574.82,348.72 L565.82,348.72 '/> <g transform="translate(54.53,352.62)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.46</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,311.88 L574.82,311.88 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,311.88 L71.92,311.88 M574.82,311.88 L565.82,311.88 '/> <g transform="translate(54.53,315.78)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.47</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,275.04 L574.82,275.04 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,275.04 L71.92,275.04 M574.82,275.04 L565.82,275.04 '/> <g transform="translate(54.53,278.94)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.48</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,238.20 L574.82,238.20 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,238.20 L71.92,238.20 M574.82,238.20 L565.82,238.20 '/> <g transform="translate(54.53,242.10)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.49</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,201.37 L574.82,201.37 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,201.37 L71.92,201.37 M574.82,201.37 L565.82,201.37 '/> <g transform="translate(54.53,205.27)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.5</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,164.53 L574.82,164.53 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,164.53 L71.92,164.53 M574.82,164.53 L565.82,164.53 '/> <g transform="translate(54.53,168.43)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.51</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,127.69 L574.82,127.69 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,127.69 L71.92,127.69 M574.82,127.69 L565.82,127.69 '/> <g transform="translate(54.53,131.59)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.52</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,90.85 L574.82,90.85 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,90.85 L71.92,90.85 M574.82,90.85 L565.82,90.85 '/> <g transform="translate(54.53,94.75)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.53</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,54.01 L574.82,54.01 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,54.01 L71.92,54.01 M574.82,54.01 L565.82,54.01 '/> <g transform="translate(54.53,57.91)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="end"> + <text><tspan font-family="Arial" > 0.54</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M62.92,422.40 L62.92,54.01 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,422.40 L62.92,413.40 M62.92,54.01 L62.92,63.01 '/> <g transform="translate(62.92,444.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >10k</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M114.29,422.40 L114.29,417.90 M114.29,54.01 L114.29,58.51 M144.33,422.40 L144.33,417.90 M144.33,54.01 L144.33,58.51 + M165.65,422.40 L165.65,417.90 M165.65,54.01 L165.65,58.51 M182.19,422.40 L182.19,417.90 M182.19,54.01 L182.19,58.51 + M195.70,422.40 L195.70,417.90 M195.70,54.01 L195.70,58.51 M207.12,422.40 L207.12,417.90 M207.12,54.01 L207.12,58.51 + M217.02,422.40 L217.02,417.90 M217.02,54.01 L217.02,58.51 M225.75,422.40 L225.75,417.90 M225.75,54.01 L225.75,58.51 + '/></g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M233.55,422.40 L233.55,54.01 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M233.55,422.40 L233.55,413.40 M233.55,54.01 L233.55,63.01 '/> <g transform="translate(233.55,444.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >100k</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M284.92,422.40 L284.92,417.90 M284.92,54.01 L284.92,58.51 M314.97,422.40 L314.97,417.90 M314.97,54.01 L314.97,58.51 + M336.28,422.40 L336.28,417.90 M336.28,54.01 L336.28,58.51 M352.82,422.40 L352.82,417.90 M352.82,54.01 L352.82,58.51 + M366.33,422.40 L366.33,417.90 M366.33,54.01 L366.33,58.51 M377.76,422.40 L377.76,417.90 M377.76,54.01 L377.76,58.51 + M387.65,422.40 L387.65,417.90 M387.65,54.01 L387.65,58.51 M396.38,422.40 L396.38,417.90 M396.38,54.01 L396.38,58.51 + '/></g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M404.19,422.40 L404.19,54.01 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M404.19,422.40 L404.19,413.40 M404.19,54.01 L404.19,63.01 '/> <g transform="translate(404.19,444.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >1M</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M455.55,422.40 L455.55,417.90 M455.55,54.01 L455.55,58.51 M485.60,422.40 L485.60,417.90 M485.60,54.01 L485.60,58.51 + M506.92,422.40 L506.92,417.90 M506.92,54.01 L506.92,58.51 M523.45,422.40 L523.45,417.90 M523.45,54.01 L523.45,58.51 + M536.97,422.40 L536.97,417.90 M536.97,54.01 L536.97,58.51 M548.39,422.40 L548.39,417.90 M548.39,54.01 L548.39,58.51 + M558.28,422.40 L558.28,417.90 M558.28,54.01 L558.28,58.51 M567.01,422.40 L567.01,417.90 M567.01,54.01 L567.01,58.51 + '/></g> +<g fill="none" color="black" stroke="black" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="gray" stroke="currentColor" stroke-width="0.50" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='gray' stroke-dasharray='2,4' class="gridline" d='M574.82,422.40 L574.82,54.01 '/></g> +<g fill="none" color="gray" stroke="gray" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M574.82,422.40 L574.82,413.40 M574.82,54.01 L574.82,63.01 '/> <g transform="translate(574.82,444.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >10M</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,54.01 L62.92,422.40 L574.82,422.40 L574.82,54.01 L62.92,54.01 Z '/></g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <g transform="translate(318.87,471.30)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >Xapian index size (in number of documents)</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> + <g id="gnuplot_plot_1" ><title>'xapian-times' using 1:(1000*$2/$1)</title> +<g fill="none" color="white" stroke="currentColor" stroke-width="2.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="2.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,403.76 L114.29,297.84 L165.65,164.44 L217.02,122.94 L268.38,163.79 L319.75,202.70 L371.11,208.05 L422.48,168.36 + L473.85,89.51 '/> <use xlink:href='#gpPt0' transform='translate(62.92,403.76) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(114.29,297.84) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(165.65,164.44) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(217.02,122.94) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(268.38,163.79) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(319.75,202.70) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(371.11,208.05) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(422.48,168.36) scale(9.00)' color='black'/> + <use xlink:href='#gpPt0' transform='translate(473.85,89.51) scale(9.00)' color='black'/> +</g> + </g> +<g fill="none" color="black" stroke="currentColor" stroke-width="2.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="black" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <path stroke='black' d='M62.92,54.01 L62.92,422.40 L574.82,422.40 L574.82,54.01 L62.92,54.01 Z '/></g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> + <g transform="translate(318.87,30.91)" stroke="none" fill="black" font-family="Arial" font-size="12.00" text-anchor="middle"> + <text><tspan font-family="Arial" >Time (in ms) per document to build a Xapian index of various sizes</tspan></text> + </g> +</g> +<g fill="none" color="black" stroke="currentColor" stroke-width="1.00" stroke-linecap="butt" stroke-linejoin="miter"> +</g> +</g> +</svg> + diff --git a/topics/xapian/xapian-indexing-on-tux01.gmi b/topics/xapian/xapian-indexing-on-tux01.gmi new file mode 100644 index 0000000..71a4843 --- /dev/null +++ b/topics/xapian/xapian-indexing-on-tux01.gmi @@ -0,0 +1,17 @@ +How to run the Xapian indexing script + +Change to Arun's user: +su aruni + +Remove old xapian directory + contents: +sudo -u aruni rm -r /export/data/genenetwork/xapian + +Run guix shell: +/home/zas1024/opt/guix/bin/guix shell --container --network --share=/export/data/genenetwork/ --development --file=guix.scm + +Run the indexer script: +env PYTHONPATH=. python3 scripts/index-genenetwork /export/data/genenetwork/xapian mysql://webqtlout:webqtlout@127.0.0.1:3306/db_webqtl + +OR Put the following in a shell script and run as Arun + +rm -r /export/data/genenetwork/xapian && /home/zas1024/opt/guix/bin/guix shell --container --network --share=/export/data/genenetwork/ --development --file=guix.scm -- env PYTHONPATH=. python3 scripts/index-genenetwork /export/data/genenetwork/xapian mysql://webqtlout:webqtlout@127.0.0.1:3306/db_webqtl diff --git a/topics/xapian/xapian-indexing.gmi b/topics/xapian/xapian-indexing.gmi new file mode 100644 index 0000000..1c82018 --- /dev/null +++ b/topics/xapian/xapian-indexing.gmi @@ -0,0 +1,19 @@ +# Xapian indexing + +Due to the enormous size of the GeneNetwork database, indexing it in a reasonable amount of time is a tricky process that calls for careful identification and optimization of the performance bottlenecks. This document is a description of how we achieve it. + +Indexing happens in the following three phases. + +* Phase 1: retrieve data from SQL +* Phase 2: index text +* Phase 3: write Xapian index to disk + +Phases 1 and 3 (that is, the retrieval of data from SQL and writing of the Xapian index to disk) are I/O bound processes. Phase 2 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1 and 3 sequential. + +There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 2 worker processes. Thus, we interleave phase 1 and 2 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1. + +Building a single large Xapian index is not scalable. See detailed report on Xapian scalability. +=> xapian-scalability +So, we let each process of phase 2 build its own separate Xapian index. Finally, we compact and combine them into one large index. When writing smaller indexes in parallel, we take care to lock access to the disk so that only one process is writing to the disk at any given time. If many processes try to simultaneously write to the disk, the write speed is slowed down, often considerably, due to I/O contention. + +It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 2 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around. diff --git a/topics/xapian/xapian-scalability.gmi b/topics/xapian/xapian-scalability.gmi new file mode 100644 index 0000000..cfc4e46 --- /dev/null +++ b/topics/xapian/xapian-scalability.gmi @@ -0,0 +1,17 @@ +# Xapian scalability + +As the index grows larger, Xapian takes longer to insert new documents. Shown below is the time (in seconds) taken to build indices of various sizes (in number of documents). + +* 10k: 4.45 +* 20k: 9.48 +* 40k: 20.40 +* 80k: 41.70 +* 160k: 81.63 +* 320k: 159.88 +* 640k: 318.84 +* 1280k: 651.47 +* 2560k: 1357.73 + +Notice that it takes 305x, not 256x, more time to build the 2560k index than it takes to build the 10k index. In terms of time, the 10k index takes on average 0.45 ms per document while the 2560k index takes on average 0.53 ms per document. We show this graphically below. + +=> xapian-index-building-scalability.svg diff --git a/topics/xapian/xapian-search-queries.gmi b/topics/xapian/xapian-search-queries.gmi new file mode 100644 index 0000000..74966f7 --- /dev/null +++ b/topics/xapian/xapian-search-queries.gmi @@ -0,0 +1,114 @@ +# Xapian search queries + +This page documents search queries as understood by our xapian search engine (aka "the global search"). + +General xapian search query syntax is documented on the xapian website. +=> https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/search/queryparser.html +The specifics of GeneNetwork's use of xapian differs slightly in the choice of prefixes and special syntax such as the synteny search. The examples below may help to illustrate it. + +## Free text search + +Search for the term "cytochrome" in the free text. +``` +cytochrome +``` + +Search for the term "cytochrome" and the term "P450" in the free text. Only results that have both are shown. +``` +cytochrome AND P450 +``` + +Search for occurrences of the term "cytochrome" near the term "P450" in the free text. +``` +cytochrome NEAR P450 +``` + +Search for the term "cytochrome" in the free text but exclude results that have the term "P450". +``` +cytochrome -P450 +cytochrome NOT P450 +``` + +## Boolean filtering + +Search for results pertaining to the human species. +``` +species:human +``` + +Search for results pertaining to the BXD group. +``` +group:BXD +``` + +Search for results pertaining to chromosome 11. +``` +chr:11 +``` + +Search for results pertaining to the BXD group and chromosome 11. +``` +group:BXD AND chr:11 +``` + +## Boolean filtering using numerical ranges + +Search for results with mean between 5 and 7. +``` +mean:5..7 +``` + +Search for results with mean less than 5. +``` +mean:..5 +``` + +Search for results with mean greater than 7. +``` +mean:7.. +``` + +## Synteny search + +Search for results near (+/- 50 kbases) base 9930021 of chromosome 4 of the human species and syntenic locations in other species. +``` +Hs:chr4:9930021 +``` + +Search for results near (+/- 50 kbases) base 9930021 of chromosome 4 of the human species and syntenic locations in mouse alone. +``` +Hs:chr4:9930021 species:mouse +``` + +Search for results between base 9130000 and 9980000 of chromosome 4 of the human species and syntenic locations in mouse alone. +``` +Hs:chr4:9130000..9980000 species:mouse +``` + +Alternatively, this same query may be expressed using kilo or mega suffixes. +``` +Hs:chr4:9130k..9980k species:mouse +Hs:chr4:9.13M..9.98M species:mouse +``` + +## Gotchas + +### Pure `NOT` queries are not supported + +Due to +=> https://xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.html#ae96a58a8de9d219ca3214a5a66e0407eacafc7c8cf7c90adac0fc07d02125aed0 performance reasons, +pure `NOT` queries are not supported. + +A search such as: + +``` +NOT author:hager +``` + +will fail. + +You will need to add something to the query to prevent the error, e.g. + +``` +species:mouse NOT author:hager +``` diff --git a/topics/xapian/xapian-search.gmi b/topics/xapian/xapian-search.gmi new file mode 100644 index 0000000..93c766d --- /dev/null +++ b/topics/xapian/xapian-search.gmi @@ -0,0 +1,44 @@ +# Xapian search + +Our main search engine (sometimes called the "global search" for historical reasons) is powered by Xapian, the excellent lightweight search engine library. This document aims to describe the architecture of the search code. + +The search engine consists of two separate parts---the indexer and the search query responder. In xapian parlance (or rather, information retrieval parlance), each possible search result is called a "document". Each document is associated with an unordered set of "terms". The indexer builds an index mapping terms to documents. When a user submits a search query, the search query is decomposed into a set of terms and these terms are looked up in the index. "Terms" are often merely the words that constitute a document or search query. But these words are normalized to remove verb conjugations, plural forms of nouns, etc. For example, "using" is normalized to "use", "looked" is normalized to "look", "books" is normalized to "book", etc. This process is called stemming. Thanks to stemming and the trickery of statistics, the xapian search engine can pretend to a crude understanding of natural language. + +## Prefixed terms + +Xapian does not just support searching free text in a document, but also for text in specific fields, say description, author, abstract, etc. using prefixes such as description:, author: and abstract:. This is done using what are called "prefixed terms". While a regular term "foo" may be indexed as "foo", when it is indexed for the author field, it may be indexed as "Afoo". Here, the prefix "A" indicates that this term is for the author field. Note that the prefix "A" is an arbitrary choice. It does not matter what prefix you choose as long as the query parser also knows to convert the author: field label to an "A" prefix. Nevertheless, there are recommended conventions and you are encouraged to use multi-letter prefixes that start with X (such as XA, XB, XBC, etc.) for non-standard prefixes. + +=> https://xapian.org/docs/omega/termprefixes.html Recommended term prefix conventions + +## Boolean terms + +Usually, terms are matched to documents "fuzzily" with each term contributing to the relevance score of a document. Thus, you may have documents that match the query very weakly but are nevertheless present in the search results albeit towards the end. However this behaviour is unacceptable for some fields. For queries such as species:mouse, we only want results that strictly match and not documents that approximately match it in some fuzzy way. This kind of boolean information retrieval is supported in xapian using "boolean terms". Just as with prefixed terms, the indexer and the query parser should agree on which terms and prefixes are boolean. + +A common pitfall is to support boolean search queries by switching the default query operator to AND. This disrupts the relevance scoring, converts xapian to a purely boolean information retrieval system (as opposed to a hybrid probabilistic + boolean information retrieval system) greatly reducing its utility. + +## Values and slots + +Some aspects of a document are numeric values or dates. They cannot be matched in the same way that terms are. Xapian supports these using a separate mechanism---slots and values. Xapian documents come with several slots each addressed by a number. These slots can contain arbitrary values (often numeric, but also dates and others). Just as with prefixed terms and boolean terms, the indexer and query parser should agree on the numeric slot addresses that numeric fields correspond to. Sorting of search results and range queries are also implemented using slots and values. + +## Position information + +In addition to terms, the xapian indexer captures position information to support phrase searches, the NEAR operator, etc. These features are unimportant for some fields. For such fields, we may tell xapian to index without capturing position information. This will help save on disk space used by the index. + +## Document data + +In addition to all the terms, position information, slots and values associated with each document, xapian also allows storing "document data" with each document. This is an unstructured data field used to store data required to render search results. In GeneNetwork, we store a serialized JSON object as document data. It is a mistake to use slots and values to store data required for rendering. Slots and values come with performance overhead. + +## Manipulate queries only as AST objects, not strings + +A common pitfall is treating queries as strings and trying to extend the query parser by manipulating query strings using string manipulation functions. This leads to fragile code. Fragile code leads to fear of breaking things when editing code. Fear leads to anger. Anger leads to hate. Hate leads to suffering. Xapian instead exposes parsed query objects as ASTs and comes with an API to manipulate such ASTs. Extending the query parser is often relatively easy using the FieldProcessor API. Never use string operations. + +## The GeneNetwork xapian indexer + +The GeneNetwork xapian indexer lives as a script in the genenetwork3 repo. +=> https://github.com/genenetwork/genenetwork3/blob/main/scripts/index-genenetwork +It retrieves data using several SQL queries and indexes them to build the index. Due to the enormous size of the GeneNetwork database, this is quite an expensive operation and relies on various tricks to complete in reasonable time. These are described in a separate document. +=> /topics/xapian-indexing + +The index is built once a day in a CI job. +=> https://ci.genenetwork.org/jobs/genenetwork3-build-xapian-index +The genenetwork3 web server process only reads the index and never mutates it. This means that the current index is a pure function of the current code and the current database. We do not have to worry about any additional state. State is evil. |