Recoll-Local-Search-Engine

From Notes_Wiki

Setting Up Recoll as a Local Full-Text Search Engine

Last updated: 2026-03-07 | Environment: Internal server | Access: [[1]]

Property Value
Recoll version 1.43.13 + Xapian 1.4.22
Server [your-internal-server]
Search URL [2]
Companion app URL [3] (unchanged)
Files indexed /var/projects/data (same folder as companion app)
Index database /home/file-search/recoll_index
WebUI install path /opt/recoll-webui
Config file /root/.recoll/recoll.conf
SSL cert [your-ssl-cert-path]
SSL key [your-ssl-key-path]
Systemd service /etc/systemd/system/recoll-webui.service
Nginx config /etc/nginx/sites-available/your-app (appended)
Cron job /etc/cron.d/recoll-index
Index log /var/log/recoll-index.log

If you've ever needed fast, reliable keyword search over a large collection of local files — PDFs, Word docs, spreadsheets, the works — Recoll is an excellent tool for the job. This guide walks through how we deployed it on an internal server alongside an existing application, giving us a clean search UI without disrupting anything that was already running.

Step 1 — Install Recoll

The version in Ubuntu's default repos tends to lag behind, so we add the official Recoll PPA to grab the latest release. We also install a handful of format helpers that let Recoll extract text from PDFs, DOCX files, RTF, and more:

# Add PPA for latest Recoll version
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt update

# Install Recoll command-line (no GUI needed on server)
sudo apt install recoll python3-recoll

# Install format helpers for PDF, DOC, RTF etc.
sudo apt install poppler-utils antiword unrtf python3-mutagen \
  libimage-exiftool-perl catdoc python3-docx

# Verify installation
recollindex --version
# Expected: Recoll 1.43.13 ...

Next, install the WebUI. One important note here — there are two repositories floating around. Use the framagit one:

# Install WebUI dependencies
sudo apt install git python3-waitress

# Clone the maintained fork (NOT the outdated koniu GitHub repo)
git clone https://framagit.org/medoc92/recollwebui.git /opt/recoll-webui


Step 2 — Configure to Use the Same Data Folder as Your Companion App

In our setup, Recoll indexes the same folder already used by another application. This keeps things simple — one source of truth, two ways to search it. Edit /root/.recoll/recoll.conf:

# /root/.recoll/recoll.conf

# ── INDEXING PATHS ──────────────────────────────────────

# Same folder that the companion app ingests from
topdirs = /var/projects/data

# Separate index DB --- does not interfere with the companion app's index
dbdir = /home/file-search/recoll_index

# Skip hidden/temp/junk files
skippedNames = .* *.tmp *.log *.bak ~* thumbs.db

skippedPaths = /var/projects/data/.trash /var/projects/data/thumbnails

# ── PERFORMANCE ─────────────────────────────────────────

maxfsoccuppc = 0       # no disk % limit
filtermaxmbytes = 100  # skip files over 100 MB
idxflushmb = 40        # flush to disk every 40 MB
nthreads = 8           # parallel indexing threads

# ── SEARCH QUALITY ──────────────────────────────────────

# Store case and diacritics for precise searches
indexStripChars = 0

# Terms in 80%+ of documents are auto-suppressed in phrase queries.
# This is Recoll's built-in stop word equivalent.
# It works WITHOUT breaking phrase searches like 'installation of nessus'
# (unlike traditional noindex stop word lists which would break this).
commontermspercent = 80

# Stemming: 'installing' also matches 'install', 'installation' etc.
indexstemminglanguages = english

Create the index directory and run the first full index:

mkdir -p /home/file-search/recoll_index

# First full index --- this takes time depending on file count
HOME=/root recollindex -z

# Validate config paths before running
HOME=/root recollindex -E


Step 3 — Run WebUI as a Systemd Service

The WebUI binds to localhost only. Nginx handles external access and SSL termination — more on that in the next step. Create /etc/systemd/system/recoll-webui.service:

[Unit]
Description=Recoll WebUI - Keyword Search
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/recoll-webui
ExecStart=/usr/bin/python3 /opt/recoll-webui/webui-standalone.py -a 127.0.0.1 -p 8180
Restart=on-failure
RestartSec=5
Environment=HOME=/root

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable recoll-webui
systemctl start recoll-webui
systemctl status recoll-webui
# Should show: Active: active (running)

Step 4 — Configure Nginx for HTTPS on Port 8443

Port 8443 was previously used by a different app in our Nginx config. That server block was completely replaced with the Recoll block. The port 443 block for the companion app was left entirely untouched.

The Recoll block reuses the existing SSL certificate — no new cert needed. Add or replace in your Nginx sites config:

# ── PORT 443: Companion App (UNCHANGED) ─────────────────

server {
    listen 443 ssl;
    server_name your-server.local;

    ssl_certificate     [your-ssl-cert-path];
    ssl_certificate_key [your-ssl-key-path];
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers on;

    location / {
        proxy_pass http://127.0.0.1:8001;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
    }
}

# ── PORT 8443: Recoll keyword search (REPLACES previous app) ─

server {
    listen 8443 ssl;
    server_name your-server.local;

    ssl_certificate     [your-ssl-cert-path];
    ssl_certificate_key [your-ssl-key-path];
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers on;

    # Static files served directly (faster, reduces Python load)
    location /static/ {
        alias /opt/recoll-webui/static/;
        expires 1d;
    }

    # Proxy everything else to recoll-webui on localhost
    location / {
        proxy_pass http://127.0.0.1:8180;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 120s;
    }
}
# Test nginx config --- must show 'syntax is ok' before reloading
nginx -t

# Reload nginx (zero downtime --- companion app stays up)
systemctl reload nginx

# Verify port is listening
ss -tlnp | grep 8443


Step 5 — WebUI Customisations

Out of the box, Recoll's WebUI is functional but rough around the edges for our setup. We made three targeted customisations.

5a — webui.py: Path Translation

The default WebUI shows raw index paths like file:///var/projects/data/... in results. We added a helper function to strip the server prefix and display clean, user-friendly relative paths instead.

Added to webui.py just before the routes section:

#{{{ cloud path helper
# Strip server prefix and prefix with a friendly label
# _SERVER_PREFIX must match the 'topdirs' root in recoll.conf

_SERVER_PREFIX = '/var/projects/data'
_LABEL = 'SharedFiles'

def _friendly_path(url):
    path = url[7:] if url.startswith('file://') else url
    if path.startswith(_SERVER_PREFIX):
        path = path[len(_SERVER_PREFIX):]
    return _LABEL + '/' + path.lstrip('/')
#}}}

And in the recoll_search() function, after the d['time'] line:

d['friendly_path'] = _friendly_path(d['url'])


5b — views/result.tpl: UI Changes

The default template shows Open/Download/Preview action buttons (not useful in our setup) and no easy path copy. Here's what we changed:

  • Removed Open, Download and Preview action links entirely
  • Added a full-width path bar below each result title showing the clean friendly path
  • Added a Copy button using document.execCommand('copy') — this works on plain HTTP, unlike navigator.clipboard which requires HTTPS
  • Copy logic lives in static/extra.js via a jQuery delegated event listener — this avoids inline onclick escaping issues with template-interpolated paths


5c — static/extra.js: Copy Button Handler

The Copy button handler uses jQuery's delegated event listener pattern, which avoids the inline onclick breakage that occurs when paths containing slashes, dots, or special characters get interpolated into HTML attributes by the template engine:

$(document).on('click', '.ocp-copy-btn', function() {
    var btn = $(this);
    var path = btn.prev('.ocp-path-bar').find('.ocp-path-text').text().trim();

    var ta = document.createElement('textarea');
    ta.value = path;
    ta.style.position = 'fixed'; ta.style.opacity = '0';
    document.body.appendChild(ta);
    ta.focus(); ta.select();
    document.execCommand('copy');
    document.body.removeChild(ta);

    btn.text('✓ Copied!');
    btn.css({'background':'#1a6640','color':'#7feba0','border-color':'#2d8a5a'});
    setTimeout(function() {
        btn.text('Copy');
        btn.css({'background':'#2a2a4a','color':'#888','border-color':'#3c3c5c'});
    }, 2000);
});

Step 6 — Cron Job for Automatic Indexing

Create /etc/cron.d/recoll-index to keep the index up to date automatically. The -z flag on the nightly job forces a full incremental scan — visiting every file but only re-indexing changed ones. Without -z, newly deleted files can linger in the index.

# /etc/cron.d/recoll-index
# Recoll automatic indexing schedule

# Full incremental index at 2:00 AM daily
# -z: reset index before starting (catches deleted files too)
0 2 * * * root HOME=/root /usr/bin/recollindex -z >> /var/log/recoll-index.log 2>&1

# Quick incremental at 8am, 2pm, 8pm --- picks up files changed during working hours
# No -z: only processes new/changed files, very fast
0 8,14,20 * * * root HOME=/root /usr/bin/recollindex >> /var/log/recoll-index.log 2>&1

# Trim log file every Sunday at 3am (keep last 1000 lines)
0 3 * * 0 root tail -1000 /var/log/recoll-index.log > /var/log/recoll-index.log.tmp \
  && mv /var/log/recoll-index.log.tmp /var/log/recoll-index.log
# Deploy cron job
chmod 644 /etc/cron.d/recoll-index

# Test a manual run to confirm it works
HOME=/root recollindex >> /var/log/recoll-index.log 2>&1
tail -20 /var/log/recoll-index.log

Quick Reference

Task Command
Force full re-index HOME=/root recollindex -z
Quick incremental update HOME=/root recollindex
Validate config paths HOME=/root recollindex -E
View index log tail -f /var/log/recoll-index.log
Restart WebUI systemctl restart recoll-webui
Check WebUI status systemctl status recoll-webui
WebUI logs journalctl -u recoll-webui -f
Test nginx config nginx -t
Reload nginx systemctl reload nginx

Troubleshooting

Symptom Cause & Fix
500 NameError: name 'r' is not defined Template uses 'r' but should use 'd'. Check views/result.tpl — you're likely using the wrong WebUI fork.
Path shows full server path _SERVER_PREFIX in webui.py doesn't match topdirs in recoll.conf
Copy button does nothing Check extra.js has the .ocp-copy-btn handler. Check browser console for JS errors.
Path widget not showing Widget is inside %if len(d['ipath']) > 0: block — move it outside.
"conflicting server name" nginx warning Two server blocks using the same port. Remove the old app block.
Search returns no results after config change Re-index with HOME=/root recollindex -z (new stop word settings require a full re-index)
Files not found in index Check topdirs in recoll.conf. Run recollindex -E to validate paths.
nginx 502 Bad Gateway recoll-webui service not running. Check: systemctl status recoll-webui