Document management with Raspberry Pi

Page content

Using a Rasperry Pi 3 (Raspbian) and Avision MiCube Scanner (SD-Card Version)

Software used or you should be some kind of familiar with:

Software Installation and Setup

apt-get install inotify-tool imagemagick ocrmypdf recoll python-recoll

adduser dms

mkdir -p /opt/scanner/incoming
mkdir /opt/scanner/raw
mkdir /opt/scanner/pdf
mkdir /opt/scanner/ocr

chown -R dms /opt/scanner

udev rule detects scanner power on

Add the line

ACTION=="add" KERNEL=="sd*[0-9]", ATTRS{serial}=="SERIAL_OF_SCANNER", RUN+="/usr/bin/su dms /opt/scripts/scanner.sh"

in a new file in /etc/udev/rules.d/ (eg. scanner.rule)

Execute

sudo udevadm control --reload-rules

to reload the rules and activate the new one.


Script to mount scanners SD-Card ‘scanner.sh’

The scanner I use has a SD-Card slot so I can mount the SD-Card and copy the files to a local folder. With other scanners you will have to use sane perhaps in combination with scanbd

    #! /bin/bash

    logger "Scanner is online!"

    sleep 0.25

    # Mount the scanner as folder
    mount /dev/disk/by-uuid/628F-0135 /opt/scanner/incoming

    logger "Scanner mounted on /opt/scanner/incoming"

    sleep 1

    # Start processing newly scanned files
    /opt/scripts/processor.sh &

Process scanned files ‘processor.sh’

    #! /bin/bash

    # Temporary folder to separate files in one scan session
    FOLDER=`mktemp -d`

    # List relevant files and move to tmp folder
    FILES=`ls -1 /opt/scanner/incoming`
    mv /opt/scanner/incoming/* $FOLDER
    logger "Process files $FILES"

    # Create a unique filename
    DATE=`date +%Y%m%d%H%M%S`
    PDFNAME="$DATE.pdf"
    logger "Save as $PDFNAME"

    # Convert JPG(s) to PDF
    convert "$FOLDER/*" /opt/scanner/pdfs/$PDFNAME
    logger "PDF generation finished"

    # Move raw files to keep them as "originals"
    mkdir /opt/scanner/raw/$DATE
    for f in $FILES
    do
        mv "$FOLDER/$f" "/opt/raw/$DATE"
    done
    rm -r $FOLDER

    # OCR the new PDF
    logger "Starting OCR"
    ocrmypdf -l deu "/opt/scanner/pdfs/$PDFNAME" "/opt/scanner/ocr/$DATE.pdf"
    logger "Finished Processing PDF"

    # Build or update the index
    logger "Update file index"
    recollindex -c /opt/conf/recoll.conf
    logger "Index updated"
    logger "Finished for $DATE.pdf"

In my setup the destination of ocrmypdf is a NFS folder mounted in an owncloud instance to have all files backed up and accessible via my cloud.


Use recoll to index the PDF/A files

Create a config file for recoll ‘recoll.conf’ with the content

topdirs = /opt/scanner/ocr

I use the default config, for futher config options take a look at /usr/share/recoll/examples.

Setup and configure the recoll-webui

Get the current webui with curl

curl https://codeload.github.com/koniu/recoll-webui/zip/master > master.zip

To make a first test run

recoll-webui-master/webui-standalone.py -a 0.0.0.0 -p 11080

and browse to ‘http://IP.OF.THE.PI:11080'.

If there is already a recoll index you can perform a search query.

To have the webui served by nginx or apache take a look at the documentation.