I have a pile of part lists for tools I’m maintaining, in pdf format; and I’m looking for a good way to take a part number, search through the collection of pdfs, and output which files contain that number. Essentially letting me match random unknown part numbers to a tool in our fleet.

I’m pretty sure the majority of them are actual text you can select and copy+paste, so searching those shouldn’t be too difficult; but I do know there’s at least a couple in there that are just a string of jpgs packed in a pdf file. They will probably need OCR, but tbh I can probably live with skipping over those altogether.

I’ve been thinking of spinning up an instance of paperless-ngx and stuffing them all in there so I can let it index the contents including using OCR, then use it’s search feature; but that also seems a tad overkill.

I’m wondering if you fine folks have any better ideas. What do you think?

  • tofu@lemmy.nocturnal.garden
    link
    fedilink
    English
    arrow-up
    24
    ·
    2 days ago

    The OCR thing is it’s own task but for just searching a string in PDFs, pdfgrep is very good.

    pdfgrep -ri CoolNumber69 /path/to/folder

    • Darkassassin07@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 days ago

      That works magnificently. I added -l so it spits out a list of files instead of listing each matching line in each file, then set it up with an alias. Now I can ssh in from my phone and search the whole collection for any string with a single command.

      Thanks again!

      • hoppolito@mander.xyz
        link
        fedilink
        English
        arrow-up
        11
        ·
        2 days ago

        In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.

        It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.