Searching through a bulk of pdf files

Darkassassin07@lemmy.ca · 2 days ago

Searching through a bulk of pdf files

tofu@lemmy.nocturnal.garden · 2 days ago

The OCR thing is it’s own task but for just searching a string in PDFs, pdfgrep is very good.

pdfgrep -ri CoolNumber69 /path/to/folder

Darkassassin07@lemmy.ca · 2 days ago

That works magnificently. I added -l so it spits out a list of files instead of listing each matching line in each file, then set it up with an alias. Now I can ssh in from my phone and search the whole collection for any string with a single command.

Thanks again!

tofu@lemmy.nocturnal.garden · 2 days ago

Glad to hear that!

Darkassassin07@lemmy.ca · 2 days ago

Interesting; that would be much simpler. I’ll give that a shot in the morning, thanks!

hoppolito@mander.xyz · 2 days ago

In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.

It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.