- "That is, you shouldn’t expect it to work with scanned PDFs"
It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).
It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).
tesseract is fine for basic use cases, but it fails when the image is tilted (and thus the text isn't laid out horizontally), which can happen several times with scanned books. Compared to how well the Google OCR engine works, tesseract should be much better than it is.
I wonder how difficult it is to develop a better OCR engine than tesseract.
I think it's more typical to low-pass (i.e. blur) the image and then use a line-detection algorithm like the Hough transform. Properly deskewed text should have prominent horizontal white lines.
The Fourier transforms map plane waves to points. Blocks of regularly-spaced text have a periodic character, with the period length of their line spacing; their Fourier transform (I think??) would, in 2d frequency space, have amplitude peaks on vectors that have the same angle as the rotation of the lines.
I've found EasyOCR to work much better at pulling text out of irregular or unknown images. Requires more resources than tesseract but gets much better results in my projects.
It seems to be not significantly better than tesseract for non-mixed images though, and it takes about 5 orders of magnitude longer to process a page on my machine; I can literally read a book 100 times faster than EasyOCR can process a book on my Ryzen 7 2700.
I have thought about using tessaract, using it to OCR the TOC and generate something like this. But there are just so many edge cases that make the whole process fail. For example, how do you handle it if the title breaks into two lines? What if the page number is not recognized correctly? For example, 10 can be 1o What if there are dots? Maybe you can use GPT to clean the extracted text.
In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.
Recently I found the getToc function in PyMuPdf was too slow. I told them about it in their discord, and a day later they had fixed it. Now it only takes a couple of milliseconds. I'm using it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm not sure if I can use it because of the licencse?
There does appear to be some licensing awkwardness here. The license is nominally GPLv3, but it says it is based on AGPLv3 projects. It also appears to misidentify (it may have been correct at the time) PyMuPDF as GPLv3 when that appears to actually be AGPLv3. My assumption is that using this would require complying with AGPLv3?
There's the additional oddity that a portion of the repository (the recipes directory) is licensed under CC-BY-NC-SA, and so the repository is not fully open source. This is particularly confusing, however, as the functional content of the recipes directory appears to be mostly records of direct observations of parameter choices in external documents and tools, and so doesn't seem like it would be copyrightable at all, at least in the US.
You can upload a PDF and convert the chapters into MP3s (either original text or simplified text). But for PDFs without a table of contents, you can only convert single pages.
I have been thinking about this, but for a while now, I have settled on using ChatGPT's GPT-4v's multimodal capability to generate a text file containing the titles and pages based on screenshots of the TOC. After that, I used a pikepdf-based Python script to bake the TOC into the PDF I had.
The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.
The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.
> The typeface you are reading right now is Garibaldi by Henrique Beier, with some custom tweaks, as you might have noticed. I hope you enjoy it as much as I do. If you want some free alternatives, check out Alegreya ht and Vollkorn, though I still prefer the look and details of Garibaldi (just look at all the punctuation marks!).
I was going to post the same thing. This has to be the most beautifully typeset webpage I've seen in quite a while. Not just the font but the layout too.
It's almost like this page is part of the web from some parallel universe, which has been disenshittified to the same extent that our own web has been... well, you know.
We (macro.com) have something similar but without the recipe part in our pdf/word processor. It works pretty well on numbered headings but not so well on non-numbered. We’re thinking of porting over to LLMs at some point.
It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).
It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).