Hacker News new | past | comments | ask | show | jobs | submit login
Pdf.tocgen (krasjet.com)
175 points by nbernard 16 days ago | hide | past | favorite | 39 comments



- "That is, you shouldn’t expect it to work with scanned PDFs"

It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).

It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).


tesseract is fine for basic use cases, but it fails when the image is tilted (and thus the text isn't laid out horizontally), which can happen several times with scanned books. Compared to how well the Google OCR engine works, tesseract should be much better than it is.

I wonder how difficult it is to develop a better OCR engine than tesseract.


You are supposed to deskew (and de-warp if the image isn't flat) images before running through tesseract. There are other tools for doing that.


Tesseract is last gen. Multimodal is SOTA, and can handle even heavily distorted or destroyed text.


Am I overlooking something, or is automating page rotation no more work than just a 2d FFT?


I think it's more typical to low-pass (i.e. blur) the image and then use a line-detection algorithm like the Hough transform. Properly deskewed text should have prominent horizontal white lines.


- "Hough transform"

Oh, that one has much nicer properties—thank you!


Mind ELI5ing this? it seems neat


The Fourier transforms map plane waves to points. Blocks of regularly-spaced text have a periodic character, with the period length of their line spacing; their Fourier transform (I think??) would, in 2d frequency space, have amplitude peaks on vectors that have the same angle as the rotation of the lines.


thanks!


I've found EasyOCR to work much better at pulling text out of irregular or unknown images. Requires more resources than tesseract but gets much better results in my projects.


It seems to be not significantly better than tesseract for non-mixed images though, and it takes about 5 orders of magnitude longer to process a page on my machine; I can literally read a book 100 times faster than EasyOCR can process a book on my Ryzen 7 2700.


To give an idea, I started EasyOCR on a single page of a book at ~200dpi before I posted the above comment. It is still running over 3 hours later.


I have thought about using tessaract, using it to OCR the TOC and generate something like this. But there are just so many edge cases that make the whole process fail. For example, how do you handle it if the title breaks into two lines? What if the page number is not recognized correctly? For example, 10 can be 1o What if there are dots? Maybe you can use GPT to clean the extracted text.

In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.


Recently I found the getToc function in PyMuPdf was too slow. I told them about it in their discord, and a day later they had fixed it. Now it only takes a couple of milliseconds. I'm using it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm not sure if I can use it because of the licencse?


Of course you can use it.

What you can't do is deny others the same freedoms the license grants to you.


There does appear to be some licensing awkwardness here. The license is nominally GPLv3, but it says it is based on AGPLv3 projects. It also appears to misidentify (it may have been correct at the time) PyMuPDF as GPLv3 when that appears to actually be AGPLv3. My assumption is that using this would require complying with AGPLv3?

There's the additional oddity that a portion of the repository (the recipes directory) is licensed under CC-BY-NC-SA, and so the repository is not fully open source. This is particularly confusing, however, as the functional content of the recipes directory appears to be mostly records of direct observations of parameter choices in external documents and tools, and so doesn't seem like it would be copyrightable at all, at least in the US.


Interested to know what is pdftomp3?


You can upload a PDF and convert the chapters into MP3s (either original text or simplified text). But for PDFs without a table of contents, you can only convert single pages.


I have been thinking about this, but for a while now, I have settled on using ChatGPT's GPT-4v's multimodal capability to generate a text file containing the titles and pages based on screenshots of the TOC. After that, I used a pikepdf-based Python script to bake the TOC into the PDF I had.

The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.

The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.


I love the typography on the site. What fonts are you using? I'm on a mobile browser so I can't really see.


According to https://krasjet.com/colophon/:

> The typeface you are reading right now is Garibaldi by Henrique Beier, with some custom tweaks, as you might have noticed. I hope you enjoy it as much as I do. If you want some free alternatives, check out Alegreya ht and Vollkorn, though I still prefer the look and details of Garibaldi (just look at all the punctuation marks!).


I was going to post the same thing. This has to be the most beautifully typeset webpage I've seen in quite a while. Not just the font but the layout too.

It's almost like this page is part of the web from some parallel universe, which has been disenshittified to the same extent that our own web has been... well, you know.


Garibaldi, $300 for up to 10k page views per month.


What a beautiful website!


It took a bit of digging from the Pdf.tocgen page, but https://krasjet.com/colophon/ tells us how it's created.


Uncommon to see someone so caring about the specifics of their chosen font. Love it.


And build with very little CSS and basic HTML.


> basic HTML

Apart from the code blocks. Syntax-highlighting in `<code>` elements, when, browser manufacturers?


Looks like a very good tool to integrate with Knowledge Graphs or just RAG (llm).


Is it possible to extract different patterns of text from a PDF document?

For example, paragraphs, code blocks, code inlined in paragraphs etc?

I tried tesseract but it recognises code blocks as tables.

Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.

Appreciate any help.


We (macro.com) have something similar but without the recipe part in our pdf/word processor. It works pretty well on numbered headings but not so well on non-numbered. We’re thinking of porting over to LLMs at some point.


Since when do you need the hyperref package to generate a table of contents under LaTeX (as the author claims)?

\tableofcontents does the job.


That is a beautiful website. I got lost in it and it created a sense of wonder. Nice.


Does someone know a tool that is sed- or awk-like for PDFs?


Qpdf has tools that go in that direction (but not a flat text format that allows arbitrary edits).

https://qpdf.readthedocs.io/en/stable/qdf.html#qdf


Perhaps you can use lesspipe with sed/awk?

https://github.com/wofr06/lesspipe


pdftk is a CLI tool that can extract and edit PDF metadata such as tables of contents*, if that's what you mean?

*(Table of contents? Tables of content?)


Can I use this tool to get toc for arxiv papers ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: