Hacker News new | past | comments | ask | show | jobs | submit login
The Design of Postgres (1986) [pdf] (dsf.berkeley.edu)
218 points by craigkerstiens on Feb 7, 2023 | hide | past | favorite | 38 comments



It took me a long time to understand why we write academic articles like we do. But reading classic articles like this one really puts it into perspective for me. It is a clean and simple way to preserve knowledge over decades. Something that no blog or wiki page could guarantee.


I think that the most annoying omission from the standard academic paper format is the year of publication.


This is something of a CS-specific problem, and is in part a side effect of the field's tradition of digital-first author-prepared publication. In basically any other academic field, any paper that is being passed around or linked to will have a header, footer, etc. that has its bibliographic details.

And in fact, most/all CS papers _do_ have a "real" version _somewhere_ that has the date and other relevant info, but in CS, we have a bad habit of passing around links to a papers that point to a random version that the author posted somewhere, rather than to an entry in a proper bibliographic database or to the canonical/archival PDF of the paper. I.e., not the PDF that the author posted themselves somewhere (likely missing dates, publication information, revision history, etc.), but the version that came as part of the conference proceedings or journal that the paper was published in. Those typically (though admittedly not always, depending on the conference) have headers/footers added that include whatever would be needed to properly cite the paper. Google Scholar etc. try their best to link to the "right" place, but are often led astray and point at the "random early draft on the author's website" instead, which of course perpetuates the problem.

Incidentally, helping avoid this situation is the sort of thing is at least nominally a big part of the "value add" that traditional journal publishers are supposed to be adding- keeping track of citation/bibliographic metadata, assigning and managing DOIs, ensuring that there's a standardized layout and production process that includes such information in PDFs, providing archival/canonical URLs for papers, etc. It's also an important part of what professional societies that have publishing arms (ACM, IEEE) and libraries (like the NLM and its PubMed/MEDLINE services) contribute.


What fields do you see this in? I read a lot of social science-y academic articles for research for a nonfiction book I’m writing, and I would be very frustrated if a date wasn’t included at the front of the article.

Though I’m not sure if I’ve been lulled into a fall sense of security by finding the date on the cover sheet of the PDF (prepended by the article’s publisher/distributer, I imagine).

In no way am I disagreeing with you, more sharing the fact that I’d be surprised and frustrated to come across this omission. Author and publishing date are arguably more important than the title, in my opinion - which now that I think about it is probably why these are the metadata we use to index citations.


Computer Science/Math.

Here's an almost completely random example, I literally searched for "graph coloring paper" just because that was a subject I could think of top of my head, and found this as first hit: https://www.cs.cmu.edu/~avrim/Papers/coloring_worstcase.pdf

Yep, no date.


That’s a preprint - check the published version https://dl.acm.org/doi/10.1145/176584.176586 which has a date on the front page of the article.


That's a good point. Given how many apparent preprints I come across, though, I wish they'd just put a date on the front page, too.


In case you do not know, I find Google Scholar to be pretty useful for finding dates and other information about the paper - e.g. in this case https://scholar.google.com/scholar?hl=en&as_sdt=0%2C14&q=New...

I work with papers in CS/math and have been able to find dates and metadata like DOI pretty quick. There are complications when there are really multiple versions of a paper, like one in a conference that is only an extended abstract and one in a journal, but you would need to figure out which version you want to read in that situation anyway. I agree it's annoying, but coupled with reference manager software like Zotero it's hardly an issue for me.


Google Scholar is a great tool but I fear its days are numbered. It's already frustrating that they removed the feature hot linking to an article's scholar page if you enter the title in google search.


If Google Scholar is sunset that would be a great loss for the academic community. When was that feature removed? I have felt reasonably confident in Google Scholar so far because it is already 18 or so years old.


You deduce a lower bound for the date based on the years of cited publications


One wonders how the author got _those_ dates...


I mentioned that in a sibling comment. Of course, what I actually do is just a web search for the paper’s title.


Agreed. And they know it's important because thier references usually includes the dates of the works cited.


Yes, absolutely. Even in my example in the sibling comment, when I look at the first citation(s) in that paper with no date, I see: "[Cha82][CAC+81][BCKT89][Ber73]".

Great, so I guess I at least know that it must not have been published before 1989... (before I just search for the title on the web, in the hopes of finding a journal it was published in).


When you write an academic paper it's up to the journal when it's published - would this play into it?


Sometimes I read something I'm just like "what the literal fuck could you possibly be talking about?!" Of course the date of publication is a standard part of any journal article. I mean, a lot of them get into the weeds about it: date of submission, date of acceptance, date of early online publication, date of press.

Then I went back to this particular PDF and notice that indeed, it doesn't have the date of publication.

Then I re-read your sentence.

"I think the most annoying omission from the standard academic paper format is the year of publication." is subtly ambiguous.

"Sometimes authors omit elements required by the standard academic paper format, and of those omissions, omitting the date of publication is the most annoying."


To me it feels like more than "sometimes", because I too have wondered the same things, and often searched for titles of papers that I'm reading on the web just to know when it was published.

I assume that's because those papers are, well, published, and of course there's a date on the journal itself? Doesn't really work that way nowadays anymore.


First search terms brought me to this:

https://dl.acm.org/doi/abs/10.1145/16856.16888


Michael Stonebraker is a prodigy in this field. He did something very rare - he took a concept deep within the academia and made it a blockbuster product in the real world.

The man writes small databases for fun over a weekend, I mean, come on.

His interview on Software Engineering Radio is still one of the best episodes I have listened to. Decades of wisdom, right there:

https://www.se-radio.net/2013/12/episode-199-michael-stonebr...


> The man writes small databases for fun over a weekend, I mean, come on.

I am a student of Mike. He is a great man, but he would not call himself a programmer.


What would he call himself then?


Mike, usually.


This is the reason I am so fascinated with both voltdb and scidb... having read the articles he and his teams were pushing (more for volt and their commercial side) I saw that he was in a different league than others and the insights and comments he made about nosql and variants resonated deeply as a sysadmin type who has to suffer the slings and arrows of so many db systems.


Wow, Postgres is older than me. I never knew. Love Postgres and use is for basically all of my projects.


The open source project is much younger, from 1995. I know some people who used the first open source version, Postgres95, back then.


I'd forgotten the bit in here about the storage hierarchy, especially optical disks and immutable storage. I'm finding the modern "big data" stack with blob stores, immutable storage, and neutral formats (Parquet, Avro) to be useful, especially when it can be seamlessly read from a DBMS instance with its own optimized storage (Redshift, Vertica, and others can do this). However, I miss Postgresql's types and planner. From afar, it seems like Postgresql would be a perfect fit into this tiered ecosystem. Having a "hot set" in Postgresql and the "cold set" in a blob store for cost optimization in particular... as long as PG could use parallel query to maximize throughput over the higher-latency connection.


I found typos, I wonder if they published it already :)


Send them a patch :)


So... some of these DB objects are in turn more DB commands, so that if you fetch the thing, it fetches the underlying things? I don't know much about DB design/implementation but that seems kind of wild.


It gets hairier than this.

For example, in Postgres, the place where all the metadata is stored that the DB uses in internal business logic is not regular C structs, but as rows in special tables (relations).

Like say you want to look up the physical file location of a table, using its ID (something Postgres functions have to do internally at various points). It does this by reading from a relation.

It's sort of like a compiler that compiles itself. It can't always have been that way, so I have to wonder at what point the codebase swapped over to doing it this way.


> It can't always have been that way

Sure it could have.

1. Start with a library version of just the storage engine;

2. write some CLI tools that use that library to direct-manipulate table heap files, batch import/export between them and serialization formats, etc — y'know, like these ones for BDB (which was the SQLite of its day, and contemporary to Postgres at Berkeley) — https://docs.oracle.com/cd/E17076_05/html/api_reference/C/ut... ;

3. define the core system-schema tables for the DBMS, and the required static data to go in them;

4. write a script that uses the CLI tools to bootstrap a "database" (directory of loose heap files) using that schema — call it initdb(1);

5. and now start writing the rest of the DBMS, which can now take "there is an initialized cluster with a populated system schema" as an axiomatic assumption, opening it and reading important state and configuration from said schema files from almost the first line of code.

(I have no special knowledge of PG's earliest versions, so this might be a just-so story; but today, initdb(1) is a separate tool with separate code that doesn't come from some library that also gets compiled into the PG postmaster. And that's not the sort of design you would move to, if you had started out with a design where the postmaster could bootstrap an empty cluster directory into existence on startup.)


Huh thanks for the explanation, I've always wondered how that worked.


At this point, Postgres is probably the piece of software whose presence is almost unavoidable in most projects.

It’s fascinating to see the way it’s stood the test of time.


Postgres is good


Postgres is good because Postgres is great.


word





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: