Hacker News new | past | comments | ask | show | jobs | submit login
How and why the Relational Model works for databases (the-pans.com)
178 points by uvdn7 on Jan 17, 2022 | hide | past | favorite | 243 comments



The best advice I can give is that you can think of your app and its data store as either:

#1 code is of prime importance, data store is simply a "bucket" for its data

#2 data is of prime importance, code is simply the means to read/write/display it

In my personal experience, #2 is a way better way to work. Apps can come and go, but data can last a long time, and the better your database is modelled the better the outcomes you'll have long term.

Corollary - I have seen some abject disasters where #1 has been adopted. Not necessarily just because of #1 alone but it's certainly been a major factor.


"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."

-- Fred Brooks, The Mythical Man Month (1975)


When reviewing patches, I always look at the changes to .h files first, to see how the data structures are modified. Frequently you can predict exactly what the code changes are going to look like as a result.


I imagine that this doesn’t work as well with PIMPL code.


While I am completely in agreement with the intent of this aphorism, and also what the article says about the relational model, I feel this is (as aphorisms tend to be) something of an exaggeration. I think, for example, that it would be difficult to deduce the rules of cricket from its scoresheets.

Tell me your tables' semantics, however, and I think one could do much better. ..

I suspect that, when Brooks made that comment, he was probably thinking mostly in terms of programs for entering and reporting on that data. We're doing a lot more now.


Well, the scoresheet is not enough of course, you need all the "datastructures" (field design, player number and layout, equipment, etc).

edit: but yes, it can be a bit of an exaggeration. Figuring out all the red-black tree rules from just the structure definitions, relations and and invariants is probably a bit of a stretch for most.


OK, I'll grant you all that, but not the rules of play. Do you think that would be enough to figure out how to fill out a scoresheet correctly, without operational observations (i.e. without having previously watched a number of games or being told how it is played?) (As I don't play or watch the game myself, I am not certain it would not be possible, but I feel doing this would be difficult for almost any game.)


One awful thing I've seen a few times is people who put data into a database in a format meant to make it easier for working with a particular platform, such as serialized Java dates (or even dates as longs).

It completely ties you to one implementation and precludes using the data for other things such as reporting and alerts or even ad hoc queries.


You can always convert in the DB when selecting if you want the date. Dates/times in general are kind of horrible in all DBs and programming languages so I don't think using unix seconds (or milliseconds, etc) is obviously wrong. It's kind of like storing strings as plain bytes in the DB, it's not what you'd do by default but it's not completely wrong.


I tend to agree that dates and times are awful to work with but storing everything as a Unix timestamp still seems to be the easiest way to avoid most of the headaches, in my experience. there's nothing worse than trying to reconcile a database filled with naive dates with no associated time zone information. is it daylight savings? is it in this timezone or did they enter it while traveling and their computer automatically updated the time?

The simplest way to minimize these issues, in my experience, is to put the logic for converting local time into UTC in the program and only store the Unix stamps in the DB.


So you don't like serialized java dates, which presumably are some standard at least.

You don't like longs / epoch.

So..... what do you like?

Date formatting/conversion/serialization is not standardized across databases. And it's generally hair pulling (at least the Oracle version was last I checked which was a long time ago).

Longs or ISO-8601 strings? No DB-specific crud. Yes please.


At least you can compare longs and index them. That's not too bad. But I prefer to just use the native date type; it allows you to use the built-in date functions.


> In my personal experience, #2 is a way better way to work. Apps can come and go, but data can last a long time

Exactly. Data is far more valuable than code. We still work with data from literally centuries ago. Most programs written even 10 years ago are probably no longer in use. I would be surprised if the oldest code still doing something useful was more than 20 or so years old.



How did they miss GNU Maxima?


> I would be surprised if the oldest code still doing something useful was more than 20 or so years old.

...you can't be serious? Every other company that has been writing code for 20+ years probably has code that is 20+ years old. Just look at banks.


I could see how someone would get that impression If they've only worked at companies who only exist solely to light VC cash on fire.

It's an incorrect impression, obviously.

For >20 year old code, Calc.exe is just one example off the top of my head - https://github.com/Microsoft/calculator

The majority of the products I've personally worked on over my 15 year career in software development are still doing useful things and contain code written more than 20 years ago.


Guys, you're missing the point. The scenario being discussed are the set of programs that collect large amounts of data, not about programs like Calc.exe or Apache that don't do anything like that. That's the whole point of asking whether data or code is of prime importance.

Banks are probably one of the few examples of very old COBOL code that's still running (although they've been trying to get rid of it for years), which is the exception that proves the rule. Regardless, if you approached bank executives and said, "we've compromised all of your systems and we're going to delete either all of your code or all of your data, you pick", which one do you seriously think they're going to pick?


No one questioned whether data is more important, only the statement regarding "being surprised if there is any useful code older than 20 years".


I don't see what's surprising about it if you account for the context where we're discussing code that operates on large datasets. Aside from banks, do you have any examples? Because all of the examples cited so far do not qualify (Apache, Linux, gcc, etc.), so it seems like most people are completely ignoring that context.

I myself am still maintaining a large system that I started about 20 years ago, and maybe only a couple of functions from that time have survived to this day, but all of that data is still there and valuable.


That part wasn't really obvious, even with context!

Though I'd say that a big reason for that is because very few companies even had a backend 20 years ago compared to today. The ones which did are likely not really consumer oriented, but if you ask medical companies (EMRs), governments, and insurance companies, I'm sure it's not too uncommon to use applications which have codebases that are from before 2002.

Also, 10-20 years from now, I'm sure it will be quite common for companies to have backend services code which are 20 years old. There are still a lot of companies having Java 7 and earlier in production, and that's 11+ years now.

But I agree with what you're saying, it's much easier to get rid of code vs. data, and data is also a lot more valuable!


Weird goalposts shifting.

My former company makes B2B software that's entire purpose is to process business data and our codebase started in 1999 and plenty of original code is still in there. Entire modules have barely been updated, those modules are working and the business needs haven't changed.

Speaking of old business software, PeopleSoft is ubiquitous at big companies and has been around forever. Their codebase probably has code two decades old. The frontend looks like it hasn't been updated in the last decade.


Insurance stuff definitely processes lots of data, and has been around for a lot longer than twenty years.

Additionally, psychometric tests such as the GRE and SAT have been around for a long time and process lots and lots of data (especially relative to when they were first scaled out).


You've just torn apart most NoSQL databases in a very short comment :-))


So this is just an observation and not a judgment.

With only 4 years of professional experience I have never worked on a MongoDB project in which MongoDB somehow wasn't an issue. The proposed solutions merely being "improve the indices" or "scale the cluster", often without clearly defining what's going on.

So granted I have a very limited experience, it's baffling to me mainly in comparison to the PostgreSQL projects I worked on. They definitely also had problems but these problems were clearly defined even if resolution wasn't quick or easy. It was usually an out-of-date or generally messy schema that was causing issues and folks usually were able to clearly define the schema problems.


I hear this about MongoDB (that it's a plague and you should stay far away) all the time but DynamoDB, which is also NoSQL? One of AWS' finest products; just launching an app on AWS? Use DynamoDB! So on and so forth ad nauseum.

Why is DynamoDB the bee's knees but MongoDB is a thing to be despised?


1. One has a better marketing machine than the other?

2. DynamoDB is pushed in slightly more balanced ways than Mongo was, at the start Mongo was supposed to be the second coming of the (database) Messiah.

3. I don't believe DynamoDB has defaults that lose your data.


Mongo works really well for backends to test your UI against in coding bootcamps. Dead simple CRUD storage and access model. As a mock React data store, it'd be hard to find simpler.

DynamoDB is a managed service that scales far beyond Mongo and truthfully far beyond what any of us here would be seriously discussing.

With regard to complex schemas and related data, both DynamoDB and Mongo are harder to deal with once you've past a trivial size. If all you need is basic CRUD with limited or no joins, Mongo could suffice to a large deployment, but if you're mostly accessing by primary key, DynamoDB will soundly eat its lunch performance-wise and for cheaper.

If you need to join related data (or just find them convenient), relational models work famously well, and usually up the point where you have 10 million simultaneous users.

Mongo more and more is being relegated to niches where the problem just happens to exactly fit Mongo's feature list, which unfortunately for Mongo encompasses ever-narrowing gaps in the other options' offerings.

Mongo can't match the scale or economy of DynamoDB and can't match the flexibility of relational. That's why fewer and fewer treat it as their go-to database nowadays.


Your comment that "fewer and fewer treat it as their go-to database nowadays" seems odd given the continued growth of MongoDB.

https://db-engines.com/en/ranking_trend/system/MongoDB

https://www.macrotrends.net/stocks/charts/MDB/mongodb/revenu...

https://seekingalpha.com/article/4473768-mongodb-mdb-stock-s...


Dynamo seems nice for some very specific use cases, but there are others where it falls short.

There's limits on how big your data can be, which is a little annoying if you want to use it to store a couple larger things alongside all your small data.

The C# API is absolutely painful IMO, and makes it easy for new developers to use it in a way that you would have been better off grabbing a different technology.

I suppose one could argue S3 would be the better choice for large payloads, OTOH I'm a fan of minimizing potential points of failure so if we already are using DynamoDB I'd rather not toss in an additional S3 integration that could break or need maintenance later.


1. Below 25GB and the read/write threshold of most apps, DynamoDB is free.

2. DynamoDB is a 100% managed service. No instances to wrangle. No manual partitioning. Just define your table's name, the table's partition key, and an optional sort key, and you're off to the races! It just works.

If you have a serverless environment like lambdas with API Gateway or AppSync, it can scale pretty much to the extent of your business model rather than some fixed limit.

BUT for data schemas beyond the most trivial, it can easily be more complex to deal with than a relational database. Whereas a relational database usually aims for normalization where no data is duplicated and foreign keys keep things straight, DynamoDB works best with a denormalized data set. No joins. Ever. Schema integrity is your problem, not the database's. In the deal though, you get a database engine that can scale effectively infinitely.

In other words, storage is cheap, but access is expensive, so data duplication is pretty much encouraged in DynamoDB for the sake of speed. You aim for getting everything you need in a single entry or sequential row iteration.

When we use a relational database and run into problems, we run EXPLAIN and EXPLAIN ANALYZE to figure out the query plan, so we can optimize. SQL is a 4th generation, declarative language that describes WHAT data you want, not HOW you get it.

DynamoDB in the larger sense is at the level of EXPLAIN output. It is 100% HOW to get data. The WHAT is at application level and implemented by you in code.

If EXPLAIN output makes no sense to you, then DynamoDB probably isn't for you either unless it's a trivial app/data set.

But then again, if it's a trivial app/data set, literally anything can work. An O(n!) algorithm is perfectly reasonable given a small/simple enough data corpus and large enough computing resources. It's when the data set gets slightly larger that decisions become important.

But when things are very small/simple, it's often hard to argue with fast+free. Those are the sweet spots for DynamoDB: very small/simple and the mind-bogglingly humongous. For everything in the middle, relational databases work wonderfully and are much easier to work with, especially for non-trivial data sets.


For one thing DynamoDB is Amazon scale whereas Mongo is merely web-scale.


But is it shardable? ;)


DynamoDB is a key value store so forces you into an up-front waterfall design model where you can't easily evolve your schemas. This has massively weakened the scenarios where it can be used but there's a survivor bias where if that's all you need you're really dealing with a simple use case that's easy to encapsulate and reason about


DynamoDB is evolved from Dynamo, described in details in their SIGOPS’07 paper. Amazon uses Dynamo internally for e.g. shopping charts. There’s nothing wrong with a distributed key/value store. It’s just that I think Dynamo started off being super reliable, and not overpromising …


Don't worry; for every misuse of a NoSQL database when they should have used a relational DB there's a solution misusing a relational DB for a queue, event bus, etc. Nature likes a balance!


Would you remind a dumb-wit like myself why that's a problem, when the API of a relational db is virtually perfect for a queue or event bus? Push, pop, look-but-don't-touch, data available through the most scrutinized and battle-tested query language ever, what else do you need?


until Postgres added JSON fields I would have retorted that nosql is really handy for scraping purposes, when you just want to grab and store the collected data, which you then parse with another specific program into a relational database later.

I've learned this by doing it the opposite way for years. however, website schemas can frequently change, which ruins the database schema if you're immediately parsing into a normalized structure. In other words, it's brittle.

noSQL (or JSON fields) allow one to store unstructured data which is more forgiving when a schema changes (I.e. change the spelling of a dictionary key, etc.).


What you need is a versioned mapping layer from the scraped data into your own schema.

The problem is when you need to update your schema to support data/relations not available in your own schema.

The problem with SQL is that schemas are so sticky and hard to change. People are always hesitant to change their schemas because schema changes themselves are difficult and then you have to update a bunch of backend code, and then your frontend code maybe does strange denormalized things, and then everything breaks.

I've been thinking about this for a while, and I think what is needed is a visual tool that can show all dependencies of a data schema element (backend, frontend), so that schema changes are easier to make.

All the layers (db access, api, cache, frontend data stores/caches/frameworks) in modern architectures make it virtually impossible to modify a schema without causing chaos. The solution is keeping the data model and query interface as close as possible throughout the entire stack. For example you should never write any manual data manipulation code (e.e. `people.map(p => p.full_name = p.firstName + p.lastName)` unless this is able to be traced through the entire system. A monorepo, typed ORM, refactoring tooling (e.g. IDE) can help, but its usually never setup well enough or integrates close enough with the db.


Well, technically you could have done this since forever in an RDBMS. Key value table, have the table be a blob (or just varchar if a string).

Then yes, you'd have to parse it yourself, but you'd be doing that anyway with JSON fields, more or less.


Indeed, you are correct. A lot of these lessons came hard-earned from doing it the wrong way first.

I thought I was being such a good little programmer by normalizing my data as soon as possible. All it did was provide me job security (at the cost of headaches) every time the schema changed.


Why does PostgreSQL include LISTEN/NOTIFY if I'm not supposed to use it for events?


NoSQL databases are mostly useful for performance reasons. You can have the most beautifully modeled relational database schema but if it performs/scales outside your tolerance bounds, then it's useless.

IMO, this is the main reason we have so many database technologies: they make important performance tradeoffs.


It isn't that difficult to optimize relational SQL databases. I've dealt with databases with billions of rows in hundreds of tables and the company's initial thought was to flatten the schema due to perf. complaints.

We got queries down from 30+ seconds to 5 ms simply by properly indexing, defragmentation, analyzing query plans, SQL Stored Procedures, etc.

I see a lot of complaints from developers claiming they have to join a "million row table to million row table" and reports are slow, and this gets blamed on the DB. These should not be slow outside of how much bandwidth is being pushed over the wire, which is often exactly what the problem is. They just didn't see it.


I'm a huge supporter of PGhero (https://github.com/ankane/pghero) for this reason. Sometimes a single index is all that's required, but if DB design isn't your forte it's hard to know where to put it.


When you start seeing devs make the same mistakes over and over again with relational databases in certain contexts, that increase the costs of operational maintainance and scaling (e.g. by requiring them to sit around analyzing query plans, lock contention, sharding, etc.), then it makes sense to consider building a DB with constraints that prevents those mistakes and ensures that every DB instance can meet those requirements with a lower overall cost. That is how I view NoSQL, graph databases, etc.

It is similar IMO to the diversity in programming languages. They are all Turing complete, at the end of the day. The difference is in the patterns they encourage and the constraints they impose.


I've always felt a more modular SQL database system is needed. SQL dbs are too black box. You can't interact with the query planner. NoSQL essentially is doing just this. SQL dbs have great underlying layers such as storage, caching, transactions, indexing, concurrency, gis support, backups, etc.

If anything, devs probably just need to be more familiar with DB internals and have tools setup to analyze queries easier.


True, but you really, really want to be sure that if you're going the NoSQL route, your manual schema management will be top notch or your data structures will be super simple and the query demands will also be super simple.

Most of these high performance use cases actually fit this model well (well, the schema management being awesome is the wildly varying part).

But for everything else, there's Mastercard and RDBMS.


Yes, you should think about your domain and requirements, and the technical tradeoffs, before picking a DB technology. Don't just run with RDBMS b/c it has better theoretical modeling capabilities, your customers don't care about that.

For enterprise stuff with relatively low traffic and high amounts of complex ad-hoc queries, RDBMS is without a doubt the best choice. If you have high traffic web services with strict availability and latency requirements, I would seriously consider avoiding RDBMS's as they tend to be difficult to operate and scale with those requirements, and they let you easily do terrible things in that context (e.g. locking behavior, txid exhaustion, etc.).


You're familiar with every NoSQL database?

Impressive!


There you go, I've edited my previous comment to account for pedantry.


I guess I should just accept the term that "NoSQL" now pretty much means "Monngo and Dynamo". Because it feels weird to group key-value stores, columnar stores and graph DBs under one term.


Well, I don't want to be super mean to you or NoSQL, but NoSQL itself was a bit of a fad.

Columnar stores and graph DBs have always felt more like "real" tech, something promoted for specific use cases.

Key-value stores I'm not even going to be denigrating at all since we've had them since forever and they do a great job at their tasks (BerkeleyDB, Memcache, Redis, etc.). They don't deserve being included into fads or niches, they're a very valuable universal resource.

At least that's how I see it.


Yeah but in my mind columnar stores and graph stores are NoSQL databases. Because they're not relational.

But the term has shifted underneath me. NoSQL means document database.

And - in practice - it means using a document database as the primary store when it's not really the right choice (been there!)


I totally agree. Data lasts forever, code is volatile.


Data ages like wine. Code ages like fish.


Fish can age well, there are some good aged fish sauces used in Japanese cuisine.

Milk also ages well (cheese).

Some subatomic particles might be a good choice. Pions decay in a few nanoseconds (for + & - charged) or attoseconds (for neutral). But "ages like Π⁰ particles" just doesn't have a nice ring to it.


  > Fish can age well, there are some good aged fish sauces used in Japanese cuisine.
So that would be TeX, the odd exception.


Sure, I agree largely with two but writing tests for this logic can get difficult (pgTap looks alright but not amazing) as can making all the database logic scale as you grow to sharding and multiple DBs. I’m not sure you’d want this if you were scaling to hundreds of millions of users but then you have loads of good problems…


Or maybe it's shades of grey in between ...


Wend hence. Thee wilt picketh a side.


> All these database aspects remained virtually unchanged since 1970, which is absolutely remarkable. Thanks to the NoSQL movement, we know it's not because of a lack of trying.

Kind of an aside, but I find it odd that people treat "relational" and "SQL" as synonymous (as well as "non-relational" and "NoSQL"). You could make a relational database that was managed with a language other than SQL, right?


Yes. PostgreSQL is a reference to the Post Ingres database (POSTGRES) adopting SQL. Ingres used QUEL. The POSTGRES database used POSTQUEL which was similar to Ingres’ QUEL. This is why the Postgres C library is called libpq (library post quel)

https://en.m.wikipedia.org/wiki/QUEL_query_languages https://en.m.wikipedia.org/wiki/PostgreSQL#


True, but SQL is (for better or worse) the de-facto standard query language for relational databases. The alternatives are all very niche.

I would like a better alternative, but it would need some very significant benefits compared to SQL to gain any traction. SQL is just so entrenched at this point that even NoSql database engines are adding support for pseudo-SQL query languages (which is really the worst of both worlds - the clunkyness of SQL syntax without the power of the relational model).


I’m curious as to why you find sql to be clunky, I find it extremely on point in most cases. I mean, how would you write a SELECT that was better than:

SELECT whatever FROM thisplace?

I know you can make it clunky with parameters and crazy stored procedures, and I’ve myself been guilty of a few recursive SQL queries that most people who aren’t intimate with SQL struggle to understand quickly, but I consider those things to be bad practice that should only be done when everything else is unavoidable.

The fact that SQL is still the preferred standard sort of speaks volumes to me about how good it is. We’re frankly approaching something similar with C styles languages. I recently did a gig as an external examiner, and it took me a while to realise that some code I was reading by a student in their PDF report was Kotlon and not TypeScript, because they look so alike.


The problem with sql is what happens when you fall off the SELECT FROM JOIN WHERE GROUP BY HAVING ORDER BY LIMIT cliff. The simple stuff in sql reads like English, but for that case ORM would generate a pretty efficient query anyway. The complex stuff in sql looks terrible in my experience and ORM bail out quickly. Once you can’t get the result with a simple SELECT then sql stops being declarative. Instead of writing what you want to get, you write something like a postmodern poem while having a stroke, just to convince postgres benevolent spirits to give you something almost right. Complex UPDATEs and DELETEs with joins are even worse.

Also lack of syntax sugar doesn’t help. SELECT list could support something like “t1.* EXCEPT col1, col2”. Maybe JOIN ON foreign key would be nice. IS DISTINCT FROM for sane null comparisons looks terrible. Aliases for reusing complicated statements are really limited. Upsert syntax is painful. Window functions are so powerful that I can’t really complain about them though.

We use a lot of sql for business logic, but some code I have to reread from zero every time I need it. Maybe we modeled our data wrong or there is some inherent complexity you can’t avoid, but I mostly blame sql the language. Unfortunately I have no idea how it could be improved.

Anyway I think the sql cliff is real. Once you take a step outside the happy path prepare for a headache. For me sql definitely is in some local maxima, after all I use it every day at work.


The biggest thing is... SQL is not reusable, period.

Why don't we have SQL libraries?

I know that data models are kind of special snowflakes, but some models pop up over and over and over again and code reuse is always 0 with SQL.

To give you an example of a common problem, SLAs or the like for teams with regular business hours.

A team has to respond to a request within N hours. To calculate that I need to take into account 8 business hours per day, excluding weekends, excluding holidays (ideally localized holidays), etc.

It's a nightmare with SQL. It's precisely the kind of thing you want in a library.

Plus, obviously, standard SQL doesn't have a way to share and distribute any libraries, even if they were made. It's pre-C in terms of stuff like that.


And the core issue is, sql is just strings. It only fails at runtime and not a compile time. There is no compile-time strong/static typing, you only find out its broken when you run it. So it makes it really difficult to re-use. The other problem is, you have to specify table and column names in queries (duh, how else), but that means the entire persistence model is hard coded and need manual tweaking and so on. In an ideal universe, each field would be generic, each object generic and each collection of objects generic and all functions being generic. You'd end up with one huge flat structure of fields that are linked to each other to form more complex objects. But that not efficient in current computers nor do we have enough memory to store all information like that. And you might end up back at square one where everything is just strings.


> The other problem is, you have to specify table and column names in queries (duh, how else), but that means the entire persistence model is hard coded and need manual tweaking and so on. In an ideal universe, each field would be generic, each object generic and each collection of objects generic and all functions being generic.

Yeah, define the database in a database. And then define the database for the definition of the database in a database.

If you'd like to deliver a product that does something, you've got to stop adding abstraction layers at some point.


The thing is, for sure we could define some use cases where it's possible to take that general case and make it specific. That's why I gave a concrete example.

I refuse to believe that absolutely every data storage and access in this world is unique.

Everybody believe that it's unique, and that's a different story.

OS vendors also thought their hardware was special and magical a long time ago and yet POSIX was invented and suddenly they were all more or less commoditized.

I feel that we're in the teen years of data storage/data access technologies. And SQL is sort of like dental braces.


A long long time ago I sent in a patch to Hibernate to check the validity of all declared HQL queries.

There's no reason why it can't be checked, you just need to have your schema declaration.

Name clashes are everywhere. That's why you have namespaces/packages. In SQL they are called schemas. People don't really use them these days.

The tweaking is one of the great things. If you're hardcoding queries (not sql), you're actually defining the order of operations etc. A query analyzer will use statistics, and you can hint how queries have to be executed, depending on the shape of your data.

Your code is also just strings, until you compile it. Actually, these days until you run it. Your arguments would make more sense in the 90s where people would actually compile code.

You ideal universe is actually "The Inner Platform Effect". Better let pgsql be the data platform ;-)


> Your code is also just strings, until you compile it. Actually, these days until you run it. Your arguments would make more sense in the 90s where people would actually compile code.

His arguments made sense in the 90s, and amusingly, post 2015.

Your argument made sense in the 00s and before 2015.

Swift is compiled (Apple platforms; Objective-C has always been compiled).

Rust is compiled (multiple platforms).

Typescript is compiled (so web/Javascript).

Kotlin is compiled (Android; Java has always been compiled).

C/C++ were always compiled (POSIX; Windows).

C# was always compiled (Windows; POSIX).

Almost every modern language is compiled and if it's not, it's getting a very solid static analysis step that for sure you want to have and run (PHP got types a while back, Python is getting them, Ruby is getting them).


The other problem is, you have to specify table and column names in queries (duh, how else), but that means the entire persistence model is hard coded and need manual tweaking and so on.

This resonates with me If I understand you correctly, with RDBMS/SQL, the structural decisions you make in the database to represent your data "poison" your application making it difficult to change over time.


The data model is the code reuse.

You can model business hours and SLAs with relationships. Join on time and team.

  SELECT support_request.request_time + team.response_sla AS respond_by_time
  FROM support_request 
  JOIN team_sla 
    ON support_request.assigned_team = team_sla.team_id
   AND DATE_PART('day', support_request.request_time) = team_sla.day_of_week
   AND DATE_PART('hour', support_request.request_time) BETWEEN team_sla.start_hour AND team_sla.end_hour;


It's quite impressive how you've missed the core of my message.

Now bundle up your proposal, put it up on Github, license it as MIT, and publish it available on sqlpm.org (SQL Package Manager) so that I can re-use it.

What's that you say? I can't? There's no sqlpm.org? Not even a postgresqlpm.org?

Where's the SQL ecosystem?

Oh, wait, there isn't any because SQL is not really reusable. It's <<all>> one-off scripts, like back in the Dark Ages of software development.


I always install https://github.com/awslabs/amazon-redshift-utils/tree/master... on my Redshift clusters.

What you are describing as code reuse exists for databases, but they are called applications and generally utilize a general purpose programming language. It doesn’t make sense to have a SLA data model library because every use case is different. It’s a database, not procedural code.


You must have not seen enterprise apps.

There's a reason monstrosities like SAP exists, they're practically what you describe.

If stuff like SAP is the future of Line Of Business (LOB) apps, instead of having a rich Open Source ecosystem of data storage and data access libraries, then we've lost.

We're locked in the trunk.


SQL is not modular (the root cause of why it’s hard to reuse) because it’s declarative. If the underlying data models stay relatively stable, you can reuse code with certain assumptions. There’s a trade off here for sure.


No that's not why it's not modular. If that was the case, then other relational algebra-based query languages would suffer the same problem. They do not.


SQL can be somewhat reusable through views. Also in your code you can make reusable functions that contain SQL queries. You won't get the best performance reusing the queries this way, but you can put them into a transaction and build up more complex flows from reusable pieces that way


Postgis and many other libraries exist. They are widely used.


Is Postgis implemented in SQL?


That's how I remembered it from back in early days. There was some pl/sql you had to execute, and you'd have some extensions. I don't think it's like that anymore


No man you are right. There is nothing arcane here: sql is just an ugly and unintuitive language. I think sql works the opposite ways normal brains work so unless you are doing a simple query, you have to be able to parse and simulate what sql will do in your head. After doing functions/MV's with CTE's that are 100+ lines per query daily, you kinda get used to but also not. I've concluded that although sql is awesome, it is also full of warts.

What helps me is to code all of it in lower case and use something like that Datagrip with a good theme. That way you get something that is readible, colour coded and has autocomplete (very good with joins). It's the only way I've managed to keep my sanity as my experience with it grew. Bad data models doesn't reallllly impact it that much, sql is still sql even with a clean model.

I've built mini database engines in the past because of my frustrations with sql but I still use and prefer an actual rdms as opposed to trying to reinvent the wheel. There are so many features we take for granted it's not even funny. Try building your own production-ready storage system and you would quickly appreciate how deep the rabbit hole really goes.


I have used those features that you say "Would be nice to have." I didn't realize they weren't ubiquitous. I agree they are excellent.


One very very simple fix is to mention the table first:

FROM this SELECT whatever

This already allows autocomplete for the attributes to work, and has an easier mental model - you think about the tables, then you think about their attributes. It also matches relational algebra better, where you'd do the projection (picking the attributes you want) at the end.

But anyway, simple cases being simple doesn't mean the language isn't horrible for more complex ones.

One thing I always complain about is join clauses making it easy to do the wrong thing (NATURAL JOIN) and annoying to do the correct thing (joining on the defined foreign keys).


People have long joked about "yoda conditionals" ("if 5 == x", for example, instead of "if x == 5"), and flipping it in that order is the same thing: SELECT first is the same as "Get the fork from the drawer", where the flipped order actually sounds like something Yoda would say.


It is more like starting the recipe with "take sauce from point 6 and pour it over meat from point 10".

Very few people has problem with trivial queries like "SELECT x FROM y", but when query contains multiple joins or inner queries then having select at the beginning is visibly problematic.


What I did to get better at 'big' queries was start by writing out things like inner queries as CTEs, Table Parameters, or (if in oracle, lol) refCursors.

That's (sometimes!) less performant than the one big query, but you can then refactor into a single query if you so choose.

Yeah, it's slow goings at first, but you get pretty good at SQL in the process.


a[5] means "From array a, select element 5", and nobody has a problem with that. If anything, the English habit of describing things in little-endian fashion ("Take a fork from the drawer to the left of the sink in the large kitchen") adds a lot of cognitive overhead because you have to wait for the end to figure out your first step, and then reverse the order. A much more practical way of writing that would be "In the large kitchen, to the left of the sink, there's a drawer; take the fork from the drawer."


It's kind of nice to see which columns you're selecting, as they're always in the beginning


Right, that's a sort of counterpoint. It is nice to know that you're going to be getting a fork so you have context for the kitchen -> sink -> drawer instructions. Similarly, I expect in many cases, when you have a complex query, the fact that you know the result is going to be "userid, sum(score)", it makes the subsequent query easier to understand (as you know where it's got to end up). "Can you get me a fork? In the kitchen, next to the sink, in the drawer" might be even friendlier.


The question is if the syntax correspond to the logical order of operations.

In the query "SELECT foo + 2 FROM bar WHERE baz ORDER BY foo", the logical order is actually "FROM foo WHERE baz SELECT foo + 2 ORDER BY wawa" because of how each clause depends on the previous.

The SQL syntax is neither the logical order nor the direct reverse - it is just a random jumble.


SQL syntax is designed for reading, which is the majority of coding time. What gets selected is generally the thing most cared about so it goes first, where stuff comes from is the next most important so you get FROM and JOIN. Filtering and aggregation are next, because they are often hinted at by the select list anyway, then sorting is the least important so it comes at the end.


> What gets selected is generally the thing most cared about so it goes first

I disagree. When you're reading a query you often already know what you were trying to select, you just have a problem with a clause somewhere. This suggests that the select should go last, and this makes perfect sense as Haskell's comprehensions and C#'s LINQ are both way easier to work with than SQL.


You know what you want to select at the time of writing, but 3 months later when someone else needs to add column foo the first two thoughts are: is it already in the select and is it available to the select without more joins. I've rarely ever touched filtering or aggregating lines in an exiting query unless requirements completely changed, which is less common than wanting more values added to an existing query.


I disagree. I've used C#'s LINQ extensively, where the select is at the end and adding columns is trivial. It's a complete non-issue.

On the other hand having the select at the beginning has all kinds of problems for autocomplete, and syntactically obscures where you're selecting from and the clauses. I recommend you try LINQPad if you want experience with how much better this works:

https://www.linqpad.net/

> I've rarely ever touched filtering or aggregating lines in an exiting query unless requirements completely changed

Requirements change or bugs are discovered in the query. This is far more common than you imply.


As long as you have each operation in the expected order, then SQL syntax might make sense. The problem comes when you need something outside this template, e.g. a projection after an aggregation. The straightforward syntax becomes unnecessary complex and hard to read if you just venture a bit outside of the default template.

SQL syntax is like if all arithmetic expressions had to be addition followed by subtraction followed by multiplication. And if you didn't need to add anything you would just have to add 0.


It would perhaps be a good thing if one could write SQL clauses in their logical ordering; as [1] explains:

* The FROM clause: First, all data sources are defined and joined

* The WHERE clause: Then, data is filtered as early as possible

* The CONNECT BY clause: Then, data is traversed iteratively or recursively, to produce new tuples

* The GROUP BY clause: Then, data is reduced to groups, possibly producing new tuples if grouping functions like ROLLUP(), CUBE(), GROUPING SETS() are used

* The HAVING clause: Then, data is filtered again

* The SELECT clause: Only now, the projection is evaluated. In case of a SELECT DISTINCT statement, data is further reduced to remove duplicates

* The UNION clause: Optionally, the above is repeated for several UNION-connected subqueries. Unless this is a UNION ALL clause, data is further reduced to remove duplicates

* The ORDER BY clause: Now, all remaining tuples are ordered

* The LIMIT clause: Then, a paginating view is created for the ordered tuples

* The FOR clause: Transformation to XML or JSON

* The FOR UPDATE clause: Finally, pessimistic locking is applied

[1] https://www.jooq.org/doc/latest/manual/sql-building/sql-stat...


I recently had to train my new junior to level his SQL skills and I found this resource pretty helpful to make sense of this mess. https://learnsql.com/blog/sql-order-of-operations/

However I do also see the point of "SELECT first" just like a header you can infer the output data structure of a sub-expression without necessarily dive into the meat of it. It require a some brain training, but once you get there it oftentimes easier to navigate 100+ lines scripts by jumping from header to header (usually organized as CTE to make the code cleaner).


> and annoying to do the correct thing (joining on the defined foreign keys).

Maybe you’ve seen this thread already; a proposal with some alternatives on how to improve the situation for joining in foreign key columns, but in case not here is the link:

https://news.ycombinator.com/item?id=29739147


That proposal unfortunately requires you to name the foreign key constraint, which is quite unergonomic.

E.g. instead of the wrong `FROM a NATURAL JOIN b` you would use the correct `FROM a JOIN FOREIGN a.foo_fkey`, which not only needs that second name but now also loses the immediate naming of b. So e.g. autocomplete would have to look up the foreign key constraint to find out the second table. And it's still longer and harder to use than the natural join!

Most databases have one foreign key from a given table to another given table, and that simple case should be made easy to use.


There are multiple alternative syntaxes suggested in the proposal, one of them doesn’t use foreign key names, but instead the foreign key column names, similar to USING (), but without the ambiguity:

https://gist.github.com/joelonsql/15b50b65ec343dce94db6249cf...


This is my biggest, high-level thing too; in the syntax we use SELECT to mean a projection and FROM the selection.


> This already allows autocomplete for the attributes to work.

So does the other way around in several SQL engines.

If you write something a long the lines of select x.ID, y.NAME from bla.bla as x join hum.hum as y on x.ID = y.FK in msSQL you’ll get autocomplete on x. and y..

You’re right that it’s more intuitive to write the from first of course.


You can't possibly get autocompletion on x. and y. for that first select if you didn't write that from clause yet (or at least the autocompletion you'd get would not be tailored to those tables).


If you add the table name you could, and that's what "x." here is.

So yes, you can autocomplete

SELECT employee.Na<TAB>

to "employee.Name", but it requires you to type the table name "employee." first.

But with the from-first style you can autocomplete even bare column names - you know you have "name" (possibly even "employee.name" and "supervisor.name") and "employeeID".


Except that the table name isn't necessarily going to be that x. If you are matching employees with their managers then you have two employees tables in that expression so you have to work with aliases. At which point autocompletion breaks down.


You're familiar with CTE, right ?


Yes, you can work around some inadequacies of SQL by bolting more features on top.

That doesn't mean that the basic design of SQL isn't awkward.


SQL syntax assumes queries have operations in a certain order - join, filter, group, filter again, project. What if you want to join after a grouping? What if you want to filter after a project? What if you want to group over a projection? You will have to use the clunky subquery syntax or WITH-clauses.

Compare to LINQ-syntax in C#, where you can just chain the operations however you want.

Another issue is that you can't reuse expressions. If you have an expression in a projection, you will have to repeat the same expression in filters and grouping. This leads to error-prone copy-pasting of expression or more convoluted syntax using subqueries.


And for expression reuse, they're so close with aliases.

lots-of-bla-bla-bla-bla as short-name

But later on you can only refer to short-name from very specific places, as you mention. So 80% of the time you're forced to go lots-of-bla-bla-bla-bla over and over and over again.

Snatching defeat from the jaws of victory.


I really, really miss LINQ, having moved from C# to Ruby, then Node and Go. LINQ is as close to absolute perfection as I’ve seen in a concept.


That's the first / starter use case though, SQL can get a bit crazy once you get into enterprise spaces - stored procedures, funky datatypes, auditing & history features like temporal tables, hundreds, thousands of tables and a similar amount of columns, naming & organizing things, etc.

Thankfully, most people will never have to deal with any of that, myself included. The biggest databases I've had to deal with were very relatable - one about books & authors, another about football and historic results. The other biggest database is one I'm working with and building right now, it's a DB for an installation of an application managing tons of configurations, a lot of domain specific terms. The existing database is not normalized of course, and uses a column with semicolon-separated-values as an alternative to foreign keys. Sigh. Current challenge is to implement history, so that a user can revert to previous versions. I'll probably end up implementing temporal tables in sqlite.


I build and operated an employee database in accordance to the Danish OIO model for years, I even sat on a comity to define either models within the OIO model set for the public service.

These days I work with millions of entries from solar production.

I’ve never had to use complex SQL more than one time.

You use tools like SSIS or APIs on top of it to get and store the data.

I know you “can” create a lot of stored procedures and views, but as I’ve already said, you really, really shouldn’t do that exactly because it’s so terrible to work with for so many people.

Honestly though, SQL with an Odata api on top of it is one of my favorite ways of storing and retrieving data. If you have to actually transform the data, you do it with SSIS or similar tools that are much more efficient top level layers that are also testable and reusable.

But to each their own I guess. The join logic never bothered me much, and that seems to be an issue for a lot of people here.


True, keep the database as dumb as possible IMO. I have converted 200 line SQL queries into 30 lines of SQL plus 20 lines of code for OLTP. OLAP is a different beast though and SQL can get nasty.


1) it is committee driven therefore changes come slowly

2) adding new functionality requires addition of new keywords

3) you cannot define new keywords from SQL

4) despite standardization, each implementation differs

this article summarizes it pretty well, while i do not agree with everything in it it points out flaws pretty well.

https://www.scattered-thoughts.net/writing/against-sql/


> 2) adding new functionality requires addition of new keywords

There are reserved keywords and unreserved keywords. The latter can be used as table/column/function/etc names, and don’t cause any trouble.

New syntax can be invented by reusing existing reserved keywords, and introducing new unreserved keywords in places where they can’t be misinterpreted.

Not saying the problem you describe isn’t a problem, just that it’s slightly more complicated and not as bad as one might think when reading your comment.


A problem is that SQL does not cleanly map to what the DBMS does to execute it. For simple queries, this is exactly the point. For more complex queries, the SELECT-FROM-WHERE straightjacket feels quite restrictive though.

The abstraction really hurts when you have to optimize slow queries and convince the optimizer to do it the right way. Entering the query plan (essentially, the annotated AST of an expression of relational algrebra) would often be helpful.

Also, SQL is ultimately text. This makes it very cumbersome to build tools that dynamically assemble queries, like ORMs or customized search dialogs, and to insert parameters. Parser performance impact overall DBMS performance quite a bit, and it would be useful to reduce overhead there too.


Yep, every abstraction introduced has added cost, in some cases. The fastest way of executing a program is to build special hardware with the perfect state changes of a set of transistors — I am not joking.


"SELECT whatever FROM thisplace" is trivial to improve and could be for example thisplace[whatever].

With joins it gets more complex but still SQL could allow using foreign keys and having SELECT at the end. You could get at least something like:

"FROM Invoice i JOIN i.customer c SELECT c.name, i.number"

instead of

"SELECT c.name, i.number

FROM Invoice i

JOIN i.customer c ON i.CustomerId = c.CustomerId"


Isn't a "WHERE" clause more intuitive than that?

SELECT c.name, i.number

FROM Invoice i, Customer c

where i.CustomerId=c.CustomerID"


I prefer JOIN clauses because it makes it easier to reason about the underlying implementation (hash/sort-merge joins) than thinking about cartesian products. It's also much harder to screw up the ON predicate and actually cause a cross join.


> I mean, how would you write a SELECT that was better than: SELECT whatever FROM thisplace?

Haskell's comprehensions. C#'s LINQ. F#'s query providers.

SQL is really not that good actually. It had (and still has) all sorts of limitations that eventually led to new syntax, and has many quirks that require all sorts of workarounds.


There’s some good discussion of the deficiencies of SQL here: https://opensource.googleblog.com/2021/04/logica-organizing-...

> Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.


Try a 3-way M-M join. Or recursive CTE's for hierarchical data. It always looks clean with simple examples. Data these days is much more nested and inter-related.


I love SQL, but I also love being a contrarian.

thisplace.whatever

There you go.


This looks a lot worse if thisplace is a long subquery itself.


I'd flip the FROM and SELECT, just like in UPDATE and DELETE commands.


> for better or worse

For better. The only other success story just like it is JavaScript, which remains the one and only native browser programming language.

They work, they are good enough, everybody knows how to use them, gaining skills in those languages is valuable and timeless. No inane things like new languages du jour like golang that (fail at) reinventing the wheel appear every few years.

SQL remains beautifully boring & useful and is as close to program language perfection as we will ever get.


> The only other success story just like it is JavaScript, which remains the one and only native browser programming language.

Javascript is a great example of exactly why this sort of thing is awful - Everyone has to use libraries like React or Vue just to make it usable, it's filled with weird backwards-compatibility junk, nobody can replace it because it is so entrenched, and attempts to make it better (typescript) end up with having to transpile backwards to javascript (rather than being able to just stand on their own).

The sooner we move to a world of web-assembly the better (but even web assembly frameworks at the moment end up having a substantial mix of javascript). We shouldn't have languages that are standard just because they are standard.


  > We shouldn't have languages that are standard just because they are standard.
This is extraordinarily naive. Standards are a far more important invention than Javascript, or even the transistor. A standard existing just to have a standard is far better than every browser implementing its own scripting language. We had that once, in fact my personal website still has the text "This website is not compatible with MS Internet Explorer. Please upgrade to Chrome, Firefox, or Opera for optimal experience." even though that hasn't been true for over a decade.


> This is extraordinarily naive.

I obviously disagree, and think that's a pretty dismissive comment.

> A standard existing just to have a standard is far better than every browser implementing its own scripting language.

We shouldn't be stuck with a language just because one person decided to invent a language in 7 days over 26 years ago for the internet at the time, and now we are stuck with that language forever with a VERY different internet. I would like to think in 10 years time we can move to a web where people have a choice of language.

And that doesn't mean what you imply - which is that every browser has it's own scripting language - because it's possible to architect an environment which allows for multiple programming languages in the browser (see bytecode, JVM, CLI, webassembly).

Is it really that naive to think that's a better way forwards?


I agree, better a mediocre standard than none. Out of interest, for Javascript there are many languages that transpile to javascript but support quite different approaches like functional programming, strongly typed, etc.

Is there something similar for SQL, where you allow an alternative syntax and maybe programming approach and then use SQL as connection to the DB ?


Reading through these posts, I had exactly the same thought. Why not have a language that takes the problems with SQL and abstracts them to something that is more easily organized into functions, modules, and packages, corrects some of the semantic problems, integrates well with source control, testing, deployment pipelines, etc. DBT is solving this problem well for data warehouses, but I do not know if a similar tool application databases.



I'll take the clunkiness of SQL over the clunkiness of Mongo's json-based query language any day.



Which in itself is confusing given the upside potential of having relational model without the baggage. Hadley Wickham has had huge success in the context of R re-implementing the relational model without SQL (see: dplyr). It is more user friendly and much more pleasant to use than the mess that database users have to put up with.


dplyr and friends are great! But probably difficult to implement the same idioms in languages less open to features like R’s non-standard evaluation.


But you could also use SQL for that new RDBMS, or support multiple query languages. It doesn't matter, all relational databases are very similar because they are relational: they do the same things, potentially in the same way, regardless of the query language front end.

Apart from being subjectively displeased at SQL syntax, most problems with SQL are actually query language independent issues in the RDBMS: maddening proprietary extensions, implementation limits, nonportable details, library and system issues (e.g. character encoding and default configurations).

The best alternative query languages can do is making certain queries (advanced, rare ones) easier to express.


One branch that was super-popular in the day was the dbase family (I work in FoxPro, it was THE dream).

Working in Fox dispel many of the weird limitations of sql (and BTW, you can use SQL on Fox alike linq, is first-class).

The #1? You can build ALL using the relational model. After MS kill Fox my career can be summarized as: Trying N workarounds because I don't have Fox anymore.

Other langs (python, delphi, f#, rust) capture a little of the magic or make the workaround bearable, but none is as productive.


Your intuition is correct. The book "Designing Data Intensive Applications" by Martin Kleppmann goes into detail about this, including some of the history.


It's for similar reasons that people call vacuum cleaners 'hoovers', they 'google' for things on DuckDuckGo, they 'Xerox' documents on a Cannon copier, all on-road reflective lane markers are a 'Catseye' etc. It may not be accurate but I don't find it at all odd.


Tutorial D is one such example of a non-SQL relational db language


SQL is simply converted into a logical relation algebra tree. So anything that can do that could be used. It's a shame that PSQL doesn't provide an API for this, then all the quirks of SQL could be fixed. However, then we would probably have a proliferation of non-standard SQL like what happened with JavaScript, and standard tooling couldn't be used.

SQL is somewhat nice because its fairly well standardized.


As the pirate meme says, "Well yes, but actually no".

This is kind of like a sticking effect. It got in at the right time and it was good enough and there wasn't enough interest in developing something to replace it that it just became standard.

But if you want to get down to it, any entity data model implemented is kind of an attempt at a replacement.


  > You could make a relational database that was managed with a language other than SQL, right?
Other commenters have shown examples of non-SQL relational databases. However, all the alternatives were also examples of Structured Query Languages. Languages designed to query a data store with _Structure_.


Well if NoSQL had called their movement "non-relational databases" that would have been an obviously bad idea, right?

But the idea of never writing SQL again remains compelling to a certain sort of mindset, one which I have been known to share when throwing myself against the proverbial wall.

SQL isn't a bad dialect with which to hand-roll queries against relational data. I'm quite sure there's room for improvement, especially when it comes to generating queries; I've dipped my toes in this and the compiler has to 'think in SQL' eventually, it's not composable in the way that it should be.

But "SQL is annoying and ORMs are terrible" is, from my recollection, the sentiment which gave us the meteoric (pun intended!) rise of Mongo, and "relational modeling isn't negotiable" is the iron fact of our profession which lead to its ignominious fall.


NoSQL was trying to “break” the relational model. It just happens to have “SQL” in its name. I intentionally avoided associating relational model with SQL in the article. Maybe I failed lol.


You're absolutely correct, but in practice I'm not aware of any relational databases with widespread adoption that don't use SQL.


RethinkDB comes to mind. "Widespread adoption" is a very fungible term.


As an FYI, I don't belive that's a correct use of the term "fungible".


Indeed. Two things are fungible if one can be replaced by the other.


Fungible is very fungible.


I have grown to understand that the relational model is the answer for solving all hyper-complex problems. The Out of the Tar Pit paper was a revolution for my understanding of how to approach properly hard things:

http://curtclifton.net/papers/MoseleyMarks06a.pdf

The sacred artifact in this paper is Chapter 9: Functional Relational Programming.

Based upon inspiration in this paper, we have developed a hybrid FRP system where we map our live business state to a SQLite database (in memory) and then use queries defined by the business to determine logical outcomes or projections of state for presentation. Assuming you have all facts contained in appropriate tables, there is always some SQL query you could write to give the business what they want. An example:

> Give me a SQL rule that says the submit button is disabled if the email address or phone number are blank/null on their current order.

  --Disable Order Submit Button Rule
  SELECT 1          -- 1 == true, 0 == false
  FROM Customer c, Order o
  WHERE o.CustomerId = c.Id
  AND c.IsActiveCustomer = 1
  AND o.IsActiveOrder = 1
  AND (IsNullOrEmpty(o.EmailAddress) OR IsNullOrEmpty(o.PhoneNumber))
I hope the advantages of this are becoming clear - You can have non-developers (ideally domain experts with some SQL background) build most of your complex software for you. No code changes are required when SQL changes.

The relational model in this context is powerful because it is something that most professionals can adopt and collaborate with over time. You don't have to be a level 40 code wizard to understand that a Customers table is very likely related to a ShoppingCarts table by way of some customer identity. If anyone starts to glaze over at your schema diagrams, just move everything into excel and hand the stakeholders some spreadsheets with example data.


Ahhh... a fellow traveller on the road.

I have been hacking on a Clojure/Script library (https://github.com/wotbrew/relic) to experiment with exactly this kind of thing. / PLUG

For most problems I encounter I want to focus on the data and its relationships.

I cannot help feeling like programming languages and idioms have me deal with too much 'mechanism', the machine like structures and forms that are mostly incidental to the data domains I work in, and are entirely about performance, layout, access pattern etc - when a machine could do an ok job of making those decisions for me until optimisation becomes important.


Glad to see there's more than one of us out there.

> I cannot help feeling like programming languages and idioms have me deal with too much 'mechanism'

Absolutely. Most "best practices" represent a lot of noise that gets in between me and solving my problem. My latest prototypes have been reduced to using static classes/methods in virtually all areas. Getting everything out of the way is so important. When data is king, your code can turn into this ultra-utilitarian apparatus.

I've had partial success writing code generators that can output supporting infrastructure models/code based upon my SQL table definitions. Assuming those generators are well-tested, high-quality code supporting a schema of arbitrary complexity could be produced instantly.


Do you have any additional resources about this model of thought? It's like Redux on steroids lol. I wonder if anybody has done a SQLite-as-the-Store pattern library for front end apps before. I'd use the hell out of that!


One of these days I am just going to have to write a book about it. There are so many layers and perspectives to consider. Maybe another small rant about our roadmap will help you see more clearly how it could work for you:

I am currently looking at an iteration that will use event sourcing at the core (i.e. append-only logs which record the side-effects of commands), with real-time replays of these events into SQLite databases (among other in-memory working sets). The SQLite databases would serve as the actual business customer front-end. We would also now have a very powerful audit log query capability that would go directly against this event source. I would just think about the database as the layer the business can communicate with. It is your internal/technical customer. As long as that database is proper, everything else downstream works without you thinking about it.

The biggest reason for pursing this is to decouple the schema from the logical reality as much as possible. The business likes to change their mind (sometimes for very good reason) and we have to keep them honest at some level. As proposed here, that source of truth will be the read-only event logs.

When you look at this on a whiteboard, you may recognize that it resembles a textbook definition of CQRS. Perhaps try reading up on: CQRS, event sourcing, database normalization, SQLite's application-defined function capability, and anything else that looks adjacent.


So you are talking about something like SQLite-as-the-API? I was a big fan of this idea for a while. Difficulties in streaming query updates from SQL dbs and difficulties in working with deeply nested, hierarchical data made me shy away from SQL though.

Having a client-side relational db to work with would remove huge amounts of complexity to most client apps though.


Client-side databases are relatively popular in ClojureScript land. DataScript and Fulcro both come to mind immediately. DataScript is an immutable in-memory database that uses Datalog to describe queries and Fulcro is a full-stack framework that automatically normalises your database.


> SQLite-as-the-store pattern

I remain consistently amazed at what one can do with lowly little SQLite. I can't count the number of times I struggled with some scripts performance issues when handling a large amount of data that came through as a CSV or text file, which immediately evaporated when i realized "why am I asking the program to handle a million tuples at a time via internal data structures when an indexed database would do it so much faster?"

An in-memory (or even SSD-resident) SQLite database can be ridiculously fast for handling in-process data and not just the stuff you want to keep when you're done.


I think your example is a solid illustration as to why you may not want non-developers writing SQL:

    SELECT 
        CASE WHEN o.EmailAddress IS NULL OR o.PhoneNumber IS NULL
            THEN 1
            ELSE 0
        END AS [IsSubmitDisabled]
    FROM Customer c 
    INNER JOIN Order o ON c.Id = o.CustomerId
        AND o.IsActiveOrder = 1
    WHERE c.IsActiveCustomer = 1
These queries aren't identical in their output, though they are logically the same. Is that a good thing? What are the implications of allowing such ad-hoc queries within a system?


The SQL was only intended for illustrative purposes against a contrived domain model. Actual correctness of the query in whatever hypothetical domain was imagined at that time is not something I would focus too much on.


My point exactly. "Not focusing too much" on the SQL is a road you likely don't want to go down, and is rather inevitable when you bestow that power unto non-developers! (You were too quick. I made an edit)


Why does someone need to be a "developer" to be highly-competent at authoring SQL?


You know that's not the point I'm making.

But even for authors that _are_ highly competent at authoring SQL, it is dubious to allow them to compose ad-hoc queries ad-nauseum (let's even pretend these "developers" are _also_ excellent at performance tuning).

This is death-by-a-thousand-cuts. The approach is just too granular. Databases _notoriously_ don't have unlimited throughput, and depending on your particular RDBMS it may be prohibitively expensive to just "throw more hardware" at the problem (I'm looking at you MS SQL Server).

For any system that you can imagine there is a point (i.e. scale) at which the above paradigm becomes a very big problem. Believe me. We are in the process of moving logic _out_ of SQL for this exact reason at my current company!


Depends what you mean by developer, but there a heaps of ways to accidentally write code that doesn't use an index, and then you need to understand db internals and how indexing works, and so you are on your way to being a SQL dev. A non-developer I wouldn't expect to understand db internals. You can get quite far though not understanding db internals as a non-dev for simple querying until you write some complex mission-critical query that must continue running, but needs maintenance/starts over-burdening the db/etc.


The promise of low code.

I’m sure there is a time and place for it (Power Apps is a billion dollar business), like time keeping frontend logic. Apps with significant logic and complexity will always need developers.

I’m not critical of your app as I have no real understanding of what it does and why the design choices were made. I do not think embedding logic in SQL would be an ideal way for any app I’ve worked on to work.


> The promise of low code.

Not exactly. We still have shitloads of code. It has just been modularized & standardized such that each component can now mostly be configured by SQL the business writes.

Our product is a B2B application, and each one of our customers has very precise & unique business rules they would like to see applied.

Every customer uses the same code pile, but with varying configuration (SQL).


These contrived examples seem appealing at first, but when you start having deeply relational, nested data, with M-M joins, the relational model is much less appealing. SQL is just one popular declarative approach that people are familiar with and is reasonably easy to read by non-devs for simple things.

Modeling all data on SQL is definitely an improvement on most ad-hoc data manipulations people do.

But say you want to stream updates from your SQL queries, this is something the relational model and SQL dbs are not great at doing efficiently and actually hinder.


In my limited experience, code built around relational databases is difficult to grok. The structure is inverted with respect to typical OOP. Rather than having a sane class hierarchy, where one can start at the top and navigate down to understand how objects are nested, the structure is inverted, piecing together class relationships requires looking at the tables and following foreign keys which effectively point to parent classes. It feels backwards.


> In my limited experience, code built around relational databases is difficult to grok. The structure is inverted with respect to typical OOP. Rather than having a sane class hierarchy, where one can start at the top and navigate down to understand how objects are nested, the structure is inverted, piecing together class relationships requires looking at the tables and following foreign keys which effectively point to parent classes. It feels backwards.

Other than your first sentence (which is subjective), you are correct. The only question is whether you are going to mangle the database structure to fit your OO hierarchy or design your program in a non-OO way to fit the relational structure.

Since the database will live on long past the program, and will have multiple programs talking to it, it makes sense to design your program around the data, not design the database structure around your program.

[EDIT: See https://blogs.tedneward.com/post/the-vietnam-of-computer-sci... for why OO is a terrible design for persistent data]


I think it’s worth noting that “database-centric” and “app-centric” notions of databases are both found in the wild. That’s how I remembered the difference between MySQL and PostgreSQL—MySQL was what happened when a bunch of app developers needed a database, and it was extremely popular e.g. with the PHP webdev crowd. If you needed to access the database, you went through the app. PostgreSQL is what happened when DBAs designed a database. If you needed access to the database, you connected to the database. A lot of other databases can be understood this way… like how MongoDB further shifts database concepts into the application.

(Honestly I’m definitely in your camp… if I need a database, design the database first, then write the code.)


You're right: object orientation and relational data don't mesh well together.

Where you're wrong, and it may take more experience to realize, is that the problems in the mismatch lie far more on the OOP side than on the relational side.

Object oriented designs - the navigable graphs of objects, leaving aside polymorphism - usually privilege a particular perspective of the data, a perspective which is suited to a specific application. For example, a store with items, customers and orders: the application is focused on creating transactions. Relational models, on the other hand, support multiple perspectives "out of the box" and are expected to support multiple applications; for example, reports that roll up by item, by category and by customer; reusing the same customers for a different application; and so on.

Different applications and perspectives usually require different object models. The alternate application reusing the same customer entities won't want to couple with a store application and decorate every customer with an irrelevant list of orders. A reporting app is better off with a visual query builder that creates SQL to hand off to the database than trying to iterate over object graphs performing aggregations and sorts. And so on.

For applications with ambitions for a longer lifespan, start with the database schema. Keep application models fairly thin; focus on verbs; try to ensure only one piece of code owns mutating a logical database entity with a set of verbs; and keep entanglements like UI out of those verb implementations so that you reuse the verbs, rather than being too tempted to reimplement direct database modifications or push too much logic into models.

Object oriented models come and go, applications are rewritten, but database migrations happen at a much slower pace, and data lasts for a very long time.


Class hierarchies are not "sane", they're inherently hard to extend in a way that preserves a consistent semantics. This is exactly what the relational model is intended to fix.


Foreign keys does not correspond to class hierarchies in OOP. If anything, they correspond to object composition.

E.g. you might have a Customer table and an Order table with FK from Order to Customer. This would correspond to a Customer object holding a collection of Order objects - it would not correspond to Order being a subclass of Customer.

In any case, the relational model is not supposed to map directly to any particular programming model or paradigm. You should be able to interact with relational data from OOP, imperative, functional and declarative languages.


> Foreign keys does not correspond to class hierarchies

If a foreign key is primary in two tables (1:1 relation), then it can correspond well to a hierarchical relation. In other cases, foreign keys can also correspond to composition as you say, and it depends on intention.

An example of inheritance could be the table Parties with primary key Party_ID. The table Customers has primary FK Party_ID (Parties.Party_ID). If you must use ORM, you can implement this is as Customer extends Party.

The class Customer then inherits whatever ability the class Party has to access further tables with FK Party_ID, ex something like PartyAddresses.

This might make more sense if you also have a table Providers with primary FK Party_ID (Parties.Party_ID), and the table Orders with FKs provider_Party_ID (Providers.Party_ID) and customer_Party_ID (Customers.Party_ID).

Normalizing like this allows you to have inheritance, polymorphism, and encapsulation in OOP.


> piecing together class relationships requires looking at the tables and following foreign keys which effectively point to parent classes

So when Employee works for Company, Company is the parent class of Employee?


You can build data models that correspond pretty well to OOP.

Lets say you have a superclass named Person, with primary key person_pk. You can then, in the data model, implement polymorphism by creating a sub-class-table, for instance CustomerPerson, by giving it the same person_pk as primary key, where the same person has the same value for the primary key.

In this case, every person should be in the Person table, but only customers should be in the CustomerPerson table.

If person is implemented in your OOP by inheritance, you can leave shared fields (variables) in the Person table, and only add additional fields to the CustomerPerson table. Your program can then instantiate Person objects by simply using the Person table, and CustomerPerson objects by joining by person_pk.

If Person is just an abstract interface, you may want to have it only contain the primary key, and have all fields in the subclass CustomerPerson table. In this case, you could join CustomerPerson directly with some address-table, for instance, by person_pk, without going through the superclass (Person).

(And if you need it to also generalize to companies, you could have both Company and Person be subclasses of a class named Subject, and use subject_pk as primary key in Subject, PersonSubject, CompanySubject, PersonCustomerSubject and CompanyCustomerSubject and even CustomerSubject)

The approach above works better in my experience, than having subclasses have their own primary keys, especially if you read more than write to the db, as it can drastically reduce the number of joins you need to do, and because it prevents one-to-many relationships between subclass and superclass. (A limitation would be if you are not able to recognize a shared primary key across subclasses when creating an object.)


Postgres actually has a concept of inherited tables to make this pattern explicit in the DDL. Although I’m not sure the PK uniqueness constraint is enforced.

https://www.postgresql.org/docs/14/ddl-inherit.html


> Rather than having a sane class hierarchy

There's your problem. No such thing as a sane class hierarchy, but thinking there is leads you down a dark path where relational modelling looks "difficult to grok". Rather the mental model you've trained on hierarchical OO code is the awkward one, and it's ultimately a dead end.


One part of the relational model is that all metadata itself is available as tables. So there is a table containing the foreign key relations. You can write a query that gives, for each table, which foreign keys point to it. You're not required to looking at the default view.


"Access Path Dependence" continues to plague our computers because they are a built-in assumption of file systems.

This is really showing its age. I want my computer to think in terms of what data is, where it came from, any other metadata I received it with, and any metadata I've added, notably, but not primarily, the various "places" I've put it.

The fact that I can't retrieve the URL I downloaded anything from, years later, no mater how many times I've moved it, is just shameful. It's cheap information our tools could be preserving but aren't.

So if you ask me what I think of the relational model, I'll tell you: it's a good idea and we should try it.


Have opened this thread a couple of times today, hoping to see comments disagreeing with the post. Still no such comments, so I’ll take the duty… hopefully not causing a

I have been programming for ~15 years, the majority of my life. Web, mobile, HPC, CUDA, assembly for x86 and ARM, kernel modules, LLVM plugins, and databases… Not once in my career I found the concept of relational databases efficient or relevant. They are frustrating to use, exceptionally slow and generally provide SQLs, or other DSLs, which is the most archaic form of query representation I could think of - not binary and not general-purpose.

It reminds me of no-code development platforms. They may work (not very well) for some super simple tasks, but as soon as you want to do something at least remotely non-trivial, they put more barriers, then provide help. From personal experience, again, I have grown to hate frontend development so much, that have tried a dozen of website builders, before reverting to good old HTML/CSS (plus a bit of JS) every time I wanted to refresh my blog or companies website.

Plus, I wouldn’t immediately dismiss the concept of Graph Databases. If we want to be truly canonical, we wouldn’t create hundreds of columns in our tables, with just a few relations. The ideology is that every unique “type” (in any sense you prefer), should be in its own table, linked with the other “types” in other tables… Then theory ends and starts practice. Try implementing a fast JOIN in a relational database. Then increase the depth to 3, tracing the relations of relations of relations. Even in a non distributed case it is a horror. Both the nested SQL queries and the program that will be evaluating them. Graph DBs are designed to solve specifically that issue really well.

Another point: how “relational algebra” suddenly makes smth superior to anything else? It’s not a Grand Unified Theory of Physics, not rocket science and not even Graph Theory for that matter. The latter being the biggest and most studied branch of Theoretical Computer Science with brilliant theorems being published even today.

Not saying that todays popular Graph DBs are good (they are mostly disgusting), but I would still much rather think of my data as a graph, than a table with some foreign keys


My big realization was that the relational model is a constraint on your data so that relational algebra can optimize your queries for perf. But these constraints cause a whole range of issues, which always result in a poorly modeled domain as people try to work around the limits of the optimizer or the expressiveness of SQL.

I've always felt that what we need is a way to maintain a logical schema (E-R diagram / graph schema), and then the physical schema is automatically generated along with denormalizations for perf as necessary. A graph db is simply denormalizing its joins using index-free adjacency.


One problem I have with relational databases is I don't know of a good way to represent sum types - I remember seeing some possible solutions but they looked very complex and hard to understand (or dbms-specific)


can you elaborate a little? i've not heard of the term sum types before and when googling superficailly they dont seem that exciting, particularly for persistent data. When would they not just be a foreign key to table or one column for each allowed datatype (or a mix of the two)?

Sorry, I am assuming here its my ignorance thats teh issue not knowing any real word examples of why they are a big deal.


I don't know if this explanation is a good one, but I'll try using Haskell syntax. In Haskell, you can have product types like:

  data CartesianCoordinate = Coord Float Float
where an element of this type is expressed as `Coord x y` where x and y are both floats. Examples of elements of this type are `Coord 1.1 0.9` or `Coord -2.9 10.0`, etc. Product types are equivalent to structs in C, if you're familiar with C.

But you can also have sum types. Instead of starting with the general idea, I'll point out that C enums are a special case of sum types:

  data Day = Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday
and then point out linked lists are more representative of the general idea:

  data ListOfFloats = Node Float ListOfFloats | End
where an element of `ListOfFloats` is for instance `Node 1.0 (Node 2.0 End)`, or `Node 0.3 End`, or just `End`.

The pipe symbol | is what makes a sum type a sum type. It means that an element of the type is either one possibility or another possibility or another possibility.

One final example is a type consisting of all possible mathematical expressions. This is also a sum type:

  data Expr = Add Expr Expr | Times Expr Expr | Negate Expr | Inverse Expr | Const Float
An element of this type can be something like `Add (Const 1.5) (Times (Const 0.2) (Const 2.8))`, which is supposed to represent the expression "1.5 + 0.2*0.8". Interestingly, you can't easily express this type in most OOP languages.

In simple set theory parlance, product types refer to Cartesian products, and sum types are set-theoretic unions.

The relevance to relational databases is that each row of a table corresponds to an element of some product type. Each row of the same table has the same product type. But there is no means defining a "table" whose elements belong to a sum type as opposed to a product type. Why is that?


As the parent mentioned, you can encode your sum type by providing all the columns and constraining exactly one to be non-null.


Yes, you can encode it, but you shouldn't have to invent and apply this encoding manually. It's error-prone, it's less efficient, and you lose information.

It's like saying you don't need foreign key constraints as built-in concept as long as you have triggers, because you can encode such integrity checks as triggers. Technically true, but no one is going to buy that is a legitimate argument against foreign key constraints.


I guess what I wanted to say is that there is nothing in the relational model per se that prevents encoding sum types. Of course actual implementations (i.e. SQL) might lack the required syntactic sugar.


I agree, I'm just pointing out that when you have some abstraction X you almost always want to also provide its categorical dual ~X, or you end up having to write awkward encodings to simulate it.

Databases provide product types, and the dual of product types are sum types. As you point out, you can encode this in various ways, but it's not natural and very error-prone.


That’s the most sensible solution. AFAIK however there neither a cross-dialect way to specify that specific constraint, nor a simple way to SELECT the column name and the value of the unique non-null value.


but wouldn't that open up possibility of having 2 columns checked? (when in a proper sum type you can't be 2 values at the same time)


No, the constraint can require exactly one column. PostgreSQL even has an optimized builtin function for this, `num_nonnulls`.

In a recent feature I had an object field modelled as:

      type Destination = Customer of Customer | Supplier of Supplier | Warehouse of Warehouse
and the table representation was

      customer_id uuid null
    , supplier_id uuid null
    , warehouse_id uuid null
    , constraint unique_destination check (num_nonnulls(customer_id, supplier_id, warehouse_id) = 1)
It's not first class support, but manageable enough.


Maybe the File System shouldn’t be hierarchical but rather relational as well.


Microsoft Sharepoint uses SQL Server for all user Item storage, so it's an example of an all relational file store. They did this around the time they were also experimenting with the Cairo a sibling comment mentions. Also as another commenter noted you query on the metadata. Folders/directories for instance are represented as metadata. Installations can also describe very detailed ontologies including ones provided by third parties. It's a lot of work.

Edit: I think I was actually thinking of WinFS which came out of Cairo later, around 2000.


That means onedrive for business uses sql server as it uses sharepoint online, interesting.


That's an interesting point. As I recall (unfortunately it's been a while). There was tension at first who would be top dog. SharePoint or OneDrive, and OneDrive won, so SharePoint Online is actually mapped over OneDrive. MS could do that because ironically even though SharePoint On-prem had been entirely SQL Server underneath, SharePoint On-prem devs generally never wrote a line of SQL. They interacted with the SharePoint Object model and that was basically mapped over OneDrive. What OneDrive is underneath I don't think MS had to be as forthcoming about because of it's cloud-based nature. I was on my way out by that point though.


Microsoft wondered that too and for a while developed "CairoOFS" as a possible replacement for NTFS. It was intended as a relational "object filesystem" https://betawiki.net/wiki/Microsoft_Cairo


And then we could have fun looking at the execution plan of our file access. :D

Meant in jest; though I think the idea of a "one true way" to access files is a pipe dream. The hierarchical works better on my computer than whatever scheme we've cooked up for our phones.


> The hierarchical works better on my computer than whatever scheme we've cooked up for our phones.

What do you mean? Hierarchical is how I access files on my phone. Is that an Apple thing?


I thought phones had moved to an "ownership" model where access to files goes through the applications. That said, trying right now, I see there is a files app that seems to be mainly types. I'm assuming that is actually by folder?


> I thought phones had moved to an "ownership" model where access to files goes through the applications.

There are many file explorers for Android, and they all have a normal browsing interface. The only "ownership" thing I’ve encountered is with the Nextcloud client, where it makes sense as those directories and files are not necessarily on the phone.

I’m only on Android 10, but I think I would have heard if Android 12 completely changed things.


Yeah. I can only think I was confused by the few times I've tried to mess with the data. I thought I read they were trying to not go folders based.


The only thing the relational models really gives you is the ability to join two relations, which lets data decompose for storage but re-compose for many different uses. I'm not immediately seeing how that would be useful in a file context, where generally you want to look up a specific blob of data (aka a file).

A tree-based file-system is optimised for doing a search from the users perspective, finding a file takes log(files) steps and finding related files is trivially cheap. It is likely hard to outdo that with a relational model.


"Finding files" has never been my problem. Organizing and categorizing them has. And the reason it's been a problem is because my files don't contort into a neat hierarchy.


> I'm not immediately seeing how that would be useful in a file context, where generally you want to look up a specific blob of data (aka a file).

I would really enjoy a filesystem that's a loose tag-based hierarchy rather than a strict single folder tree.

In SQL terms, something like:

   create table files (
       file_id bigint primary key
     , file_name text
   )

   create table tags (
       tag_id bigint primary key
     , tag_name text not null
     , parent_tag bigint null
     )

   create table file_tags (
       file_id bigint
     , tag_id bigint
   )
So tags are organized in hierarchies, but there can be multiple parallel hierarchies, and a file can belong simultaneously to multiple hierarchies. Say one is a flat list of tags by user, another is a flat list of tags by apps, and another represents replication or backup strategies, plus the usual directory organization.

(I'm not sure if a file should be allowed to belong to multiple tags within the same hierarchy; my gut says no.)

Let's say that as a convention we separate the root tag in each hierarchy with a colon, and other tags with a slash. Your file 'cool_code.py' may be found under the tags 'storage:sda', 'users:roenxi', 'apps:pycharm', 'pycharm:projects/cool_app/src/utils', and 'rclone:gdrive/roenxi@gmail.com/202201'.


> (I'm not sure if a file should be allowed to belong to multiple tags within the same hierarchy; my gut says no.)

Gmail is a good example. Has very flexible tagging schema. Every tag exists in a hierarchy. Emails can have multiple tags.

SQL is not good for hierarchies though. You need to use recursive CTEs or denormalize relations and things.


This looks really cool!!


I would see it more as a different and better way to store the same and additional information. In a RDBMS an index is already storing references in a tree format. If you map that to files and folders you now have basically the same thing as a file system.

Doing it more as a DB enables the OS to use the knowledge from RDBMS for efficiency, which I'm sure rivals the best file systems and it's possible to create multiple indexes and views for other use cases.

Our current view on file systems and the knowledge we have is heavily influenced by slow spinning disks, while RDBMS have leveraged RAM a lot more. With todays fast SSDs the file system operates in a reality that is more like RAM than a slow spinning disk.


> In a RDBMS an index is already storing references in a tree format. If you map that to files and folders you now have basically the same thing as a file system.

Yeah, a file system stores general data so it is very easy to map it to an RDBMs that also stores general data.

But what this is identifying that once the relational nature of the data isn't a factor, the best lookup structure is a tree.


You don't need the relational aspect per se, but if you have indexes by filename, filetype, creation date and size (which I do not find unreasonable). It becomes trivially fast to ask the question:

> Give me all image files > 100 KiB from 31. December 2021 to 1. January 2022

While in current file systems you need to scan the entire content of file metadata to get that information. It will take a long time, especially if you have a lot of small files (think Windows C: drive).

That might not be something the average user would do by themselves, but developers of, say, image processing apps certainly would.


Might be the next big thing. But if that is a valuable feature it is _really_ easy to implement. App, sqlite, scan file system once, build up a database of metadata then put 3 indexes on it. I expect a lot of photo apps can already do that.

But all that still isn't really leveraging the power of the relational model. That is simply indexing files on a lot of different attributes, ie, leveraging an RDBMS implementation detail where they use trees. The point of the relational model is relational algebra (SELECT, JOIN & WHERE in SQL terms). And WHERE isn't the interesting one out of those 3, it is JOIN.

If the use case for a relational filesystem is interesting filters then it sounds a lot like a false start.


Of course it's simple to do for every app. But they have to keep that database updated. Either by running a service and listening to all changes to the file system or by running it from scratch every so often. Neither is a good idea from a performance point of view.

I can see where you're coming from with the false start if looking at it from that isolated point of view. I see it more as the first step to storing data, in general, in an RDBMS and once applications start to utilize that, new use cases will start to emerge. Linux in particular with it's "everything is a file" philosophy seems to suited to use this model. A table for processes, files, network connections, ect.


Go meta, and consider projecting distinct hierarchies on document collections. For example, I may have a hierarchical view of my documents organized by content matter, and another one organized by projects, etc. So there is your JOIN.


> A tree-based file-system is optimised for doing a search from the users perspective, finding a file takes log(files) steps and finding related files is trivially cheap. It is likely hard to outdo that with a relational model.

A tree structured file system is effectively conflating an index with relations. Another way to look at it is that most FSes are like databases which only allow one index, and expose the index to users.


The filesystem is used by different people for different purposes. Any attempt to make sense about “what makes sense” for files is going to be colored very differently depending on what perspective you have.

From an end-user perspective, files are documents that I create and I want to be able to find them in different ways.

From the perspective of a typical app developer, the filesystem is a hierarchical key-value store.

The perspective of a database developer, backup software developer, system administrator, etc. is going to be completely different yet.


It is easy to do that though - set up a database. I've met one, maybe 2, people who don't use file systems the conventional way. They're rare and they generally just want a search index as opposed to relational data.


> It is easy to do that though - set up a database.

I think we have really, fundamentally, failed to communicate here. I’m 100% sure we are talking about different things.


The relational model also covers transactions. You could imagine opening multiple files, writing data to them, and rolling back a.k.a. reverting to the old situation if something goes wrong. This would protect against file corruption, and would also be interesting for e.g. installers. E.g. a transaction could cover all file changes, registry updates, etc.

Windows did this for a short while, but I believe it was removed again. I don't know the exact reason.


The relational model isn't logically related to transactions or rollbacks. It is easy enough to have those without the relational model.


The relational model is orthogonal to transactions. Ted Codd literally called out consistency challenges in the paper. Transactions were not invented then.


What is the equivalent of a directory, and of a file? From the relation point of view, what would your tables be?

It doesn't map that well, the idea sounds intriguing, but in practice a filesystem seems like a better model for files.


You would not have a "directory" in the sense you use the word (why would directory be required in a FILE system?)

Instead a heap of files and you search for them using metadata (e.g. each file has an associated key value store that you search on).

Music catalogues is a good example. In the 90s one vould wonder if the music sitting on HD should be organized as "Music/(genre)/(artist)" or "Music/(artist)/(genre)" as directories. Different choices was best for different persons.

Eventually many music players for desktop (e.g. iTunes) just made a different metaphor. And Spotify, Netflix etc do not use a hierarchy but you search for items using metadata.

Another example is e.g.

    /lib/libfoo-1.2.so
Where one could instead have

    libraryname=foo
    scope=system
    version=1.2
    sharedlibrary=true
Or similar


> Music/(artist)/(genre)

Well that's only necessary when your music library contains Zappa.


That's not relational, that's closer to a NoSQL store.


That is orthogonal, NoSQL can be used as a relational DB too. SQL is one way to do a relational DB.

Also what I describe is close to SQL if e.g. you declare a set of properties and property types for each "filetype".

Relational is one aspect of a DB.

Schemaless or not is another aspect.


A "file" is just a key-value entry where the "key" is some label in an arbitrary namespace, and the "value" is a blob of bytes. The "directories" of a relational FS would be dynamically generated: they would arise as query result sets based on user input, as opposed to being fixed and materialized on-disk. This is how the relational FS worked in BeOS, but the exact same "virtual directory" feature was also ported to some versions of Windows.


> Maybe the File System shouldn’t be hierarchical but rather relational as well

How do you know it isn't?


Most operating systems do have the ability to add key/value pairs to files via extended attributes, like xattr on Linux. That's been around for quite some time. I suppose it goes unused since it's not centrally indexed, and also gets left behind in most types of file transfers.


> What is a relation in English? Actually the dictionary definition works pretty well in this case. The following definition comes from Merriam Webster.

    an aspect ... that connects two or more things or parts as being or belonging or working together
This is needlessly and wrongly freighting the model with a semantic interpretation. A relation is a subset of a cross-product of sets. Nothing more.


But there is a semantic interpretation: it's a set of true statements out of all possible statements.


Other than parts and bill-of-material with self loop, I do not know why you do not use data model.

Show me the code I am confused.

Show me the data, …


exactly. give me an SME and direct access to the DB. i dont need to see any code, not even the UI.

working on master data and BI changed my whole perspective on how to value software and what i demand of any business application going even vaguely close to business critical processes (hint: unfettered db access to cut through your bullshit.. its my damn data thank you :D )


I've come to realize the relational model is not a great fit for most data.

It constrains your data model for the purpose of representing your queries in relational calculus, which allows a corresponding relational algebra to operate on them to help optimize disk access. This comes at a cost. Although, if this is your primary goal then that's fine, which it has been for many.

Data these days is deeply nested or document-based, and encoding this in the relational model is incredibly unwieldy, with huge ugly join queries, and the planner starts making random guesses at 6 joins or so.

Everyone ends up with a rigid, and poorly normalized physical schema to suit the sql planner. Think about all the times you avoid M-M joins because your queries will explode in complexity.

Also, pretty much every app these days wants streaming updates to queries. The optimizer doesn't optimize for streaming updates and most streaming is done by polling. Incremental view maintenance is also very difficult to achieve as well as streaming SQL.


Two thoughts here: - This is why ETL + variants are hard - Graph databases are (arguably because HN) better than the relational model (e.g., flexibility, ease of modeling, accessibility of algorithms/analytics)


If you want to use SQL with your Stripe data, I recently released a CLI called tdog that downloads your Stripe data to a SQL database.

https://table.dog


Anyone who thinks they don't need a relational database eventually ends up reinventing and rewriting aspects of an RDBMS, except badly. You're just kicking the can down the road.


Depends on the use cases, but yeah, for most people/companies adopting NoSQL before even using SQL, sure.


Literally technical debt. And managing debt is part of any project.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: