Show HN: Hashquery, a Python library for defining reusable analysis

code_biologist · 2024-04-23T19:19:38

I'm exactly the target audience for this type of tool, a tech leader that has implemented a data warehouse and BI strategy. Some concrete tips for adoption:

- Break the dependency on your product. I need to be able to use the library even if your company goes under.

- Add a dbt library that makes it easy to use hashquery within dbt models. It gets you materialization for free and will answer a lot of questions you will get about dbt integration.

To comment more broadly, if you want to be a broad solution, the going trend in data integration seems to be at the warehouse level so you need to have SQL answers.

A bunch of tools all consume from our warehouse (existing BI, reverse ETL, data science systems). A BI definition tool won't work if I can't define segments in a way that all of those can access, even as just tables or views.

Our programmers and data science people know Python and are often very good at SQL, but their time is short and BI projects depending on them have been delayed. Our analysts know SQL, and have the dedicated time to make these projects happen.

This kind of code snippet isn't crazy to put into dbt, and if someone wants to do Python magic in the background they can:

    hq_project.models.events_model.with_activity_schema(
        group='user_id', timestamp='timestamp', event_key='event_type'
    ).funnel("ad_impression", "add_to_cart", "buy").as_sql()

cpimhoff · 2024-04-23T21:04:03

All great advice. All this resonated with the team.

100% on decoupling it from the product. Our customers needed a headless BI solution and we needed a better internal framework, so the stars aligned for this first version being a little coupled. This kind of feedback is helpful, because it helps me advocate for capacity to decouple it fully!

Hashquery does have APIs to get the rendered SQL without executing it, so I think integration with any in-warehouse processing tool is possible, dbt likely being the most valuable.

Full materialization/interoperability with SQL is hard for any tool trying to encapsulate the semantics of the SQL. The intent of those tools is to encode concepts that are not possible to represent with static tables, so folding them back to a SQL-surface area will always be lossy. Having said that, we could certainly have a better story around it — materialization can still be useful even if it has some caveats.

reply

grantjpowell · 2024-04-23T20:44:38

This looks neat. I'm the author of a similar project in typescript we use at Cotera called Era [0]. Y'all might be implement something similar to our caching layer [1] which we think is super useful. Once you have a decent cross warehouse representation it's pretty easy to "split" queries across the real warehouse and something like duckdb. The other thing that we find useful in Era that y'all might like are "Invariants"[2]. Invariants work by compiling lazily evaluated invalid casts into the query that only trigger under failing conditions. We use "invariants" to fail a query at _runtime_, which eliminates TOUTOC problems that come from a DBT tests style solution.

    [0] https://newera.dev/
    [1] https://cotera.co/blog/how-era-brings-last-mile-analytics-to-any-data-warehouse-via-duckdb
    [2] https://newera.dev/docs/invariants

anentropic · 2024-04-23T17:35:35

I'm potentially super interested in this as am building this kind of feature for my job at the moment

But https://hashquery.dev/#faq says:

> the Hashquery SQL compiler is not available to run locally, so you do need to define your data connections inside of Hashboard and use its API to execute your queries.

> We do plan on making the full Hashquery stack available to run locally in the near future

I'm not quite sure what the use case for this library is at present

If I'm not a Hashboard customer and don't want to pay $60/mo for a nicer way to query my existing db, what am I going to do with it?

Hashboard seems roughly similar to Superset and/or Cube?

reply

cpimhoff · 2024-04-23T18:14:37

Bit of context here is that Hashquery started as an internal tool and as a way to power our headless BI offering, so it still has a few bits tied to our infrastructure we're working to fully separate.

For folks wanting to hack around more with Hashquery in the meantime, Hashboard accounts can be created for free and we don't have plans to charge folks just using the Hashquery stuff (but we probably will bother you for feedback ). You can sign up here https://hashboard.com/getAccess

reply

totalhack · 2024-04-23T23:14:19

This looks cool. I built a similar open source semantic data / warehousing tool called Zillion. I use it to power my company's BI but haven't put as much time into the polish as you guys.

https://github.com/totalhack/zillion

reply

guy4261 · 2024-04-23T16:59:43

Looks pretty darn cool! Two questions please:

1. How does this compare to dbt? If we're already using dbt, why migrate?

2. Will you consider making a tool that tries to transpile SQL back to Hashquery models? This way I can work against my database, then merge the changes back to the model.

Good luck!

reply

cpimhoff · 2024-04-23T18:53:09

> 1. How does this compare to dbt? If we're already using dbt, why migrate?

I actually think dbt and Hashquery work very well alongside one another!

dbt can help you normalize, clean, and materialize your data upstream, then Hashquery can be used to associate semantics to those output tables (measurements, synthesized attributes, common views) and query them.

So dbt can be the build/ETL part, and Hashquery can be the semantic layer/frontend for analytical queries.

> 2. Will you consider making a tool that tries to transpile SQL back to Hashquery models?

This is a really interesting idea!

I'm pretty skeptical we could make this technically feasible. Compilation from a higher level abstraction (Hashquery semantics) to a lower level abstraction is inherently lossy and can't really be done in reverse without a lot of noise.

Hashquery has a lot of escape hatches for raw SQL though if you need to access some functionality not yet implemented as part of the project. There's API to inline Hashquery structures inside of SQL fragments and visa versa.

reply

peterhadlaw · 2024-04-23T17:19:06

I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]

0: https://ibis-project.org/

reply

BadHumans · 2024-04-23T18:04:42

You might not get the appeal of dbt because it is not meant for you. Dbt was made for data analyst who wanted the same engineering practices that software engineering teams have. The people who use it may have Python experience but they are not software engineers so dbt introduces concepts like testing and CI/CD to a group of people who don't historically use them.

mjirv · 2024-04-23T18:20:25

re: dbt, it looks like Hashquery is more for running analyses than doing in-data-warehouse transformations.

Unless I'm mistaken there's no native way in Hashquery right now to materialize the results back to your warehouse like you would with dbt.

reply

demaga · 2024-04-23T17:04:19

> way beyond the capabilities of standard SQL

Maybe some examples would help

> AI and LLMs (coming soon)

Why?

reply

cpimhoff · 2024-04-23T18:25:44

> way beyond the capabilities of standard SQL

Maybe the more precise language would be "way beyond the _expressive_ capabilities of standard SQL". Ultimately Hashquery compiles into SQL for execution, in that way it's a bit of a transpiler.

One concrete example is funnel analysis. In SQL, an efficient funnel analysis on large data sets will span several hundred lines of pipelined queries; writing, reading, and re-parameterizing this is very challenging. In Hashquery, it's abstracted into a single function call: `model.funnel(*str[])`.

> AI and LLMs (coming soon)

Frankly? SEO just told us to drop "AI" _somewhere_ on the page for new `.dev` website. I agree it's a little silly to make it a top-line bullet point.

We do think Hashquery is better suited for coding co-pilot tools, as it's a known language, and you can establish API boundaries that an AI tool won't break into and futz with (good for writing queries on top of already defined business logic).

reply

mpeg · 2024-04-23T17:00:59

This is pretty cool, although DSLs can bring their own set of challenges vs writing SQL – how does it compare to dbt? (apart from it being python instead of SQL)

tomrod · 2024-04-23T17:00:21

How aware is the library regarding existing structure, e.g foreign key relationships?

cpimhoff · 2024-04-23T18:35:05

The library itself provides tools for declaring data tables, columns, metrics, and relationships, then performing transformations (eg. queries) on top of them.

Reflecting an existing database isn't part of the library, but with the magic of "it's just Python" it's pretty easy to write a function that does so yourself:

    import hashquery as hq
    def reflect_db(url: str) -> list[hq.Model]:
      """
      Given a connection string,
      returns a list of Hashquery models
      for all the physical tables in
      the database.
      """
      models: list[hq.Model] = []

      reflection = some_reflection_lib.reflect_db(url)
      for table in reflection.tables:
        models.append(
          hq.Model()
          .with_source(url, table.name, table.schema)
          .with_attributes(*table.column_names)
        )
        # ...more logic for importing foreign keys or whatever else

      return models

vndrewlee · 2024-04-23T18:38:57

Joining logic is defined in our data modeling layer, either from Hashboard (our BI tool) or can also be defined in Hashquery as well instead of replying on your database schema for fk relationships. We went this route since a lot of people are using dbt generated tables.

https://hashquery.dev/docs/pattern_guides/joins/

reply

zsdev · 2024-04-23T22:37:49

Interested to see a comparison between hashquery and ibis. https://ibis-project.org/