Comparing SQLite, DuckDB and Arrow with UN trade data

# · 🔥 246 · 💬 78 · 2 years ago · pacha.dev · marcle · 📷

The statistician in your team pointed you to some RDS files obtained from UN COMTRADE, an official United Nations database with comprehensive bilateral trade records at product level that goes back to 1962. Now we can explore the files, all the file have the same columns but contain data for different years. This is somewhat similar to SQL indexes, but instead of creating a table with less columns this creates a structure of folders containing different files according to the partitioning, and therefore reading the resulting data is very efficient as it allows to just skipping fragments instead of reading and then filtering as one would do with RDS. library(arrow) library(dplyr) arrow dir <- "~/comtrade/2011 2020" if sizes2 <- sapply(files2, function(x) ) tibble( file = files2, size in mb = sizes2 ) %>% mutate(extension = gsub(". ", ", file %>% group by(extension) %>% summarise(total size in mb = sum(size in mb ## # A tibble: 4 2 ## extension total size in mb ## <chr> <dbl> ## 1 duckdb 2538. Library(bench) library(purrr) countries <- c("usa", "can", "gbr", "fra", "ita", "lka", "chl") # RDS -- benchmark rds <- mark( map df( files, function(x) ) ) # SQLite -- # we need to open and close a connection for SQLite con <- dbConnect(SQLite(), sqlite file) benchmark sqlite <- mark( tbl(con, "yrpc") %>% filter(reporter iso %in% countries) %>% group by(year, reporter iso) %>% summarise( total exports in usd = sum(trade value usd exp, na. Rm = T) ) %>% # here we need a collect at the end to move the data into R collect() ) dbDisconnect(con, shutdown = T) # DuckDB -- # we need to open and close a connection for DuckDB con <- dbConnect(duckdb(), duckdb file) benchmark duckdb <- mark( tbl(con, "yrpc") %>% filter(reporter iso %in% countries) %>% group by(year, reporter iso) %>% summarise( total exports in usd = sum(trade value usd exp, na. SQLite and DuckDB files consists in a single large file, but the indexes we created allow their respective packages to read a copy of the tables that has just the year and reporter iso columns, and therefore allows very fast filtering to provide the exact location of what we need to read in the large tables.