non-equi joins in data.table

non-equi joins in data.table

A quick note to understand (non-equi) joins in data.table

I have been toying with some of the advent of code challenges (I am way behind though!).

For day 5, I had to create a function, and I’m writing this up, because it’s an example of a non-equi join between two tables.
In this particular sitation, there are are no common columns between the two tables, so my usual data.table hack of copying the columns of interest, renaming themjoin_col, and then keying them both does not work.

Here’s the function:

find_matches <- function(input_dt, lookup_dt) {
 res <-  lookup_dt[input_dt,
            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),
            on = .(source_start <= V1,
                   source_end >= V1)
  ][, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ][,.(V1 = dest)]
 return(res)
}

I want to join the main table, input_dt with the lookup_dt. Because data.table uses right joins, and removes the joining column from the main table (unless you tell it otherwise), I am being very specific.

This bit is where I specify the column names I want to return, and which table they come from.

i.col_name is the main / large/ right hand side table

x.col_name is the smaller/ left hand side table

Here I’m saying to keep V1 from the main table, and dest_start, source_start and source_end from the smaller table:

            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),

Here is the non-equi join bit.

on = .(source_start <= V1,
                   source_end >= V1)

The tables should join when V1 is between the source start and source end.

There will be rows where these conditions are not met, these will return NA. This section of code deals with those - any NA’s are replaced with the original value and no further calculation is required.

[, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ]

The rest of the code returns the vector of interest.

The main takeway from this is to understand:

  • data.table uses right joins by default
  • the join column in the main table will be removed from the resulting joined table, unless you tell it otherwise
  • Use i.col_name to specify columns from the main table
  • Use x.col_name to specify columns from the smaller / lookup table
  • you can use conditions in your on syntax to create equi or non equi joins
0

© 2016 - 2024. All rights reserved.