Merge Datasets

Merging gives you more flexibility to bring data from different Datasets together.

A merge operation connects rows in Datasets using a “key” column that the Datasets have in common.

Riptable’s two main Dataset merge functions are merge_lookup() and merge_asof(). Generally speaking, merge_lookup() aligns data based on identical keys, while merge_asof() aligns data based on the nearest key.

For more general merges, merge2() does database-style left, right, inner, and outer joins.

`merge_lookup()`

Let’s start with merge_lookup(). It’s common to have one Dataset that has most of the information you need, and another, usually smaller Dataset that has information you want to add to the first Dataset to enrich it.

Here we’ll create a larger Dataset with symbols and size values, and a smaller Dataset that has symbols associated with trader names. We’ll use the shared Symbol column as the key to add the trader info to the larger Dataset:

>>> rng = np.random.default_rng(seed=42)
>>> N = 25
>>> # Larger Dataset
>>> ds = rt.Dataset({'Symbol': rng.choice(['GME', 'AMZN', 'TSLA', 'SPY'], N),
...                  'Size': rng.integers(1, 1000, N),})
>>> # Smaller Dataset, with data used to enrich the larger Dataset
>>> ds_symbol_trader = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN'],
...                                'Trader': ['Nate', 'Elon', 'Josh', 'Dan']})
>>> ds.head()
 #   Symbol   Size
--   ------   ----
 0   GME       644
 1   SPY       403
 2   TSLA      822
 3   AMZN      545
 4   AMZN      443
 5   SPY       451
 6   GME       228
 7   TSLA       93
 8   GME       555
 9   GME       888
10   TSLA       64
11   SPY       858
12   TSLA      827
13   SPY       277
14   TSLA      632
15   SPY       166
16   TSLA      758
17   GME       700
18   SPY       355
19   AMZN       68

>>> ds_symbol_trader
#   Symbol   Trader
-   ------   ------
0   GME      Nate
1   TSLA     Elon
2   SPY      Josh
3   AMZN     Dan

`merge_lookup()` with Key Columns That Have the Same Name

Now we’ll use merge_lookup() to add the trader information to the larger Dataset. merge_lookup() will align the data based on exact matches in the shared Symbol column.

A note about terms: When you merge two Datasets, the Dataset you’re merging data into is the left Dataset; the one you’re getting data from is the right Dataset.

Here, we call merge_lookup() on our left Dataset, ds. We pass it the name of the right Dataset, and tell it what column to use as the key:

>>> ds.merge_lookup(ds_symbol_trader, on='Symbol')
 #   Symbol   Size   Trader
--   ------   ----   ------
 GME       644   Nate
 SPY       403   Josh
 TSLA      822   Elon
 AMZN      545   Dan
 AMZN      443   Dan
 SPY       451   Josh
 GME       228   Nate
 TSLA       93   Elon
 GME       555   Nate
 GME       888   Nate
 TSLA       64   Elon
 SPY       858   Josh
 TSLA      827   Elon
 SPY       277   Josh
 TSLA      632   Elon
 SPY       166   Josh
 TSLA      758   Elon
 GME       700   Nate
 SPY       355   Josh
 AMZN       68   Dan
 TSLA      970   Elon
 AMZN      446   Dan
 GME       893   Nate
 SPY       678   Josh
 SPY       778   Josh

The left Dataset now has the trader information, correctly aligned.

You can also use the following syntax, passing merge_lookup() the names of the left and right Datasets, along with the key:

>>> rt.merge_lookup(ds, ds_symbol_trader, on='Symbol')
 #   Symbol   Size   Trader
--   ------   ----   ------
 GME       644   Nate
 SPY       403   Josh
 TSLA      822   Elon
 AMZN      545   Dan
 AMZN      443   Dan
 SPY       451   Josh
 GME       228   Nate
 TSLA       93   Elon
 GME       555   Nate
 GME       888   Nate
 TSLA       64   Elon
 SPY       858   Josh
 TSLA      827   Elon
 SPY       277   Josh
 TSLA      632   Elon
 SPY       166   Josh
 TSLA      758   Elon
 GME       700   Nate
 SPY       355   Josh
 AMZN       68   Dan
 TSLA      970   Elon
 AMZN      446   Dan
 GME       893   Nate
 SPY       678   Josh
 SPY       778   Josh

`merge_lookup` with Key Columns That Have Different Names

If the key column has a different name in each Dataset, just specify each column name with left_on and right_on:

>>> # For illustrative purposes, rename the key column in the right Dataset.
>>> ds_symbol_trader.col_rename('Symbol', 'UnderlyingSymbol')
>>> ds.merge_lookup(ds_symbol_trader, left_on='Symbol', right_on='UnderlyingSymbol')
 #   Symbol   Size   UnderlyingSymbol   Trader
--   ------   ----   ----------------   ------
 GME       644   GME                Nate
 SPY       403   SPY                Josh
 TSLA      822   TSLA               Elon
 AMZN      545   AMZN               Dan
 AMZN      443   AMZN               Dan
 SPY       451   SPY                Josh
 GME       228   GME                Nate
 TSLA       93   TSLA               Elon
 GME       555   GME                Nate
 GME       888   GME                Nate
 TSLA       64   TSLA               Elon
 SPY       858   SPY                Josh
 TSLA      827   TSLA               Elon
 SPY       277   SPY                Josh
 TSLA      632   TSLA               Elon
 SPY       166   SPY                Josh
 TSLA      758   TSLA               Elon
 GME       700   GME                Nate
 SPY       355   SPY                Josh
 AMZN       68   AMZN               Dan
 TSLA      970   TSLA               Elon
 AMZN      446   AMZN               Dan
 GME       893   GME                Nate
 SPY       678   SPY                Josh
 SPY       778   SPY                Josh

Notice that when the key columns have different names, both are kept. If you want keep only certain columns from the left or right Dataset, you can specify them with columns_left or columns_right:

>>> ds.merge_lookup(ds_symbol_trader, left_on='Symbol', right_on='UnderlyingSymbol',
...                 columns_right='Trader')
 #   Symbol   Size   Trader
--   ------   ----   ------
 GME       644   Nate
 SPY       403   Josh
 TSLA      822   Elon
 AMZN      545   Dan
 AMZN      443   Dan
 SPY       451   Josh
 GME       228   Nate
 TSLA       93   Elon
 GME       555   Nate
 GME       888   Nate
 TSLA       64   Elon
 SPY       858   Josh
 TSLA      827   Elon
 SPY       277   Josh
 TSLA      632   Elon
 SPY       166   Josh
 TSLA      758   Elon
 GME       700   Nate
 SPY       355   Josh
 AMZN       68   Dan
 TSLA      970   Elon
 AMZN      446   Dan
 GME       893   Nate
 SPY       678   Josh
 SPY       778   Josh

Note: `merge_lookup()` Keeps Only the Keys in the Left Dataset

One thing to note about merge_lookup() is that it keeps only the keys are that are in the left Dataset (it’s equivalent to a SQL left join). If there are keys in the right Dataset that aren’t in the left Dataset, they’re discarded in the merged data:

>>> # Create a right Dataset with an extra symbol key ('MSFT').
>>> ds_symbol_trader2 = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN', 'MSFT'],
...                                 'Trader': ['Nate', 'Elon', 'Josh', 'Dan', 'Lauren']})
>>> # Change 'UnderlyingSymbol' back to 'Symbol' for simplicity.
>>> ds_symbol_trader.col_rename('UnderlyingSymbol', 'Symbol')
>>> ds.merge_lookup(ds_symbol_trader2, on='Symbol', columns_right='Trader')
 #   Symbol   Size   Trader
--   ------   ----   ------
 GME       644   Nate
 SPY       403   Josh
 TSLA      822   Elon
 AMZN      545   Dan
 AMZN      443   Dan
 SPY       451   Josh
 GME       228   Nate
 TSLA       93   Elon
 GME       555   Nate
 GME       888   Nate
 TSLA       64   Elon
 SPY       858   Josh
 TSLA      827   Elon
 SPY       277   Josh
 TSLA      632   Elon
 SPY       166   Josh
 TSLA      758   Elon
 GME       700   Nate
 SPY       355   Josh
 AMZN       68   Dan
 TSLA      970   Elon
 AMZN      446   Dan
 GME       893   Nate
 SPY       678   Josh
 SPY       778   Josh

`merge_lookup()` with Overlapping Columns That Aren’t Keys

As we saw above, if the two key columns have the same name in both Datasets, only one is kept. For columns that aren’t used as keys, you’ll get a name collision error when you try to merge:

>>> # Add a Size column to the right Dataset
>>> ds_symbol_trader.Size = rng.integers(1, 1000, 4)

>>> try:
...     rt.merge_lookup(ds, ds_symbol_trader, on='Symbol')
... except ValueError as e:
...     print("ValueError:", e)
ValueError: columns overlap but no suffix specified: {'Size'}

If you want to keep both columns, add a suffix to each column name to disambiguate them:

>>> rt.merge_lookup(ds, ds_symbol_trader, on='Symbol', suffixes=('_1', '_2'))
 #   Symbol   Size_1   Trader   Size_2
--   ------   ------   ------   ------
 GME         644   Nate        760
 SPY         403   Josh        364
 TSLA        822   Elon        195
 AMZN        545   Dan         467
 AMZN        443   Dan         467
 SPY         451   Josh        364
 GME         228   Nate        760
 TSLA         93   Elon        195
 GME         555   Nate        760
 GME         888   Nate        760
 TSLA         64   Elon        195
 SPY         858   Josh        364
 TSLA        827   Elon        195
 SPY         277   Josh        364
 TSLA        632   Elon        195
 SPY         166   Josh        364
 TSLA        758   Elon        195
 GME         700   Nate        760
 SPY         355   Josh        364
 AMZN         68   Dan         467
 TSLA        970   Elon        195
 AMZN        446   Dan         467
 GME         893   Nate        760
 SPY         678   Josh        364
 SPY         778   Josh        364

`merge_lookup()` with a Right Dataset That Has Duplicate Keys

If the right Dataset has more than one match for a unique key in the left Dataset, you can specify whether to use the first or the last match encountered in the right Dataset:

>>> # Create a right Dataset with a second GME key, associated to Lauren
>>> ds_symbol_trader3 = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN', 'GME'],
...                                 'Trader': ['Nate', 'Elon', 'Josh', 'Dan', 'Lauren']})
>>> ds_symbol_trader3
#   Symbol   Trader
-   ------   ------
0   GME      Nate
1   TSLA     Elon
2   SPY      Josh
3   AMZN     Dan
4   GME      Lauren

We’ll keep the last match:

>>> ds.merge_lookup(ds_symbol_trader3, on='Symbol', columns_right='Trader', keep='last')
 #   Symbol   Size   Trader
--   ------   ----   ------
 GME       644   Lauren
 SPY       403   Josh
 TSLA      822   Elon
 AMZN      545   Dan
 AMZN      443   Dan
 SPY       451   Josh
 GME       228   Lauren
 TSLA       93   Elon
 GME       555   Lauren
 GME       888   Lauren
 TSLA       64   Elon
 SPY       858   Josh
 TSLA      827   Elon
 SPY       277   Josh
 TSLA      632   Elon
 SPY       166   Josh
 TSLA      758   Elon
 GME       700   Lauren
 SPY       355   Josh
 AMZN       68   Dan
 TSLA      970   Elon
 AMZN      446   Dan
 GME       893   Lauren
 SPY       678   Josh
 SPY       778   Josh

`merge_asof()`

In a merge_asof(), Riptable matches on the nearest key rather than an equal key.

This is useful for merges based on keys that are times, where the times in one Dataset are not an exact match for the times in another Dataset, but they’re close enough to be used to merge the data.

Note: To most efficiently find the nearest match, merge_asof() requires both key columns to be sorted. The key columns must also be numeric, such as a datetime, integer, or float. You can check whether a column is sorted with issorted(), or just sort it using sort_inplace(). (If the key columns aren’t sorted, Riptable will give you an error when you try to merge.)

With merge_asof(), you need to specify how you want to find the closest match:

direction='forward' matches based on the closest key in the right Dataset that’s greater than the key in the left Dataset.
direction='backward' matches based on the closest key in the right Dataset that’s less than the key in the left Dataset.
direction='nearest' matches based on the closest key in the right Dataset, regardless of whether it’s greater than or less than the key in the left Dataset.

Let’s see an example based on closest times. The left Dataset has three trades and their times. The right Dataset has spot prices and times that are not all exact matches. We’ll merge the spot prices from the right Dataset by getting the values associated with the nearest earlier times.

>>> # Left Dataset with trades and times
>>> ds = rt.Dataset({'Symbol': ['AAPL', 'AMZN', 'AAPL'],
...                  'Venue': ['A', 'I', 'A'],
...                  'Time': rt.TimeSpan(['09:30', '10:00', '10:20'])})
>>> # Right Dataset with spot prices and nearby times
>>> spot_ds = rt.Dataset({'Symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL'],
...                       'Spot Price': [2000.0, 2025.0, 2030.0, 500.0, 510.0, 520.0],
...                       'Time': rt.TimeSpan(['09:30', '10:00', '10:25', '09:25', '10:00', '10:25'])})
>>> ds
#   Symbol   Venue                 Time
-   ------   -----   ------------------
0   AAPL     A       09:30:00.000000000
1   AMZN     I       10:00:00.000000000
2   AAPL     A       10:20:00.000000000

>>> spot_ds
#   Symbol   Spot Price                 Time
-   ------   ----------   ------------------
0   AMZN       2,000.00   09:30:00.000000000
1   AMZN       2,025.00   10:00:00.000000000
2   AMZN       2,030.00   10:25:00.000000000
3   AAPL         500.00   09:25:00.000000000
4   AAPL         510.00   10:00:00.000000000
5   AAPL         520.00   10:25:00.000000000

Note that an as-of merge requires the on columns to be sorted. Before the merge, the on columns are always checked. If they’re not sorted, by default they are sorted before the merge; the original order is then restored in the returned merged Dataset.

If you don’t need to preserve the existing ordering, it’s faster to sort the on columns in place first:

>>> spot_ds.sort_inplace('Time')
#   Symbol   Spot Price                 Time
-   ------   ----------   ------------------
0   AAPL         500.00   09:25:00.000000000
1   AMZN       2,000.00   09:30:00.000000000
2   AMZN       2,025.00   10:00:00.000000000
3   AAPL         510.00   10:00:00.000000000
4   AAPL         520.00   10:25:00.000000000
5   AMZN       2,030.00   10:25:00.000000000

Now we can merge based on the nearest earlier time. But not just any nearest earlier time – we want to make sure it’s the nearest earlier time associated with the same symbol. We use the optional by parameter to make sure we match on the symbol before getting the nearest earlier time. We’ll also use the matched_on argument to show which key in spot_ds was matched on:

>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='backward', matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           500.00   09:25:00.000000000
1   AMZN     10:00:00.000000000   I         2,025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           510.00   10:00:00.000000000

We can see that both AAPL trades were matched based on the nearest earlier time.

Merge based on the nearest later time:

>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='forward', matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           510.00   10:00:00.000000000
1   AMZN     10:00:00.000000000   I         2,025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           520.00   10:25:00.000000000

Both AAPL trades were matched based on the nearest later time.

Here, we get the spot price associated with whatever time is nearest, whether it’s earlier or later:

>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='nearest', matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           500.00   09:25:00.000000000
1   AMZN     10:00:00.000000000   I         2,025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           520.00   10:25:00.000000000

For the first AAPL trade, the nearest time is earlier. For the second AAPL trade, the nearest time is later.

We won’t spend time on examples of merge2(), which is Riptable’s more general merge function that does database-style joins (left, right, inner, outer). Check out the API Reference for details.

Next, we’ll briefly cover a couple of ways to change the shape of your Dataset: Reshape Data with Pivot and Transpose.

Questions or comments about this guide? Email RiptableDocumentation@sig.com.

Merge Datasets

merge_lookup()

merge_lookup() with Key Columns That Have the Same Name

merge_lookup with Key Columns That Have Different Names

Note: merge_lookup() Keeps Only the Keys in the Left Dataset

merge_lookup() with Overlapping Columns That Aren’t Keys

merge_lookup() with a Right Dataset That Has Duplicate Keys

merge_asof()