Merge Datasets
Merging gives you more flexibility to bring data from different Datasets together.
A merge operation connects rows in Datasets using a “key” column that the Datasets have in common.
Riptable’s two main Dataset merge functions are merge_lookup()
and
merge_asof()
. Generally speaking, merge_lookup()
aligns data
based on identical keys, while merge_asof()
aligns data based on the
nearest key.
For more general merges, merge2()
does database-style left, right,
inner, and outer joins.
merge_lookup()
Let’s start with merge_lookup()
. It’s common to have one Dataset
that has most of the information you need, and another, usually smaller
Dataset that has information you want to add to the first Dataset to
enrich it.
Here we’ll create a larger Dataset with symbols and size values, and a smaller Dataset that has symbols associated with trader names. We’ll use the shared Symbol column as the key to add the trader info to the larger Dataset:
>>> rng = np.random.default_rng(seed=42)
>>> N = 25
>>> # Larger Dataset
>>> ds = rt.Dataset({'Symbol': rng.choice(['GME', 'AMZN', 'TSLA', 'SPY'], N),
... 'Size': rng.integers(1, 1000, N),})
>>> # Smaller Dataset, with data used to enrich the larger Dataset
>>> ds_symbol_trader = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN'],
... 'Trader': ['Nate', 'Elon', 'Josh', 'Dan']})
>>> ds.head()
# Symbol Size
-- ------ ----
0 GME 644
1 SPY 403
2 TSLA 822
3 AMZN 545
4 AMZN 443
5 SPY 451
6 GME 228
7 TSLA 93
8 GME 555
9 GME 888
10 TSLA 64
11 SPY 858
12 TSLA 827
13 SPY 277
14 TSLA 632
15 SPY 166
16 TSLA 758
17 GME 700
18 SPY 355
19 AMZN 68
>>> ds_symbol_trader
# Symbol Trader
- ------ ------
0 GME Nate
1 TSLA Elon
2 SPY Josh
3 AMZN Dan
merge_lookup()
with Key Columns That Have the Same Name
Now we’ll use merge_lookup()
to add the trader information to the
larger Dataset. merge_lookup()
will align the data based on exact
matches in the shared Symbol column.
A note about terms: When you merge two Datasets, the Dataset you’re merging data into is the left Dataset; the one you’re getting data from is the right Dataset.
Here, we call merge_lookup()
on our left Dataset, ds
. We pass it
the name of the right Dataset, and tell it what column to use as the
key:
>>> ds.merge_lookup(ds_symbol_trader, on='Symbol')
# Symbol Size Trader
-- ------ ---- ------
0 GME 644 Nate
1 SPY 403 Josh
2 TSLA 822 Elon
3 AMZN 545 Dan
4 AMZN 443 Dan
5 SPY 451 Josh
6 GME 228 Nate
7 TSLA 93 Elon
8 GME 555 Nate
9 GME 888 Nate
10 TSLA 64 Elon
11 SPY 858 Josh
12 TSLA 827 Elon
13 SPY 277 Josh
14 TSLA 632 Elon
15 SPY 166 Josh
16 TSLA 758 Elon
17 GME 700 Nate
18 SPY 355 Josh
19 AMZN 68 Dan
20 TSLA 970 Elon
21 AMZN 446 Dan
22 GME 893 Nate
23 SPY 678 Josh
24 SPY 778 Josh
The left Dataset now has the trader information, correctly aligned.
You can also use the following syntax, passing merge_lookup()
the
names of the left and right Datasets, along with the key:
>>> rt.merge_lookup(ds, ds_symbol_trader, on='Symbol')
# Symbol Size Trader
-- ------ ---- ------
0 GME 644 Nate
1 SPY 403 Josh
2 TSLA 822 Elon
3 AMZN 545 Dan
4 AMZN 443 Dan
5 SPY 451 Josh
6 GME 228 Nate
7 TSLA 93 Elon
8 GME 555 Nate
9 GME 888 Nate
10 TSLA 64 Elon
11 SPY 858 Josh
12 TSLA 827 Elon
13 SPY 277 Josh
14 TSLA 632 Elon
15 SPY 166 Josh
16 TSLA 758 Elon
17 GME 700 Nate
18 SPY 355 Josh
19 AMZN 68 Dan
20 TSLA 970 Elon
21 AMZN 446 Dan
22 GME 893 Nate
23 SPY 678 Josh
24 SPY 778 Josh
merge_lookup
with Key Columns That Have Different Names
If the key column has a different name in each Dataset, just specify
each column name with left_on
and right_on
:
>>> # For illustrative purposes, rename the key column in the right Dataset.
>>> ds_symbol_trader.col_rename('Symbol', 'UnderlyingSymbol')
>>> ds.merge_lookup(ds_symbol_trader, left_on='Symbol', right_on='UnderlyingSymbol')
# Symbol Size UnderlyingSymbol Trader
-- ------ ---- ---------------- ------
0 GME 644 GME Nate
1 SPY 403 SPY Josh
2 TSLA 822 TSLA Elon
3 AMZN 545 AMZN Dan
4 AMZN 443 AMZN Dan
5 SPY 451 SPY Josh
6 GME 228 GME Nate
7 TSLA 93 TSLA Elon
8 GME 555 GME Nate
9 GME 888 GME Nate
10 TSLA 64 TSLA Elon
11 SPY 858 SPY Josh
12 TSLA 827 TSLA Elon
13 SPY 277 SPY Josh
14 TSLA 632 TSLA Elon
15 SPY 166 SPY Josh
16 TSLA 758 TSLA Elon
17 GME 700 GME Nate
18 SPY 355 SPY Josh
19 AMZN 68 AMZN Dan
20 TSLA 970 TSLA Elon
21 AMZN 446 AMZN Dan
22 GME 893 GME Nate
23 SPY 678 SPY Josh
24 SPY 778 SPY Josh
Notice that when the key columns have different names, both are kept. If
you want keep only certain columns from the left or right Dataset, you
can specify them with columns_left
or columns_right
:
>>> ds.merge_lookup(ds_symbol_trader, left_on='Symbol', right_on='UnderlyingSymbol',
... columns_right='Trader')
# Symbol Size Trader
-- ------ ---- ------
0 GME 644 Nate
1 SPY 403 Josh
2 TSLA 822 Elon
3 AMZN 545 Dan
4 AMZN 443 Dan
5 SPY 451 Josh
6 GME 228 Nate
7 TSLA 93 Elon
8 GME 555 Nate
9 GME 888 Nate
10 TSLA 64 Elon
11 SPY 858 Josh
12 TSLA 827 Elon
13 SPY 277 Josh
14 TSLA 632 Elon
15 SPY 166 Josh
16 TSLA 758 Elon
17 GME 700 Nate
18 SPY 355 Josh
19 AMZN 68 Dan
20 TSLA 970 Elon
21 AMZN 446 Dan
22 GME 893 Nate
23 SPY 678 Josh
24 SPY 778 Josh
Note: merge_lookup()
Keeps Only the Keys in the Left Dataset
One thing to note about merge_lookup()
is that it keeps only the
keys are that are in the left Dataset (it’s equivalent to a SQL left
join). If there are keys in the right Dataset that aren’t in the left
Dataset, they’re discarded in the merged data:
>>> # Create a right Dataset with an extra symbol key ('MSFT').
>>> ds_symbol_trader2 = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN', 'MSFT'],
... 'Trader': ['Nate', 'Elon', 'Josh', 'Dan', 'Lauren']})
>>> # Change 'UnderlyingSymbol' back to 'Symbol' for simplicity.
>>> ds_symbol_trader.col_rename('UnderlyingSymbol', 'Symbol')
>>> ds.merge_lookup(ds_symbol_trader2, on='Symbol', columns_right='Trader')
# Symbol Size Trader
-- ------ ---- ------
0 GME 644 Nate
1 SPY 403 Josh
2 TSLA 822 Elon
3 AMZN 545 Dan
4 AMZN 443 Dan
5 SPY 451 Josh
6 GME 228 Nate
7 TSLA 93 Elon
8 GME 555 Nate
9 GME 888 Nate
10 TSLA 64 Elon
11 SPY 858 Josh
12 TSLA 827 Elon
13 SPY 277 Josh
14 TSLA 632 Elon
15 SPY 166 Josh
16 TSLA 758 Elon
17 GME 700 Nate
18 SPY 355 Josh
19 AMZN 68 Dan
20 TSLA 970 Elon
21 AMZN 446 Dan
22 GME 893 Nate
23 SPY 678 Josh
24 SPY 778 Josh
merge_lookup()
with Overlapping Columns That Aren’t Keys
As we saw above, if the two key columns have the same name in both Datasets, only one is kept. For columns that aren’t used as keys, you’ll get a name collision error when you try to merge:
>>> # Add a Size column to the right Dataset
>>> ds_symbol_trader.Size = rng.integers(1, 1000, 4)
>>> try:
... rt.merge_lookup(ds, ds_symbol_trader, on='Symbol')
... except ValueError as e:
... print("ValueError:", e)
ValueError: columns overlap but no suffix specified: {'Size'}
If you want to keep both columns, add a suffix to each column name to disambiguate them:
>>> rt.merge_lookup(ds, ds_symbol_trader, on='Symbol', suffixes=('_1', '_2'))
# Symbol Size_1 Trader Size_2
-- ------ ------ ------ ------
0 GME 644 Nate 760
1 SPY 403 Josh 364
2 TSLA 822 Elon 195
3 AMZN 545 Dan 467
4 AMZN 443 Dan 467
5 SPY 451 Josh 364
6 GME 228 Nate 760
7 TSLA 93 Elon 195
8 GME 555 Nate 760
9 GME 888 Nate 760
10 TSLA 64 Elon 195
11 SPY 858 Josh 364
12 TSLA 827 Elon 195
13 SPY 277 Josh 364
14 TSLA 632 Elon 195
15 SPY 166 Josh 364
16 TSLA 758 Elon 195
17 GME 700 Nate 760
18 SPY 355 Josh 364
19 AMZN 68 Dan 467
20 TSLA 970 Elon 195
21 AMZN 446 Dan 467
22 GME 893 Nate 760
23 SPY 678 Josh 364
24 SPY 778 Josh 364
merge_lookup()
with a Right Dataset That Has Duplicate Keys
If the right Dataset has more than one match for a unique key in the left Dataset, you can specify whether to use the first or the last match encountered in the right Dataset:
>>> # Create a right Dataset with a second GME key, associated to Lauren
>>> ds_symbol_trader3 = rt.Dataset({'Symbol': ['GME', 'TSLA', 'SPY', 'AMZN', 'GME'],
... 'Trader': ['Nate', 'Elon', 'Josh', 'Dan', 'Lauren']})
>>> ds_symbol_trader3
# Symbol Trader
- ------ ------
0 GME Nate
1 TSLA Elon
2 SPY Josh
3 AMZN Dan
4 GME Lauren
We’ll keep the last match:
>>> ds.merge_lookup(ds_symbol_trader3, on='Symbol', columns_right='Trader', keep='last')
# Symbol Size Trader
-- ------ ---- ------
0 GME 644 Lauren
1 SPY 403 Josh
2 TSLA 822 Elon
3 AMZN 545 Dan
4 AMZN 443 Dan
5 SPY 451 Josh
6 GME 228 Lauren
7 TSLA 93 Elon
8 GME 555 Lauren
9 GME 888 Lauren
10 TSLA 64 Elon
11 SPY 858 Josh
12 TSLA 827 Elon
13 SPY 277 Josh
14 TSLA 632 Elon
15 SPY 166 Josh
16 TSLA 758 Elon
17 GME 700 Lauren
18 SPY 355 Josh
19 AMZN 68 Dan
20 TSLA 970 Elon
21 AMZN 446 Dan
22 GME 893 Lauren
23 SPY 678 Josh
24 SPY 778 Josh
merge_asof()
In a merge_asof()
, Riptable matches on the nearest key rather than
an equal key.
This is useful for merges based on keys that are times, where the times in one Dataset are not an exact match for the times in another Dataset, but they’re close enough to be used to merge the data.
Note: To most efficiently find the nearest match, merge_asof()
requires both key columns to be sorted. The key columns must also be
numeric, such as a datetime, integer, or float. You can check whether a
column is sorted with issorted()
, or just sort it using
sort_inplace()
. (If the key columns aren’t sorted, Riptable will
give you an error when you try to merge.)
With merge_asof()
, you need to specify how you want to find the
closest match:
direction='forward'
matches based on the closest key in the right Dataset that’s greater than the key in the left Dataset.direction='backward'
matches based on the closest key in the right Dataset that’s less than the key in the left Dataset.direction='nearest'
matches based on the closest key in the right Dataset, regardless of whether it’s greater than or less than the key in the left Dataset.
Let’s see an example based on closest times. The left Dataset has three trades and their times. The right Dataset has spot prices and times that are not all exact matches. We’ll merge the spot prices from the right Dataset by getting the values associated with the nearest earlier times.
>>> # Left Dataset with trades and times
>>> ds = rt.Dataset({'Symbol': ['AAPL', 'AMZN', 'AAPL'],
... 'Venue': ['A', 'I', 'A'],
... 'Time': rt.TimeSpan(['09:30', '10:00', '10:20'])})
>>> # Right Dataset with spot prices and nearby times
>>> spot_ds = rt.Dataset({'Symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL'],
... 'Spot Price': [2000.0, 2025.0, 2030.0, 500.0, 510.0, 520.0],
... 'Time': rt.TimeSpan(['09:30', '10:00', '10:25', '09:25', '10:00', '10:25'])})
>>> ds
# Symbol Venue Time
- ------ ----- ------------------
0 AAPL A 09:30:00.000000000
1 AMZN I 10:00:00.000000000
2 AAPL A 10:20:00.000000000
>>> spot_ds
# Symbol Spot Price Time
- ------ ---------- ------------------
0 AMZN 2,000.00 09:30:00.000000000
1 AMZN 2,025.00 10:00:00.000000000
2 AMZN 2,030.00 10:25:00.000000000
3 AAPL 500.00 09:25:00.000000000
4 AAPL 510.00 10:00:00.000000000
5 AAPL 520.00 10:25:00.000000000
Note that an as-of merge requires the on
columns to be sorted. Before the merge,
the on
columns are always checked. If they’re not sorted, by default they are
sorted before the merge; the original order is then restored in the returned merged
Dataset.
If you don’t need to preserve the existing ordering, it’s faster to sort the
on
columns in place first:
>>> spot_ds.sort_inplace('Time')
# Symbol Spot Price Time
- ------ ---------- ------------------
0 AAPL 500.00 09:25:00.000000000
1 AMZN 2,000.00 09:30:00.000000000
2 AMZN 2,025.00 10:00:00.000000000
3 AAPL 510.00 10:00:00.000000000
4 AAPL 520.00 10:25:00.000000000
5 AMZN 2,030.00 10:25:00.000000000
Now we can merge based on the nearest earlier time. But not just any
nearest earlier time – we want to make sure it’s the nearest earlier
time associated with the same symbol. We use the optional by
parameter to make sure we match on the symbol before getting the nearest
earlier time. We’ll also use the matched_on
argument to show which
key in spot_ds
was matched on:
>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='backward', matched_on=True)
# Symbol Time Venue Spot Price matched_on
- ------ ------------------ ----- ---------- ------------------
0 AAPL 09:30:00.000000000 A 500.00 09:25:00.000000000
1 AMZN 10:00:00.000000000 I 2,025.00 10:00:00.000000000
2 AAPL 10:20:00.000000000 A 510.00 10:00:00.000000000
We can see that both AAPL trades were matched based on the nearest earlier time.
Merge based on the nearest later time:
>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='forward', matched_on=True)
# Symbol Time Venue Spot Price matched_on
- ------ ------------------ ----- ---------- ------------------
0 AAPL 09:30:00.000000000 A 510.00 10:00:00.000000000
1 AMZN 10:00:00.000000000 I 2,025.00 10:00:00.000000000
2 AAPL 10:20:00.000000000 A 520.00 10:25:00.000000000
Both AAPL trades were matched based on the nearest later time.
Here, we get the spot price associated with whatever time is nearest, whether it’s earlier or later:
>>> ds.merge_asof(spot_ds, on='Time', by='Symbol', direction='nearest', matched_on=True)
# Symbol Time Venue Spot Price matched_on
- ------ ------------------ ----- ---------- ------------------
0 AAPL 09:30:00.000000000 A 500.00 09:25:00.000000000
1 AMZN 10:00:00.000000000 I 2,025.00 10:00:00.000000000
2 AAPL 10:20:00.000000000 A 520.00 10:25:00.000000000
For the first AAPL trade, the nearest time is earlier. For the second AAPL trade, the nearest time is later.
We won’t spend time on examples of merge2()
, which is Riptable’s
more general merge function that does database-style joins (left, right,
inner, outer). Check out the API Reference for details.
Next, we’ll briefly cover a couple of ways to change the shape of your Dataset: Reshape Data with Pivot and Transpose.
Questions or comments about this guide? Email RiptableDocumentation@sig.com.