riptable.rt_merge
Classes
Holds fancy/logical indices into the left and right Datasets constructed by the join implementation, |
Functions
|
Merge Dataset by performing a database-style join operation by columns. |
|
Merge Dataset by performing a database-style join operation by columns. |
|
Combine two |
|
Perform a join/merge of two |
|
Combine two |
- class riptable.rt_merge.JoinIndices
Bases:
NamedTuple
Holds fancy/logical indices into the left and right Datasets constructed by the join implementation, along with other relevant data needed to construct the resulting merged Dataset.
- left_index: riptable.rt_fastarray.FastArray | None
Integer fancy index or boolean mask for selecting data from the columns of the left Dataset to create the columns of the merged Dataset. This index is optional; when None, indicates that the columns from the left Dataset can be used directly in the resulting merged Dataset without needing to be filtered or otherwise transformed.
- right_index: riptable.rt_fastarray.FastArray | None
Integer fancy index or boolean mask for selecting data from the columns of the right Dataset to create the columns of the merged Dataset. This index is optional; when None, indicates that the columns from the right Dataset can be used directly in the resulting merged Dataset without needing to be filtered or otherwise transformed.
- right_only_rowcount: int | None
Only populated for outer merges. Indices the number of rows in the right Dataset whose keys do not occur in the left Dataset. This value can be used to slice
right_fancyindex
to get just the part at the end which represents these “right-only rows”.
- static result_rowcount(index_arr, dset_rowcount)
Calculate the number of rows resulting from indexing into a
Dataset
with a fancy/logical index.- Parameters:
index_arr (np.ndarray, optional) – A fancy index or boolean mask array to be used to select rows from a
Dataset
.dset_rowcount (int) – The number of rows in the
Dataset
thatindex_arr
will be applied to.
- Returns:
The number of rows resulting from indexing into a
Dataset
with a fancy/logical index. Guaranteed to be non-negative.- Return type:
- riptable.rt_merge.merge(left, right, on=None, left_on=None, right_on=None, how='left', suffixes=('_x', '_y'), indicator=False, columns_left=None, columns_right=None, verbose=False, hint_size=0)
Merge Dataset by performing a database-style join operation by columns.
- Parameters:
left (Dataset) – Left Dataset
right (Dataset) – Right Dataset
on (str or list of str, optional) – Column names to join on. Must be found in both
left
andright
.left_on (str or list of str, optional) – Column names from left Dataset to join on. When specified, overrides whatever is specified in
on
.right_on (str or list of str, optional) – Column names from right to join on. When specified, overrides whatever is specified in
on
.how ({'left','right', 'inner', 'outer'}) –
left: use only keys from the left. The output rows will be in one-to-one correspondence with the left rows! If multiple matches on the right occur, the last is taken.
right: use only keys from the right. The output rows will be in one-to-one correspondence with the left rows! If multiple matches on the left occur, the last is taken.
inner: use intersection of keys from both Datasets, similar to SQL inner join
outer: use union of keys from both Datasets, similar to SQL outer join
suffixes (tuple of (str, str), default ('_x', '_y')) – Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).
indicator (bool or str, default False) – If True, adds a column to output Dataset called “merge_indicator” with information on the source of each row. If string, column with information on source of each row will be added to output Dataset, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in
left
Dataset, “right_only” for observations whose merge key only appears inright
Dataset, and “both” if the observation’s merge key is found in both.columns_left (str or list of str, optional) – Column names to include in the merge from
left
, defaults to None which causes all columns to be included.columns_right (str or list of str, optional) – Column names to include in the merge from
right
, defaults to None which causes all columns to be included.verbose (boolean) – For the stdout debris, defaults to False
hint_size (int) – An estimate of the number of unique keys used for the join, to optimize performance by pre-allocating memory for the key hash table.
- Returns:
merged
- Return type:
Examples
>>> rt.merge(ds_simple_1, ds_simple_2, left_on = 'A', right_on = 'X', how = 'inner') # A B X C - - ---- - ----- 0 0 1.20 0 2.40 1 1 3.10 1 6.20 2 6 9.60 6 19.20 [3 rows x 4 columns] total bytes: 72.0 B
Demonstrating a ‘left’ merge.
>>> rt.merge(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left') # B A C E - - -- ----- ----- 0 Q 0 2.40 1.50 1 R 6 6.20 11.20 2 S 9 19.20 nan 3 T 11 25.90 nan [4 rows x 4 columns] total bytes: 84.0 B
See also
- riptable.rt_merge.merge2(left, right, on=None, left_on=None, right_on=None, how='left', suffixes=None, copy=True, indicator=False, columns_left=None, columns_right=None, validate=None, keep=None, high_card=None, hint_size=None, **kwargs)
Merge Dataset by performing a database-style join operation by columns.
- Parameters:
left (Dataset) – Left Dataset
right (Dataset) – Right Dataset
on (str or (str, str) or list of str or list of (str, str), optional) – Column names to join on. Must be found in both
left
andright
.left_on (str or list of str, optional) – Column names from left Dataset to join on. When specified, overrides whatever is specified in
on
.right_on (str or list of str, optional) – Column names from right to join on. When specified, overrides whatever is specified in
on
.how ({'left', 'right', 'inner', 'outer'}) –
The type of merge to be performed.
left: use only keys from
left
, as in a SQL ‘left join’. Preserves the ordering of keys.right: use only keys from
right
, as in a SQL ‘right join’. Preserves the ordering of keys.inner: use intersection of keys from both Datasets, as in a SQL ‘inner join’. Preserves the ordering of keys from
left
.outer: use union of keys from both Datasets, as in a SQL ‘full outer join’.
suffixes (tuple of (str, str), optional) – Suffix to apply to overlapping column names in the left and right side, respectively. The default (
None
) causes an exception to be raised for any overlapping columns.copy (bool, default True) – If False, avoid copying data when possible; this can reduce memory usage but users must be aware that data can be shared between
left
and/orright
and the Dataset returned by this function.indicator (bool or str, default False) – If True, adds a column to output Dataset called “merge_indicator” with information on the source of each row. If string, column with information on source of each row will be added to output Dataset, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in
left
Dataset, “right_only” for observations whose merge key only appears inright
Dataset, and “both” if the observation’s merge key is found in both.columns_left (str or list of str, optional) – Column names to include in the merge from
left
, defaults to None which causes all columns to be included.columns_right (str or list of str, optional) – Column names to include in the merge from
right
, defaults to None which causes all columns to be included.validate ({'one_to_one', '1:1', 'one_to_many', '1:m', 'many_to_one', 'm:1', 'many_to_many', 'm:m'}, optional) – Validate the uniqueness of the values in the columns specified by the
on
,left_on
,right_on
parameters. In other words, allows the _multiplicity_ of the keys to be checked so the user can prevent the merge if they want to ensure the uniqueness of the keys in one or both of the Datasets being merged. Note: Thekeep
parameter logically takes effect beforevalidate
when they’re both specified.keep ({'first', 'last'} or (str, str), optional) – An optional string which specifies that only the first or last occurrence of each unique key within
left
andright
should be kept. In other words, resolves multiple occurrences of keys (multiplicity > 1) to a single occurrence.high_card (bool or (bool, bool), optional) – Hint to low-level grouping implementation that the key(s) of
left
and/orright
contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when
high_card
is specified.
- Returns:
merged
- Return type:
Examples
>>> rt.merge2(ds_simple_1, ds_simple_2, left_on = 'A', right_on = 'X', how = 'inner') # A B X C - - ---- - ----- 0 0 1.20 0 2.40 1 1 3.10 1 6.20 2 6 9.60 6 19.20 [3 rows x 4 columns] total bytes: 72.0 B
Demonstrating a ‘left’ merge.
>>> rt.merge2(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left') # B A C E - - -- ----- ----- 0 Q 0 2.40 1.50 1 R 6 6.20 11.20 2 S 9 19.20 nan 3 T 11 25.90 nan [4 rows x 4 columns] total bytes: 84.0 B
See also
- riptable.rt_merge.merge_asof(left, right, on=None, left_on=None, right_on=None, by=None, left_by=None, right_by=None, suffixes=None, copy=True, columns_left=None, columns_right=None, tolerance=None, allow_exact_matches=True, direction='backward', verbose=False, action_on_unsorted='sort', matched_on=False, **kwargs)
Combine two
Dataset
objects by performing a database-style left-join based on the nearest numeric key.An as-of merge is useful for keys that are times (or other numeric values) that aren’t exact matches but are close enough to merge on.
Both
Dataset
objects must be sorted by the key column.Use the
direction
argument to find the nearest key in the rightDataset
:A
direction="backward"
search selects the closest key that’s less than or equal to the key in the leftDataset
.A
direction="forward"
search selects the closest key that’s greater than or equal to the key in the leftDataset
.A
direction="nearest"
search selects the key that’s closest in absolute distance to the key in the leftDataset
.
Optionally, you can match on equivalent keys with
by
before performing an as-of merge withon
.- Parameters:
left (Dataset) – The first
Dataset
to merge. If a nearest match for a key inleft
isn’t found inright
, the returned.Dataset
includes a row with the columns fromleft
, but with NaN values in each column fromright
.right (Dataset) – The second
Dataset
to merge. If rows inright
don’t have nearest matches inleft
they will be discarded. If they match multiple rows inleft
they will be duplicated appropriately.on (str or (str, str)) – Name of the column to join on (the key column). If the column name is the same in both
Dataset
objects, use a single string. If the column names are different, use a tuple of strings. The values must be numeric (such as integers, floats, or datetimes). Ifon
isn’t specified,left_on
andright_on
must be specified.left_on (str, optional) – Use instead of
on
to specify the column in the leftDataset
to join on.right_on (str, optional) – Use instead of
on
to specify the column in the rightDataset
to join on.by (str or (str, str) or list of str or list of (str, str), optional) –
Match equal keys in these columns before performing the as-of merge. Options for types:
Single string: Join on one column that has the same name in both
Dataset
objects.List: A list of strings is treated as a multi-key in which all associated key column values in
left
must have matches inright
. The column names must be the same in bothDataset
objects, unless they’re in a tuple; see below.Tuple: Use a tuple to specify key columns that have different names. For example,
("col_a", "col_b")
joins oncol_a
inleft
andcol_b
inright
. Both columns are in the returnedDataset
unless you specify otherwise usingcolumns_left
orcolumns_right
.
left_by (str or list of str, optional) – Use instead of
by
to specify names of columns to match equivalent values on in the leftDataset
.right_by (str or list of str, optional) – Use instead of
by
to specify names of columns to match equivalent values on in the rightDataset
.suffixes ((str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in
left
andright
, respectively. By default, an error is raised for any overlapping non-key columns that will be in the returnedDataset
.copy (bool, default True) – Set to
False
to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared amongleft
,right
, and theDataset
returned by this function.columns_left (str or list of str, optional) – Names of columns from
left
to include in the mergedDataset
. By default, all columns are included.columns_right (str or list of str, optional) – Names of columns from
right
to include in the mergedDataset
. By default, all columns are included.tolerance (int, float, or timedelta, optional) – Not implemented. Tolerance allowed when performing the as-of part of the merge. When a row from
left
doesn’t have a key inright
within this distance, that row will have a NaN for any columns fromright
that appear in the merged result.allow_exact_matches (boolean, default True) – If
True
(the default), allow matching with an equivalenton
value (i.e., allow less-than-or-equal-to or greater-than-or-equal-to matches). IfFalse
, don’t match an equivalenton
value (i.e., perform only strictly-less-than or strictly-greater-than matches).direction ({"backward", "forward", "nearest"}, default "backward") – Whether to search for prior, subsequent, or closest matches in the right
Dataset
.verbose (bool, default False) – Show information used for debugging.
check_sorted (bool, default True) –
Deprecated since version 1.10.0: See
action_on_unsorted
.action_on_unsorted ({"sort", "raise"}, default "sort") –
New in version 1.10.0: The
on
columns are always checked to see if theyare sorted. If they’re unsorted, by default they are sorted before the merge; the original order is then restored in the returned merged
Dataset
. Set to “raise” to raise an error for any unsortedon
column.matched_on (bool or str, default False) – Add a column to the merged
Dataset
that contains theon
column value fromright
that was matched. WhenTrue
, the column uses the default name “matched_on”; specify a string to name the column.left_index (bool, default False) –
Deprecated since version 1.10.0.
This parameter is only provided for compatibility with
pandas.merge_asof
and has no effect in a Riptable as-of merge.right_index (bool, default False) –
Deprecated since version 1.10.0.
This parameter is only provided for compatibility with
pandas.merge_asof
and has no effect in a Riptable as-of merge.
- Returns:
A new
Dataset
of the two merged objects.- Return type:
See also
merge_lookup
Merge two
Dataset
objects based on equivalent keys.merge2
Merge two
Dataset
objects using various database-style joins.merge_indices
Return the left and right indices created by the join engine.
Dataset.merge_asof
Merge two
Dataset
objects using the nearest key.Dataset.merge_lookup
Merge two
Dataset
objects with an in-place option.Dataset.merge2
Merge two
Dataset
objects using various database-style joins.
Examples
>>> left = rt.Dataset({"a": [1, 5, 10], "left_val": ["a", "b", "c"]}) >>> right = rt.Dataset({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]}) >>> left # a left_val - -- -------- 0 1 a 1 5 b 2 10 c [3 rows x 2 columns] total bytes: 15.0 B >>> right # a right_val - - --------- 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7 [5 rows x 2 columns] total bytes: 40.0 B
Merge based on the integers in the “a” columns. The first match is exact; the second two are “backward” nearest matches (the default direction):
>>> rt.merge_asof(left, right, on="a") # a left_val right_val - -- -------- --------- 0 1 a 1 1 5 b 3 2 10 c 7 [3 rows x 3 columns] total bytes: 27.0 B
When
allow_exact_matches=False
, a nearest match is used if there is one (as for row 0). Here, also, there’s no “forward” nearest match inright
for row 2:>>> rt.merge_asof(left, right, on="a", direction="forward", allow_exact_matches=False) # a left_val right_val - -- -------- --------- 0 1 a 2 1 5 b 6 2 10 c Inv [3 rows x 3 columns] total bytes: 27.0 B
If
allow_exact_matches=False
and there are no nearest matches, a NaN value is filled in:>>> rt.merge_asof(left, right, on="a", allow_exact_matches=False) # a left_val right_val - -- -------- --------- 0 1 a Inv 1 5 b 3 2 10 c 7 [3 rows x 3 columns] total bytes: 27.0 B
As-of merges are good for time-series data. Here, the
Dataset
objects are merged withon="Time"
andby="Symbol"
to get the nearest time associated with the same symbol:>>> # Left Dataset with trades and times. >>> ds = rt.Dataset({"Symbol": ["AAPL", "AMZN", "AAPL"], ... "Venue": ["A", "I", "A"], ... "Time": rt.TimeSpan(["09:30", "10:00", "10:20"])}) >>> # Right Dataset with spot prices and nearby times. >>> spot_ds = rt.Dataset({"Symbol": ["AMZN", "AMZN", "AMZN", "AAPL", "AAPL", "AAPL"], ... "Spot Price": [2000.0, 2025.0, 2030.0, 500.0, 510.0, 520.0], ... "Time": rt.TimeSpan(["09:30", "10:00", "10:25", "09:25", "10:00", "10:25"])}) >>> ds # Symbol Venue Time - ------ ----- ------------------ 0 AAPL A 09:30:00.000000000 1 AMZN I 10:00:00.000000000 2 AAPL A 10:20:00.000000000 [3 rows x 3 columns] total bytes: 39.0 B >>> spot_ds # Symbol Spot Price Time - ------ ---------- ------------------ 0 AMZN 2,000.00 09:30:00.000000000 1 AMZN 2,025.00 10:00:00.000000000 2 AMZN 2,030.00 10:25:00.000000000 3 AAPL 500.00 09:25:00.000000000 4 AAPL 510.00 10:00:00.000000000 5 AAPL 520.00 10:25:00.000000000 [6 rows x 3 columns] total bytes: 120.0 B
Note that an as-of merge requires the
on
columns to be sorted. Before the merge, theon
columns are always checked. If they’re not sorted, by default they are sorted before the merge; the original order is then restored in the returned mergedDataset
.If you don’t need to preserve the existing ordering, it’s faster to sort the
on
columns in place first:>>> spot_ds.sort_inplace("Time") # Symbol Spot Price Time - ------ ---------- ------------------ 0 AAPL 500.00 09:25:00.000000000 1 AMZN 2,000.00 09:30:00.000000000 2 AMZN 2,025.00 10:00:00.000000000 3 AAPL 510.00 10:00:00.000000000 4 AAPL 520.00 10:25:00.000000000 5 AMZN 2,030.00 10:25:00.000000000 [6 rows x 3 columns] total bytes: 120.0 B
Get the nearest earlier time:
>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="backward", matched_on=True) # Symbol Time Venue Spot Price matched_on - ------ ------------------ ----- ---------- ------------------ 0 AAPL 09:30:00.000000000 A 500.00 09:25:00.000000000 1 AMZN 10:00:00.000000000 I 2025.00 10:00:00.000000000 2 AAPL 10:20:00.000000000 A 510.00 10:00:00.000000000 [3 rows x 5 columns] total bytes: 87.0 B
Get the nearest later time:
>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="forward", matched_on=True) # Symbol Time Venue Spot Price matched_on - ------ ------------------ ----- ---------- ------------------ 0 AAPL 09:30:00.000000000 A 510.00 10:00:00.000000000 1 AMZN 10:00:00.000000000 I 2025.00 10:00:00.000000000 2 AAPL 10:20:00.000000000 A 520.00 10:25:00.000000000 [3 rows x 5 columns] total bytes: 87.0 B
Get the nearest time, whether earlier or later:
>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="nearest", matched_on=True) # Symbol Time Venue Spot Price matched_on - ------ ------------------ ----- ---------- ------------------ 0 AAPL 09:30:00.000000000 A 500.00 09:25:00.000000000 1 AMZN 10:00:00.000000000 I 2025.00 10:00:00.000000000 2 AAPL 10:20:00.000000000 A 520.00 10:25:00.000000000 [3 rows x 5 columns] total bytes: 87.0 B
- riptable.rt_merge.merge_indices(left, right, *, on=None, how='left', validate=None, keep=None, high_card=None, hint_size=None, **kwargs)
Perform a join/merge of two
Dataset
objects, returning the left/right indices created by the join engine.The returned indices can be used to index into the left and right
Dataset
objects to construct a merged/joinedDataset
.- Parameters:
left (Dataset) – Left Dataset
right (Dataset) – Right Dataset
on (str or (str, str) or list of str or list of (str, str), optional) – Column names to join on. Must be found in both
left
andright
.how ({'left', 'right', 'inner', 'outer'}) –
The type of merge to be performed.
left: use only keys from
left
, as in a SQL ‘left join’. Preserves the ordering of keys.right: use only keys from
right
, as in a SQL ‘right join’. Preserves the ordering of keys.inner: use intersection of keys from both Datasets, as in a SQL ‘inner join’. Preserves the ordering of keys from
left
.outer: use union of keys from both Datasets, as in a SQL ‘full outer join’.
validate ({'one_to_one', '1:1', 'one_to_many', '1:m', 'many_to_one', 'm:1', 'many_to_many', 'm:m'}, optional) – Validate the uniqueness of the values in the columns specified by the
on
,left_on
,right_on
parameters. In other words, allows the _multiplicity_ of the keys to be checked so the user can prevent the merge if they want to ensure the uniqueness of the keys in one or both of the Datasets being merged. Note: Thekeep
parameter logically takes effect beforevalidate
when they’re both specified.keep ({'first', 'last'} or (str, str), optional) – An optional string which specifies that only the first or last occurrence of each unique key within
left
andright
should be kept. In other words, resolves multiple occurrences of keys (multiplicity > 1) to a single occurrence.high_card (bool or (bool, bool), optional) – Hint to low-level grouping implementation that the key(s) of
left
and/orright
contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when
high_card
is specified.
- Return type:
Examples
>>> rt.merge_indices(ds_simple_1, ds_simple_2, on=('A', 'X'), how = 'inner') # A B X C - - ---- - ----- 0 0 1.20 0 2.40 1 1 3.10 1 6.20 2 6 9.60 6 19.20 [3 rows x 4 columns] total bytes: 72.0 B
Demonstrating a ‘left’ merge.
>>> rt.merge_indices(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left') # B A C E - - -- ----- ----- 0 Q 0 2.40 1.50 1 R 6 6.20 11.20 2 S 9 19.20 nan 3 T 11 25.90 nan [4 rows x 4 columns] total bytes: 84.0 B
See also
- riptable.rt_merge.merge_lookup(left, right, on=None, left_on=None, right_on=None, require_match=False, suffixes=None, copy=True, columns_left=None, columns_right=None, keep=None, high_card=None, hint_size=None)
Combine two
Dataset
objects by performing a database-style left-join operation on columns.This operation returns a new
Dataset
object. To do an in-place merge, useDataset.merge_lookup
withinplace=True
.- Parameters:
left (
Dataset
) – The firstDataset
to merge. If a matching value for a key inleft
isn’t found inright
, the returnedDataset
includes a row with the columns fromleft
, but with NaN values in each column fromright
.right (
Dataset
) – The secondDataset
to merge. If rows inright
don’t have matches inleft
they will be discarded. If they match multiple rows inleft
they will be duplicated appropriately.on (str or (str, str) or list of str or list of (str, str), optional) –
Names of columns (keys) to join on. If
on
isn’t specified,left_on
andright_on
must be specified. Options for types:Single string: Join on one column that has the same name in both
Dataset
objects.List: A list of strings is treated as a multi-key in which all associated key column values in
left
must have matches inright
. The column names must be the same in bothDataset
objects, unless they’re in a tuple; see below.Tuple: Use a tuple to specify key columns that have different names. For example,
("col_a", "col_b")
joins oncol_a
inleft
andcol_b
inright
. Both columns are in the returnedDataset
unless you specify otherwise usingcolumns_left
orcolumns_right
.
left_on (str or list of str, optional) – Use instead of
on
to specify names of columns in the leftDataset
to join on. A list of strings is treated as a multi-key in which all associated key column values inleft
must have matches inright
. If bothon
andleft_on
are specified, an error is raised.right_on (str or list of str, optional) – Use instead of
on
to specify names of columns in the rightDataset
to join on. A list of strings is treated as a multi-key in which all associated key column values inright
must have matches inleft
. If bothon
andright_on
are specified, an error is raised.require_match (bool, default
False
) – WhenTrue
, all keys inleft
are required to have a matching key inright
, and an error is raised when this requirement is not met.suffixes (tuple of (str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in
left
andright
, respectively. By default, an error is raised for any overlapping non-key columns that will be in the returnedDataset
.copy (bool, default
True
) – Set toFalse
to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared amongleft
,right
, and theDataset
returned by this function.columns_left (str or list of str, optional) – Names of columns from
left
to include in the mergedDataset
. By default, all columns are included.columns_right (str or list of str, optional) – Names of columns from
right
to include in the mergedDataset
. By default, all columns are included.keep ({None, 'first', 'last'}, optional) – When
right
contains multiple rows with a given unique key fromleft
, keep only one such row; this parameter indicates whether it should be the first or last row with the given key. By default (keep=None
), an error is raised if there are any non-unique keys inright
.high_card (bool or (bool, bool), optional) – Hint to the low-level grouping implementation that the key(s) of
left
orright
contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when
high_card
is specified.
- Returns:
A new
Dataset
of the two merged objects.- Return type:
See also
merge_asof
Merge two
Dataset
objects using the nearest key.merge2
Merge two
Dataset
objects using various database-style joins.merge_indices
Return the left and right indices created by the join engine.
Dataset.merge_lookup
Merge two
Dataset
objects with an in-place option.Dataset.merge2
Merge two
Dataset
objects using various database-style joins.Dataset.merge_asof
Merge two
Dataset
objects using the nearest key.
Examples
A basic merge on a single column:
>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA", ... "AMZN", "GME", "SPY", "GME", "TSLA"])}) >>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY"]), ... "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan"])}) >>> ds_l # Symbol - ------ 0 GME 1 AMZN 2 TSLA 3 SPY 4 TSLA 5 AMZN 6 GME 7 SPY 8 GME 9 TSLA [10 rows x 1 columns] total bytes: 40.0 B >>> ds_r # Symbol Trader - ------ ------ 0 TSLA Nate 1 GME Elon 2 AMZN Josh 3 SPY Dan [4 rows x 2 columns] total bytes: 32.0 B >>> rt.merge_lookup(ds_l, ds_r, on="Symbol") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Dan 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Dan 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
When key columns have different names, use
left_on
andright_on
to specify them:>>> ds_r.col_rename("Symbol", "Primary_Symbol") >>> rt.merge_lookup(ds_l, ds_r, left_on="Symbol", right_on="Primary_Symbol", ... columns_right="Trader") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Dan 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Dan 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
For non-key columns with the same name that will be returned, specify
suffixes
:>>> # Add duplicate non-key columns. >>> ds_l.Value = rt.FA([0.72, 0.85, 0.14, 0.55, 0.77, 0.65, 0.23, 0.15, 0.43, 0.25]) >>> ds_r.Value = rt.FA([0.28, 0.56, 0.89, 0.74]) >>> # You can also use a tuple to specify left and right key columns. >>> rt.merge_lookup(ds_l, ds_r, on=("Symbol", "Primary_Symbol"), ... suffixes=["_1", "_2"], columns_right=["Value", "Trader"]) # Symbol Value_1 Value_2 Trader - ------ ------- ------- ------ 0 GME 0.72 0.56 Elon 1 AMZN 0.85 0.89 Josh 2 TSLA 0.14 0.28 Nate 3 SPY 0.55 0.74 Dan 4 TSLA 0.77 0.28 Nate 5 AMZN 0.65 0.89 Josh 6 GME 0.23 0.56 Elon 7 SPY 0.15 0.74 Dan 8 GME 0.43 0.56 Elon 9 TSLA 0.25 0.28 Nate [10 rows x 4 columns] total bytes: 240.0 B
When
on
is a list, a multi-key join is performed. All keys must match in the rightDataset
.If a matching value for a key in the left
Dataset
isn’t found in the rightDataset
, the returnedDataset
includes a row with the columns fromleft
but with NaN values in the columns fromright
.>>> # Add associated Size values for multi-key join. Note that one >>> # symbol-size pair in the left Dataset doesn't have a match in >>> # the right Dataset. >>> ds_l.Size = rt.FA([500, 150, 430, 225, 430, 320, 175, 620, 135, 260]) >>> ds_r.Size = rt.FA([430, 500, 150, 2250]) >>> # Pass a list of key columns that contains a tuple. >>> rt.merge_lookup(ds_l, ds_r, on=[("Symbol", "Primary_Symbol"), "Size"], ... suffixes=["_1", "_2"]) # Size Symbol Value_1 Trader Value_2 - ---- ------ ------- ------ ------- 0 500 GME 0.72 Elon 0.56 1 150 AMZN 0.85 Josh 0.89 2 430 TSLA 0.14 Nate 0.28 3 225 SPY 0.55 nan 4 430 TSLA 0.77 Nate 0.28 5 320 AMZN 0.65 nan 6 175 GME 0.23 nan 7 620 SPY 0.15 nan 8 135 GME 0.43 nan 9 260 TSLA 0.25 nan [10 rows x 5 columns] total bytes: 280.0 B
When the right
Dataset
has more than one matching key, usekeep
to specify which one to use:>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA", ... "AMZN", "GME", "SPY", "GME", "TSLA"])}) >>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY", "SPY"]), ... "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan", "Amy"])}) >>> rt.merge_lookup(ds_l, ds_r, on="Symbol", keep="last") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Amy 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Amy 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
Invalid values are not treated as equal keys:
>>> ds_l = rt.Dataset({"Key": [1.0, rt.nan, 2.0,], "Value1": [1.0, 2.0, 3.0]}) >>> ds_r = rt.Dataset({"Key": [1.0, 2.0, rt.nan], "Value2": [1.0, 2.0, 3.0]}) >>> rt.merge_lookup(ds_l, ds_r, on="Key", columns_right="Value2") # Key Value1 Value2 - ---- ------ ------ 0 1.00 1.00 1.00 1 nan 2.00 nan 2 2.00 3.00 2.00 [3 rows x 3 columns] total bytes: 72.0 B