riptable.rt_merge

Classes

JoinIndices

Holds fancy/logical indices into the left and right Datasets constructed by the join implementation,

Functions

merge(left, right[, on, left_on, right_on, how, ...])

Merge Dataset by performing a database-style join operation by columns.

merge2(left, right[, on, left_on, right_on, how, ...])

Merge Dataset by performing a database-style join operation by columns.

merge_asof(left, right[, on, left_on, right_on, by, ...])

Combine two Dataset objects by performing a database-style left-join based

merge_indices(left, right, *[, on, how, validate, ...])

Perform a join/merge of two Dataset objects, returning the left/right indices created by the join engine.

merge_lookup(left, right[, on, left_on, right_on, ...])

Combine two Dataset objects by performing a database-style left-join

class riptable.rt_merge.JoinIndices

Bases: NamedTuple

Holds fancy/logical indices into the left and right Datasets constructed by the join implementation, along with other relevant data needed to construct the resulting merged Dataset.

left_index: riptable.rt_fastarray.FastArray | None

Integer fancy index or boolean mask for selecting data from the columns of the left Dataset to create the columns of the merged Dataset. This index is optional; when None, indicates that the columns from the left Dataset can be used directly in the resulting merged Dataset without needing to be filtered or otherwise transformed.

right_index: riptable.rt_fastarray.FastArray | None

Integer fancy index or boolean mask for selecting data from the columns of the right Dataset to create the columns of the merged Dataset. This index is optional; when None, indicates that the columns from the right Dataset can be used directly in the resulting merged Dataset without needing to be filtered or otherwise transformed.

right_only_rowcount: int | None

Only populated for outer merges. Indices the number of rows in the right Dataset whose keys do not occur in the left Dataset. This value can be used to slice right_fancyindex to get just the part at the end which represents these “right-only rows”.

static result_rowcount(index_arr, dset_rowcount)

Calculate the number of rows resulting from indexing into a Dataset with a fancy/logical index.

Parameters:
  • index_arr (np.ndarray, optional) – A fancy index or boolean mask array to be used to select rows from a Dataset.

  • dset_rowcount (int) – The number of rows in the Dataset that index_arr will be applied to.

Returns:

The number of rows resulting from indexing into a Dataset with a fancy/logical index. Guaranteed to be non-negative.

Return type:

int

riptable.rt_merge.merge(left, right, on=None, left_on=None, right_on=None, how='left', suffixes=('_x', '_y'), indicator=False, columns_left=None, columns_right=None, verbose=False, hint_size=0)

Merge Dataset by performing a database-style join operation by columns.

Parameters:
  • left (Dataset) – Left Dataset

  • right (Dataset) – Right Dataset

  • on (str or list of str, optional) – Column names to join on. Must be found in both left and right.

  • left_on (str or list of str, optional) – Column names from left Dataset to join on. When specified, overrides whatever is specified in on.

  • right_on (str or list of str, optional) – Column names from right to join on. When specified, overrides whatever is specified in on.

  • how ({'left','right', 'inner', 'outer'}) –

    • left: use only keys from the left. The output rows will be in one-to-one correspondence with the left rows! If multiple matches on the right occur, the last is taken.

    • right: use only keys from the right. The output rows will be in one-to-one correspondence with the left rows! If multiple matches on the left occur, the last is taken.

    • inner: use intersection of keys from both Datasets, similar to SQL inner join

    • outer: use union of keys from both Datasets, similar to SQL outer join

  • suffixes (tuple of (str, str), default ('_x', '_y')) – Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).

  • indicator (bool or str, default False) – If True, adds a column to output Dataset called “merge_indicator” with information on the source of each row. If string, column with information on source of each row will be added to output Dataset, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left Dataset, “right_only” for observations whose merge key only appears in right Dataset, and “both” if the observation’s merge key is found in both.

  • columns_left (str or list of str, optional) – Column names to include in the merge from left, defaults to None which causes all columns to be included.

  • columns_right (str or list of str, optional) – Column names to include in the merge from right, defaults to None which causes all columns to be included.

  • verbose (boolean) – For the stdout debris, defaults to False

  • hint_size (int) – An estimate of the number of unique keys used for the join, to optimize performance by pre-allocating memory for the key hash table.

Returns:

merged

Return type:

Dataset

Examples

>>> rt.merge(ds_simple_1, ds_simple_2, left_on = 'A', right_on = 'X', how = 'inner')
#   A      B   X       C
-   -   ----   -   -----
0   0   1.20   0    2.40
1   1   3.10   1    6.20
2   6   9.60   6   19.20

[3 rows x 4 columns] total bytes: 72.0 B

Demonstrating a ‘left’ merge.

>>> rt.merge(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left')
#   B    A       C       E
-   -   --   -----   -----
0   Q    0    2.40    1.50
1   R    6    6.20   11.20
2   S    9   19.20     nan
3   T   11   25.90     nan

[4 rows x 4 columns] total bytes: 84.0 B

See also

merge_asof

riptable.rt_merge.merge2(left, right, on=None, left_on=None, right_on=None, how='left', suffixes=None, copy=True, indicator=False, columns_left=None, columns_right=None, validate=None, keep=None, high_card=None, hint_size=None, **kwargs)

Merge Dataset by performing a database-style join operation by columns.

Parameters:
  • left (Dataset) – Left Dataset

  • right (Dataset) – Right Dataset

  • on (str or (str, str) or list of str or list of (str, str), optional) – Column names to join on. Must be found in both left and right.

  • left_on (str or list of str, optional) – Column names from left Dataset to join on. When specified, overrides whatever is specified in on.

  • right_on (str or list of str, optional) – Column names from right to join on. When specified, overrides whatever is specified in on.

  • how ({'left', 'right', 'inner', 'outer'}) –

    The type of merge to be performed.

    • left: use only keys from left, as in a SQL ‘left join’. Preserves the ordering of keys.

    • right: use only keys from right, as in a SQL ‘right join’. Preserves the ordering of keys.

    • inner: use intersection of keys from both Datasets, as in a SQL ‘inner join’. Preserves the ordering of keys from left.

    • outer: use union of keys from both Datasets, as in a SQL ‘full outer join’.

  • suffixes (tuple of (str, str), optional) – Suffix to apply to overlapping column names in the left and right side, respectively. The default (None) causes an exception to be raised for any overlapping columns.

  • copy (bool, default True) – If False, avoid copying data when possible; this can reduce memory usage but users must be aware that data can be shared between left and/or right and the Dataset returned by this function.

  • indicator (bool or str, default False) – If True, adds a column to output Dataset called “merge_indicator” with information on the source of each row. If string, column with information on source of each row will be added to output Dataset, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left Dataset, “right_only” for observations whose merge key only appears in right Dataset, and “both” if the observation’s merge key is found in both.

  • columns_left (str or list of str, optional) – Column names to include in the merge from left, defaults to None which causes all columns to be included.

  • columns_right (str or list of str, optional) – Column names to include in the merge from right, defaults to None which causes all columns to be included.

  • validate ({'one_to_one', '1:1', 'one_to_many', '1:m', 'many_to_one', 'm:1', 'many_to_many', 'm:m'}, optional) – Validate the uniqueness of the values in the columns specified by the on, left_on, right_on parameters. In other words, allows the _multiplicity_ of the keys to be checked so the user can prevent the merge if they want to ensure the uniqueness of the keys in one or both of the Datasets being merged. Note: The keep parameter logically takes effect before validate when they’re both specified.

  • keep ({'first', 'last'} or (str, str), optional) – An optional string which specifies that only the first or last occurrence of each unique key within left and right should be kept. In other words, resolves multiple occurrences of keys (multiplicity > 1) to a single occurrence.

  • high_card (bool or (bool, bool), optional) – Hint to low-level grouping implementation that the key(s) of left and/or right contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.

  • hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when high_card is specified.

Returns:

merged

Return type:

Dataset

Examples

>>> rt.merge2(ds_simple_1, ds_simple_2, left_on = 'A', right_on = 'X', how = 'inner')
#   A      B   X       C
-   -   ----   -   -----
0   0   1.20   0    2.40
1   1   3.10   1    6.20
2   6   9.60   6   19.20

[3 rows x 4 columns] total bytes: 72.0 B

Demonstrating a ‘left’ merge.

>>> rt.merge2(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left')
#   B    A       C       E
-   -   --   -----   -----
0   Q    0    2.40    1.50
1   R    6    6.20   11.20
2   S    9   19.20     nan
3   T   11   25.90     nan

[4 rows x 4 columns] total bytes: 84.0 B

See also

merge_asof

riptable.rt_merge.merge_asof(left, right, on=None, left_on=None, right_on=None, by=None, left_by=None, right_by=None, suffixes=None, copy=True, columns_left=None, columns_right=None, tolerance=None, allow_exact_matches=True, direction='backward', verbose=False, action_on_unsorted='sort', matched_on=False, **kwargs)

Combine two Dataset objects by performing a database-style left-join based on the nearest numeric key.

An as-of merge is useful for keys that are times (or other numeric values) that aren’t exact matches but are close enough to merge on.

Both Dataset objects must be sorted by the key column.

Use the direction argument to find the nearest key in the right Dataset:

  • A direction="backward" search selects the closest key that’s less than or equal to the key in the left Dataset.

  • A direction="forward" search selects the closest key that’s greater than or equal to the key in the left Dataset.

  • A direction="nearest" search selects the key that’s closest in absolute distance to the key in the left Dataset.

Optionally, you can match on equivalent keys with by before performing an as-of merge with on.

Parameters:
  • left (Dataset) – The first Dataset to merge. If a nearest match for a key in left isn’t found in right, the returned .Dataset includes a row with the columns from left, but with NaN values in each column from right.

  • right (Dataset) – The second Dataset to merge. If rows in right don’t have nearest matches in left they will be discarded. If they match multiple rows in left they will be duplicated appropriately.

  • on (str or (str, str)) – Name of the column to join on (the key column). If the column name is the same in both Dataset objects, use a single string. If the column names are different, use a tuple of strings. The values must be numeric (such as integers, floats, or datetimes). If on isn’t specified, left_on and right_on must be specified.

  • left_on (str, optional) – Use instead of on to specify the column in the left Dataset to join on.

  • right_on (str, optional) – Use instead of on to specify the column in the right Dataset to join on.

  • by (str or (str, str) or list of str or list of (str, str), optional) –

    Match equal keys in these columns before performing the as-of merge. Options for types:

    • Single string: Join on one column that has the same name in both Dataset objects.

    • List: A list of strings is treated as a multi-key in which all associated key column values in left must have matches in right. The column names must be the same in both Dataset objects, unless they’re in a tuple; see below.

    • Tuple: Use a tuple to specify key columns that have different names. For example, ("col_a", "col_b") joins on col_a in left and col_b in right. Both columns are in the returned Dataset unless you specify otherwise using columns_left or columns_right.

  • left_by (str or list of str, optional) – Use instead of by to specify names of columns to match equivalent values on in the left Dataset.

  • right_by (str or list of str, optional) – Use instead of by to specify names of columns to match equivalent values on in the right Dataset.

  • suffixes ((str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in left and right, respectively. By default, an error is raised for any overlapping non-key columns that will be in the returned Dataset.

  • copy (bool, default True) – Set to False to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared among left, right, and the Dataset returned by this function.

  • columns_left (str or list of str, optional) – Names of columns from left to include in the merged Dataset. By default, all columns are included.

  • columns_right (str or list of str, optional) – Names of columns from right to include in the merged Dataset. By default, all columns are included.

  • tolerance (int, float, or timedelta, optional) – Not implemented. Tolerance allowed when performing the as-of part of the merge. When a row from left doesn’t have a key in right within this distance, that row will have a NaN for any columns from right that appear in the merged result.

  • allow_exact_matches (boolean, default True) – If True (the default), allow matching with an equivalent on value (i.e., allow less-than-or-equal-to or greater-than-or-equal-to matches). If False, don’t match an equivalent on value (i.e., perform only strictly-less-than or strictly-greater-than matches).

  • direction ({"backward", "forward", "nearest"}, default "backward") – Whether to search for prior, subsequent, or closest matches in the right Dataset.

  • verbose (bool, default False) – Show information used for debugging.

  • check_sorted (bool, default True) –

    Deprecated since version 1.10.0: See action_on_unsorted.

  • action_on_unsorted ({"sort", "raise"}, default "sort") –

    New in version 1.10.0: The on columns are always checked to see if they

    are sorted. If they’re unsorted, by default they are sorted before the merge; the original order is then restored in the returned merged Dataset. Set to “raise” to raise an error for any unsorted on column.

  • matched_on (bool or str, default False) – Add a column to the merged Dataset that contains the on column value from right that was matched. When True, the column uses the default name “matched_on”; specify a string to name the column.

  • left_index (bool, default False) –

    Deprecated since version 1.10.0.

    This parameter is only provided for compatibility with pandas.merge_asof and has no effect in a Riptable as-of merge.

  • right_index (bool, default False) –

    Deprecated since version 1.10.0.

    This parameter is only provided for compatibility with pandas.merge_asof and has no effect in a Riptable as-of merge.

Returns:

A new Dataset of the two merged objects.

Return type:

Dataset

See also

merge_lookup

Merge two Dataset objects based on equivalent keys.

merge2

Merge two Dataset objects using various database-style joins.

merge_indices

Return the left and right indices created by the join engine.

Dataset.merge_asof

Merge two Dataset objects using the nearest key.

Dataset.merge_lookup

Merge two Dataset objects with an in-place option.

Dataset.merge2

Merge two Dataset objects using various database-style joins.

Examples

>>> left = rt.Dataset({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
>>> right = rt.Dataset({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]})
>>> left
#    a   left_val
-   --   --------
0    1   a
1    5   b
2   10   c

[3 rows x 2 columns] total bytes: 15.0 B
>>> right
#   a   right_val
-   -   ---------
0   1           1
1   2           2
2   3           3
3   6           6
4   7           7

[5 rows x 2 columns] total bytes: 40.0 B

Merge based on the integers in the “a” columns. The first match is exact; the second two are “backward” nearest matches (the default direction):

>>> rt.merge_asof(left, right, on="a")
#    a   left_val   right_val
-   --   --------   ---------
0    1   a                  1
1    5   b                  3
2   10   c                  7

[3 rows x 3 columns] total bytes: 27.0 B

When allow_exact_matches=False, a nearest match is used if there is one (as for row 0). Here, also, there’s no “forward” nearest match in right for row 2:

>>> rt.merge_asof(left, right, on="a", direction="forward", allow_exact_matches=False)
#    a   left_val   right_val
-   --   --------   ---------
0    1   a                  2
1    5   b                  6
2   10   c                Inv

[3 rows x 3 columns] total bytes: 27.0 B

If allow_exact_matches=False and there are no nearest matches, a NaN value is filled in:

>>> rt.merge_asof(left, right, on="a", allow_exact_matches=False)
#    a   left_val   right_val
-   --   --------   ---------
0    1   a                Inv
1    5   b                  3
2   10   c                  7

[3 rows x 3 columns] total bytes: 27.0 B

As-of merges are good for time-series data. Here, the Dataset objects are merged with on="Time" and by="Symbol" to get the nearest time associated with the same symbol:

>>> # Left Dataset with trades and times.
>>> ds = rt.Dataset({"Symbol": ["AAPL", "AMZN", "AAPL"],
...                  "Venue": ["A", "I", "A"],
...                  "Time": rt.TimeSpan(["09:30", "10:00", "10:20"])})
>>> # Right Dataset with spot prices and nearby times.
>>> spot_ds = rt.Dataset({"Symbol": ["AMZN", "AMZN", "AMZN", "AAPL", "AAPL", "AAPL"],
...                       "Spot Price": [2000.0, 2025.0, 2030.0, 500.0, 510.0, 520.0],
...                       "Time": rt.TimeSpan(["09:30", "10:00", "10:25", "09:25", "10:00", "10:25"])})
>>> ds
#   Symbol   Venue                 Time
-   ------   -----   ------------------
0   AAPL     A       09:30:00.000000000
1   AMZN     I       10:00:00.000000000
2   AAPL     A       10:20:00.000000000

[3 rows x 3 columns] total bytes: 39.0 B
>>> spot_ds
#   Symbol   Spot Price                 Time
-   ------   ----------   ------------------
0   AMZN       2,000.00   09:30:00.000000000
1   AMZN       2,025.00   10:00:00.000000000
2   AMZN       2,030.00   10:25:00.000000000
3   AAPL         500.00   09:25:00.000000000
4   AAPL         510.00   10:00:00.000000000
5   AAPL         520.00   10:25:00.000000000

[6 rows x 3 columns] total bytes: 120.0 B

Note that an as-of merge requires the on columns to be sorted. Before the merge, the on columns are always checked. If they’re not sorted, by default they are sorted before the merge; the original order is then restored in the returned merged Dataset.

If you don’t need to preserve the existing ordering, it’s faster to sort the on columns in place first:

>>> spot_ds.sort_inplace("Time")
#   Symbol   Spot Price                 Time
-   ------   ----------   ------------------
0   AAPL         500.00   09:25:00.000000000
1   AMZN       2,000.00   09:30:00.000000000
2   AMZN       2,025.00   10:00:00.000000000
3   AAPL         510.00   10:00:00.000000000
4   AAPL         520.00   10:25:00.000000000
5   AMZN       2,030.00   10:25:00.000000000

[6 rows x 3 columns] total bytes: 120.0 B

Get the nearest earlier time:

>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="backward", matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           500.00   09:25:00.000000000
1   AMZN     10:00:00.000000000   I          2025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           510.00   10:00:00.000000000

[3 rows x 5 columns] total bytes: 87.0 B

Get the nearest later time:

>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="forward", matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           510.00   10:00:00.000000000
1   AMZN     10:00:00.000000000   I          2025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           520.00   10:25:00.000000000

[3 rows x 5 columns] total bytes: 87.0 B

Get the nearest time, whether earlier or later:

>>> rt.merge_asof(ds, spot_ds, on="Time", by="Symbol", direction="nearest", matched_on=True)
#   Symbol                 Time   Venue   Spot Price           matched_on
-   ------   ------------------   -----   ----------   ------------------
0   AAPL     09:30:00.000000000   A           500.00   09:25:00.000000000
1   AMZN     10:00:00.000000000   I          2025.00   10:00:00.000000000
2   AAPL     10:20:00.000000000   A           520.00   10:25:00.000000000

[3 rows x 5 columns] total bytes: 87.0 B
riptable.rt_merge.merge_indices(left, right, *, on=None, how='left', validate=None, keep=None, high_card=None, hint_size=None, **kwargs)

Perform a join/merge of two Dataset objects, returning the left/right indices created by the join engine.

The returned indices can be used to index into the left and right Dataset objects to construct a merged/joined Dataset.

Parameters:
  • left (Dataset) – Left Dataset

  • right (Dataset) – Right Dataset

  • on (str or (str, str) or list of str or list of (str, str), optional) – Column names to join on. Must be found in both left and right.

  • how ({'left', 'right', 'inner', 'outer'}) –

    The type of merge to be performed.

    • left: use only keys from left, as in a SQL ‘left join’. Preserves the ordering of keys.

    • right: use only keys from right, as in a SQL ‘right join’. Preserves the ordering of keys.

    • inner: use intersection of keys from both Datasets, as in a SQL ‘inner join’. Preserves the ordering of keys from left.

    • outer: use union of keys from both Datasets, as in a SQL ‘full outer join’.

  • validate ({'one_to_one', '1:1', 'one_to_many', '1:m', 'many_to_one', 'm:1', 'many_to_many', 'm:m'}, optional) – Validate the uniqueness of the values in the columns specified by the on, left_on, right_on parameters. In other words, allows the _multiplicity_ of the keys to be checked so the user can prevent the merge if they want to ensure the uniqueness of the keys in one or both of the Datasets being merged. Note: The keep parameter logically takes effect before validate when they’re both specified.

  • keep ({'first', 'last'} or (str, str), optional) – An optional string which specifies that only the first or last occurrence of each unique key within left and right should be kept. In other words, resolves multiple occurrences of keys (multiplicity > 1) to a single occurrence.

  • high_card (bool or (bool, bool), optional) – Hint to low-level grouping implementation that the key(s) of left and/or right contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.

  • hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when high_card is specified.

Return type:

JoinIndices

Examples

>>> rt.merge_indices(ds_simple_1, ds_simple_2, on=('A', 'X'), how = 'inner')
#   A      B   X       C
-   -   ----   -   -----
0   0   1.20   0    2.40
1   1   3.10   1    6.20
2   6   9.60   6   19.20

[3 rows x 4 columns] total bytes: 72.0 B

Demonstrating a ‘left’ merge.

>>> rt.merge_indices(ds_complex_1, ds_complex_2, on = ['A','B'], how = 'left')
#   B    A       C       E
-   -   --   -----   -----
0   Q    0    2.40    1.50
1   R    6    6.20   11.20
2   S    9   19.20     nan
3   T   11   25.90     nan

[4 rows x 4 columns] total bytes: 84.0 B

See also

merge2

riptable.rt_merge.merge_lookup(left, right, on=None, left_on=None, right_on=None, require_match=False, suffixes=None, copy=True, columns_left=None, columns_right=None, keep=None, high_card=None, hint_size=None)

Combine two Dataset objects by performing a database-style left-join operation on columns.

This operation returns a new Dataset object. To do an in-place merge, use Dataset.merge_lookup with inplace=True.

Parameters:
  • left (Dataset) – The first Dataset to merge. If a matching value for a key in left isn’t found in right, the returned Dataset includes a row with the columns from left, but with NaN values in each column from right.

  • right (Dataset) – The second Dataset to merge. If rows in right don’t have matches in left they will be discarded. If they match multiple rows in left they will be duplicated appropriately.

  • on (str or (str, str) or list of str or list of (str, str), optional) –

    Names of columns (keys) to join on. If on isn’t specified, left_on and right_on must be specified. Options for types:

    • Single string: Join on one column that has the same name in both Dataset objects.

    • List: A list of strings is treated as a multi-key in which all associated key column values in left must have matches in right. The column names must be the same in both Dataset objects, unless they’re in a tuple; see below.

    • Tuple: Use a tuple to specify key columns that have different names. For example, ("col_a", "col_b") joins on col_a in left and col_b in right. Both columns are in the returned Dataset unless you specify otherwise using columns_left or columns_right.

  • left_on (str or list of str, optional) – Use instead of on to specify names of columns in the left Dataset to join on. A list of strings is treated as a multi-key in which all associated key column values in left must have matches in right. If both on and left_on are specified, an error is raised.

  • right_on (str or list of str, optional) – Use instead of on to specify names of columns in the right Dataset to join on. A list of strings is treated as a multi-key in which all associated key column values in right must have matches in left. If both on and right_on are specified, an error is raised.

  • require_match (bool, default False) – When True, all keys in left are required to have a matching key in right, and an error is raised when this requirement is not met.

  • suffixes (tuple of (str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in left and right, respectively. By default, an error is raised for any overlapping non-key columns that will be in the returned Dataset.

  • copy (bool, default True) – Set to False to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared among left, right, and the Dataset returned by this function.

  • columns_left (str or list of str, optional) – Names of columns from left to include in the merged Dataset. By default, all columns are included.

  • columns_right (str or list of str, optional) – Names of columns from right to include in the merged Dataset. By default, all columns are included.

  • keep ({None, 'first', 'last'}, optional) – When right contains multiple rows with a given unique key from left, keep only one such row; this parameter indicates whether it should be the first or last row with the given key. By default (keep=None), an error is raised if there are any non-unique keys in right.

  • high_card (bool or (bool, bool), optional) – Hint to the low-level grouping implementation that the key(s) of left or right contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.

  • hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when high_card is specified.

Returns:

A new Dataset of the two merged objects.

Return type:

Dataset

See also

merge_asof

Merge two Dataset objects using the nearest key.

merge2

Merge two Dataset objects using various database-style joins.

merge_indices

Return the left and right indices created by the join engine.

Dataset.merge_lookup

Merge two Dataset objects with an in-place option.

Dataset.merge2

Merge two Dataset objects using various database-style joins.

Dataset.merge_asof

Merge two Dataset objects using the nearest key.

Examples

A basic merge on a single column:

>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA",
...                                     "AMZN", "GME", "SPY", "GME", "TSLA"])})
>>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY"]),
...                    "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan"])})
>>> ds_l
#   Symbol
-   ------
0   GME
1   AMZN
2   TSLA
3   SPY
4   TSLA
5   AMZN
6   GME
7   SPY
8   GME
9   TSLA

[10 rows x 1 columns] total bytes: 40.0 B
>>> ds_r
#   Symbol   Trader
-   ------   ------
0   TSLA     Nate
1   GME      Elon
2   AMZN     Josh
3   SPY      Dan

[4 rows x 2 columns] total bytes: 32.0 B
>>> rt.merge_lookup(ds_l, ds_r, on="Symbol")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Dan
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Dan
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

When key columns have different names, use left_on and right_on to specify them:

>>> ds_r.col_rename("Symbol", "Primary_Symbol")
>>> rt.merge_lookup(ds_l, ds_r, left_on="Symbol", right_on="Primary_Symbol",
...                 columns_right="Trader")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Dan
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Dan
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

For non-key columns with the same name that will be returned, specify suffixes:

>>> # Add duplicate non-key columns.
>>> ds_l.Value = rt.FA([0.72, 0.85, 0.14, 0.55, 0.77, 0.65, 0.23, 0.15, 0.43, 0.25])
>>> ds_r.Value = rt.FA([0.28, 0.56, 0.89, 0.74])
>>> # You can also use a tuple to specify left and right key columns.
>>> rt.merge_lookup(ds_l, ds_r, on=("Symbol", "Primary_Symbol"),
...                 suffixes=["_1", "_2"], columns_right=["Value", "Trader"])
#   Symbol   Value_1   Value_2   Trader
-   ------   -------   -------   ------
0   GME         0.72      0.56   Elon
1   AMZN        0.85      0.89   Josh
2   TSLA        0.14      0.28   Nate
3   SPY         0.55      0.74   Dan
4   TSLA        0.77      0.28   Nate
5   AMZN        0.65      0.89   Josh
6   GME         0.23      0.56   Elon
7   SPY         0.15      0.74   Dan
8   GME         0.43      0.56   Elon
9   TSLA        0.25      0.28   Nate

[10 rows x 4 columns] total bytes: 240.0 B

When on is a list, a multi-key join is performed. All keys must match in the right Dataset.

If a matching value for a key in the left Dataset isn’t found in the right Dataset, the returned Dataset includes a row with the columns from left but with NaN values in the columns from right.

>>> # Add associated Size values for multi-key join. Note that one
>>> # symbol-size pair in the left Dataset doesn't have a match in
>>> # the right Dataset.
>>> ds_l.Size = rt.FA([500, 150, 430, 225, 430, 320, 175, 620, 135, 260])
>>> ds_r.Size = rt.FA([430, 500, 150, 2250])
>>> # Pass a list of key columns that contains a tuple.
>>> rt.merge_lookup(ds_l, ds_r, on=[("Symbol", "Primary_Symbol"), "Size"],
...                 suffixes=["_1", "_2"])
#   Size   Symbol   Value_1   Trader   Value_2
-   ----   ------   -------   ------   -------
0    500   GME         0.72   Elon        0.56
1    150   AMZN        0.85   Josh        0.89
2    430   TSLA        0.14   Nate        0.28
3    225   SPY         0.55                nan
4    430   TSLA        0.77   Nate        0.28
5    320   AMZN        0.65                nan
6    175   GME         0.23                nan
7    620   SPY         0.15                nan
8    135   GME         0.43                nan
9    260   TSLA        0.25                nan

[10 rows x 5 columns] total bytes: 280.0 B

When the right Dataset has more than one matching key, use keep to specify which one to use:

>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA",
...                                     "AMZN", "GME", "SPY", "GME", "TSLA"])})
>>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY", "SPY"]),
...                    "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan", "Amy"])})
>>> rt.merge_lookup(ds_l, ds_r, on="Symbol", keep="last")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Amy
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Amy
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

Invalid values are not treated as equal keys:

>>> ds_l = rt.Dataset({"Key": [1.0, rt.nan, 2.0,], "Value1": [1.0, 2.0, 3.0]})
>>> ds_r = rt.Dataset({"Key": [1.0, 2.0, rt.nan], "Value2": [1.0, 2.0, 3.0]})
>>> rt.merge_lookup(ds_l, ds_r, on="Key", columns_right="Value2")
#    Key   Value1   Value2
-   ----   ------   ------
0   1.00     1.00     1.00
1    nan     2.00      nan
2   2.00     3.00     2.00

[3 rows x 3 columns] total bytes: 72.0 B