Concatenate Datasets

Concatentating Datasets is straightforward, with two Dataset methods: You can concatenate rows (vertically) with concat_rows() or columns (horizontally) with concat_columns().

However, it’s good to be aware of what happens when you concatenate two Datasets that have different shapes or column names, so we’ll look at a few examples.

Concatenate Rows

Concatenating data row-wise is sometimes called vertical stacking. When two Datasets have identical column names, concat_rows() simply stacks the data:

>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds2 = rt.Dataset({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})
>>> ds1
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2

>>> ds2
#   A    B
-   --   --
0   A3   B3
1   A4   B4
2   A5   B5

>>> rt.Dataset.concat_rows([ds1, ds2])
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2
3   A3   B3
4   A4   B4
5   A5   B5

When the two Datasets have only some column names in common, the result has a gap in the data:

>>> # Create two Datasets with two out of three columns in common.
>>> ds3 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds4 = rt.Dataset({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5'], 'C': ['C3', 'C4', 'C5'] })
>>> ds3
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2

>>> ds4
#   A    B    C
-   --   --   --
0   A3   B3   C3
1   A4   B4   C4
2   A5   B5   C5

>>> rt.Dataset.concat_rows([ds3, ds4])
#   A    B    C
-   --   --   --
0   A0   B0
1   A1   B1
2   A2   B2
3   A3   B3   C3
4   A4   B4   C4
5   A5   B5   C5

As you can see, Riptable’s missing string value is ’’. If the values were floats, the empty spots would be filled with nans:

>>> rng = np.random.default_rng(seed=42)
>>> ds5 = rt.Dataset({'col_'+str(i):rng.random(2) for i in range(2)})
>>> ds6 = rt.Dataset({'col_'+str(i):rng.random(2) for i in range(3)})
>>> ds5
#   col_0   col_1
-   -----   -----
0    0.77    0.86
1    0.44    0.70

>>> ds6
#   col_0   col_1   col_2
-   -----   -----   -----
0    0.09    0.76    0.13
1    0.98    0.79    0.45

>>> rt.Dataset.concat_rows([ds5, ds6])
#   col_0   col_1   col_2
-   -----   -----   -----
0    0.77    0.86     nan
1    0.44    0.70     nan
2    0.09    0.76    0.13
3    0.98    0.79    0.45

See Working with Missing Data for more about what to expect when you have missing values in Riptable.

You can also concatenate datasets row-wise with Categoricals if the Datasets have identical column names:

>>> a = rt.Cat(['a', 'a', 'a', 'b', 'b'])
>>> b = rt.FA([0, 1, 2, 3, 4])
>>> ds10 = rt.Dataset({'Cat': a, 'Val': b})
>>> c = rt.Cat(['c', 'c', 'c', 'd', 'd'])
>>> d = rt.FA([5, 6, 7, 8, 9])
>>> ds11 = rt.Dataset({'Cat': c, 'Val': d})
>>> ds10
#   Cat   Val
-   ---   ---
0   a       0
1   a       1
2   a       2
3   b       3
4   b       4

>>> ds11
#   Cat   Val
-   ---   ---
0   c       5
1   c       6
2   c       7
3   d       8
4   d       9

>>> rt.Dataset.concat_rows([ds10, ds11])
#   Cat   Val
-   ---   ---
0   a       0
1   a       1
2   a       2
3   b       3
4   b       4
5   c       5
6   c       6
7   c       7
8   d       8
9   d       9

Concatenate Columns

Concatenating data column-wise is also called horizontal stacking. It’s most straightforward when you’re concatenating two Datasets that have no column names in common:

>>> ds7 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds8 = rt.Dataset({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']})
>>> ds7
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2

>>> ds8
#   C    D
-   --   --
0   C0   D0
1   C1   D1
2   C2   D2

>>> ds9 = rt.Dataset.concat_columns([ds7, ds8], do_copy=True)
>>> ds9
#   A    B    C    D
-   --   --   --   --
0   A0   B0   C0   D0
1   A1   B1   C1   D1
2   A2   B2   C2   D2

Note that do_copy is a required argument for concat_columns(). When do_copy=True, changes you make to values in the original Datasets do not change the values in your new, concatenated Dataset, and vice-versa.

When your two Datasets have a column name (or names) in common, you need to specify which data you want to keep – the data from the shared column(s) in first Dataset or the data from the shared column(s) in the second Dataset.

We’ll give our second Dataset an ‘A’ column:

>>> ds8.A = rt.FA(['A3', 'A4', 'A5'])

If you try to concatenate the two Datasets, you get an error:

>>> try:
...     rt.Dataset.concat_columns([ds7, ds8], do_copy=True)
... except KeyError as e:
...     print("KeyError:", e)
KeyError: "Duplicate column 'A'"

To keep the column data from the first Dataset, use on_duplicate='first'. You’ll get a warning about mismatched column names, but the concatenation is performed:

>>> rt.Dataset.concat_columns([ds7, ds8], do_copy=True, on_duplicate='first')
C:\\riptable\\rt_dataset.py:5628: UserWarning: concat_columns() duplicate column mismatch: {'A'}
warnings.warn(f'concat_columns() duplicate column mismatch: {dups!r}')
#   A    B    C    D
-   --   --   --   --
0   A0   B0   C0   D0
1   A1   B1   C1   D1
2   A2   B2   C2   D2

You can turn off this warning by adding on_mismatch='ignore'.

To keep the column data from the second dataset, use on_duplicate='last':

>>> rt.Dataset.concat_columns([ds7, ds8], on_duplicate='last', do_copy=True)
#   A    B    C    D
-   --   --   --   --
0   A3   B0   C0   D0
1   A4   B1   C1   D1
2   A5   B2   C2   D2

Note: To concatenate Datasets column-wise, the columns must all be the same length – Riptable does not fill in missing column values the way it does missing row values:

>>> ds9 = rt.Dataset({'E': ['E0', 'E1']})
>>> try:
...     rt.Dataset.concat_columns([ds8, ds9], do_copy=True)
... except ValueError as e:
...     print("ValueError:", e)
ValueError: Inconsistent Dataset lengths {2, 3}

Concatenation is sufficient in certain situations, but it helps to have more flexibility to bring data from two Datasets together. Next, we’ll cover how to Merge Datasets.


Questions or comments about this guide? Email RiptableDocumentation@sig.com.