Reshape Data with Pivot and Transpose

Riptable is designed to efficiently work with column-oriented data, which is also called long-format data. This isn’t always the best format for displaying data for human consumption, however.

For example, suppose your data consists of a measurement (say, trade volume) per date and symbol. The long-format, Riptable-friendly way to represent this is to have three columns – for date, symbol, and volume:

>>> long_ds = rt.Dataset({'Date': ['20191111', '20191111', '20191111', '20191112',
...                                '20191112', '20191112'],
...                       'Symbol': ['AAPL', 'MSFT', 'TSLA', 'MSFT', 'AAPL', 'TSLA'],
...                       'Volume': [10, 20, 30, 20, 10, 30]})
>>> long_ds
#   Date       Symbol   Volume
-   --------   ------   ------
0   20191111   AAPL         10
1   20191111   MSFT         20
2   20191111   TSLA         30
3   20191112   MSFT         20
4   20191112   AAPL         10
5   20191112   TSLA         30

While this format is ideal for Riptable’s work, the repeated date and symbol values make it a bit unintuitive for humans to read and make sense of.

In this case, a simple transform from long format to wide doesn’t help much:

>>> long_ds._T
Fields:            0               1               2               3               4               5
   Date     20191111        20191111        20191111        20191112        20191112        20191112
 Symbol         AAPL            MSFT            TSLA            MSFT            AAPL            TSLA
 Volume           10              20              30              20              10              30

A more human-friendly presentation can be gotten from the Dataset method pivot(), which reorganizes data with multiple keys (here, our keys are the date and the symbol).

We can use pivot() to show one row per date and one column for each symbol:

>>> wide_ds = long_ds.pivot('Date', 'Symbol', 'Volume')
>>> wide_ds
*Date      AAPL   MSFT   TSLA
--------   ----   ----   ----
20191111     10     20     30
20191112     10     20     30

The first argument passed is used for the row labels; the second is for the column labels. The third argument specifies which column’s (or columns’) data to use to populate the table. (If none are specified, all remaining columns are used.)

Notice the output’s similarity to that of Accum2():

>>> long_ds.accum2(long_ds.Date, long_ds.Symbol).sum(long_ds.Volume)
*Date      AAPL   MSFT   TSLA   Total
--------   ----   ----   ----   -----
20191111     10     20     30      60
20191112     10     20     30      60
   Total     20     40     60     120

Also note that some wide-format data may be too wide for reasonable display.

To undo your pivot (or “unpivot”), use melt():

>>> melted_ds = wide_ds.melt('Date')
>>> melted_ds
#   Date       variable   value
-   --------   --------   -----
0   20191111   AAPL          10
1   20191112   AAPL          10
2   20191111   MSFT          20
3   20191112   MSFT          20
4   20191111   TSLA          30
5   20191112   TSLA          30

Here, we specified the Date column as the “identifier variable.” The other columns, considered “measured variables,” are unpivoted to the row axis with the column headers “variable” and “value.”

We could have specified our original column labels with var_name and value_name:

>>> melted_ds = wide_ds.melt('Date', var_name='Symbol', value_name='Volume')
>>> melted_ds
#   Date       Symbol   Volume
-   --------   ------   ------
0   20191111   AAPL         10
1   20191112   AAPL         10
2   20191111   MSFT         20
3   20191112   MSFT         20
4   20191111   TSLA         30
5   20191112   TSLA         30

Next, we’ll change gears to give you a high-level overview of tools you can use to Visualize Data.


Questions or comments about this guide? Email RiptableDocumentation@sig.com.