Performance Considerations

Riptable uses multi-threaded and vectorized operations to work quickly and efficiently with large amounts of data. However, because memory is a finite resource, it’s good to keep some things in mind.

Whenever possible:

  • Use universal functions (ufuncs) and ufunc methods. (Ufuncs take array inputs and produce array outputs.)

  • To work with a subset of an array, use slicing instead of fancy indexing. Fancy indexing creates a copy of the array; slicing instead gives you a “view” of the array. (This differs from slicing lists in Python, which creates copies.) Be aware that changes to the slice also change the original data. Also note that a slice creates a reference to the original data, and the original data won’t be cleared from memory until the reference is also deleted.

  • In general, be aware of which operations make copies of data. Use flags to do operations in place when you can.

  • Avoid filtering entire Datasets using ds.filter(). Use Boolean mask arrays, or use filter keyword arguments in operations on columns.

    • When it makes sense, you can use ds.filter(inplace=True) to modify the original Dataset.

  • Avoid string operations (creating strings, parsing strings, etc.). When you need to parse a string, use the FastArray string methods and try to do it in as few operations as possible.

  • Use Categoricals for string arrays, especially for repeated strings or if you’re converting data between Pandas and Riptable.

  • Delete datasets you’re not using. Though be aware that if you have any references to the Dataset in other objects (for example, a slice or any operation that gives you a “view” of the data), you might not actually free up memory.

  • Avoid using apply() – it’s not a vectorized operation.

Multiprocessing

If you need to use multiprocessing with Riptable, this project may be helpful: https://github.com/joblib/loky.


Questions or comments about this guide? Email RiptableDocumentation@sig.com.