riptable.rt_bin
Functions
|
Partition values into discrete bins. |
|
Quantile-based discretization function. |
|
Compute sample quantile or quantiles of the input array. For example, q=0.5 computes the median. |
- riptable.rt_bin.cut(x, bins, labels=True, right=True, retbins=False, precision=3, include_lowest=False, filter=None, duplicates='raise')
Partition values into discrete bins.
This function is also useful for converting a continuous variable to a
Categorical
variable.Values can be partitioned into a specified number of equal-width bins or bins bounded by specified endpoints.
For bins bounded by specified endpoints, values that fall outside of the bin range are put into the ‘Filtered’ bin, which is mapped to 0 in the returned
Categorical
. See the exception (caused by a known issue) noted in the description of theright
parameter, below.Other known issues are noted in the parameter descriptions and shown in the Examples section, below.
- Parameters:
x (
array
) – The input array to be partitioned. Must be 1-dimensional. NaN values are put into the ‘Filtered’ bin.bins (
int
or sequence of scalar) –Determines how bins are created:
Indicates whether each bin includes its right endpoint or not. Note: Until known issues are fixed:
Each bin includes its right endpoint, even if
right
is set toFalse
.If
right
isTrue
(the default), the first bin includes its left endpoint even ifinclude_lowest
isFalse
(the default).If
right
isFalse
, values ofx
that fall outside of the last bin’s right endpoint are put into a bin labeled with an integer representing the bin number. For example, ifbins=[1, 2, 3, 4]
, a value of 5 inx
is put in a bin labeled!<4>
. This bin is mapped to 4 in the integer mapping array.
labels (
bool
,array
, orNone
, defaultTrue
) – Specify the labels for the returned bins. If an array, it must be the same length as the number of resulting bins (that is, its length should be one fewer than the number of endpoints). IfTrue
(the default) orNone
, the labels are created based on the bin endpoints. IfFalse
, only aFastArray
of the integer bin mappings is returned.retbins (
bool
, defaultFalse
) – Whether to return an array of the bin endpoints. Useful whenbins
is provided as a scalar or other labels are specified. See the Returns section below for details of the output.precision (int, default 3) – The precision at which to display the bin labels. Note that the endpoints used for partitioning are not changed.
include_lowest (
bool
, defaultFalse
) – Indicates whether the first bin should include its left endpoint or not. Note: Until a known issue is fixed, the first bin always includes its left endpoint, except whenright
is set toFalse
.filter (
array
ofbool
, optional) – A boolean mask array. If a filter is provided, any values ofx
corresponding toFalse
values are put in the ‘Filtered’ bin and mapped to 0 in the integer bin mapping array. Note that until a known issue is fixed, this parameter accepts a mask array that is shorter thanx
and ignores values ofx
that are past the last corresponding value of the mask.duplicates ({'raise', 'drop'}, default 'raise') – If bin endpoints are not unique, raise an error or drop duplicate values.
- Returns:
bins (Categorical or FastArray) –
If
labels
isTrue
orNone
, aCategorical
is returned, consisting of the bins, the integer mapping codes for the bins, and the unique bin labels.If
labels
isFalse
, aFastArray
is returned that contains the integer mapping codes.
endpoints (optional) (
ndarray
of str) – An array of the bin endpoints. Returned as a separate value, only whenretbins
isTrue
.
See also
riptable.qcut
Partition values into bins based on rank or sample quantiles.
Examples
Partition values into three equal-sized bins.
>>> rt.cut(x=rt.FA([1, 7, 5, 4, 6, 3]), bins=3) Categorical([1.0->3.0, 5.0->7.0, 3.0->5.0, 3.0->5.0, 5.0->7.0, 1.0->3.0]) Length: 6 FastArray([1, 3, 2, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'1.0->3.0', b'3.0->5.0', b'5.0->7.0'], dtype='|S8') Unique count: 3
Also return an array of the bin endpoints.
>>> cat, endpoints = rt.cut(x=rt.FA([1, 7, 5, 4, 6, 3]), bins=3, retbins=True) >>> cat Categorical([1.0->3.0, 5.0->7.0, 3.0->5.0, 3.0->5.0, 5.0->7.0, 1.0->3.0]) Length: 6 FastArray([1, 3, 2, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'1.0->3.0', b'3.0->5.0', b'5.0->7.0'], dtype='|S8') Unique count: 3 >>> endpoints array([1., 3., 5., 7.])
Return just the array of integer bin mappings.
>>> rt.cut(x=rt.FA([1, 7, 5, 4, 6, 3]), bins=3, labels=False) FastArray([1, 3, 2, 2, 3, 1], dtype=int8)
Assign the bins specific labels. Notice that the returned
Categorical
object’s categories arelabels
.>>> rt.cut(x=rt.FA([1, 7, 5, 4, 6, 3]), ... bins=3, labels=["bad", "medium", "good"]) Categorical([bad, good, medium, medium, good, bad]) Length: 6 FastArray([1, 3, 2, 2, 3, 1], dtype=int8) Base Index: 1 FastArray([b'bad', b'medium', b'good'], dtype='|S6') Unique count: 3
Partition values into bins with specified endpoints. Values that fall outside of the bins are put in the ‘Filtered’ category.
>>> rt.cut(x=rt.FA([1, 7, 5, 4, 6, 3]), bins=[1, 3, 6]) Categorical([1.0->3.0, Filtered, 3.0->6.0, 3.0->6.0, 3.0->6.0, 1.0->3.0]) Length: 6 FastArray([1, 0, 2, 2, 2, 1], dtype=int8) Base Index: 1 FastArray([b'1.0->3.0', b'3.0->6.0'], dtype='|S8') Unique count: 2
Known Issues
Each bin includes its right endpoint, even if
right
is set toFalse
.>>> rt.cut(x=rt.FA([2, 3, 4]), bins=[1, 2, 3, 4], right=False) Categorical([1.0->2.0, 2.0->3.0, 3.0->4.0]) Length: 3 FastArray([1, 2, 3], dtype=int8) Base Index: 1 FastArray([b'1.0->2.0', b'2.0->3.0', b'3.0->4.0'], dtype='|S8') Unique count: 3
If
right
isTrue
(the default), the first bin includes its left endpoint even ifinclude_lowest
isFalse
(the default).>>> rt.cut(x=rt.FA([1, 2, 3, 4]), bins=3, include_lowest=False) Categorical([1.0->2.0, 1.0->2.0, 2.0->3.0, 3.0->4.0]) Length: 4 FastArray([1, 1, 2, 3], dtype=int8) Base Index: 1 FastArray([b'1.0->2.0', b'2.0->3.0', b'3.0->4.0'], dtype='|S8') Unique count: 3
If
right
isFalse
, values ofx
that fall outside of the last bin’s right endpoint are put into a bin labeled with an integer representing the bin number.>>> rt.cut(x=rt.FA([1, 2, 3, 4, 5, 6]), bins=[1, 2, 3, 4], right=False) Categorical([Filtered, 1.0->2.0, 2.0->3.0, 3.0->4.0, !<4>, !<4>]) Length: 6 FastArray([0, 1, 2, 3, 4, 4], dtype=int8) Base Index: 1 FastArray([b'1.0->2.0', b'2.0->3.0', b'3.0->4.0'], dtype='|S8') Unique count: 3
If a boolean mask filter is provided that’s shorter than the length of
x
, values ofx
that are past the length of the mask are ignored.>>> rt.cut(x=rt.FA([1, 2, 3, 4]), bins=2, filter=rt.FA([False, True, True])) Categorical([Filtered, 2.0->2.5, 2.5->3.0]) Length: 3 FastArray([0, 1, 2], dtype=int8) Base Index: 1 FastArray([b'2.0->2.5', b'2.5->3.0'], dtype='|S8') Unique count: 2
- riptable.rt_bin.qcut(x, q, labels=True, retbins=False, precision=3, duplicates='raise', filter=None)
Quantile-based discretization function.
Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example, 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
- Parameters:
x (1d ndarray) –
q (integer or array of quantiles) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately, array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
labels (boolean, array, or None) – Used as labels for the resulting bins. If an array, must be of the same length as the resulting bins. If False, returns only integer indicators of the bins. If None or True, the labels are created based on the bins. This affects the type of the output container (see below).
retbins (bool, optional) – Whether to return the (bins, labels) or not.
precision (int, optional) – The precision at which to store and display the bins labels
duplicates ({default 'raise', 'drop'}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques.
filter (ndarray of bool, default None) – If provided, any False values will be ignored in the calculation.
- Returns:
out (Categorical or FastArray) – An array-like object representing the respective bin for each value of
x
. The type depends on the value oflabels
:False : returns a FastArray of integers
array, True, or None : returns a Categorical
bins (ndarray of floats) – The computed or specified bins. Only returned when
retbins=True
.
Notes
Out of bounds values will be represented as ‘Clipped’ in the resulting Categorical object
See also
cut()
Bin values into discrete intervals.
Categorical
Array type for storing data that come from a fixed set of values.
Examples
>>> rt.qcut(range(5), 4) Categorical([0.0->1.0, 0.0->1.0, 1.0->2.0, 2.0->3.0, 3.0->4.0]) Length: 5 FastArray([2, 2, 3, 4, 5], dtype=int8) Base Index: 1 FastArray([b'Clipped', b'0.0->1.0', b'1.0->2.0', b'2.0->3.0', b'3.0->4.0'], dtype='|S8') Unique count: 5
>>> rt.qcut(range(5), 3, labels=["good", "medium", "bad"]) Categorical([good, good, medium, bad, bad]) Length: 5 FastArray([2, 2, 3, 4, 4], dtype=int8) Base Index: 1 FastArray([b'Clipped', b'good', b'medium', b'bad'], dtype='|S7') Unique count: 4
>>> rt.qcut(range(5), 4, labels=False) FastArray([2, 2, 3, 4, 5], dtype=int8)
- riptable.rt_bin.quantile(x, q, interpolation_method='fraction')
Compute sample quantile or quantiles of the input array. For example, q=0.5 computes the median.
The
interpolation_method
parameter supports three values, namelyfraction
(default),lower
andhigher
. Interpolation is done only, if the desired quantile lies between two data pointsi
andj
. Forfraction
, the result is an interpolated value betweeni
andj
; forlower
, the result isi
, forhigher
the result isj
.- Parameters:
x (ndarray) – Values from which to extract score.
q (scalar or array) – Percentile at which to extract score.
interpolation_method ({'fraction', 'lower', 'higher'}, optional) – This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points
i
andj
: - fraction:i + (j - i)*fraction
, wherefraction
is the fractional part of the index surrounded byi
andj
. - lower:i
. - higher:j
.
- Returns:
score – Score at percentile.
- Return type:
Examples
>>> from scipy import stats >>> a = np.arange(100) >>> stats.scoreatpercentile(a, 50) 49.5