riptable.rt_str
Classes
Provides access to FAString methods for Categoricals. |
|
String accessor class for |
- class riptable.rt_str.CatString(cat)
Provides access to FAString methods for Categoricals. All string methods are wrappers of the FAString equivalent with categorical re-expansion and option for how to fill filtered elements.
- property _isfiltered
- property substr
- classmethod _build_method(method)
General purpose factory for FAString function wrappers.
- classmethod _build_property(name)
General purpose factory for FAString property wrappers.
- _convert_fastring_output(out)
- extract(regex, expand=None, fillna='', names=None)
- class riptable.rt_str.FAString(shape, dtype=float, buffer=None, offset=0, strides=None, order=None)
Bases:
riptable.rt_fastarray.FastArray
String accessor class for
FastArray
.- property backtostring
convert back to FastArray or np.ndarray ‘S’ or ‘U’ string ‘S12’ or ‘U40’
- property lower
upper case a string (bytes or unicode) makes a copy
Examples
>>> FAString(['THIS','THAT','TEST']).lower FastArray(['this','that','test'], dtype='<U4')
- property n_elements
The number of elements in the original string array
- property reverse
upper case a string (bytes or unicode) does not make a copy
Examples
FAString([‘this’,’that’,’test’]).reverse
- property reverse_inplace
upper case a string (bytes or unicode) does not make a copy
Examples
FAString([‘this’,’that’,’test’]).reverse_inplace
- property str
Casts an array of byte strings or unicode as
FAString
.Enables a variety of useful string manipulation methods.
- Return type:
- Raises:
TypeError – If the FastArray is of dtype other than byte string or unicode
See also
np.chararray
,np.char
,rt.FAString.apply
Examples
>>> s=FA(['this','that','test ']*100_000) >>> s.str.upper FastArray([b'THIS', b'THAT', b'TEST ', ..., b'THIS', b'THAT', b'TEST '], dtype='|S5')
>>> s.str.lower FastArray([b'this', b'that', b'test ', ..., b'this', b'that', b'test '], dtype='|S5')
>>> s.str.removetrailing() FastArray([b'this', b'that', b'test', ..., b'this', b'that', b'test'], dtype='|S5')
- property strlen
return the string length of every string (bytes or unicode)
Examples
>>> FAString(['this ','that ','test']).strlen FastArray([6, 5, 4])
- property substr
- property upper
upper case a string (bytes or unicode) makes a copy
Examples
>>> FAString(['this','that','test']).upper FastArray(['THIS','THAT','TEST'], dtype='<U4')
- property upper_inplace
upper case a string (bytes or unicode) does not make a copy
Examples
FAString([‘this’,’that’,’test’]).upper_inplace
- _APPLY_PARALLEL_THRESHOLD = 10000
- nb_char
- nb_char_par
- nb_contains
- nb_contains_par
- nb_endswith
- nb_endswith_par
- nb_find
- nb_index
- nb_index_any_of
- nb_index_any_of_par
- nb_index_par
- nb_lower
- nb_lower_par
- nb_removetrailing
- nb_removetrailing_par
- nb_replace
- nb_replace_par
- nb_reverse
- nb_reverse_inplace
- nb_reverse_inplace_par
- nb_reverse_par
- nb_startswith
- nb_startswith_par
- nb_strlen
- nb_strlen_par
- nb_substr
- nb_substr_par
- nb_upper
- nb_upper_inplace
- nb_upper_inplace_par
- nb_upper_par
- _apply_func(func, funcp, *args, dtype=None, input=None)
- _find(str2)
Searches src for occurences of str2 and build a Boolean mask the same size as src indicating the starting point of all such occurences.
- Parameters:
for (str2 - a string with one or more characters to search) –
Examples
>>> FAString(['this','that','test']).find('t') FastArray([ [True, False, False, False], [True, False, False, True], [True, False, False, True] ])
- _nb_char(position, itemsize, strlen, out)
- _nb_contains(itemsize, dest, str2)
- _nb_endswith(itemsize, dest, str2)
- _nb_find(itemsize, dest, str2)
Searches src for occurrences of str2 and build a Boolean array with a row per string indicating indicating the starting points of all such occurrences.
- _nb_index(itemsize, dest, str2)
- _nb_index_any_of(itemsize, dest, str2)
- _nb_lower(itemsize, dest)
- _nb_removetrailing(itemsize, dest, removechar)
- _nb_replace(itemsize, dest, dest_itemsize, old, new, locations)
- _nb_reverse(itemsize, dest)
- _nb_reverse_inplace(itemsize)
- _nb_startswith(itemsize, dest, str2)
- _nb_strlen(itemsize, dest)
- _nb_substr(out, itemsize, start, stop, strlen)
- _nb_upper(itemsize, dest)
- _nb_upper_inplace(itemsize)
- _substr(start, stop=None)
Take a substring of each element using slice args. Behaves like slice, such that a single argument is treated as the stop. start, stop may be integers or arrays of integers aligned with self.
Examples
>>> a = rt.FA(['abc', 'xyzQ']) >>> a.str.substr(2) FastArray([b'ab', b'xy'], dtype='|S2') >>> a.str.substr(0, 2) FastArray([b'ab', b'xy'], dtype='|S2') >>> a.str.substr(1, 2) FastArray([b'b', b'y'], dtype='|S2') >>> a.str.substr([1, 2]) # element-wise bounds FastArray([b'a', b'xy'], dtype='|S2')
- _validate_input(str2)
- apply(func, *args, dtype=None)
Write your own string apply function NOTE: byte strings are passed as uint8 NOTE: unicode strings are passed as uint32
default signature must match
@nb.njit(cache=get_global_settings().enable_numba_cache, nogil=True) def nb_upper(src, itemsize, dest):
src: is uint array itemsize: is how wide the string is per row dest: is return uint array
- Parameters:
*args (pass in zero or more arguments (the arguments are always at the end)) –
dtype (specify a different dtype) –
Example
>>> import numba as nb ... @nb.njit(cache=get_global_settings().enable_numba_cache, nogil=True) ... def nb_upper(src, itemsize, dest): ... for i in nb.prange(len(src) / itemsize): ... rowpos = i * itemsize ... for j in range(itemsize): ... c=src[rowpos+j] ... if c >= 97 and c <= 122: ... # convert to ASCII upper ... dest[rowpos+j] = c-32 ... else: ... dest[rowpos+j] = c
>>> FAString(['this ','that ','test']).apply(nb_upper)
- char(position)
Take a single character from each element.
- Parameters:
position (int or list of int or np.ndarray) – The position of the character to be extracted. Negative values respect the length of the individual strings. If an array, the length must be equal to the number of strings. An error is raised if any positions are out of bounds (>= self._itemsize).
- contains(str2)
Return a boolean array that’s True for each string element that contains the given substring, otherwise False.
The entire substring must match.
- Parameters:
str2 (str) – A string with one or more characters to search for. To search using regular expressions, use
FAString.regex_match()
.- Returns:
A boolean array where the value is True if the string contains the entire substring specified in
str2
, otherwise False.- Return type:
FastArray
Examples
>>> FAString(['this ','that ','test']).contains('at') FastArray([False, True, False])
This can be called on a
FastArray
using.str.contains()
.>>> a = rt.FastArray(['this ','that ','test']) >>> a.str.contains('at') FastArray([False, True, False])
- endswith(str2)
Return a boolean array that’s True where the given substring matches the end of each string element, otherwise False.
The entire substring must match.
- Parameters:
str2 (str) – A string with one or more characters to search for. To search using regular expressions, use
FAString.regex_match()
.- Returns:
A boolean array where the value is True if the string ends with the entire substring specified in
str2
, otherwise False.- Return type:
FastArray
Examples
>>> FAString(['abab','ababa','abababb']).endswith('ab') FastArray([True, False, False])
This can be called on a
FastArray
using.str.endswith()
.>>> a = rt.FastArray(['abab','ababa','abababb']) >>> a.str.endswith('ab') FastArray([True, False, False])
- extract(regex, expand=None, fillna='', names=None, apply_unique=True)
Extract one or more pattern groups from each element of an array into a
FastArray
orDataset
.This is useful when you have pieces of data in a string that you want to split into separate elements.
For one capture group, the default is to return a
FastArray
, but this can be overridden by settingexpand
to True or by providing a name of aDataset
column to populate. For more than one capture group, aDataset
is returned.Column names for the resulting
Dataset
can be specified within the regex using(?P<name>)
in the capture group(s) or by passing thenames
argument, which may be more convenient.- Parameters:
regex (str) – The pattern(s) to search for. Define multiple capture groups using parentheses.
expand (bool, default False) – Set to True to return a
Dataset
for a single capture group. If False, aFastArray
is returned.fillna (str, default '' (empty string)) – For elements where there’s no match, this is the fill value for the resulting
FastArray
orDataset
column.names (list of str, default None) – For more than one capture group, a
Dataset
is returned. Optionally, you can provide column names (keys) for the extracted data.apply_unique (bool) – When True, the regex is applied to the unique values and then expanded using the reverse index (see
riptable.unique()
). This is optimal for repetitive data and benign for unique or highly non-repetitive data.
- Returns:
For one capture group, a
FastArray
(or optionally aDataset
) is returned. For more than one capture group, aDataset
is returned.- Return type:
FastArray
orDataset
See also
FAString.regex_match
Return a boolean array that indicates whether given string or regular expression pattern is contained in each string element.
FAString.regex_replace
Replace each instance of a specified string or pattern.
Examples
These examples use a
FastArray
containing OSI symbols.>>> osi = rt.FastArray(['SPX UO 12/15/23 C5700', 'SPXW UO 09/17/21 C3650'])
Extract one substring:
>>> osi.str.extract('\w+') FastArray([b'SPX', b'SPXW'], dtype='|S4')
Provide a name for the resulting
Dataset
column:>>> osi.str.extract('(?P<root>\w+)') # root - ---- 0 SPX 1 SPXW
Define two capture groups and provide names for the resulting
Dataset
columns:>>> osi.str.extract('(\w+).* (\d{2}/\d{2}/\d{2})', names = ['root', 'expiration']) # root expiration - ---- ---------- 0 SPX 12/15/23 1 SPXW 09/17/21
Extract one substring into a
Dataset
column usingexpand = True
. (Note that for the element with an unmatched pattern, an empty string is returned).>>> osi.str.extract('\w+W', expand = True) # group_0 - ------- 0 1 SPXW
- index(str2)
return the first index location of the entire substring specified in str2, or -1 if the substring does not exist
- Parameters:
for (str2 - a string with one or more characters to search) –
Examples
>>> FAString(['this ','that ','test']).index('at') FastArray([-1, 2, -1])
- index_any_of(str2)
return the first index location any of the characters that are part of str2, or -1 if none of the characters match
- Parameters:
for (str2 - a string with one or more characters to search) –
Examples
>>> FAString(['this ','that ','test']).index_any_of('ia') FastArray([2, 2, -1])
- possibly_convert_tostr(arr)
converts list like or an array to the same string type
- regex_match(regex, apply_unique=True)
Return a boolean array that’s True where the given substring or regular expression pattern is contained in each string element, otherwise False.
The entire substring or pattern must match.
Applies
re.search()
on each element withregex
as the pattern.- Parameters:
regex (str) – String or regular expression pattern to search for.
apply_unique (bool, default True) – When True, the regex is applied to the unique values and then expanded using the reverse index (see
riptable.unique()
). This is optimal for repetitive data and benign for unique or highly non-repetitive data.
- Returns:
A boolean array where the value is True if the string element contains the entire substring or regex pattern specified in
regex
, otherwise False.- Return type:
FastArray
See also
FAString.regex_replace
Replace each instance of a specified substring or pattern.
FAString.extract
Extract one or more pattern groups into a
Dataset
orFastArray
.
Examples
Find any instance of ‘ab’ that appears at the end of a string:
>>> FAString(['abab','ababa','abababb']).regex_match('ab$') FastArray([True, False, False])
This can be called on a
FastArray
using.str.regex_match()
.>>> a = rt.FastArray(['abab','ababa','abababb']) >>> a.str.regex_match('ab$') FastArray([True, False, False])
- regex_replace(regex, repl, apply_unique=True)
Replace each instance of a specified substring or pattern.
The entire substring or pattern must match. If the substring or pattern isn’t found, the original string is returned unchanged.
The behavior is identical to that of
re.sub()
. In particular, the returned string is obtained by replacing the leftmost non-overlapping occurrences of the substring or pattern with the replacement string.- Parameters:
regex (str) – String or regular expression pattern to search for.
repl (str) – The replacement string.
apply_unique (bool, default True) – When True, the regex is applied to the unique values and then expanded using the reverse index (see
riptable.unique()
). This is optimal for repetitive data and benign for unique or highly non-repetitive data.
- Returns:
An array with all occurrences of the substring or pattern replaced.
- Return type:
FastArray
See also
FAString.regex_match
Return a boolean array that indicates whether given substring or regular expression pattern is contained in each string element.
FAString.extract
Extract one or more pattern groups into a
Dataset
orFastArray
.
Examples
Replace instances of ‘aa’ with ‘b’. All non-overlapping occurrences are replaced, starting from the left:
>>> FAString(['aaa', 'aaaa', 'aaaaa']).regex_replace('aa', 'b') FastArray(['ba', 'bb', 'bba'], dtype='<U3>')
Replace any instance of ‘ab’ that appears at the end of a string with ‘b’.
>>> FAString(['abab','ababa','abababb']).regex_replace('ab$', 'b') FastArray(['abb', 'ababa', 'abababb'], dtype='<U7')
This can be called on a FastArray using
.str.regex_replace()
. The returnedFastArray
elements are byte strings.>>> a = rt.FastArray(['abab','ababa','abababb']) >>> a.str.regex_replace('ab$', 'b') FastArray([b'abb', b'ababa', b'abababb'], dtype='|S7')
- removetrailing(remove=32)
removes spaces at end of string (often to fixup matlab string) makes a copy
- Parameters:
character) (remove=32. defaults to removing ascii 32 (space) –
Examples
>>> FAString(['this ','that ','test']).removetrailing() FastArray(['this','that','test'], dtype='<U6')
- replace(old, new)
Replace all occurrences of
old
withnew
- startswith(str2)
Return a boolean array that’s True where the given substring matches the start of each string element, otherwise False.
The entire substring must match.
- Parameters:
str2 (str) – A string with one or more characters to search for. To search using regular expressions, use
FAString.regex_match()
.- Returns:
A boolean array where the value is True if the string starts with the entire substring specified in
str2
, otherwise False.- Return type:
FastArray
Examples
>>> FAString(['this ','that ','test']).startswith('thi') FastArray([True, False, False])
This can be called on a
FastArray
using.str.startswith()
.>>> a = rt.FastArray(['this ','that ','test']) >>> a.str.startswith('thi') FastArray([True, False, False])
- strpbrk(str2)
- strstr(str2)
- strstrb(str2)
- substr_char_stop(stop, inclusive=False)
Take a substring of each element using characters as bounds.
- Parameters:
stop – A string used to determine the start of the sub-string. Excluded from the result by default. We go to the end of the string where stop is not in found in the corresponding element
inclusive (bool) – If True, include the stopping string in the result
Examples
>>> s = FastArray(['ABC', 'A_B', 'AB_C', 'AB_C_DD']) >>> s.str.substr_char_stop('_') FastArray([b'ABC', b'A', b'AB', b'AB'], dtype='|S2') >>> s.str.substr_char_stop('_', inclusive=True) FastArray([b'ABC', b'A_', b'AB_', b'AB_'], dtype='|S2')