Strus built-in functions
Core
List of functions and operators predefined in the storage queryprocessor:
Posting join operator
List of predefined posting join operators:
- chain get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
- chain_struct get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings. - contains get the set of postings (d,1) for documents d that contain all of the argument features.
- diff get the set of postings (d,p) that are in the first argument set but not in the second..
- inrange get the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|.
- inrange_structget the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.
- intersect get the set of postings (d,p) that are occurring in all argument sets or in at least the number specified with cardinality.
- pred get the set of postings (d,p-1) for all (d,p) with p>1 in the argument set.
- sequence get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i
- sequence_struct get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i
2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings. - succ get the set of postings (d,p+1) for all (d,p) in the argument set.
- union get the set of postings that are occurring in any argument set.
- within get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|.
- within_struct get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.
- chain_struct get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
Weighting function
List of predefined weighting functions with argument descriptions:
- bm25 calculates the document weight with the weighting scheme "BM25" (Okapi)
- match [Feature] defines the query features to weight.
- k1 [Numeric (1:1000)] parameter of the BM25 weighting scheme.
- b [Numeric (0.0001:1000)] parameter of the BM25 weighting scheme.
- avgdoclen [Numeric (0:)] the average document lenght.
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted.
- bm25pff calculates the document weight with the weighting scheme "BM25pff". This is "BM25" where the feature frequency is counted by 1.0 per feature only for features with the maximum proximity score. The proximity score is a measure that takes the proximity of other query features into account.
- match [Feature] defines the query features to weight.
- struct [Feature] defines the delimiter for structures.
- para [Feature] defines the delimiter for paragraphs (windows used for proximity weighting must not overlap paragraph borders).
- k1 [Numeric (1:1000)] parameter of the BM25pff weighting scheme.
- b [Numeric (0.0001:1000)] parameter of the BM25pff weighting scheme.
- titleinc [Numeric (0.0:)] ff increment for title features.
- tidocnorm [Numeric] specifies a normalization factor of the title weight between 0 and 1. Document bigger or equal this value get close to 1, others smaller").
- windowsize [Numeric] the size of the window used for finding features to increment proximity scores.
- cardinality [Numeric] the number of query features a proximity score window must contain to be considered (optional, default is all features).
- metadata_title_maxpos [Metadata] the metadata element that specifies the last title element. Elements in title are scored with an ff increment.
- metadata_title_size [Metadata] the metadata element that specifies the number of terms (size) of the title.
- ffbase [Numeric (0.0:1.0)] value in the range from 0.0 to 1.0 specifying the percentage of the constant score on the proximity ff for every feature occurrence. (with 1.0 the scheme is plain BM25).
- fftie [Numeric (0:)] value specifying the mapping of the ff of a weighted to an intervall between 0 and this value.
- proxffbias [Numeric (0.0:1.0)] bias for proximity ff increments always counted (the others are counted only till 'proxfftie'.
- proxfftie [Numeric (0:)] the maximum proximity based ff value that is considered for weighting except for increments exceeding 'proxffbias'.
- avgdoclen [Numeric (0:)] the average document lenght.
- maxdf [Numeric (0:)] the maximum df as fraction of the collection size.
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted.
- smart calculates the document weight as a sum of scalar functions for each query feature of the ff (feature freuency), the df (document frequency), N (number of documents in the collection) and some document metadata element values. The name 'smart' is inspired by information retrieval calssical SMART weighting schemes that are covered by this method. The scalar function is specified as string expression.
- match [Feature] defines the input query features.
- function [String] defines an expression to evaluate. You can use the operators '*','/','+','-' and functions like 'log', 'tanh', etc.. Brackets '(' and ')' can be used for grouping subexpressions. The variables 'df','ff' and 'N' can be used besides all variables specified as parameters or as meta data elements.
- metadata [String] defines a metadata element that can be referenced in the scalar function expression.
- [a-z]+ [Numeric] defines a variable to be used in the formula expression.
- metadata calculate the weight of a document as value of a meta data element.
- name [Metadata] name of the meta data element to use as weight.
- td calculates the weight of a document as sum of the the feature weights multiplied with their feature frequency.
- match [Feature] defines the query features to weight.
- tf calculates the weight of a document as sum of the feature frequency of a feature multiplied with the feature weight
- match [Feature] defines the query features to weight.
- weight [Numeric (0:)] defines the query feature weight factor.
Summarizer
List of predefined summarizer functions with argument descriptions:
- accuvariable accumulates the weights of all contents of a variable in matching expressions. Weights with same positions are grouped and multiplied, the group results are added to the sum, the total weight assigned to the variable content.
- match [Feature] defines the query features to inspect for variable matches.
- type [String] the forward index feature type for the content to extract.
- var [String] the name of the variable referencing the content to weight.
- nof [Numeric (1:)] the maximum number of the best weighted elements to return (default 10).
- norm [Numeric (0.0:1.0)] the normalization factor of the calculated weights (default 1.0).
- attribute gets the value of a document attribute.
- name [Attribute] the name of the attribute to get.
- matchphrase gets the best matching phrase delimited by the structure postings.
- match [Feature] defines the features to weight.
- struct [Feature] defines the delimiter for structures.
- para [Feature] defines the delimiter for paragraphs (summaries must not overlap paragraph borders).
- type [String] the forward index type of the result phrase elements.
- metadata_title_maxpos [Metadata (1:)] the metadata element that specifies the last title element. Only content is used for abstracting.
- sentencesize [Numeric (1:)] restrict the maximum length of sentences in summaries.
- windowsize [Numeric (1:)] maximum size of window used for identifying matches.
- cardinality [Numeric (1:)] minimum number of features in a window.
- matchmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for highlighting matches in the resulting phrases.
- floatingmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for marking floating phrases without start or end of sentence found.
- name_para [String] specifies the summary element name used for paragraphs (default 'para').
- name_phrase [String] specifies the summary element name used for phrases (default 'phrase').
- name_docstart [String] specifies the summary element name used for the document start (alternative summary, if no match found, default 'docstart').
- matchpos get the feature occurencies printed.
- match [Feature] defines the query features.
- N [Numeric (1:)] the maximum number of matches to return.
- matchvariables extracts all variables assigned to subexpressions of features specified.
- match [Feature] defines the query features to inspect for variable matches.
- type [String] the forward index feature type for the content to extract.
- metadata get the value of a document meta data element.
- name [Metadata] the name of the meta data element to get.
- content producing one token for each input chunk (identity).
- punctuation producing punctuation elements (end of sentence recognition). The language is specified as parameter (currently only german 'de' and english 'en' supported).
- split splitting tokens separated by whitespace characters.
- word splitting tokens by word boundaries for european languages.
- regex identify tokens with a regular expression.
- convdia mapping all diacritical characters to ascii. The language is passed as first argument and alternative date formats as following argument (currently only german 'de' and english 'en' supported).
- date2int mapping a date to an integer. The granularity of the result is passed as first argument and alternative date formats as following arguments.
- dictmap mapping the elements with a dictionary. For found elements the passed value is returned. The dictionary file name is passed as argument.
- empty mapping input tokens to an empty string.
- lc mapping all characters to lowercase.
- orig mapping the identity of the input tokens-
- stem doing stemming based on snowball. The language is passed as parameter.
- text mapping the identity of the input tokens.
- uc mapping all characters to uppercase.
- ngram producing of NGRAMs of a defined size (2,3,4,etc.)
- regex search replace with a regular expression and a substitution pattern
- count counting the input elements.
- maxpos getting the maximum position of the input elements.
- minpos getting the minimum position of the input elements.
- sumsquaretf calculating the sum of the square of the tf of all selected elements.
Analyzer
List of functions and operators predefined in the analyzer textprocessor:
Tokenizer
List of predefined tokenizer functions:
Normalizer
List of predefined normalizer functions:
Aggregator
List of predefined aggregator functions: