Class SimplePatternSplitTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory
Factory for
SimplePatternSplitTokenizer
, for producing tokens by splitting according to
the provided regexp.
This tokenizer uses Lucene RegExp
pattern matching to construct distinct tokens for
the input stream. The syntax is more limited than PatternTokenizer
, but the tokenization
is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp
- "determinizeWorkLimit" (optional, default
Operations.DEFAULT_DETERMINIZE_WORK_LIMIT
) the limit on total effort to determinize the automaton computed from the regexp
The pattern matches the characters that should split tokens, like String.split
, and
the matching is greedy such that the longest token separator matching at a given point is
matched. Empty tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> </analyzer> </fieldType>
- Since:
- 6.5.0
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final int
private final Automaton
static final String
SPI namestatic final String
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
Constructor Summary
ConstructorsConstructorDescriptionDefault ctor for compatibility with SPICreates a new SimpleSplitPatternTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate
(AttributeFactory factory) Creates a TokenStream of the specified input using the given AttributeFactoryMethods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
PATTERN
- See Also:
-
dfa
-
determinizeWorkLimit
private final int determinizeWorkLimit
-
-
Constructor Details
-
SimplePatternSplitTokenizerFactory
Creates a new SimpleSplitPatternTokenizerFactory -
SimplePatternSplitTokenizerFactory
public SimplePatternSplitTokenizerFactory()Default ctor for compatibility with SPI
-
-
Method Details
-
create
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-