org.apache.uima.java
true
com.ibm.langware.annotator.jFrostLexAnnotator
LanguageWare Lexical Annotator
This annotator provides access to LanguageWare Lexical Analysis.
8.0.4.0
IBM Corporation
SofaNames
The Sofa names the annotator should work on. If no
names are specified, the annotator works on the
default sofa.
String
true
false
LWDataSubdir
The name of the directory under the UIMA data directory in
which the LanguageWare resources are located
String
false
false
UseExplicitDicts
Dictionaries to be used are specified explicitly in this config file.
Boolean
false
false
PreloadLanguages
A list of all languages which should be pre-loaded at
init-time in the form xx-YY (xx=lang, YY=sublang/country)
String
true
false
DefaultLanguage
The language to use in processing when the document language is not set
before the annotator processing.
String
false
false
DictionaryCacheSize
!Deprecated! Maximum number of dictionaries held in cache
Integer
false
false
ProcessLanguagesWithNoDictionaries
Control the annotator behaviour if no dictionaries are configured for the processed document language.
If "tokenize", only basic tokenization will be possible.
If "skip", processing will be terminated with no errors.
If "error", an exception will be thrown.
The default value is "skip".
String
false
false
UseFirstMatchPolicy
If true lookup stops after the first match in any dictionary (DLTCM_POLICY_FIRST)
otherwise all matches from all dictionaries are found (DLTCM_POLICY_ALL)
Boolean
false
false
UseStrictCaseMode
If true, the strict-case mode is turned 'ON'. That means Case information will be respected when
doing lookup in lowercase dictionaries. otherwise, it will set strict-case mode to 'OFF' and a match
will be returned even if the case doesn't match.
Boolean
false
false
UseRelativeTokenAndSentenceNumbers
If true token and sentence numbers are reset to 1 for each new sentence/paragraph
Boolean
false
false
AnnotateMWConstituentTokens
If true, MWU annotations will be created for Multi-Word entries and Token Annotations will be created
for their constituent words; otherwise, Only MWU annotations will be created.
Boolean
false
false
MWBoundary
This defines MWUs lookup boundaries. possible values for this parameter are:
"Sentence", "Paragraph", or "Document".
String
false
false
IgnorePunctuationTokens
If true, punctuation tokens are ignored
Boolean
false
false
AggressiveSentenceBreaks
!Deprecated! If true, an end-of-line will be considered end-of-sentence
Boolean
false
false
CrossDictionaryDecomposition
If true a decomposition is performed across dictionaries
i.e. words from several dictionaries may be combined into one compound
Boolean
false
false
BOFAOnlyDecomposition
If true a decomposition is performed basing on BOFA values only.
Boolean
false
false
FilterDecomposedGlosses
If true, the paradigms reported by decomposition for each component are filtered
according to the decomposition rules, removing paradigms that are not valid in
combination. Setting this to false may lead to better performance and recall
at the expense of precision.
Boolean
false
false
StandaloneDecomposition
If true, the lexical analyzer tries to decompose dictionary-matched entries which have a compound flag.
Boolean
false
false
JapaneseDecomposition
If true decomposition is done for Japanese documents without
regard to the result specification
Boolean
false
false
JapaneseDeepWordBreak
If true returns Japanese word suffixes separated from their stems
Boolean
false
false
CreateCompoundPartsInsteadOfToken
If true then compound parts are created not as type uima.tt.CompPartAnnotation but
as uima.tt.TokenAnnotation. The annotations for a compound parts of a complex word
are created instead of (not in addition to) the token for the whole complex word.
Boolean
false
true
ReturnOnlyFirstLevelOfCompoundBreakdown
If true then for compounds which have several decompositions are only the first
(longest match) decomposition is returned. E.g. for the German "Segelschullehrer" only
"Segelschul"+"lehrer" is returned and not also "Segel" + "schul"+ "lehrer"
Boolean
false
false
CreateDecompStructure
If true, then full decomp analysis structure is created.
This option is intended to be used mutually exclusively with the previous two.
Boolean
false
false
BreakOnHyphens
!Deprecated! If true then we will try to break unknown words if it contains a hyphen
Boolean
false
false
DoLookupVariant
If true lookup unknown word in variant dictionary.
Boolean
false
false
DoRuleBasedNormalization4All
If true lookup a variant with rulebased normalization for all unknown word.
Boolean
false
false
DoRuleBasedNormalization4Katakana
If true lookup a variant with rulebased normalization only for katakana word.
Boolean
false
false
CreateGenericAnnotations
Create Generic annotations if annotate glosses availables.
Boolean
false
false
CheckGenericTypes
Check the types when writing the feature values for generic annotations.
Boolean
false
false
GlossComparatorClassname
The full name of the class implementation for the Comparator interface
to be used for sorting gloss collections.
String
false
false
LemmaPoolingThreshold
A threshold that is used to control lemma Pooling. Pooling enhance memory usage of
the annotator. It is good when processing large documents. Setting the value to 0
means always enabled; while setting its value to -1 disables pooling.
Integer
false
false
LexicalDicts
File name of dictionaries for the lexical analysis
String
true
false
MultiWordDicts
File name of dictionaries for the specific multi-word unit
String
true
false
OOVDicts
File name of dictionaries for the morphological guesser (out-of-vocabulary)
String
true
false
SynonymDicts
File name of dictionaries for synonyms
String
true
false
VariantDicts
File name of dictionaries for word variants
String
true
false
SpellCorrectionDicts
File name of dictionaries for the spelling correction
String
true
false
PartOfSpeechDict
File name of dictionary for the Part-of-Speech Tagging
String
false
false
PostTagHandling
Post tag handling policy
String
false
false
PostLemmaEntryHandling
Post LemmaEntries handling policy
String
false
false
MaxCharNumPerSentence
The maximum number of characters in a sentence.
Integer
false
false
BreakRulesSpec
Break rules to be used.
String
false
false
DecompositionRulesSpec
Decomposition rules to be used.
String
false
false
LWDataSubdir
PreloadLanguages
en
UseExplicitDicts
true
ProcessLanguagesWithNoDictionaries
skip
UseFirstMatchPolicy
false
UseStrictCaseMode
false
UseRelativeTokenAndSentenceNumbers
false
AnnotateMWConstituentTokens
true
MWBoundary
Sentence
IgnorePunctuationTokens
false
CrossDictionaryDecomposition
true
BOFAOnlyDecomposition
false
FilterDecomposedGlosses
true
JapaneseDecomposition
true
JapaneseDeepWordBreak
false
CreateCompoundPartsInsteadOfToken
true
ReturnOnlyFirstLevelOfCompoundBreakdown
false
CreateDecompStructure
false
DoLookupVariant
false
DoRuleBasedNormalization4All
false
DoRuleBasedNormalization4Katakana
false
CreateGenericAnnotations
true
CheckGenericTypes
false
GlossComparatorClassname
com.ibm.langware.annotator.GlossComparator
PartOfSpeechDict
de-XX-TSimplified-7220.dic
LexicalDicts
../resources/dictionary/8/de-XX-LLex-7017.dic
../resources/dictionary/9/de-XX-OOV-7002.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
ru-RU-TSimplified-7200.dic
LexicalDicts
../resources/dictionary/24/ru-RU-LLex-7003.dic
../resources/dictionary/25/ru-RU-OOV-7003.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
pt-XX-TSimplified-7001.dic
LexicalDicts
../resources/dictionary/22/pt-XX-LLex-7008.dic
../resources/dictionary/23/pt-XX-OOV-7003.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
ko-KR-TKpos-8041.dic
LexicalDicts
OOVDicts
BreakRulesSpec
PartOfSpeechDict
en-XX-TPenn-7212.dic
LexicalDicts
../resources/dictionary/0/en-XX-LLex-7030.dic
../resources/dictionary/1/en-XX-OOV-7004.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
it-IT-TSimplified-7001.dic
LexicalDicts
../resources/dictionary/15/it-IT-LLex-7007.dic
../resources/dictionary/16/it-IT-OOV-7002.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
fr-XX-TSimplified-7001.dic
LexicalDicts
../resources/dictionary/12/fr-XX-LLex-7009.dic
../resources/dictionary/13/fr-XX-OOV-7002.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
zh-XX-TCpos-7000.dic
LexicalDicts
../resources/dictionary/26/zh-XX-Lex-8003.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
es-ES-TSimplified-7002.dic
LexicalDicts
../resources/dictionary/10/es-ES-LLex-7006.dic
../resources/dictionary/11/es-ES-OOV-7003.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
cs-CZ-TSimplified-7200.dic
LexicalDicts
../resources/dictionary/4/cs-CZ-LLex-7003.dic
../resources/dictionary/5/cs-CZ-OOV-7004.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
ar-XX-TSimplified-7003.dic
LexicalDicts
../resources/dictionary/2/ar-XX-Lex-7007.dic
../resources/dictionary/3/ar-XX-OOV-7003.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
ja-JP-TJpos-7000.dic
LexicalDicts
../resources/dictionary/17/ja-JP-Lex-7006.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
pl-PL-TSimplified-7200.dic
LexicalDicts
../resources/dictionary/20/pl-PL-LLex-7003.dic
../resources/dictionary/21/pl-PL-OOV-7004.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
da-DK-TSimplified-7000.dic
LexicalDicts
../resources/dictionary/6/da-DK-LLex-7005.dic
../resources/dictionary/7/da-DK-OOV-7002.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
he-IL-TSimplified-7201.dic
LexicalDicts
../resources/dictionary/14/he-IL-Lex-7205.dic
OOVDicts
BreakRulesSpec
PartOfSpeechDict
tr-TR-TTpos-8502.dic
LexicalDicts
OOVDicts
BreakRulesSpec
PartOfSpeechDict
nl-NL-TSimplified-7000.dic
LexicalDicts
../resources/dictionary/18/nl-NL-Reform-LLex-7004.dic
../resources/dictionary/19/nl-NL-OOV-7002.dic
OOVDicts
BreakRulesSpec
uima.tcas.DocumentAnnotation
Annotation covering the entire document, containing document meta information, for example the document language
uima.tcas.Annotation
language
The document language
uima.cas.String
languageCandidates
A list of language candidates for the document produced during language identification. These are sorted by confidence value
uima.cas.FSList
uima.tt.TTAnnotation
Base type for lexical and document structure annotation types
uima.tcas.Annotation
uima.tt.DocStructureAnnotation
Base type for document structure annotation types
uima.tt.TTAnnotation
uima.tt.ParagraphAnnotation
A paragraph
uima.tt.DocStructureAnnotation
paragraphNumber
The sequence number of the paragraph in the document
uima.cas.Integer
uima.tt.SentenceAnnotation
A sentence
uima.tt.DocStructureAnnotation
sentenceNumber
The sequence number of the sentence in the paragraph (or the document)
uima.cas.Integer
uima.tt.LexicalAnnotation
Base type for lexical annotation types
uima.tt.TTAnnotation
uima.tt.DictionaryEntryAnnotation
Base type for dictionary-based user-defined annotation types
uima.tt.LexicalAnnotation
lemma
Morphological information for the dictionary entry
uima.tt.Lemma
uima.tt.TokenLikeAnnotation
Base type for token annotation types
uima.tt.LexicalAnnotation
lemma
The best probable entry containing all morphological information for the token
uima.tt.Lemma
lemmaEntries
List of lemma entries containing all morphological information for the token
uima.cas.FSArray
dictionaryMatch
A flag indicating whether or not the token matches a dictionary entry
uima.cas.Boolean
uima.tt.TokenAnnotation
General token annotation type. It is also the base type for the special token types
uima.tt.TokenLikeAnnotation
posTag
Part-of-Speech tag
uima.cas.String
uima.tt.CompPartAnnotation
A part of a compound word
uima.tt.TokenLikeAnnotation
uima.tt.KeyStringEntry
Base type for types defining key/value feature (e.g. uima.tt.Lemma type)
uima.cas.TOP
key
A key/value feature (e.g. lemma string in uima.tt.Lemma type)
uima.cas.String
uima.tt.Lemma
Morphological information retrieved from a lexical dictionary entry
uima.tt.KeyStringEntry
partOfSpeech
An integral encoding representing the part-of-speech for the lemma
uima.cas.Integer
frost_ExtendedPOS
An integer representing additional information related to the part-of-speech
uima.cas.Integer
isStopword
uima.cas.Boolean
uima.tt.LanguageConfidencePair
Language-Confidence pair of a language candidate for the document text
uima.cas.TOP
languageConfidence
An indication (a float value between 0 and 1) of how well the candidate language actually fits the language of the document
uima.cas.Float
language
Language name (ISO Locale code)
uima.cas.String
com.ibm.langware.uimatypes.WordLikeToken
Base type for possible words (not punctuations nor symbols). Also represents alphanumeric tokens
uima.tt.TokenAnnotation
com.ibm.langware.uimatypes.Alphabetic
Alphabetic word
com.ibm.langware.uimatypes.WordLikeToken
com.ibm.langware.uimatypes.UppercaseAlphabetic
Uppercase alphabetic word
com.ibm.langware.uimatypes.Alphabetic
com.ibm.langware.uimatypes.TitlecaseAlphabetic
Titlecase alphabetic word
com.ibm.langware.uimatypes.Alphabetic
com.ibm.langware.uimatypes.LowercaseAlphabetic
Lowercase alphabetic word
com.ibm.langware.uimatypes.Alphabetic
com.ibm.langware.uimatypes.Arabic
Arabic word
com.ibm.langware.uimatypes.Alphabetic
com.ibm.langware.uimatypes.Hebrew
Hebrew word
com.ibm.langware.uimatypes.Alphabetic
com.ibm.langware.uimatypes.Syllabic
Syllabic word
com.ibm.langware.uimatypes.WordLikeToken
com.ibm.langware.uimatypes.Hiragana
Hiragana (Syllabic) word
com.ibm.langware.uimatypes.Syllabic
com.ibm.langware.uimatypes.Katakana
Katakana (Syllabic) word
com.ibm.langware.uimatypes.Syllabic
com.ibm.langware.uimatypes.Hangul
Hangul (Syllabic) word
com.ibm.langware.uimatypes.Syllabic
com.ibm.langware.uimatypes.Ideographic
Ideographic word
com.ibm.langware.uimatypes.WordLikeToken
com.ibm.langware.uimatypes.Han
Han (Ideographic) word
com.ibm.langware.uimatypes.Ideographic
com.ibm.langware.uimatypes.Numeric
A numeric sequence
com.ibm.langware.uimatypes.WordLikeToken
com.ibm.langware.uimatypes.ChineseNumeral
A Chinese numeral
com.ibm.langware.uimatypes.Numeric
com.ibm.langware.uimatypes.Punctuation
A punctuation or symbol
uima.tt.TokenAnnotation
com.ibm.langware.uimatypes.ClauseEndingPunctuation
A clause terminating punctuation
com.ibm.langware.uimatypes.Punctuation
uima.tt.ParagraphAnnotation
uima.tt.SentenceAnnotation
uima.tt.TokenAnnotation
uima.tt.TokenAnnotation:lemma
uima.tt.TokenAnnotation:lemmaEntries
x-unspecified
uima.tt.ParagraphAnnotation
uima.tt.SentenceAnnotation
uima.tt.TokenAnnotation
uima.tt.CompPartAnnotation
uima.tt.Lemma
uima.tt.ParagraphAnnotation:paragraphNumber
uima.tt.SentenceAnnotation:sentenceNumber
uima.tt.TokenAnnotation:posTag
uima.tt.TokenAnnotation:lemmaEntries
uima.tt.TokenAnnotation:dictionaryMatch
uima.tt.Lemma:key
uima.tt.Lemma:partOfSpeech
uima.tt.Lemma:isStopword
uima.tt.Lemma:frost_ExtendedPOS
en
af
ar
ca
cs
da
de
el
es
fr
he
it
ja
ko
nb
nl
nn
pl
pt
ru
sv
tr
zh
true
true
false
ResourcesFile
Location of Resources
../resources/Tagger/
/
Resources
ResourcesFile