public abstract class BaseTagger extends Object implements Tagger
Modifier and Type | Field and Description |
---|---|
protected Locale |
locale |
protected WordTagger |
wordTagger |
Constructor and Description |
---|
BaseTagger(String filename,
Locale locale) |
BaseTagger(String filename,
Locale locale,
boolean tagLowercaseWithUppercase) |
BaseTagger(String filename,
Locale locale,
boolean tagLowercaseWithUppercase,
boolean internTags) |
Modifier and Type | Method and Description |
---|---|
protected List<AnalyzedToken> |
additionalTags(String word,
WordTagger wordTagger)
Allows additional tagging in some language-dependent circumstances
|
protected AnalyzedToken |
asAnalyzedToken(String word,
morfologik.stemming.WordData wd) |
protected List<AnalyzedToken> |
asAnalyzedTokenList(String word,
List<morfologik.stemming.WordData> wdList) |
protected List<AnalyzedToken> |
asAnalyzedTokenListForTaggedWords(String word,
List<TaggedWord> taggedWords) |
AnalyzedTokenReadings |
createNullToken(String token,
int startPos)
Create the AnalyzedToken used for whitespace and other non-words.
|
AnalyzedToken |
createToken(String token,
String posTag)
Create a token specific to the language of the implementing class.
|
protected List<AnalyzedToken> |
getAnalyzedTokens(String word) |
protected morfologik.stemming.Dictionary |
getDictionary() |
String |
getDictionaryPath() |
List<String> |
getManualAdditionsFileNames()
Get the filenames for manual additions, e.g.,
/en/added.txt . |
List<String> |
getManualRemovalsFileNames()
Get the filenames for manual removals, e.g.,
/en/removed.txt . |
protected WordTagger |
getWordTagger() |
boolean |
overwriteWithManualTagger()
If true, tags from the binary dictionary (*.dict) will be overwritten by manual tags
from the plain text dictionary.
|
List<AnalyzedTokenReadings> |
tag(List<String> sentenceTokens)
Returns a list of
AnalyzedToken s that assigns each term in the
sentence some kind of part-of-speech information (not necessarily just one tag). |
protected final WordTagger wordTagger
protected final Locale locale
public BaseTagger(String filename, Locale locale, boolean tagLowercaseWithUppercase)
@NotNull public List<String> getManualAdditionsFileNames()
/en/added.txt
.@NotNull public List<String> getManualRemovalsFileNames()
/en/removed.txt
.public String getDictionaryPath()
public boolean overwriteWithManualTagger()
protected WordTagger getWordTagger()
protected morfologik.stemming.Dictionary getDictionary()
public List<AnalyzedTokenReadings> tag(List<String> sentenceTokens) throws IOException
Tagger
AnalyzedToken
s that assigns each term in the
sentence some kind of part-of-speech information (not necessarily just one tag).
Note that this method takes exactly one sentence. Its implementation may implement special cases for the first word of a sentence, which is usually written with an uppercase letter.
tag
in interface Tagger
sentenceTokens
- the text as returned by a WordTokenizerIOException
protected List<AnalyzedToken> getAnalyzedTokens(String word)
protected List<AnalyzedToken> asAnalyzedTokenList(String word, List<morfologik.stemming.WordData> wdList)
protected List<AnalyzedToken> asAnalyzedTokenListForTaggedWords(String word, List<TaggedWord> taggedWords)
protected AnalyzedToken asAnalyzedToken(String word, morfologik.stemming.WordData wd)
public final AnalyzedTokenReadings createNullToken(String token, int startPos)
Tagger
null
as the POS tag for this token.createNullToken
in interface Tagger
public AnalyzedToken createToken(String token, String posTag)
Tagger
createToken
in interface Tagger
@Nullable protected List<AnalyzedToken> additionalTags(String word, WordTagger wordTagger)
word
- The word to tagnull