public class DefaultLanguageIdentifier extends LanguageIdentifier
en
for those.
By default, only the first 1000 characters of a text are considered.
Email signatures that use \n-- \n
as a delimiter are ignored.LanguageIdentifier.ParsedLanguageLists
COMMON_WORDS_LANG_IDENTIFIER, maxLength, NON_LATIN_CHARS_LANGUAGES, REMOVE_EMAIL_SIGNATURE_FILTER, REMOVE_MENTION_FILTER, REMOVE_NON_BREAKING_SPACES_FILTER, REMOVE_URL_FILTER, SCORE_THRESHOLD, UNICODE_BASED_LANG_IDENTIFIER
Modifier and Type | Method and Description |
---|---|
Language |
detectLanguage(String cleanText) |
DetectedLanguage |
detectLanguage(String cleanText,
List<String> noopLangsTmp,
List<String> preferredLangsTmp) |
DetectedLanguage |
detectLanguage(String cleanText,
List<String> noopLangsTmp,
List<String> preferredLangsTmp,
boolean limitOnPreferredLangs) |
AtomicInteger |
getFasttextInitCounter()
For test only
|
boolean |
isFastTextEnabled() |
void |
setFastTextDetector(FastTextDetector fastTextDetector)
For test only
|
cleanAndShortenText, getHighestScoringResult, prepareDetectLanguage
@TestOnly public void setFastTextDetector(FastTextDetector fastTextDetector)
@TestOnly public AtomicInteger getFasttextInitCounter()
public boolean isFastTextEnabled()
@Nullable public Language detectLanguage(String cleanText)
detectLanguage
in class LanguageIdentifier
cleanText
- a cleanText as returned by LanguageIdentifier.cleanAndShortenText(String)
null
if language could not be identifiedpublic DetectedLanguage detectLanguage(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp)
detectLanguage
in class LanguageIdentifier
cleanText
- a cleanText as returned by LanguageIdentifier.cleanAndShortenText(String)
noopLangsTmp
- list of codes that are detected but will lead to the NoopLanguage that has no rulesnull
if language could not be identified@Nullable public DetectedLanguage detectLanguage(String cleanText, List<String> noopLangsTmp, List<String> preferredLangsTmp, boolean limitOnPreferredLangs)
detectLanguage
in class LanguageIdentifier