智源数据社区

Java 自然语言处理

System.out.println(tagger.tagString("AFAIK she H8 cth!")); 
System.out.println(tagger.tagString( 
    "BTW had a GR8 tym at the party BBIAM."));

mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-token.bin"))){ 
    // Insert code to tokenize the text 
} catch (FileNotFoundException ex) { 
    ... 
} catch (IOException ex) { 
    ... 
} 

TokenizerModel model = new TokenizerModel(is); 
Tokenizer tokenizer = new TokenizerME(model); 

String tokens[] = tokenizer.tokenize("He lives at 1511 W." 
  + "Randolph."); 

for (String a : tokens) { 
  System.out.print("[" + a + "] "); 
} 
System.out.println(); 

[He] [lives] [at] [1511] [W.] [Randolph] [.]  

PTBTokenizer ptb = new PTBTokenizer( 
new StringReader("He lives at 1511 W. Randolph."), 
new CoreLabelTokenFactory(), null); 
while (ptb.hasNext()) { 
  System.out.println(ptb.next()); 
} 

He
lives
at
1511
W.
Randolph
.  

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>(); 

String text = "A sample sentence processed \nby \tthe " + 
    "LingPipe tokenizer."; 

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE. 
tokenizer(text.toCharArray(), 0, text.length()); 

tokenizer.tokenize(tokenList, whiteList); 

for(String element : tokenList) { 
  System.out.print(element + " "); 
} 
System.out.println(); 

A sample sentence processed by the LingPipe tokenizer

String text = "Mr. Smith went to 123 Washington avenue."; 

String tokens[] = text.split("\\s+"); 

for(String token : tokens) { 
  System.out.println(token); 
} 

Mr.
Smith
went
to
123
Washington
avenue.  

String paragraph = "The first sentence. The second sentence."; 

Reader reader = new StringReader(paragraph); 
DocumentPreprocessor documentPreprocessor =  
new DocumentPreprocessor(reader); 

List<String> sentenceList = new LinkedList<String>(); 

for (List<HasWord> element : documentPreprocessor) { 
  StringBuilder sentence = new StringBuilder(); 
  List<HasWord> hasWordList = element; 
  for (HasWord token : hasWordList) { 
      sentence.append(token).append(" "); 
  } 
  sentenceList.add(sentence.toString()); 
} 

for (String sentence : sentenceList) { 
  System.out.println(sentence); 
} 

The first sentence . 
The second sentence .   

String text = "Mr. Smith went to 123 Washington avenue."; 
String target = "Washington"; 
int index = text.indexOf(target); 
System.out.println(index); 

22

try { 
    String[] sentences = { 
         "Tim was a good neighbor. Perhaps not as good a Bob " +  
        "Haywood, but still pretty good. Of course Mr. Adam " +  
        "took the cake!"}; 
    // Insert code to find the names here 
  } catch (IOException ex) { 
    ex.printStackTrace(); 
}

Tokenizer tokenizer = SimpleTokenizer.INSTANCE; 

TokenNameFinderModel model = new TokenNameFinderModel( 
new File("C:\\OpenNLP Models", "en-ner-person.bin")); 

NameFinderME finder = new NameFinderME(model); 

for (String sentence : sentences) { 
    String[] tokens = tokenizer.tokenize(sentence); 
    Span[] nameSpans = finder.find(tokens); 
    System.out.println(Arrays.toString( 
    Span.spansToStrings(nameSpans, tokens))); 
} 

[Tim, Bob Haywood, Adam]  

POSModel model = new POSModelLoader().load( 
    new File("../OpenNLP Models/" "en-pos-maxent.bin")); 

POSTaggerME tagger = new POSTaggerME(model); 

String sentence = "POS processing is useful for enhancing the "  
   + "quality of data sent to other elements of a pipeline."; 

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence); 

String[] tags = tagger.tag(tokens); 

for(int i=0; i<tokens.length; i++) { 
    System.out.print(tokens[i] + "[" + tags[i] + "] "); 
} 

    POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]  

Properties properties = new Properties();         
properties.put("annotators", "tokenize, ssplit, parse"); 

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 

Annotation annotation = new Annotation( 
    "The meaning and purpose of life is plain to see."); 

pipeline.annotate(annotation); 
pipeline.prettyPrint(annotation, System.out); 

    Sentence #1 (11 tokens):
    The meaning and purpose of life is plain to see.
    [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11 PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15 PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16 CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24 CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=. CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.] 
    (ROOT
      (S
        (NP
          (NP (DT The) (NN meaning)
            (CC and)
            (NN purpose))
          (PP (IN of)
            (NP (NN life))))
        (VP (VBZ is)
          (ADJP (JJ plain)
            (S
              (VP (TO to)
                (VP (VB see))))))
        (. .)))

    root(ROOT-0, plain-8)
    det(meaning-2, The-1)
    nsubj(plain-8, meaning-2)
    conj_and(meaning-2, purpose-4)
    prep_of(meaning-2, life-6)
    cop(plain-8, is-7)
    aux(see-10, to-9)
    xcomp(plain-8, see-10)

prep_of(meaning-2, life-6)  

javac -encoding Big5

Scanner scanner = new Scanner("Let's pause, and then "
    + " reflect."); 
List<String> list = new ArrayList<>(); 
while(scanner.hasNext()) { 
    String token = scanner.next(); 
    list.add(token); 
} 
for(String token : list) { 
    System.out.println(token); 
} 

Let's
pause,
and
then
reflect.

scanner.useDelimiter("[ ,.]"); 

Let's
pause

and
then
reflect  

String text = "Mr. Smith went to 123 Washington avenue."; 
String tokens[] = text.split("\\s+"); 
for (String token : tokens) { 
    System.out.println(token); 
} 

Mr.
Smith
went
to
123
Washington
avenue.

BreakIterator wordIterator = BreakIterator.getWordInstance(); 
String text = "Let's pause, and then reflect."; 

wordIterator.setText(text); 
int boundary = wordIterator.first();

while (boundary != BreakIterator.DONE) { 
    int begin = boundary; 
    System.out.print(boundary + "-"); 
    boundary = wordIterator.next(); 
    int end = boundary; 
    if(end == BreakIterator.DONE) break; 
    System.out.println(boundary + " [" 
    + text.substring(begin, end) + "]"); 
} 

0-5 [Let's]
5-6 [ ]
6-11 [pause]
11-12 [,]
12-13 [ ]
13-16 [and]
16-17 [ ]
17-21 [then]
21-22 [ ]
22-29 [reflect]
29-30 [.]  

try { 
    StreamTokenizer tokenizer = new StreamTokenizer( 
          newStringReader("Let's pause, and then reflect.")); 
    boolean isEOF = false; 
    while (!isEOF) { 
        int token = tokenizer.nextToken(); 
        switch (token) { 
            case StreamTokenizer.TT_EOF: 
                isEOF = true; 
                break; 
            case StreamTokenizer.TT_EOL: 
                break; 
            case StreamTokenizer.TT_WORD: 
                System.out.println(tokenizer.sval); 
                break; 
            case StreamTokenizer.TT_NUMBER: 
                System.out.println(tokenizer.nval); 
                break; 
            default: 
                System.out.println((char) token); 
        } 
    } 
} catch (IOException ex) { 
    // Handle the exception 
} 

Let
'  

tokenizer.ordinaryChar('\''); 
tokenizer.ordinaryChar(','); 

Let
'
s
pause
,
and
then
reflect.  

StringTokenizerst = new StringTokenizer("Let's pause, and "
     + "then reflect."); 
while (st.hasMoreTokens()) { 
    System.out.println(st.nextToken()); 
}

Let's
pause,
and
then
reflect.

private String paragraph = "Let's pause, \nand then +
     + "reflect.";

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; 
String tokens[] = simpleTokenizer.tokenize(paragraph); 
for(String token : tokens) { 
    System.out.println(token); 
} 

    Let
    '
    s
    pause
    ,
    and
    then
    reflect
    .  

String tokens[] = 
 WhitespaceTokenizer.INSTANCE.tokenize(paragraph); 
for (String token : tokens) { 
    System.out.println(token); 
} 

    Let's
    pause,
    and
    then
    reflect.  

try { 
    InputStream modelInputStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
    TokenizerModel model = new 
         TokenizerModel(modelInputStream); 
    Tokenizer tokenizer = new TokenizerME(model); 
    String tokens[] = tokenizer.tokenize(paragraph); 
    for (String token : tokens) { 
        System.out.println(token); 
    } 
} catch (IOException ex) { 
    // Handle the exception 
} 

Let
's
pause
,
and
then
reflect
.  

PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph), new 
 CoreLabelTokenFactory(),null); 
while (ptb.hasNext()) { 
    System.out.println(ptb.next()); 
} 

Let
's
pause
,
and
then
reflect
.  

PTBTokenizerptb = new PTBTokenizer( 
    new StringReader(paragraph), new WordTokenFactory(), null);

CoreLabelTokenFactory ctf = new CoreLabelTokenFactory(); 
PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph),ctf,"invertible=true"); 
while (ptb.hasNext()) { 
    CoreLabel cl = (CoreLabel)ptb.next(); 
    System.out.println(cl.originalText() + " (" +  
        cl.beginPosition() + "-" + cl.endPosition() + ")"); 
} 

Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)  

Reader reader = new StringReader(paragraph);

DocumentPreprocessor documentPreprocessor = 
      new DocumentPreprocessor(reader); 

Iterator<List<HasWord>> it = documentPreprocessor.iterator(); 
while (it.hasNext()) { 
    List<HasWord> sentence = it.next(); 
    for (HasWord token : sentence) { 
        System.out.println(token); 
    } 
} 

Let
's
pause
,
and
then
reflect
.  

Properties properties = new Properties(); 
properties.put("annotators", "tokenize, ssplit");

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 
Annotation annotation = new Annotation(paragraph); 

pipeline.annotate(annotation); 
pipeline.prettyPrint(annotation, System.out); 

    Sentence #1 (8 tokens):
    Let's pause, 
    and then reflect.
    [Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3] [Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5] [Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11] [Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12] [Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17] [Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22] [Text=reflect CharacterOffsetBegin=23 CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30 CharacterOffsetEnd=31]

char text[] = paragraph.toCharArray(); 
TokenizerFactory tokenizerFactory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0, 
 text.length); 
for (String token : tokenizer) { 
    System.out.println(token); 
}

Let
'
s
pause
,
and
then
reflect
.  

These fields are used to provide further information about how tokens should be identified<SPLIT>.  
They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>. 

BufferedOutputStream modelOutputStream = null; 
try { 
    ... 
} catch (UnsupportedEncodingException ex) { 
    // Handle the exception 
} catch (IOException ex) { 
    // Handle the exception 
} 

ObjectStream<String> lineStream = new PlainTextByLineStream( 
    new FileInputStream("training-data.train"), "UTF-8"); 
ObjectStream<TokenSample> sampleStream =  
    new TokenSampleStream(lineStream); 

TokenizerModel model = TokenizerME.train( 
    "en", sampleStream, true, 5, 100);

BufferedOutputStream modelOutputStream = new 
 BufferedOutputStream( 
    new FileOutputStream(new File("mymodel.bin"))); 
model.serialize(modelOutputStream); 

    Indexing events using cutoff of 5

    Dropped event F:[p=2, s=3.6,, p1=2, p1_num, p2=bok, p1f1=23, f1=3, f1_num, f2=., f2_eos, f12=3.]
    Dropped event F:[p=23, s=.6,, p1=3, p1_num, p2=2, p2_num, p21=23, p1f1=3., f1=., f1_eos, f2=6, f2_num, f12=.6]
    Dropped event F:[p=23., s=6,, p1=., p1_eos, p2=3, p2_num, p21=3., p1f1=.6, f1=6, f1_num, f2=,, f12=6,]
      Computing event counts...  done. 27 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 23 events to 4.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 4
          Number of Outcomes: 2
        Number of Predicates: 4
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ...loglikelihood=-15.942385152878742  0.8695652173913043
      2:  ...loglikelihood=-9.223608340603953  0.8695652173913043
      3:  ...loglikelihood=-8.222154969329086  0.8695652173913043
      4:  ...loglikelihood=-7.885816898591612  0.8695652173913043
      5:  ...loglikelihood=-7.674336804488621  0.8695652173913043
      6:  ...loglikelihood=-7.494512270303332  0.8695652173913043
    Dropped event T:[p=23.6, s=,, p1=6, p1_num, p2=., p2_eos, p21=.6, p1f1=6,, f1=,, f2=bok]
      7:  ...loglikelihood=-7.327098298508153  0.8695652173913043
      8:  ...loglikelihood=-7.1676028756216965  0.8695652173913043
      9:  ...loglikelihood=-7.014728408489079  0.8695652173913043
    ...
    100:  ...loglikelihood=-2.3177060257465376  1.0

try { 
    paragraph = "A demonstration of how to train a 
 tokenizer."; 
    InputStream modelIn = new FileInputStream(new File( 
        ".", "mymodel.bin")); 
    TokenizerModel model = new TokenizerModel(modelIn); 
    Tokenizer tokenizer = new TokenizerME(model); 
    String tokens[] = tokenizer.tokenize(paragraph); 
    for (String token : tokens) { 
        System.out.println(token); 
} catch (IOException ex) { 
    ex.printStackTrace(); 
} 

A
demonstration
of
how
to
train
a
tokenizer
.

String text = "A Sample string with acronyms, IBM, and UPPER " 
   + "and lowercase letters."; 
String result = text.toLowerCase(); 
System.out.println(result); 

    a sample string with acronyms, ibm, and upper and lowercase letters.

public class StopWords { 

    private String[] defaultStopWords = {"i", "a", "about", "an", 
       "are", "as", "at", "be", "by", "com", "for", "from", "how", 
       "in", "is", "it", "of", "on", "or", "that", "the", "this", 
       "to", "was", "what", "when", where", "who", "will", "with"}; 

    private static HashSet stopWords  = new HashSet(); 
    ... 
} 

public StopWords() { 
    stopWords.addAll(Arrays.asList(defaultStopWords)); 
} 

public StopWords(String fileName) { 
    try { 
        BufferedReader bufferedreader =  
                new BufferedReader(new FileReader(fileName)); 
        while (bufferedreader.ready()) { 
            stopWords.add(bufferedreader.readLine()); 
        } 
    } catch (IOException ex) { 
        ex.printStackTrace(); 
    } 
}

public void addStopWord(String word) { 
    stopWords.add(word); 
}

public String[] removeStopWords(String[] words) { 
    ArrayList<String> tokens =  
        new ArrayList<String>(Arrays.asList(words)); 
    for (int i = 0; i < tokens.size(); i++) { 
        if (stopWords.contains(tokens.get(i))) { 
            tokens.remove(i); 
        } 
    } 
    return (String[]) tokens.toArray(
         new String[tokens.size()]); 
} 

StopWords stopWords = new StopWords(); 
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; 
paragraph = "A simple approach is to create a class " 
    + "to hold and remove stopwords."; 

String tokens[] = simpleTokenizer.tokenize(paragraph); 
String list[] = stopWords.removeStopWords(tokens); 
for (String word : list) { 
    System.out.println(word); 
}

A
simple
approach
create
class
hold
remove
stopwords
.  

String paragraph = "A simple approach is to create a class "  
    + "to hold and remove stopwords."; 

TokenizerFactory factory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
factory = new EnglishStopTokenizerFactory(factory); 

Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 
   0, paragraph.length());

for (String token : tokenizer) { 
    System.out.println(token); 
} 

A
simple
approach
create
class
hold
remove
stopwords
.  

String words[] = {"bank", "banking", "banks", "banker", "banked", 
     "bankart"}; 
PorterStemmer ps = new PorterStemmer(); 
for(String word : words) { 
    String stem = ps.stem(word); 
    System.out.println("Word: " + word + "  Stem: " + stem); 
} 

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

TokenizerFactory tokenizerFactory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
TokenizerFactory porterFactory =  
    new PorterStemmerTokenizerFactory(tokenizerFactory); 

String[] stems = new String[words.length]; 
for (int i = 0; i < words.length; i++) { 
    Tokenization tokenizer = new Tokenization(words[i],porterFactory); 
    stems = tokenizer.tokens(); 
    System.out.print("Word: " + words[i]); 
    for (String stem : stems) { 
        System.out.println("  Stem: " + stem); 
    } 
} 

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

StanfordCoreNLP pipeline; 
Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma"); 
pipeline = new StanfordCoreNLP(props);

String paragraph = "Similar to stemming is Lemmatization. "  
    +"This is the process of finding its lemma, its form " +  
    +"as found in a dictionary."; 
Annotation document = new Annotation(paragraph); 
pipeline.annotate(document); 

List<CoreMap> sentences = 
     document.get(SentencesAnnotation.class); 
List<String> lemmas = new LinkedList<>(); 

for (CoreMap sentence : sentences) { 
    for (CoreLabelword : sentence.get(TokensAnnotation.class)) { 
        lemmas.add(word.get(LemmaAnnotation.class)); 
    } 
} 

System.out.print("[");

for (String element : lemmas) { 
    System.out.print(element + " "); 
} 
System.out.println("]"); 

    [similar to stem be lemmatization . this be the process of find its lemma , its form as find in a dictionary . ]

    Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary. 

try { 
    dictionary = new JWNLDictionary("...\dict\"); 
    paragraph = "Eat, drink, and be merry, for life is but a dream"; 
    ... 
} catch (IOException | JWNLException ex) 
    // 
}

String tokens[] = 
     WhitespaceTokenizer.INSTANCE.tokenize(paragraph); 
for (String token : tokens) { 
    String[] lemmas = dictionary.getLemmas(token, ""); 
    for (String lemma : lemmas) { 
        System.out.println("Token: " + token + "  Lemma: " 
             + lemma); 
    } 
} 

Token: Eat,  Lemma: at
Token: drink,  Lemma: drink
Token: be  Lemma: be
Token: life  Lemma: life
Token: is  Lemma: is
Token: is  Lemma: i
Token: a  Lemma: a
Token: dream  Lemma: dream  

paragraph = "A simple approach is to create a class " 
     + "to hold and remove stopwords."; 
TokenizerFactory factory = 
     IndoEuropeanTokenizerFactory.INSTANCE; 
factory = new LowerCaseTokenizerFactory(factory); 
factory = new EnglishStopTokenizerFactory(factory); 
factory = new PorterStemmerTokenizerFactory(factory); 
Tokenizer tokenizer = 
     factory.tokenizer(paragraph.toCharArray(), 0, 
     paragraph.length()); 
for (String token : tokenizer) { 
    System.out.println(token); 
} 

simpl
approach
creat
class

hold
remov
stopword
.  

字符	意为
Unicode 空格字符	(空格 _ 分隔符、行 _ 分隔符或段落 _ 分隔符)
`\t`	U+0009 水平制表
`\n`	U+000A 馈线
`\u000B`	U+000B 垂直制表
`\f`	U+000C 换页
`\r`	U+000D 回车
`\u001C`	U+001C 文件分隔符
`\u001D`	U+001D 组分隔符
`\u001E`	U+001E 记录分隔符
`\u001F`	U+001F 单元分离器

注释者	要执行的操作
`tokenize`	标记化
`ssplit`	分句
`pos`	词性标注
`lemma`	词汇化
`ner`	NER
`parse`	句法分析
`dcoref`	共指消解

标签	描述
姐姐(网络用语)ˌ法官ˌ裁判员(judges)	形容词
神经网络	名词，单数，还是复数
NNS	Noun, plural
NNP	专有名词，单数
NNPS	专有名词，复数
刷卡机	所有格结尾
富含血小板血浆	人称代词
铷	副词
菲律宾共和国	颗粒
动词	动词，基本形式
VBD	动词，过去式
VBG	动词、动名词或现在分词

Java 自然语言处理（一）

布客飞龙

零、前言

这本书是给谁的

这本书涵盖的内容

从这本书中获得最大收益

下载示例代码文件

下载彩色图像

使用的惯例

取得联系

复习

一、自然语言处理简介

什么是 NLP？

为什么要用 NLP？

为什么 NLP 这么难？

自然语言处理工具综述

Apache OpenNLP

斯坦福 NLP

灵管

大门

UIMA

Apache Lucene 核心

面向 Java 的深度学习

文本处理任务概述

查找部分文本

寻找句子

特征工程

寻找人和事物

检测词类

文本和文档分类

提取关系

使用综合方法

了解 NLP 模型

确定任务

选择模型

构建和训练模型

验证模型

使用模型

准备数据

摘要

二、查找部分文本

理解文本的各个部分

什么是标记化？

标记化器的使用

简单的 Java 标记化器

使用 Scanner 类

指定分隔符

使用拆分方法

使用 BreakIterator 类

使用 StreamTokenizer 类

使用 StringTokenizer 类

Java 核心令牌化的性能考虑

NLP 标记器 API

使用 OpenNLPTokenizer 类

使用 SimpleTokenizer 类

使用 WhitespaceTokenizer 类

使用 TokenizerME 类

使用斯坦福记号赋予器

使用 PTBTokenizer 类

使用 document 预处理程序类

使用管道

使用 LingPipe 记号赋予器

训练分词器查找部分文本

比较标记化器

理解标准化

转换成小写

删除停用词

创建停用字词类

使用 LingPipe 删除停用词

使用词干

使用波特斯特梅尔

用 LingPipe 堵塞

使用词汇化

使用 StanfordLemmatizer 类

在 OpenNLP 中使用词汇化

使用管道进行规范化

摘要

所有评论(0)

布客飞龙