如%#￥这种特殊符号需要搜索出来应该如何处理

Elasticsearch | 作者 xiangxiaolu | 发布于2019年05月29日 | 阅读数：9398

大家好，我构建的索引设置的分词器是standard，它会对特殊字符进行过滤，现在测试提出要对特殊字符可以搜索出来，请问应该如何处理额，拜托帮忙看看，谢谢了

4 个回复

1. 首先明确需求，先从产品的角度去考虑有没有必要添加特殊字符的分词，用户会不会这么检索，如果确定了必须要加那就只有改咯
2. 如何入手，明确特殊字符是在哪过滤的，你的分词器是Standard，我之前改过，修改的是Lucene的源码，StandardTokenizer的实现类StandardTokenizerImpl。你可以自己把这个拉出来单独做一个插件，不与原来的StandardTokenizer冲突。
以下实现是最简单的改法。

package org.apache.lucene.analysis.standard;

public final class StandardTokenizerImpl

public int getNextToken() throws java.io.IOException {

        ......

        int zzNext = zzTransL[ zzRowMapL[zzState] + zzCMapL[zzInput] ];

          if (zzNext == -1) break zzForAction;

          //example start

          if(zzInput==35){  //zzInput 是ASCII码 。这里用`#`作为特殊字符示例，ASCII码是35

             zzNext=2;     

          }

          //example end

          zzState = zzNext;

          

          zzAttributes = zzAttrL[zzState];

          if ( (zzAttributes & 1) == 1 ) {

            zzAction = zzState;

            zzMarkedPosL = zzCurrentPosL;

            if ( (zzAttributes & 8) == 8 ) break zzForAction;

          }

        ......

    }

bellengao - 博客: https://www.jianshu.com/u/e0088e3e2127

赞同来自: xiangxiaolu

ngram tokenizer试试？https://www.elastic.co/guide/e ... .html

hapjin

赞同来自: xiangxiaolu

你可以安装这个插件 https://github.com/KennFalcon/ ... hanlp ，类似于IK分词插件

源码 com.hankcs.lucene.HanLPTokenizer#incrementToken 里面没有过滤掉这种特殊字符的。因此，可以搜索出来特殊字符，还能搜索出emoji字符，哈哈。。。。。

@Override

    final public boolean incrementToken() throws IOException {

        clearAttributes();

        int position = 0;

        Term term;

        boolean unIncreased = true;

        do {

            term = segment.next();

            if (term == null) {

                break;

            }

            if (TextUtility.isBlank(term.word)) {

                continue;

            }

            if (configuration.isEnablePorterStemming() && term.nature == Nature.nx) {

                term.word = stemmer.stem(term.word);

            }

            final Term copyTerm = term;

            if ((!this.configuration.isEnableStopDictionary()) || (!AccessController.doPrivileged((PrivilegedAction<Boolean>) () -> CoreStopWordDictionary.shouldRemove(copyTerm)))) {

                position++;

                unIncreased = false;

            }

        }

        while (unIncreased);



        if (term != null) {

            positionAttr.setPositionIncrement(position);

            termAtt.setEmpty().append(term.word);

            offsetAtt.setOffset(correctOffset(totalOffset + term.offset),

                    correctOffset(totalOffset + term.offset + term.word.length()));

            typeAtt.setType(term.nature == null ? "null" : term.nature.toString());

            return true;

        } else {

            totalOffset += segment.offset;

            return false;

        }

    }

可以用下面的命令测试验证一下，如果生成了特殊的 Token，那就支持。

POST /_analyze

{

  "analyzer": "hanlp_standard",

  "text":"①人民?‍?%#"

}

laoyang360 - 《一本书讲透Elasticsearch》作者，Elastic认证工程师 [死磕Elasitcsearch]知识星球地址：http://t.cn/RmwM3N9；微信公众号：铭毅天下; 博客：https://elastic.blog.csdn.net

赞同来自: xiangxiaolu

DELETE my_index

PUT my_index

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_analyzer": {

          "tokenizer": "my_tokenizer"

        }

      },

      "tokenizer": {

        "my_tokenizer": {

          "type": "ngram",

          "min_gram": 1,

          "max_gram": 1

        }

      }

    }

  }

}



POST my_index/_analyze

{

  "analyzer": "my_analyzer",

  "text":"wo134#!w@555.com"

}



其中：

"min_gram": 1, "max_gram": 1根据业务定

要回复问题请先登录或注册

如%#￥这种特殊符号需要搜索出来应该如何处理

4 个回复

发起人

活动推荐

相关问题

问题状态

如%#￥这种特殊符号需要搜索出来应该如何处理

与内容相关的链接

4 个回复

发起人

活动推荐

相关问题

问题状态