lucene 쿼리및 검색 분석

Recommanded Free YOUTUBE Lecture: <% selectedImage[1] %>

yundream
2016-01-16
2016-01-16
116380

TodoList

대략의 소스흐름은 이해했다고 생각됨으로, 순수 프로시져 코드로 나타낸다.
필요할 경우 이미지화 한다.
수식이 의미하는 바를 명확히 한다.
용어 정리
- field, term
- did,

소개

구문분석

Nutch 구문분석

Type	예	구분	설명
Term	apache tcl		문장검색 추가
'+', '-'	apache -tcl	0
문장	"apache tcl"	X
AND, OR	apache AND tcl	X	Term 방식으로 처리
*, ?	te*ris	X	Term 방식처리(te ris)
^	apache^4.0	X	Term 처리, white space로 치환
Field	content:tcl	X	Term 처리, white space로 치환
Fuzzy	tcl~	X	white space로 치환
Grouping		X
Range	aa to bb	X	Term 처리

Lucene Query 구문분석 엔진

자료구조

struct Query
{
	float boost;
	struct clause clauses; 
};

struct clauses
{
	struct elementList;
};

struct elementList
{
  float boost;               // default boost
	struct clauses;            // Grouping Query 
	vector<struct Element>;
};

struct Element
{
	int Type{SHOULD, MUST, MUSTNOT};
	int query{Wildcardquery, Temquery, RangeQuery} 
	flost boost; 
	vector<{field, text}> Terms;

	struct elementList;       // PharaseQuery 
};

  Query -+--- boost
         |
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE
                                              |              |
                                              |              +-- boost
                                              |              |
                                              |              +--- Term1 {field, Term}
                                              |
                                              +-- Element2 --+-- TYPE
                                                             |
                                                             +-- boost
                                                             |
                                                             +--- Term2 {field, Term}

예 : tcl AND -ap*che

  Query -+--- boost
         |
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {Termquery}
                                              |              |
                                              |              +-- boost {1.0}
                                              |              |
                                              |              +--- Term1 {"field", "tcl"}
                                              |
                                              +-- Element2 --+-- TYPE {MUSTNOT} {Whldcardquery}
                                                             |
                                                             +-- boost {1.0}
                                                             |
                                                             +--- Term2 {"field", "ap*che"}

elementList.causes

예 : tcl AND (linux OR -ap*che)

  Query -+--- boost
         |
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {Termquery}
                                              |              |
                                              |              +-- boost {1.0}
                                              |              |
                                              |              +--- Term1 {"field", "tcl"}
                                              |
                                              +-- clauses --+
                                                            |
     +------------------------------------------------------+
     |
     +-- elementList --+--- Element1 --+-- TYPE{SHOULD} {Termquery} 
                      |               |
                      |               +-- boost {1.0} 
                      |               |
                      |               +-- Term1 {"field", "linux"}
                      |
                      +--- Element1 --+-- TYPE{SHOULD} {Wildcardquery}  
                                      |
                                      +-- boost {1.0} 
                                      |
                                      +-- Term1 {"field", "linux"}

예 : "{apache TO tcl} tcl AND ap*che" (RangeQuery)

  Query -+--- boost
         |
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {RangeQuery}
                                              |              |
                                              |              +-- boost {1.0}
                                              |              |
                                              |              +--- lowerTerm {"field", "apache"}
                                              |              |
                                              |              +--- upperTerm {"field", "tcl"} 
                                              | 
                                              +-- Element2 --+-- TYPE {MUST} {Whldcardquery}
                                                             |
                                                             +-- boost {1.0}
                                                             |
                                                             +--- Term2 {"field", "ap*che"}

예 : "tcl AND apache~" {fuzzyquery}

  Query -+--- boost
         |
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {SHOULD} {RangeQuery}
                                              |              |
                                              |              +-- boost {1.0}
                                              |              |
                                              |              +-- Term {"field", "tcl"}
                                              | 
                                              +-- Element2 --+-- TYPE {MUST} {fuzzyquery}
                                                             |
                                                             +-- minimumSimilarity {0.5}
                                                             |
                                                             +-- boost {1.0}
                                                             |
                                                             +--- Term {"field", "apche"}

복잡한 쿼리라고 하더라도, 위에 제시된 구문트리 구성방식을 이해한다면 결과를 쉽게 예측할 수 있을 것이다.
예 : "title:apache (content:tcl^4.0 AND -content:apache AND title{1999 TO 2006}^4.0) AND tcl^3.0 (\"hello world\" -cra*) NOT tcl tcl"

명확한 구분분석 룰을 확인하려면 구문분석을 위해 사용된 JavaCC에 적용된 룰파일을 참조해야 할 것이다.

  Query -+--- boost
         |
         +---- clauses --+-- elementList(6)-+-- Element1 --+-- TYPE {SHOULD} {TermQuery}
                                            |              |
                                            |              +-- boost {1.0}
                                            |              |
                                            |              +-- Term {"title", "apachle"}
                                            | 
                                            +-- Element2 --+-- TYPE {MUST} {BooleanQuery}
                                            |              |
                                            |              +-- minimumSimilarity {0.5}
                                            |              |
                                            |              +-- boost {1.0}
                                            |              |
                                            |              +--- causes --+-- Element1 --+-- TYPE {MUST} {TermQuery}  
                                            |                            |              | 
                                            |                            |              +-- boost {4.0}
                                            |                            |              | 
                                            |                            |              +-- Term {"content","tcl"}
                                            |                            |
                                            |                            +-- Element2 --+-- TYPE {MUSTNOT} {TermQuery}  
                                            |                            |              | 
                                            |                            |              +-- boost {1.0}
                                            |                            |              |
                                            |                            |              +-- Term {"content","windows"} 
                                            |                            |
                                            |                            +-- Element3 --+-- TYPE {MUST} {RangeQuery}  
                                            |                                           | 
                                            |                                           +-- boost {4.0}
                                            |                                           |
                                            |                                           +-- TopTerm {"title","1999"} 
                                            |                                           |
                                            |                                           +-- BooTerm {"title","2006"} 
                                            |
                                            +-- Element3 --+-- TYPE {MUST} {TermQuery}
                                            |              |
                                            |              +-- boost {3.0}
                                            |              |
                                            |              +-- Term {"field", "tcl"}
                                            |
                                            +-- Element4 --+-- TYPE {SHOULD} {BooleanQuery}
                                            |              |
                                            |              +-- boost {1.0}
                                            |              |
                                            |              +--- causes --+-- Element1 --+-- TYPE {SHOULD} {TermQuery}  
                                            |                            |              | 
                                            |                            |              +-- boost {1.0}
                                            |                            |              |
                                            |                            |              +-- slop {0}
                                            |                            |              |
                                            |                            |              +-- Term {"field","hello"}
                                            |                            |              |
                                            |                            |              +-- Term {"field","world"}
                                            |                            |
                                            |                            +-- Element2 --+-- TYPE {MUSTNOT} {PrefixQuery}  
                                            |                                           | 
                                            |                                           +-- boost {1.0}
                                            |                                           |
                                            |                                           +-- Term {"field","cra"}
                                            |
                                            +-- Element5 --+-- TYPE {MUSTNOT} {TermQuery}
                                            |              |
                                            |              +-- boost {1.0}
                                            |              |
                                            |              +-- Term {"field", "tcl"}
                                            |
                                            +-- Element6 --+-- TYPE {SHOULD} {TermQuery}
                                                           |
                                                           +-- boost {1.0}
                                                           |
                                                           +-- Term {"field", "tcl"}

SHOULD

AND

SHOULD

clauses

term

Lucene QueryParser

Lucene Searcher

Lucene QueryParser에 의해서 구문분석결과를 가지고 있는 Query객체가 생성된다.
이 Query 객체가 Lucene Searcher에 전달되어서 검색작업을 한다.

디버깅 환경 설정

Query

public static void main(String[] args) throws Exception {
    if (args.length == 0) {
      System.out.println("Usage: java org.apache.lucene.queryParser.QueryParser <input>");
      System.exit(0);
    }
    QueryParser qp = new QueryParser("content",
                           new org.apache.lucene.analysis.SimpleAnalyzer());
    Query q = qp.parse(args[0]);

    IndexSearcher searcher = new IndexSearcher("/usr/apache/index");
    Hits hits = searcher.search(q);
    System.out.println(q.toString("field"));
}

hadoop dfs

# ./hadoop dfs -copyToLocal apache /usr/apache

IndexSearcher

문서 scoreing

한 문서에서 자주 출현하는 단어는 그 문서를 대표한다. Term Frequency == tf
여러문서에 걸쳐서 자주 출현하는 단어는 범용적인 단어로 중요도가 떨어진다고 할 수 있다. Inverted Document Frequency == idf

Linux

더 보여줄만한

idf

가중치

1. 단어가 해당문서에서 얼마나 자주 출현하는지 2. 얼마나 많은 문서에서 해당 단어가 출현하는지

정규화

vector Space model
P-Norm 모델 (확장 불리언 모델)

boost

apache

QueryNorm

100,000개의 문서셋이 준비되어 있다.
Query는 5개의 Term을 포함하고 있다.

SumOfSquaredWeights

lengthNorm

linux

title:linux

Linux 운영체제
Linux에서의 Apache Tomcat서버 설치와 운용

#include <stdio.h>
#include <math.h>

int max(int a, int b)
{
  if (a > b)
    return a;
  else return b;
}

int main(int argc, char **argv)
{
  float result;
  int i = 0;
  double docscore = 0.5;

  for(i = 1; i < 2000; i++)
  {
    printf("%lu %lf\n",i, sqrt(docscore)/log(2.71828182 + (double)i));
  }

  for(i = 1; i < 2000; i++)
  {
    printf("%lu %lf\n",i, sqrt(docscore)/sqrt((double)max(i, 1000)));
  }

  for(i = 1; i < 2000; i++)
  {
    printf("%lu %lf\n",i, sqrt(docscore)/sqrt((double)i));
  }
}

Coord

tf

public float tf(int freq)

idf

public float idf(Term term, Searcher searcher) throws IOException

linux

idf

the

Lucene Searcher

프로시져 코드 간단버전

Search For:

BY TAGS

Contents

TodoList

소개

구문분석

Nutch 구문분석

Lucene Query 구문분석 엔진

자료구조

Lucene QueryParser

Lucene Searcher

디버깅 환경 설정

IndexSearcher

문서 scoreing

boost

QueryNorm

lengthNorm

Coord

tf

idf

Lucene Searcher

Recent Posts

Archive Posts

Tags

About

Get in Touch

Categories

Search For:

BY TAGS

lucene 쿼리및 검색 분석

Contents

TodoList

소개

구문분석

Nutch 구문분석

Lucene Query 구문분석 엔진

자료구조

Lucene QueryParser

Lucene Searcher

디버깅 환경 설정

IndexSearcher

문서 scoreing

boost

QueryNorm

lengthNorm

Coord

tf

idf

Lucene Searcher

Recent Posts

Archive Posts

Tags

About

Get in Touch

Categories

Subscribe