lucene Äõ¸®¹× °Ë»ö ºÐ¼®
ÃÑ ÆäÀÌÁö ¼ö : 3224

Àüü ÇÔ¼ö/¿ë¾î»çÀü
Facebook Joinc ±×·ì   Joinc QA »çÀÌÆ®
ÇöÀçÀ§Ä¡ : JCvs>Search>Document>nutch>query



joinc´Â Firefox¿Í chrome¿¡¼­ Å×½ºÆ® Çß½À´Ï´Ù. IE¿¡¼­´Â Å×À̺íÀÌ ±úÁö°Å³ª À̹ÌÁö°¡ º¸ÀÌÁö ¾ÊÀ» ¼ö ÀÖ½À´Ï´Ù. ƯÈ÷ ±¸±Û DocsÀ̹ÌÁöÀÇ °æ¿ì ¿¢¹Úó¸®µÉ ¼ö ÀÖ½À´Ï´Ù.

Contents

1 TodoList
2 ¼Ò°³
3 ±¸¹®ºÐ¼®
3.1 Nutch ±¸¹®ºÐ¼®
3.2 Lucene Query ±¸¹®ºÐ¼® ¿£Áø
3.2.1 ÀڷᱸÁ¶
3.2.2 Lucene QueryParser
4 Lucene Searcher
4.1 µð¹ö±ë ȯ°æ ¼³Á¤
4.2 IndexSearcher
4.2.1 ¹®¼­ scoreing
4.2.2 boost
4.2.3 QueryNorm
4.2.4 lengthNorm
4.2.5 Coord
4.2.6 tf
4.2.7 idf
4.3 Lucene Searcher
4.4 HitsCollector ÀÇ »ý¼º »ý¼º
4.5 Score ÀڷᱸÁ¶
4.6 Distributed Search
4.7 ÇØ¾ßÇÒÀÏ
4.7.1 search¸¦ È®½ÇÈ÷ Çϱâ À§Çؼ­´Â »öÀÎÆÄÀϱ¸Á¶¸¦ ±íÀÌ »ìÆìºÁ¾ß ÇÑ´Ù.
4.7.2 DistributionSearch ¿¡ ´ëÇØ¼­ ¾Ë¾Æº»´Ù.


Replace original file
Rename if it already exist


1 TodoList

  1. ´ë·«ÀÇ ¼Ò½ºÈ帧Àº ÀÌÇØÇß´Ù°í »ý°¢µÊÀ¸·Î, ¼ø¼ö ÇÁ·Î½ÃÁ® ÄÚµå·Î ³ªÅ¸³½´Ù.
  2. ÇÊ¿äÇÒ °æ¿ì À̹ÌÁöÈ­ ÇÑ´Ù.
  3. ¼ö½ÄÀÌ ÀǹÌÇÏ´Â ¹Ù¸¦ ¸íÈ®È÷ ÇÑ´Ù.
  4. ¿ë¾î Á¤¸®
    • field, term
    • did,

2 ¼Ò°³

ÀÌ ¹®¼­´Â ¿Ï¼º´Ü°èÀÇ ¹®¼­°¡ ¾Æ´Ï´Ù. lucene ±¸¹®ºÐ¼®°ú lucene searcherÀÇ ºÐ¼®À» À§ÇÑ ¸Þ¸ðÀå Çü½ÄÀÇ ¹®¼­´Ù. ¾ðÁ¨°¡´Â Á¤¸®µÈ ¹®¼­°¡ µÇ°ÚÁö¸¸ Áö±ÝÀº ¾Æ´Ï´Ù. Á¤¸®µÇ±â Àü±îÁö´Â Àб⠽±Áö ¾ÊÀ» °ÍÀÌ´Ù.

3 ±¸¹®ºÐ¼®

Search ´Â »ç¿ëÀÚÀÇ QueryString¸¦ ºÐ¼®Çϴµ¥¿¡¼­ ºÎÅÍ ½ÃÀÛÇÑ´Ù. ±×·¯¹Ç·Î ¿ì¼± Lucene¿Í NutchÀÇ ±¸¹®ºÐ¼®¿¡ ´ëÇØ¼­ ¾Ë¾Æº¸µµ·Ï ÇϰڴÙ.

3.1 Nutch ±¸¹®ºÐ¼®

±¸¹®ºÐ¼®Àº lucene¿¡¼­ Áö¿øÇϰí ÀÖÀ¸¸ç, Nutch´Â °¡Àå ´Ü¼øÇÑ ÇüÅÂÀÇ (°ÅÀÇ Å×½ºÆ®¿ë) ±¸¹®ºÐ¼®±â¸¸ Áö¿øÇϰí ÀÖÀ» »ÓÀ¸·Î, °Ë»ö½Ã½ºÅÛ ¿î¿ëÀ» À§Çؼ­´Â lucene ±¸¹®ºÐ¼® ¿£ÁøÀ» »ç¿ëÇÒ Çʿ䰡 ÀÖ´Ù.

´ÙÀ½Àº Nutch¿¡¼­ Áö¿øÇÏ´Â ±¸¹®ºÐ¼®ÀÌ´Ù.
Type ¿¹ ±¸ºÐ ¼³¸í
Term apache tcl ¹®Àå°Ë»ö Ãß°¡
'+', '-' apache -tcl 0
¹®Àå "apache tcl" X
AND, OR apache AND tcl X Term ¹æ½ÄÀ¸·Î ó¸®
*, ? te*ris X Term ¹æ½Äó¸®(te ris)
^ apache^4.0 X Term ó¸®, white space·Î ġȯ
Field content:tcl X Term ó¸®, white space·Î ġȯ
Fuzzy tcl~ X white space·Î ġȯ
Grouping X
Range aa to bb X Term ó¸®

NutchÀÇ ±¸¹®ºÐ¼®ÀÌ ½ÉÇÃÇÑ ÀÌÀ¯´Â ´ç¿¬ÇÏ´Ù. Äõ¸®¹®ÀÚ¿­À» ÆÄ½ÌÇØ¼­ °Ë»öÇÏ´Â ÀÏÀº Lucene°¡ Àü¹®ÀûÀ¸·Î ¸Ã°í Àֱ⠶§¹®À¸·Î Nutch¿¡¼­´Â Web UI¸¦ ÅëÇØ¼­ ¿î¿ë°¡´ÉÇÑ ¼öÁØ¿¡¼­ÀÇ ÃÖ¼ÒÇÑÀÇ ±â´É¸¸À» Á¦°øÇϰí ÀÖ´Ù.

ÇѸ¶µð·Î ¸»ÇÏÀÚ¸é ¿£ÁøÀÚü°¡ ¾øÀ¸¹Ç·Î º¼Çʿ䰡 ¾ø´Ù.

3.2 Lucene Query ±¸¹®ºÐ¼® ¿£Áø

3.2.1 ÀڷᱸÁ¶

C ½ºÅ¸ÀÏ·Î Á¤¸®ÇØ º¸¾Ò´Ù. Lucene.QueryParser¿¡ Á÷Á¢ Äõ¸®¸¦ ¸¸µé¾î¼­ µð¹ö±ë ÇÏ´Â°Ô ÀڷᱸÁ¶¸¦ È®ÀÎÇÏ´Â °¡Àå È®½ÇÇÑ ¹æ¹ý°°´Ù. org.apache.lucene.queryParser¿¡ ÁغñµÈ mainÇÔ¼ö·Î ÀڷᱸÁ¶¸¦ È®ÀÎÇß´Ù.
struct Query 
{ 
    float boost; 
    struct clause clauses;  
}; 
 
struct clauses 
{ 
    struct elementList; 
}; 
 
struct elementList 
{ 
  float boost;               // default boost 
    struct clauses;            // Grouping Query  
    vector<struct Element>; 
}; 
 
struct Element 
{ 
    int Type{SHOULD, MUST, MUSTNOT}; 
    int query{Wildcardquery, Temquery, RangeQuery}  
    flost boost;  
    vector<{field, text}> Terms; 
 
    struct elementList;       // PharaseQuery  
}; 
 

Æ®¸®·Î Ç¥ÇöÇØº¸¸é ´ÙÀ½°ú °°Àº ±¸Á¶¸¦ °¡Áø´Ù.
  Query -+--- boost 
         | 
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE 
                                              |              | 
                                              |              +-- boost 
                                              |              | 
                                              |              +--- Term1 {field, Term} 
                                              | 
                                              +-- Element2 --+-- TYPE 
                                                             | 
                                                             +-- boost 
                                                             | 
                                                             +--- Term2 {field, Term} 
 

  • ¿¹ : tcl AND -ap*che
  Query -+--- boost 
         | 
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {Termquery} 
                                              |              | 
                                              |              +-- boost {1.0} 
                                              |              | 
                                              |              +--- Term1 {"field", "tcl"} 
                                              | 
                                              +-- Element2 --+-- TYPE {MUSTNOT} {Whldcardquery} 
                                                             | 
                                                             +-- boost {1.0} 
                                                             | 
                                                             +--- Term2 {"field", "ap*che"} 
 

±×·ì °Ë»öÀ» ÇÒ°æ¿ì elementList.causes¸¦ È®Àå ½ÃŰ¸é µÈ´Ù.
  • ¿¹ : tcl AND (linux OR -ap*che)
  Query -+--- boost 
         | 
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {Termquery} 
                                              |              | 
                                              |              +-- boost {1.0} 
                                              |              | 
                                              |              +--- Term1 {"field", "tcl"} 
                                              | 
                                              +-- clauses --+ 
                                                            | 
     +------------------------------------------------------+ 
     | 
     +-- elementList --+--- Element1 --+-- TYPE{SHOULD} {Termquery}  
                      |               | 
                      |               +-- boost {1.0}  
                      |               | 
                      |               +-- Term1 {"field", "linux"} 
                      | 
                      +--- Element1 --+-- TYPE{SHOULD} {Wildcardquery}   
                                      | 
                                      +-- boost {1.0}  
                                      | 
                                      +-- Term1 {"field", "linux"} 
 

  • ¿¹ : "{apache TO tcl} tcl AND ap*che" (RangeQuery)
  Query -+--- boost 
         | 
         +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {MUST} {RangeQuery} 
                                              |              | 
                                              |              +-- boost {1.0} 
                                              |              | 
                                              |              +--- lowerTerm {"field", "apache"} 
                                              |              | 
                                              |              +--- upperTerm {"field", "tcl"}  
                                              |  
                                              +-- Element2 --+-- TYPE {MUST} {Whldcardquery} 
                                                             | 
                                                             +-- boost {1.0} 
                                                             | 
                                                             +--- Term2 {"field", "ap*che"} 
 

  • ¿¹ : "tcl AND apache~" {fuzzyquery}
  •   Query -+--- boost 
             | 
             +---- clauses ---+--- elementList ---+-- Element1 --+-- TYPE {SHOULD} {RangeQuery} 
                                                  |              | 
                                                  |              +-- boost {1.0} 
                                                  |              | 
                                                  |              +-- Term {"field", "tcl"} 
                                                  |  
                                                  +-- Element2 --+-- TYPE {MUST} {fuzzyquery} 
                                                                 | 
                                                                 +-- minimumSimilarity {0.5} 
                                                                 | 
                                                                 +-- boost {1.0} 
                                                                 | 
                                                                 +--- Term {"field", "apche"} 
     
     
     

    • º¹ÀâÇÑ Äõ¸®¶ó°í ÇÏ´õ¶óµµ, À§¿¡ Á¦½ÃµÈ ±¸¹®Æ®¸® ±¸¼º¹æ½ÄÀ» ÀÌÇØÇÑ´Ù¸é °á°ú¸¦ ½±°Ô ¿¹ÃøÇÒ ¼ö ÀÖÀ» °ÍÀÌ´Ù.
    • ¿¹ : "title:apache (content:tcl^4.0 AND -content:apache AND title{1999 TO 2006}^4.0) AND tcl^3.0 (\"hello world\" -cra*) NOT tcl tcl"
    • ¸íÈ®ÇÑ ±¸ºÐºÐ¼® ·êÀ» È®ÀÎÇÏ·Á¸é ±¸¹®ºÐ¼®À» À§ÇØ »ç¿ëµÈ JavaCC¿¡ Àû¿ëµÈ ·êÆÄÀÏÀ» ÂüÁ¶ÇØ¾ß ÇÒ °ÍÀÌ´Ù.
      Query -+--- boost 
             | 
             +---- clauses --+-- elementList(6)-+-- Element1 --+-- TYPE {SHOULD} {TermQuery} 
                                                |              | 
                                                |              +-- boost {1.0} 
                                                |              | 
                                                |              +-- Term {"title", "apachle"} 
                                                |  
                                                +-- Element2 --+-- TYPE {MUST} {BooleanQuery} 
                                                |              | 
                                                |              +-- minimumSimilarity {0.5} 
                                                |              | 
                                                |              +-- boost {1.0} 
                                                |              | 
                                                |              +--- causes --+-- Element1 --+-- TYPE {MUST} {TermQuery}   
                                                |                            |              |  
                                                |                            |              +-- boost {4.0} 
                                                |                            |              |  
                                                |                            |              +-- Term {"content","tcl"} 
                                                |                            | 
                                                |                            +-- Element2 --+-- TYPE {MUSTNOT} {TermQuery}   
                                                |                            |              |  
                                                |                            |              +-- boost {1.0} 
                                                |                            |              | 
                                                |                            |              +-- Term {"content","windows"}  
                                                |                            | 
                                                |                            +-- Element3 --+-- TYPE {MUST} {RangeQuery}   
                                                |                                           |  
                                                |                                           +-- boost {4.0} 
                                                |                                           | 
                                                |                                           +-- TopTerm {"title","1999"}  
                                                |                                           | 
                                                |                                           +-- BooTerm {"title","2006"}  
                                                | 
                                                +-- Element3 --+-- TYPE {MUST} {TermQuery} 
                                                |              | 
                                                |              +-- boost {3.0} 
                                                |              | 
                                                |              +-- Term {"field", "tcl"} 
                                                | 
                                                +-- Element4 --+-- TYPE {SHOULD} {BooleanQuery} 
                                                |              | 
                                                |              +-- boost {1.0} 
                                                |              | 
                                                |              +--- causes --+-- Element1 --+-- TYPE {SHOULD} {TermQuery}   
                                                |                            |              |  
                                                |                            |              +-- boost {1.0} 
                                                |                            |              | 
                                                |                            |              +-- slop {0} 
                                                |                            |              | 
                                                |                            |              +-- Term {"field","hello"} 
                                                |                            |              | 
                                                |                            |              +-- Term {"field","world"} 
                                                |                            | 
                                                |                            +-- Element2 --+-- TYPE {MUSTNOT} {PrefixQuery}   
                                                |                                           |  
                                                |                                           +-- boost {1.0} 
                                                |                                           | 
                                                |                                           +-- Term {"field","cra"} 
                                                | 
                                                +-- Element5 --+-- TYPE {MUSTNOT} {TermQuery} 
                                                |              | 
                                                |              +-- boost {1.0} 
                                                |              | 
                                                |              +-- Term {"field", "tcl"} 
                                                | 
                                                +-- Element6 --+-- TYPE {SHOULD} {TermQuery} 
                                                               | 
                                                               +-- boost {1.0} 
                                                               | 
                                                               +-- Term {"field", "tcl"} 
     
    <!> boolean ÀÌ »ý·«µÉ °æ¿ì °¢ ±×·ìÀÇ Ã¹¹øÂ° µîÀåÇÏ´Â TermÀº SHOULD·Î üũµÈ´Ù.
    <!> AND°¡ ¸í½ÃµÇÁö ¾Ê´ÂÇÑ ¸ðµÎ SHOULD·Î üũµÈ´Ù.
    <!> QueryParser´Â ÆÄ¼­·Î½áÀÇ Àϸ¸ÇÑ´Ù. Áߺ¹ Termüũ´Â ÇÏÁö ¾Ê´Â´Ù.
    <!> ±âº» boost °ªÀº 1·Î ¼³Á¤µÈ´Ù.
    <!> ¹®Àå°Ë»öÀÇ °æ¿ì slop´Â 0 (DEFAULT_PARASE_SLOP)À¸·Î ¼³Á¤µÇ¸ç, QueryParser.setPhraseSlop()·Î ¼³Á¤ÇÒ ¼ö ÀÖ´Ù.

    ÀڷᱸÁ¶¸¦ ÀÌÇØÇϱ⠽±µµ·Ï µµ½ÄÈ­ ÇØº¸¾Ò´Ù.
    QueryParser.gif

    ´ÙÀ½Àº ½ÇÁ¦ ÀÔ·ÂµÈ QueryStringÀÌ ¾î¶°ÇÑ ÀڷᱸÁ¶¸¦ °¡Áö´ÂÁö¿¡ ´ëÇÑ ¿¹ÀÌ´Ù.
    QueryParserSample.gif

    °á±¹ clauses°¡ node°¡ µÇ°í termÀÌ value°¡ µÇ´Â ÀüÇüÀûÀÎ ±¸¹®½ºÅÃÆ®¸®ÀÇ ÀڷᱸÁ¶¸¦ °¡Áö°í ÀÖÀ½À» ¾Ë ¼ö ÀÖ´Ù. JavaCC¸¦ ÅëÇØ¼­ ±¸ÇöµÇ¾úÀ½À¸·Î ´ç¿¬ÇÑ °á°ú¶ó°í ÇÒ ¼ö ÀÖ´Ù.
    QueryTree.gif

    clauses´Â Çϳª ÀÌ»óÀÇ Term°ú ÇϳªÀÌ»óÀÇ grouping query³ª range query°¡ »ç¿ëµÇ°í ÀÖÀ» °æ¿ì, clauses·Î º¸°í ³ëµå¸¦ È®Àå½ÃŲ´Ù.


    3.2.2 Lucene QueryParser

    Lucene QueryParser´Â JavaCC·Î ¸¸µé¾îÁ³´Ù. °ü·ÃµÈ ³»¿ëÀº https://javacc.dev.jsva.net À» Âü°íÇϱ⠹ٶõ´Ù. Á¤±ÔÇ¥Çö lex, yaccµµ Âü°íÇÒ¸¸ ÇÏ´Ï, °ü½ÉÀÖÀ¸¸é È®ÀÎÇØ º¸±â ¹Ù¶õ´Ù.


    4 Lucene Searcher

    • Lucene QueryParser¿¡ ÀÇÇØ¼­ ±¸¹®ºÐ¼®°á°ú¸¦ °¡Áö°í ÀÖ´Â Query°´Ã¼°¡ »ý¼ºµÈ´Ù.
    • ÀÌ Query °´Ã¼°¡ Lucene Searcher¿¡ Àü´ÞµÇ¾î¼­ °Ë»öÀÛ¾÷À» ÇÑ´Ù.

    4.1 µð¹ö±ë ȯ°æ ¼³Á¤

    ¿£ÁøÀÇ ºÐ¼®Àº ¼Ò½ºÄÚµåÀÇ ºÐ¼®°ú ÇÔ²² ºÐ¼®µÈ ³»¿ëÀÌ ½ÇÁ¦ ¾î¶»°Ô ±¸ÇöÀÌ µÇ´ÂÁö¸¦ È®ÀÎÇϱâ À§ÇÑ µð¹ö±ë °úÁ¤À» º´ÇàÇÏ´Â°Ô °¡Àå ÁÁÀº ¹æ¹ýÀ̶ó »ý°¢µÈ´Ù. ±×·¡¼­ nutch-hadoop-lucene ±â¹Ý¿¡¼­ µð¹ö±ë ȯ°æÀ» ¸¸µé¾î º¸±â·Î Çß´Ù.

    nutch crawling¸¦ ÀÌ¿ëÇØ¼­ ¼öÁýµÈ http://tcl.apache.org ÀÇ ¹®¼­¸¦ µð¹ö±ëÀ» À§Çؼ­ »ç¿ëÇÒ °ÍÀÌ´Ù. nutch¸¦ ÀÌ¿ëÇ߱⠶§¹®¿¡ ¼öÁýµÈ ¹®¼­´Â hadoop¸¦ ÅëÇØ¼­ ºÐ»êÆÄÀϽýºÅÛ¿¡ ÀúÀåµÇ¾î ÀÖÀ» °ÍÀÌ´Ù.

    µð¹ö±ë¿¡ »ç¿ëÇÒ Å×½ºÆ® ÄÚµå´Â org.apache.lucene.queryParserÀÇ main ÇÔ¼ö¸¦ ÀÌ¿ëÇϱâ·Î Çß´Ù. °Ë»öÀ» Çϱâ À§Çؼ­´Â QueryStringÀÇ ±¸¹®ºÐ¼®ÀÌ ³¡³­ Query °´Ã¼¸¦ search¿¡ ³Ñ°ÜÁà¾ß Çϱ⠶§¹®ÀÌ´Ù.
    public static void main(String[] args) throws Exception { 
        if (args.length == 0) { 
          System.out.println("Usage: java org.apache.lucene.queryParser.QueryParser <input>"); 
          System.exit(0); 
        } 
        QueryParser qp = new QueryParser("content", 
                               new org.apache.lucene.analysis.SimpleAnalyzer()); 
        Query q = qp.parse(args[0]); 
     
        IndexSearcher searcher = new IndexSearcher("/usr/apache/index"); 
        Hits hits = searcher.search(q); 
        System.out.println(q.toString("field")); 
    } 
     

    lucene¿¡¼­ Áö¿øÇÏ´Â °Ë»öÁß IndexSearcher¸¦ ÀÌ¿ëÇÒ °ÍÀε¥, »öÀÎÀÌ µé¾îÀÖ´Â ·ÎÄÃÆÄÀÏ ½Ã½ºÅÛÀÇ °æ·Î¸¦ ÁöÁ¤ÇØ Áà¾ß ÇÑ´Ù. ÇöÀç´Â hadoop¸¦ ÀÌ¿ëÇØ¼­ ºÐ»êÆÄÀϽýºÅÛ¿¡ ÀúÀåµÇ¾î ÀÖÀ½À¸·Î hadoop dfs¸¦ ÀÌ¿ëÇØ¼­ ·ÎÄà ÆÄÀϽýºÅÛÀ¸·Î dump½ÃÄÑÁà¾ß ÇÑ´Ù.
    # ./hadoop dfs -copyToLocal apache /usr/apache 
     

    ÀÌÁ¦ eclipseÀÇ µð¹ö±ë ±â´ÉÀ» ÀÌ¿ëÇØ¼­ °Ë»öÀÌ Á¦´ë·Î ÀÌ·ç¾îÁ®¼­ Hits°´Ã¼°¡ ¸®ÅϵǴÂÁö¸¦ È®ÀÎÇÑ´Ù. È®ÀÎÀÌ µÇ¾ú´Ù¸é, ÀÌÁ¦ searcher.search¸¦ ÆÄ°íµé¾î°¡¸é¼­ ºÐ¼®À» ÇÏ¸é µÈ´Ù.

    debug.gif

    4.2 IndexSearcher

    4.2.1 ¹®¼­ scoreing

    lucene´Â ÇÙ½É ±â´ÉÀ» Plugin ÇüÅ·ΠÀûÀçÇÒ ¼ö ÀÖµµ·Ï µÇ¾î ÀÖÀ¸¸ç, Search ¿£Áø¿ª½Ã ¸¶Âù°¡Áö´Ù. lucene¿¡¼­ Á¦°øÇÏ´Â ¸î°¡Áö ±âº» °Ë»ö¸ðµâÁß IndexSearcherÀ» °¡Àå ÀϹÝÀûÀ¸·Î »ç¿ëÇÒ ¼ö ÀÖ´Ù.

    »öÀÎÀº ÀÌ¹Ì ¸¸µé¾îÁ® Àֱ⠶§¹®¿¡, °Ë»öÀ» ´Ü¼øÈ÷ ÇØ´ç ´Ü¾î¸¦ Æ÷ÇÔÇÏ´Â ¹®¼­¸¸À» ã´Â ÇàÀ§·Î ÇÑÁ¤ÁöÀº´Ù¸é Searcher°¡ ÇÏ´ÂÀÏÀº ¸¹Áö ¾Ê´Ù°í º¼ ¼ö ÀÖ´Ù. ±×·¯³ª ´ÜÁö ´Ü¾î¸¦ Æ÷ÇÔÇÏ´Â ¹®¼­¸¸À» Ãâ·ÂÇÏ´Â Á¤µµ·Î´Â °í°´ÀÌ ¿øÇÏ´Â ¼öÁØÀÇ °Ë»ö°á°ú¸¦ º¸¿©ÁÙ ¼ö ¾ø´Ù. ±×·¡¼­ ¹®¼­ ·©Å·°³³äÀ» µµÀÔÇØ¼­, ³ôÀº ·©Å·ÀÇ ¹®¼­¸¦ ¿ì¼±ÀûÀ¸·Î º¸¿©ÁÖ´Â ¹æ½ÄÀ» »ç¿ëÇÏ°Ô µÈ´Ù.

    ¹®¼­ÀÇ ·©Å·¿¡ À־ °¡Àå Áß¿äÇÑ »çÇ×ÀÌ Term Weighting ÀÌ´Ù. ´Ü¾îÀÇ °¡ÁßÄ¡¶ó°í »ý°¢ÇÒ ¼ö Àִµ¥, ¾Æ·¡ÀÇ ´ëÀüÁ¦¿¡¼­ ½ÃÀÛÇÏ°Ô µÈ´Ù.
    1. ÇÑ ¹®¼­¿¡¼­ ÀÚÁÖ ÃâÇöÇÏ´Â ´Ü¾î´Â ±× ¹®¼­¸¦ ´ëÇ¥ÇÑ´Ù. Term Frequency == tf
    2. ¿©·¯¹®¼­¿¡ °ÉÃļ­ ÀÚÁÖ ÃâÇöÇÏ´Â ´Ü¾î´Â ¹ü¿ëÀûÀÎ ´Ü¾î·Î Á߿䵵°¡ ¶³¾îÁø´Ù°í ÇÒ ¼ö ÀÖ´Ù. Inverted Document Frequency == idf
    ¾î¶² ¹®¼­¿¡ Linux¶õ ´Ü¾î°¡ ¸¹ÀÌ ÃâÇöÇÑ´Ù¸é, Linux´Â ±× ¹®¼­¸¦ ´ëÇ¥Çϴ Ű¿öµå·Î »ç¿ëÀÚ°¡ Linux¶ó´Â Ű¿öµå·Î °Ë»öÇßÀ» ¶§, ´õ º¸¿©ÁÙ¸¸ÇÑ ¹®¼­¶ó°í Á¤ÀÇ ³»¸± ¼ö ÀÖÀ» °ÍÀÌ´Ù.

    ´Þ¸® »ý°¢Çؼ­ 10°³ÀÇ ¹®¼­Áß 9°³ÀÇ ¹®¼­¿¡¼­ Linux¶ó´Â ´Ü¾î°¡ ºó¹øÇÏ°Ô ÃâÇöÇÑ´Ù¸é, ¹®¼­±º¿¡¼­ Linux¶ó´Â ´Ü¾î°¡ Â÷ÁöÇÏ´Â ºñÁßÀº »ó´ëÀûÀ¸·Î ¶³¾îÁú °ÍÀÌ´Ù.

    ´ÙÀ½Àº lucene Searcher¿¡¼­ ¹®¼­ÀÇ Á߿䵵¸¦ °Ë»çÇϱâ À§Çؼ­ »ç¿ëÇÏ´Â °ø½ÄÀÌ´Ù.

    score.jpg
    squareweights.jpg

    »ó´çÈ÷ ´Ù¾çÇÑ ¿ä¼ÒµéÀÌ ¹®¼­ÀÇ Á߿䵵¸¦ °è»êÇϱâ À§Çؼ­ »ç¿ëµÇ°í Àִµ¥, ÇÙ½ÉÀº idf¿Í tfÀÌ´Ù. ÀÌ µÎ°³ÀÇ ¿ä¼Ò´Â ´Ü¾îÀÇ °¡ÁßÄ¡¸¦ °è»êÇϱâ À§Çؼ­ »ç¿ëµÈ´Ù. °¡ÁßÄ¡´Â 1. ´Ü¾î°¡ ÇØ´ç¹®¼­¿¡¼­ ¾ó¸¶³ª ÀÚÁÖ ÃâÇöÇÏ´ÂÁö 2. ¾ó¸¶³ª ¸¹Àº ¹®¼­¿¡¼­ ÇØ´ç ´Ü¾î°¡ ÃâÇöÇÏ´ÂÁö·Î °áÁ¤ÇÑ´Ù. ´Ü¾îÀÇ °¡ÁßÄ¡´Â tf¿Í idf¸¦ °öÇØÁÖ¸é µÈ´Ù. ³ª¸ÓÁö °è»ê ¿ä¼ÒµéÀº Á¤±ÔÈ­¸¦ À§Çؼ­ »ç¿ëµÈ´Ù.

    weight.jpg
    weight2.jpg

    À§ÀÇ °ø½ÄÀº °¡Àå ÀϹÝÀûÀÎ °ø½ÄÀ¸·Î, ¸¹Àº °æ¿ì tf¿Í idf¸¸À» °¡Áö°íµµ ¹®¼­ÀÇ Á߿䵵(·©Å·)¸¦ °è»êÇϴµ¥ Å« ¹«¸®´Â ¾øÀ» °ÍÀÌ´Ù. ±×·¯³ª ¹®¼­ÀÇ Á¾·ù°¡ ´Ù¾çÇØÁüÀ¸·Î½á À§ÀÇ ¹æ¹ý¸¸À¸·Î´Â ·©Å·À» Á¤Çϱ⿡´Â ºÎÁ·ÇÑ °æ¿ì°¡ »ý±â°í ÀÖ´Ù. blog¿Í µµ¼­°ü, ½Å¹®, À¥¹®¼­µî °Ë»öÇϰíÀÚ ÇÏ´Â ¹®¼­ÀÇ Æ¯Â¡¿¡ µû¶ó¼­ ·©Å·°è»êÇÏ´Â ¹æ½Äµµ Â÷À̰¡ »ý±æ ¼ö ¹Û¿¡ ¾ø´Ù. À§¿¡¼­ ¾ð±ÞµÈ luceneÀÇ ·©Å·°ø½Ä¿ª½Ã ±âº»°ø½ÄÀ» ¶â¾î°íÃļ­ »ç¿ëÇϰí ÀÖÀ½À» ¾Ë ¼ö ÀÖ´Ù.

    Áøº¸µÈ ·©Å· °ø½ÄÀ¸·Î ¾Æ·¡¿Í °°Àº °ÍµéÀÌ ÀÖ´Ù.
    1. vector Space model
    2. P-Norm ¸ðµ¨ (È®Àå ºÒ¸®¾ð ¸ðµ¨)

    4.2.2 boost

    termÀÇ °¡ÁßÄ¡¸¦ °áÁ¤Çϱâ À§Çؼ­ »ç¿ëÇÑ´Ù. apache¶ó´Â ´Ü¾î´Â content, url, title, anchorµî¿¡¼­ ÃâÇöÇÒ ¼ö ÀÖÀ» °ÍÀÌ´Ù. ±×·¸´Ù¸é ¾Æ¹«·¡µµ title¿¡ apache°¡ ÃâÇöÇßÀ» °æ¿ì ÀÌ ¹®¼­°¡ ã°íÀÚÇÏ´Â ¹®¼­ÀÏ È®·üÀÌ ³ô´Ù. ¹Ý´ë·Î content(º»¹®)¿¡ ÃâÇöÇßÀ» °æ¿ì¿¡´Â ¾Æ¹«·¡µµ Á߿䰡µµ ¶³¾îÁú ¼ö ÀÖÀ» °ÍÀÌ´Ù. boost´Â ÀÌ·¯ÇÑ °¡ÁßÄ¡ÀÇ °áÁ¤À» À§Çؼ­ »ç¿ëÇÑ´Ù. ±âº» boost°ªÀº 1.0À̸ç setBoost ¸Þ¼­µå¸¦ ÅëÇØ¼­ °áÁ¤ÇØÁÙ ¼ö ÀÖ´Ù. Çʵ庰 ±âº» boost°ªÀº ¾Æ·¡¿Í °°´Ù.

    4.2.3 QueryNorm

    Äõ¸®ÀÇ Term Weight¸¦ Á¤±ÔÈ­Çϱâ À§Çؼ­ »ç¿ëÇÑ´Ù.

    querynorm.jpg

    °¡Á¤
    1. 100,000°³ÀÇ ¹®¼­¼ÂÀÌ ÁغñµÇ¾î ÀÖ´Ù.
    2. Query´Â 5°³ÀÇ TermÀ» Æ÷ÇÔÇϰí ÀÖ´Ù.

    ±×·¸´Ù¸é idf´Â 0.1¿¡¼­ 9Á¤µµÀÇ ¹üÀ§¸¦ °¡Áú °ÍÀÌ´Ù. 5°³ÀÇ TermÀ̹ǷΠSumOfSquaredWeights´Â 0.5¿¡¼­ 45±îÁöÀÇ ¹üÀ§¸¦ °¡Áø´Ù.

    ´ÙÀ½Àº SumOfSquaredWeights°¡ 0.5¿¡¼­ 45±îÁö Áõ°¡ÇÒ¶§ queryNormÀÇ º¯È­¸¦ ³ªÅ¸³½ ±×·¡ÇÁ´Ù.

    querynorm.gif


    4.2.4 lengthNorm

    lucene °Ë»öÀº field:term °Ë»öÀÌ´Ù. ¹®¼­°¡ ÀÖÀ¸¸é ¹®¼­¸¦ contnet, url, anchor, titleµîÀÇ Çʵå·Î ±¸ºÐÀ» ÇØ¼­ °¢°¢ÀÇ Çʵ忡 ´ëÇØ¼­ term°Ë»öÀ» ÇÏ´Â ¹æ½ÄÀÌ´Ù. ±×·¸´Ù¸é °¢ Çʵ忡 ´ëÇÑ Á¤±ÔÈ­ÀÛ¾÷ÀÌ ÇÊ¿äÇÏ°Ô µÈ´Ù.

    ¿¹¸¦µé¾î¼­ title¿¡ linux¹®ÀÚ¸¦ Æ÷ÇÔÇÑ ¹®¼­¸¦ °Ë»ö ÇÑ °á°ú ¾Æ·¡¿Í °°Àº ŸÀÌÆ²À» °¡Áö´Â 2°³ÀÇ ¹®¼­°¡ ¹ß°ßµÇ¾ú´Ù°í °¡Á¤Çغ¸ÀÚ. - title:linux¶ó´Â Äõ¸®¸¦ »ç¿ëÇßÀ» °ÍÀÌ´Ù. -
    1. Linux ¿î¿µÃ¼Á¦
    2. Linux¿¡¼­ÀÇ Apache Tomcat¼­¹ö ¼³Ä¡¿Í ¿î¿ë
    1¹ø ¹®¼­°¡ À¯»çµµ°¡ ´õ ³ôÀº ¹®¼­¶ó´Â°Ç ÀǽÉÇÒ Çʿ䰡 ¾ø´Ù. ÀϹÝÀûÀ¸·Î ÇØ´çÇʵ忡 ÅäÅ«ÀÇ °¹¼ö°¡ ¸¹¾ÆÁú ¼ö·Ï Äõ¸®¿¡ ´ëÇÑ ¹®¼­ÀÇ À¯»çµµ´Â ¶³¾îÁö°Ô µÈ´Ù. lengthNorm Çʵ峻ÀÇ ÅäÅ«µéÀÇ °¹¼ö(length)¸¦ ÀÌ¿ëÇØ¼­ Á¤±ÔÈ­(Norm)ÇÑ °ªÀÌ´Ù. lucence´Â lengthNormÀ» ±¸Çϱâ À§Çؼ­ ¾Æ·¡¿Í °°Àº °ø½ÄÀ» »ç¿ëÇÑ´Ù. lengthNorm°ø½ÄÀº fieldÀÇ Á¾·ù¿¡ µû¶ó¼­ ¾à°£¾¿ ´Þ¶óÁø´Ù.

    lengthnorm.gif

    ´ÙÀ½Àº DocScore¸¦ 0.5·Î Çß¶§, tokenÀÇ Áõ°¡¿¡ µû¸¥ LengthNormÀÇ º¯È­´Ù.

    scorenorm.png

    contentÀÇ °æ¿ì ¹®¼­¿¡ ÅäÅ«ÀÌ 1000°³°¡ ³Ñ¾î°¡±â Àü±îÁö´Â Á¤±Ô°ª¿¡ º¯È­°¡ ¾øÀ½À» ¾Ë ¼ö ÀÖ´Ù. À§ÀÇ ±×·¡ÇÁ´Â gnuplot¸¦ ÅëÇØ¼­ ÀÛ¼ºµÇ¾úÀ¸¸ç, gnuplot¸¦ À§ÇÑ µ¥ÀÌÅÍ´Â ¾Æ·¡ÀÇ Äڵ带 ÀÌ¿ëÇØ¼­ ¸¸µé¾ú´Ù.
    #include <stdio.h> 
    #include <math.h> 
     
    int max(int a, int b) 
    { 
      if (a > b) 
        return a; 
      else return b; 
    } 
     
    int main(int argc, char **argv) 
    { 
      float result; 
      int i = 0; 
      double docscore = 0.5; 
     
      for(i = 1; i < 2000; i++) 
      { 
        printf("%lu %lf\n",i, sqrt(docscore)/log(2.71828182 + (double)i)); 
      } 
     
      for(i = 1; i < 2000; i++) 
      { 
        printf("%lu %lf\n",i, sqrt(docscore)/sqrt((double)max(i, 1000))); 
      } 
     
      for(i = 1; i < 2000; i++) 
      { 
        printf("%lu %lf\n",i, sqrt(docscore)/sqrt((double)i)); 
      } 
    } 
     


    4.2.5 Coord

    ¸»±×´ë·Î coordinator ´Ù. °ªÀ» ÆòÁØÈ­ ½Ã۱â À§Çؼ­ »ç¿ëÇÑ´Ù. ¿¹¸¦ µé¾î scoreÀÇ °ªÀÌ 0.0000000001 ¼öÁØ¿¡¼­ º¯ÇÑ´Ù¸é, ÀǹÌÀÖ´Â °ªÀ» ¸¸µé¾î³»±â°¡ Èûµé °ÍÀÌ´Ù. À̰æ¿ì ÀûÀýÇÑ °ªÀ» °öÇØÁØ´Ù. ±âº»À¸·Î ÁÖ¾îÁö´Â °ªÀº 1.0ÀÌ´Ù.

    4.2.6 tf

    public float tf(int freq) 
     
    tf´Â ¹®¼­³»¿¡¼­ ´Ü¾î³ª ¹®ÀåÀÌ ¾ó¸¶³ª ÀÚÁÖ ¹ß»ýÇÏ´ÂÁö¿¡ ´ëÇÑ Á¡¼ö¸¦ °è»êÇÑ´Ù. °ªÀÌ Å¬¼ö·Ï ÇØ´ç ´Ü¾î¿Í ¹®ÀåÀÌ ´õ ÀÚÁÖ µîÀåÇÔÀ» ÀǹÌÇÑ´Ù. °ø½ÄÀº ¾Æ·¡¿Í °°´Ù.

    termfrequency.jpg

    ºÐ¸ð´Â ¹®¼­¿¡ ÃâÇöÇÑ ´Ü¾îÁß ÃâÇöºóµµ°¡ °¡Àå ³ôÀº ¿ë¾î°¡ µÈ´Ù.

    ºÐ¸ð¿¡ ¹®¼­¿¡ ÃâÇöÇÑ ¸ðµç ´Ü¾î°¡ µé¾î°£´Ù¸é, Å« ¹®¼­¿¡¼­´Â »ó´ëÀûÀ¸·Î °ªÀÌ ÀÛ¾ÆÁú °ÍÀ̰í, ÀÛÀº ¹®¼­¿¡¼­´Â »ó´ëÀûÀ¸·Î °ªÀÌ Ä¿Áö´Â ¹®Á¦°¡ ¹ß»ýÇÒ °ÍÀ̹ǷÎ, Á¤±ÔÈ­ÇÒ Çʿ䰡 ÀÖ´Ù.ºÐ¸ð¸¦ ÃâÇö ºóµµ°¡ °¡Àå ³ôÀº ¿ë¾î·Î ÇÑ ÀÌÀ¯´Ù.

    Freq°¡ 5°³·Î °íÁ¤µÇ¾îÀÖ´Ù°í ÇßÀ»¶§, MaxFreq¿¡ µû¸¥ tfÀÇ º¯È­´Â ´ÙÀ½°ú °°´Ù.

    termfreqgrp.gif

    4.2.7 idf

    public float idf(Term term, Searcher searcher) throws IOException 
     
    Inver Document Frequency ÀÇ ÁÙÀÓ¸»ÀÌ´Ù. <Term, DID List>Çü½ÄÀ¸·Î µÈ »öÀÎÅ×À̺íÀ» °Ë»çÇÔÀ¸·Î½á, ÇØ´ç ÅÒÀÌ ¾ó¸¶³ª ¸¹Àº ¹®¼­¿¡¼­ ÃâÇöÇß´Â Áö¸¦ °Ë»çÇÑ´Ù. °Ë»çµÈ °ªÀº scoreÀÇ °è»êÀÎÀÚ·Î ³Ñ°ÜÁø´Ù.

    idf.jpg

    ·Î±×´Â ½ºÄÉÀÏÀ» Á¶ÀýÇϱâ À§Çؼ­ »ç¿ëÇß´Ù. Áß¿äÇÑ ´Ü¾î´Â ÇØ´ç ´Ü¾î¸¦ Àü¹®ÀûÀ¸·Î ´Ù·ç´Â ¸î°³ÀÇ ¹®¼­¿¡¼­ º»°ÝÀûÀ¸·Î ÃâÇöÇÒ È®·üÀÌ ³ôÀ» °ÍÀÌ´Ù. ¹Ý´ë·Î ¿ì¸®°¡ ÀÏ»óÀûÀ¸·Î »ç¿ëÇÏ´Â ´Ü¾î´Â ¸¹Àº ¹®¼­¿¡¼­ ÃâÇöÇÒ °ÍÀÌ´Ù. ¾î¶² ¹®¼­¿¡¼­ 5°³ÀÇ linux¶ó´Â ´Ü¾î°¡ ¹ß»ýÇß´Ù¸é, maxDocÀÇ °¹¼ö¿¡ µû¶ó¼­ idf´Â ´ÙÀ½°ú °°ÀÌ º¯ÇÑ´Ù.

    idflog.gif

    maxDoc°¡ Ä¿Áú ¼ö·Ï idfÀÇ °ªµµ Ä¿Áø´Ù. 10°³ÀÇ ¹®¼­ Áß 5°³ÀÇ ¹®¼­¿¡¼­ linux°¡ ¹ß»ýµÈ°Í º¸´Ù´Â, 1000°³ÀÇ ¹®¼­Áß 5°³ÀÇ ¹®¼­¿¡¼­ linux°¡ ¹ß»ýµÇ¾úÀ» °æ¿ì ¹®¼­ÀÇ Á߿䵵°¡ Ä¿Áú°Å¶ó°í ¿¹»óÇÒ ¼ö Àֱ⠶§¹®ÀÌ´Ù.

    ´ÙÀ½Àº maxDoc¸¦ 1000°³·Î °íÁ¤½Ã۰í df¸¦ 5¿¡¼­ 1000±îÁö Áõ°¡½ÃÄ×À» ¶§, idf °ªÀÇ º¯È­¸¦ ÃøÁ¤ÇÑ °á°ú´Ù.

    idflog2.gif

    ¿¹¸¦µé¾î the, a¿Í °°ÀÌ ¿©·¯¹®¼­¿¡ °ÉÃļ­ ³ªÅ¸³¯ ¼ö ÀÖ´Â TermÀº ³·Àº idf °ªÀ» °¡Áö°Ô µÈ´Ù.

    4.3 Lucene Searcher

    Lucene.SearcherÀº ÁÖ¾îÁø Query¸¦ ÀÌ¿ëÇØ¼­ °Ë»öÀ» Çϴ Ŭ·¡½º´Ù. ´Ù¾çÇÒ ¼ö ÀÖ´Â °Ë»ö¹æ½ÄÀ» Áö¿øÇϱâ À§Çؼ­ °Ë»ö¿£ÁøÀº Plugin ¹æ½ÄÀ¸·Î ÀûÀçÇÒ ¼ö ÀÖ´Ù. ±âº» °Ë»ö PluginÀº »öÀΰ˻öÀ» ÇÏ´Â search.IndexSearcher¿Í search.Hitcollector, search.TopFieldDocCollector ÀÌ´Ù.

    • ÇÁ·Î½ÃÁ® ÄÚµå °£´Ü¹öÀü
    »ç¿ëÀÚ °Ë»ö ¹®ÀÚ¿­À» ¹Þ¾Æµé¿©¼­ Query¸¦ »ý¼ºÇÑ´Ù. 
    for(i = 0; i < Query.Term.size(); i++) 
    { 
        Term.Weight¸¦ °è»êÇÑ´Ù. 
        { 
            TermInfoIndex ÆÄÀÏ¿¡¼­, ÇØ´ç TermÀÌ Æ÷ÇÔµÈ TermInfos.blockÀÇ Æ÷ÀÎÅ͸¦ ã¾Æ³½´Ù. 
            ÇØ´ç block¸¦ ã¾Ò´Ù¸é ¼±Çü°Ë»öÀ» Çϸ鼭 ÀÏÄ¡ÇÏ´Â <field:term>ÀÌ ÀÖ´ÂÁö È®ÀÎÇÑ´Ù.  
            ã¾Ò´Ù¸é TermInofs Å×ÀÌºí¿¡¼­ ´ÙÀ½°ú °°Àº Á¤º¸¸¦ ¾ò¾î¿Â´Ù. 
            { 
                @ DocFreq         : ÇØ´ç TermÀ» Æ÷ÇÔÇÑ ¹®¼­°¡ ¸î°³ ÀÖ´ÂÁö  
                @ freq pointer    : TermFreq¿¡ ´ëÇÑ <did,freq>Á¤º¸¸¦ °¡Áø ÆÄÀÏ¿¡¼­, ÇöÀç Term¿¡ ´ëÇÑ <did,freq>°¡ ½ÃÀÛÇÏ´Â À§Ä¡°ª 
                @ prox pointer    : ÇöÀç TermÀÌ ¹®¼­ÀÇ ¾î´ÀÀ§Ä¡¿¡ Á¸ÀçÇϰí ÀÖ´ÂÁö¿¡ ´ëÇÑ Á¤º¸¸¦ °¡Áø ÆÄÀÏ¿¡¼­, ÇöÀç TermÀÌ ½ÃÀ۵Ǵ À§Ä¡°ª 
            } 
            /* 
                ÀÌÁ¦ freq pointer¸¦ ÀÌ¿ëÇØ¼­ ÇØ´ç TermÀ» ¾î¶² ¹®¼­°¡ ¸î°³ Æ÷ÇÔÇϰí ÀÖ´ÂÁö ¾Ë ¼ö ÀÖ´Ù. 
                prox pointer¸¦ ÀÌ¿ëÇÏ¸é °Ë»ö°á°úÀÇ ¿ä¾àÀ» ¸¸µé¾î ³¾ ¼ö ÀÖ´Ù. 
            */ 
            // termÀÇ idf ¸¦ ±¸Çϰí weight °´Ã¼¸¦ »ý¼ºÇÑ´Ù. 
            term.idf = log(maxDocs/docFreq+1))+1.0;  
            weights.add(term); 
        } 
     
        weight °´Ã¼¸¦ ¼øÈ¯Çϸ鼭 sumOfSquaredWeights¸¦ ±¸ÇÑ´Ù. 
        for (i = 0; i < weights.size(); i++) 
        { 
            queryWeight = weights[i].idf * getboost(); 
            Squared = queryWeight * queryWeight; 
            sum += Squared; 
        } 
        sum *= getboost()^2; 
        sumOfSquaredWeights = sum; 
        // queryNorm(sumOfSquaredWeights) °ªÀ» ±¸ÇÑ´Ù.  
        { 
            1.0/sqrt(sumOfSquaredWeights); 
        } 
        // queryNorm(sumOfSquaredWeights)¸¦ ÀÌ¿ëÇØ¼­ idf query weight¸¦ Á¤±ÔÈ­ ÇÑ´Ù.  
        for (i = 0; i < weights.size(); i++) 
        { 
            queryWeight *= queryNorm; 
            WeightValue = queryWeight * idf; 
        } 
    } 
     
    HitsQueue ¸¦ »ý¼ºÇÑ´Ù. 
    ¿©±â¿¡´Â °¡Àå ³ôÀº Score¸¦ °¡Áö´Â score°´Ã¼Á¤º¸°¡ À¯ÁöµÈ´Ù. 
     
    weight¸¦ ¼øÈ¯Çϸ鼭 ÇØ´ç weight.termÀ» Æ÷ÇÔÇÑ ¸ðµç weight¿¡ ´ëÇÑ score¸¦ °¡Á®¿Í¼­ 
    BolleanScorer¿¡ addÇÑ´Ù. 
     
    BolleanScorer result; 
    for (i = 0; i < weights.size(); i++) 
    { 
        // tis.freqpointer ¸¦ ÀÌ¿ëÇØ¼­ score¿¬»êÀ» À§ÇÑ freq pointer, prox pointerµîÀ» °¡Á®¿Ã ¼ö ÀÖ´Ù.   
        Weight w = weights.elementAt(i); 
        w.scorer()¸¦ È£Ãâ ÇØ´ç weight¿¡ ´ëÇÑ scorerÀ» °è»êÇÑ´Ù. 
        { 
            weightÀÇ ¼º°Ý¿¡ µû¶ó¼­     
            SloopyPhraseScorer ȤÀº ExactPharseScorerÀ» ¼±ÅÃÇÑ´Ù.  
            SloopyPhraseScorer Àº slop°¡ 0ÀÌ ¾Æ´Ñ °æ¿ì 
            ExactPharseScorer Àº slop°¡ 0ÀÎ °æ¿ì 
        } 
        // weight.scorer¸¦ BooleanScorer.result¿¡ add ÇÑ´Ù. 
        result.add(w.scorer, c.isRequired(), c.isProhibited()) 
        { 
            isRequired (¹Ýµå½Ã ¿ä±¸), isProhibited(¹Ýµå½Ã Á¦¿Ü)ÀÎÁö¸¦ È®ÀÎÇÑ´ÙÀ½ 
            // isRequired ´Â Äõ¸®ÀÇ Term¿¡ '''AND'''³ª '''+'''ÀÌ ÁöÁ¤µÇ¾úÀ» °æ¿ì 
            // isProhibited´Â Äõ¸®ÀÇ Term¿¡ '''-'''°¡ ÁöÁ¤µÇ¾úÀ» °æ¿ì 
            if(isRequired) 
            { 
                requiredScorers.add(scorer); 
            } 
            else if (prohibited) 
            { 
                prohibitedScorers.add(scorer); 
            } 
            else 
            { 
                optionalScorers.add(scorer); 
            } 
        } 
        return scorer.BolleanScorer; 
    } 
     
    scorer.scorer()¸¦ È£Ãâ 
    { 
        if(requriedScorers.size()°¡ ÇϳªÀÌ»ó Á¸ÀçÇÑ´Ù¸é) 
        { 
            makeCountingSumScorerSomeReq()À» È£Ãâ 
            { 
                optionalScorer°¡ Á¸Àç ÇÏÁö ¾Ê´Â °æ¿ì 
                optionalScorerÀÌ Á¸ÀçÇÏ´Â °æ¿ì 
                { 
                    requiredScorerÀÌ Çϳª¶ó¸é 
                        SingleMatchScorerÀ» ¼öÇà (ÇØ´ç scorer¸¦ ±×´ë·Î ¸®ÅÏ) 
                    ±×·¸Áö ¾Ê´Ù¸é 
                        countingConjunctionSumScorer¸¦ ¼öÇà 
                } 
            } 
        } 
        ±×·¸Áö ¾Ê´Ù¸é 
        { 
            makeCountingSumScorerNoReq()¸¦ È£Ãâ 
        } 
        À§ÀÇ sumScorer°úÁ¤À» °ÅÄ¡°í³ª¸é °¢ weight¿¡ ´ëÇÑ scorer heapÀÌ »ý¼ºµÈ´Ù.  
        heapÀÇ °¢ scorer´Â DID¸¦ °¡Áö°í ÀÖ´Â priorityQueue¸¦ °¡Áö°í ÀÖ´Ù.  
     
        // HeapÀÇ topºÎÅÍ scorer¸¦ Çϳª¾¿ °¡Á®¿Â´Ù. 
        for (i =0; i < heap.size(); i++)  
        { 
            Scorer top = heap[i]; 
            // ÇØ´ç doc¿¡ ´ëÇÑ score¸¦ ¾ò¾î¿Â´Ù. 
            currentDoc = top.doc(); 
            currentScore = top.score(); 
     
            // heapÀÇ ´Ù¸¥ scorer¿¡ currentDoc¿Í µ¿ÀÏÇÑ doc°¡ ÀÖ´ÂÁö È®ÀÎÇÑ´Ù.  
            while(!top.next()) 
            { 
                top=scorerQueue.top(); 
                if (top.doc() == currentDoc) 
                { 
                    currentscore += top.score(); 
                } 
            } 
            HitCollector(currentDoc,currentScorer); 
        } 
    } 
     

    QuerySearch.gif

    4.4 HitsCollector ÀÇ »ý¼º »ý¼º

    scorer´Â °¢ term¿¡ ´ëÇØ¼­ ¸¸µé¾îÁø´Ù. ¸¸¾à Äõ¸®¿¡ 2°³ÀÇ ÅÒÀÌ ÀÖ¾ú´Ù¸é. 2°³ÀÇ scorerÀÌ ¸¸µé¾îÁö°Ô µÈ´Ù. nutchÀÇ °æ¿ì linux¶ó´Â ´ÜÀÏ ´Ü¾î·Î ¸¸µé¾îÁø Äõ¸®¸¦ ÀÔ·ÂÇß´Ù¸é, nutch ³»ºÎÀûÀ¸·Î °¢ Çʵ庰·Î 5°³ÀÇ TermÀ» °¡Áø Äõ¸®¸¦ ¸¸µé°Ô µÈ´Ù (url:linux OR content:linux OR anchor:linux OR site:linux OR title:linux). ±×·¯¹Ç·Î À̰æ¿ì 5°³ÀÇ ÅÒ¿¡ ´ëÇÑ scorerÀÌ ¸¸µé¾î¸ç, °¢°¢ÀÇ scorer´Â heap ÀڷᱸÁ¶¿¡ µé¾î°¡°Ô µÈ´Ù. ±×¸®°í °¢°¢ÀÇ scorer´Â ÇØ´ç term¿¡ ´ëÇÑ priorityscorerqueue¸¦ À¯ÁöÇÑ´Ù.

    ´ÙÀ½Àº linux¶ó´Â Äõ¸®°¡ ÁÖ¾îÁ³À»¶§ HistsQueue°¡ ¾î¶²½ÄÀ¸·Î »ý¼ºµÇ´ÂÁö¸¦ º¸¿©ÁÖ´Â ±×¸²ÀÌ´Ù.

    scorerqueue.gif

    1. °¢ Term¿¡ ´ëÇÑ scorer¸¦ ¸¸µé°í, ScorerQueue¸¦ ¸¸µé¾î¼­ À¯ÁöÇÑ´Ù.
    2. °¢ ScorerQueue´Â priorityQueue¸¦ À¯ÁöÇÑ´Ù.
    3. PriorityQueu.top.doc()À» °¡Á®¿Â´Ù.
    4. ´Ù¸¥ scorer¿¡¼­µµ top.doc()¸¦ °¡Á®¿Í¼­ µ¿ÀÏÇÑ doc°¡ ÀÖ´Ù¸é score¸¦ ´õÇØ¼­,
    5. HietQueue¿¡ ³Ö´Â´Ù.

    MergeSort±¸ÇöÀÓÀ» ¾Ë ¼ö ÀÖ´Ù.

    4.5 Score ÀڷᱸÁ¶

    scorertree.gif

    ScorerÀÇ Àç±ÍÈ£ÃâÀ» ÀÌ¿ëÇÑ Stack ÀڷᱸÁ¶¸¦ °¡Áø´Ù. weight.scorerÀ» ÅëÇØ¼­ term¿¡ ´ëÇÑ score°¡ ¸¸µé¾îÁö¸é ŸÀÔ¿¡ µû¶ó¼­ ¾Æ·¡¿Í °°ÀÌ ºÐ·ùµÇ¾î¼­ AddµÈ´Ù.
    Type Scorer Type ¼³¸í
    isprohibit prohibitedScorers ÇØ´ç Term Á¦¿Ü
    isrequired requiredScorers ÇØ´ç Term ¹Ýµå½Ã Æ÷ÇÔ
    should optionalScorers ÇØ´ç Term Æ÷ÇÔÇÒ ¼ö ÀÖÀ½

    ¿¹¸¦ µé¾î apache -linux¶ó¸é ¾Æ·¡¿Í °°ÀÌ Ç¥ÇöµÉ °ÍÀÌ´Ù.

    scoretreesmp.gif

    ´ÙÀ½Àº ¶Ç´Ù¸¥ ¿¹ÀÌ´Ù.

    advscoresmp.gif

    À̹ø¿¡´Â »ó´çÈ÷ º¹ÀâÇÑ Äõ¸®¸¦ ÀÌ¿ëÇØ¼­ scorerÀÇ ÀÛµ¿¹æ½Ä¿¡ ´ëÇØ¼­ ¾Ë¾Æº¸µµ·Ï ÇϰڴÙ. Äõ¸®´Â ´ÙÀ½°ú °°´Ù.
    • title:apache (content:tcl^4.0 AND -content:apache AND {1999 TO 2006}) AND tcl^3.0 ("php programing" -windows*) NOT tcl tcl

    • ÃÑ 6°³ÀÇ clauses°¡ ¸¸µé¾îÁú °ÍÀÌ´Ù.
    • 6°³ÀÇ clauses¿¡ ´ëÇÑ weight¸¦ ¸¸µé °ÍÀÌ´Ù.
      1. ÀÌÁß 2¹øÂ° clauses´Â group query·Î 3°³ÀÇ clauses¸¦ °¡Áö°Ô µÉ °ÍÀ̸ç, À̰ÍÀº weight¸¦ Àç±ÍÈ£Ãâ ÇÒ °ÍÀÌ´Ù. 4¹øÂ° clauses¿ª½Ã ¸¶Âù°¡Áö·Î 2°³ÀÇ clauses¸¦ °¡Áø´Ù.
    • search¸¦ ÀÌ¿ëÇØ¼­ weight.score¸¦ ±¸ÇÑ´Ù.
    • Scorer´Â ¾Æ·¡¿Í °°Àº ±¸¼ºÀ» °¡Áú °ÍÀÌ´Ù.
    cplxscorer.jpg
    • °¢°¢ÀÇ scorerµéÀº ½¬¿î¿¬»êÀ» À§Çؼ­ BooleanScorer°ú ReqOptSumScorer·Î sumCounting °úÁ¤À» °ÅÃļ­ ±×·ìÈ­ ÇÑ´Ù.

    4.6 Distributed Search



    Replace original file
    Rename if it already exist

    ¼Ò°³

    Nutch´Â ±âº»ÀûÀ¸·Î hadoop Global ÆÄÀϽýºÅÛ¿¡¼­ °Ë»öÀÌ ÀÌ·ç¾îÁöµµ·Ï ¸¸µé¾îÁ® ÀÖ´Ù. ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀ» ÀÌ¿ëÇϱ⠶§¹®¿¡ ¸Å¿ì À¯¿¬ÇÑ ¹æ½ÄÀ̱ä ÇÏÁö¸¸, °Ë»öÇØ¾ß ÇÏ´Â ¹®¼­ÀÇ ¾çÀÌ ¸¹ÀÌ Áú°æ¿ì ¾öû³ª°Ô ´Ê¾îÁú ¼ö ÀÖ´Ù´Â ´ÜÁ¡À» °¡Áø´Ù.

    Hadoop ÀÚü°¡ ÀÚ¹Ù°¡»ó¸Ó½ÅÀ§¿¡¼­ ÆÄÀϽýºÅÛÀ» Ãß»óÈ­½ÃŲ µµ±¸À̱⠶§¹®¿¡ Å»ýÀûÀ¸·Î ´À¸± ¼ö ¹Û¿¡ ¾ø´Ù.

    ÀÌ °æ¿ì ¼º´ÉÀ» ³ôÀ̱â À§Çؼ­ Segment¸¦ ¿©·¯°³·Î ³ª´«´ÙÀ½¿¡ ¸î°³ÀÇ ¼­¹ö¿¡ µÎ°í, °¢°¢ÀÇ ¼­¹ö¿¡¼­´Â HadoopÀÌ ¾Æ´Ñ Local¿¡¼­ °Ë»öÀ» ÇÏ°í ±× °á°ú¸¦ Web ServerÃø¿¡ ´øÁ®ÁÖ´Â °ÍÀ» »ý°¢ÇÒ ¼ö ÀÖ´Ù. Nutch´Â ÀÌ·¯ÇÑ ¹æ½ÄÀÇ Distributed Search ¸¦ Áö¿øÇϰí ÀÖ´Ù. ±¸¼ºÀº ´ÙÀ½°ú °°´Ù.

    Dist.png

    ÀüÇüÀûÀÎ Server&Client ¸ðµ¨À» µû¸¥´Ù. À̰æ¿ì Web ServerÀÌ Search Client °¡ µÇ°í, ´Ù¸¥ ÇÏÀ§ ³ëµåµéÀÌ Search Server°¡ µÈ´Ù. Search Server´Â ÇØ´çÆ÷Æ®·Î ¿­¸°»óÅ·Π±â´Ù·È´Ù°¡, Search Client·ÎºÎÅÍÀÇ ¿äûÀÌ ¿À¸é ·ÎÄà »öÀÎÆÄÀÏÀ» °Ë»öÇØ¼­ °á°ú¸¦ °¡Á®¿À°í, Search Client·Î º¸³»°Ô µÈ´Ù. rpc¸¦ ÀÌ¿ëÇØ¼­ ¿äûÀ» º¸³»°í ÇÁ·Î½ÃÁ®¸¦ ½ÇÇà½ÃŰ°í ±× °á°ú¸¦ ¸®ÅÏÇÑ´Ù.

    ÀÌ ºÐ»ê°Ë»öÀ» Àû¿ëÇÏ·Á¸é »öÀÎÀ» ÇÒ¶§, °¢ ½Ã½ºÅÛ¿¡¼­ ó¸®ÇÒ ÃÖ´ë segmentsÀÇ Å©±â¸¦ ÁöÁ¤Çؼ­ ¿©·¯°³ÀÇ ¼¼±×¸ÕÆ®°¡ »ý±âµµ·Ï ÇØ¾ß ÇÒ °ÍÀÌ´Ù. ¸¸¾à ¿¹»ó »öÀÎ ¹®¼­ÀÇ °¹¼ö°¡ 500¸¸À̶ó¸é, 100¸¸°³ÀÇ Å©±â¸¦ °¡Áö´Â 5°³ÀÇ sgement·Î »ý¼ºµÇ°Ô ÇÏ¸é µÉ °ÍÀÌ´Ù. ±×·³ 5´ëÀÇ search server¿¡¼­ ÀÚ½ÅÀÌ ´ã´çÇÒ segment¸¦ Hadoop¿¡¼­ ·ÎÄ÷Πº¹»ç¸¦ ÇÏ´Â °ÍÀ¸·Î ±âº»ÀûÀÎ ±¸¼ºÀ» ¸¶Ä¥ ¼ö ÀÖ´Ù.

    ÇϳªÀÇ ¼­¹ö°¡ 100¸¸°³ Á¤µµÀÇ ¹®¼­¸¦ ó¸®ÇÒ¶§, ¼º´É¿¡ Å« ¹®Á¦°¡ ¾ø´Â °ÍÀ¸·Î »ý°¢µÈ´Ù.

    ¼³Á¤

    ¼­¹ö ±¸¼º

    ´ÙÀ½°ú °°Àº ¼­¹ö±¸¼º¿¡¼­ Å×½ºÆ®¸¦ Çß´Ù.
    scluster01 
    scluster02 
    scluster03 
    scluster04 
     
    scluster01ÀÌ master node·Î search clinet°¡ µÇ¸ç, tomcat ¼­¹ö°¡ ¿î¿ëµÉ °ÍÀÌ´Ù. ³ª¸ÓÁö 02~04´Â search server °¡ µÈ´Ù.

    search client

    ¿¬°áÇÒ search serverÀÇ Á¤º¸¸¦ ¾Ë·ÁÁà¾ß ÇÒ °ÍÀÌ´Ù. ÀÌ Á¤º¸´Â search-servers.txt¶ó´Â ÆÄÀÏ¿¡ ´ÙÀ½°ú °°Àº <Server Name, Port> Æ÷¸ËÀ¸·Î ÀúÀåÀÌ µÈ´Ù.
    # ServerName Port 
    scluster02  1234  
    scluster03  1235 
    scluster04  1236 
     

    ÀÌ ¼³Á¤ÆÄÀÏÀº nutch-site.xmlÀÇ searcher.dir¿¡ Á¤ÀǵǾî ÀÖ´Â »öÀÎ·çÆ®µð·ºÅ丮¿¡ À§Ä¡ÇØ¾ß ÇÑ´Ù. nutch-site.xml ÆÄÀÏÀº ÅèĹ ·çÆ®µð·ºÅ丮ÀÇ WEB-INF/classes ¹Ø¿¡ ÀÖÀ¸´Ï, ¼öÁ¤Çϱ⠹ٶõ´Ù.
    <property> 
        <name>searcher.dir</name> 
        <value>/scluster01/idx</value> 
    </property> 
     
    searcher.dirÀÇ °æ·Î°¡ À§¿Í °°ÀÌ µÇ¾î ÀÖ´Ù¸é /scluster02/idx ¹Ø¿¡ search-server.txt¸¦ °¡Á®´Ù ³õÀ¸¸é µÈ´Ù.

    ÀÌÁ¦ tomcat/bin¿¡ ÀÖ´Â startup.sh¸¦ ÀÌ¿ëÇØ¼­ ¼­¹ö¸¦ °¡µ¿½ÃŰ¸é µÈ´Ù. ÀÌÁ¦ server client ´ÜÀÇ nutch´Â °Ë»öÄõ¸®°¡ ÁÖ¾îÁú °æ¿ì search-server.txt¿¡ ÀÖ´Â ¼­¹öµé¿¡ ¿¬°áÀ» ÇØ¼­ °á°ú¸¦ Àü¼Û ¹Þ°Ô µÈ´Ù.

    nutch´Â searcher.dir °æ·Î¿¡ search-server.txt°¡ ÀÖ´Ù¸é, client·Î ÀÛµ¿À» ÇÏ°Ô µÈ´Ù. ±×·¯¹Ç·Î search server¿¡´Â search-server.txt ÆÄÀÏÀÌ ÀÖÀ¸¸é ¾ÈµÉ °ÍÀÌ´Ù.

    search server

    °¢°¢ÀÇ search server¿¡ ´ëÇØ¼­ ¾Æ·¡¿Í °°Àº ÀÛ¾÷À» µ¿ÀÏÇÏ°Ô ÇØÁÖ¸é µÈ´Ù. ¿ì¼± »öÀÎÆÄÀÏ °Ë»öÀÌ hadoopÀÌ ¾Æ´Ñ local¿¡¼­ ÀÌ·ç¾îÁöµµ·Ï hadoop-site.xml ÆÄÀÏÀ» ¼öÁ¤ÇÑ´Ù.
    fs.default.name local was "local", The name of the default file system. Either the literal string "local" or a host:port for DFS. mapred.job.tracker local was "local", The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
    ÀÌÁ¦ nutch ¸í·ÉÀ» ÀÌ¿ëÇØ¼­ server ¸ðµå·Î ½ÇÇà½ÃŰ¸é µÈ´Ù. À̶§ port ¹øÈ£´Â ¹Ýµå½Ã search clientÀÇ search-server.txtÀÇ ³»¿ë°ú ÀÏÄ¡µÇµµ·Ï ÇØ¾ß ÇÑ´Ù.
    scluster02 # bin/nutch server 1234 /scluster02/idx  
     

    ¹®Á¦ ÇØ°á

    ¸¸¾à ÃÖ½ÅÀÇ Linux¶ó¸é IPv6 Ä¿³Î ¸ðµâÀÌ µ¿ÀÛÁßÀÏ °Å´Ù. À̰æ¿ì search server¸¦ ½ÇÇà½ÃŰ¸é ´ÙÀ½°ú °°Àº ¿¡·¯¸Þ½ÃÁö¸¦ Ãâ·ÂÇÑ´Ù. (ÈÄ ÀÌ ¹®Á¦ ¶§¹®¿¡ °í»ý Á» Çß½À´Ï´Ù.)
    Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4

    ÀÌ ¹®Á¦´Â bin/nutch ÀÇ ¿É¼ÇÀ» ¼öÁ¤ÇØ¾ß ÇÑ´Ù.
    JAVA_IPV4=-Djava.net.preferIPv4Stack=true 
    # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" $CLASS "$@"  
     


    Cache Error

    4.7 ÇØ¾ßÇÒÀÏ

    4.7.1 search¸¦ È®½ÇÈ÷ Çϱâ À§Çؼ­´Â »öÀÎÆÄÀϱ¸Á¶¸¦ ±íÀÌ »ìÆìºÁ¾ß ÇÑ´Ù.

    4.7.2 DistributionSearch ¿¡ ´ëÇØ¼­ ¾Ë¾Æº»´Ù.

    EmailÀ» ±âÀÔÇϸé, ´ñ±ÛÀÌ ¸ÞÀÏ·Î Àü´ÞµË´Ï´Ù.