Nutch·Î ¾Ë¾Æº¸´Â Crawling ±¸Á¶
ÃÑ ÆäÀÌÁö ¼ö : 3224

Àüü ÇÔ¼ö/¿ë¾î»çÀü
Facebook Joinc ±×·ì   Joinc QA »çÀÌÆ®
ÇöÀçÀ§Ä¡ : JCvs>Search>Document>nutch>Crawling



joinc´Â Firefox¿Í chrome¿¡¼­ Å×½ºÆ® Çß½À´Ï´Ù. IE¿¡¼­´Â Å×À̺íÀÌ ±úÁö°Å³ª À̹ÌÁö°¡ º¸ÀÌÁö ¾ÊÀ» ¼ö ÀÖ½À´Ï´Ù. ƯÈ÷ ±¸±Û DocsÀ̹ÌÁöÀÇ °æ¿ì ¿¢¹Úó¸®µÉ ¼ö ÀÖ½À´Ï´Ù.

Replace original file
Rename if it already exist

Contents

1 Nutch ¼Ò°³
2 Architecture
2.1 Crawler
2.2 crawltool
3 CrawlÀÇ ÀÛµ¿
4 Crawl °á°ú ºÐ¼®
5 WebDB
6 Segments
7 »öÀÎ (index)

¹®¼­È­¸¦ ¸ÕÀúÇÑ ÈÄ ÇØ´ç ÇÁ·Î±×·¥À» ÀÌ¿ëÇØ¼­ Å×½ºÆ®¸¦ ÁøÇàÇÒ °ÍÀ̱⠶§¹®¿¡, ¾Æ·¡ÀÇ ¹®¼­µéÀº ÃæºÐÈ÷ °ËÁõµÇÁö ¾Ê¾ÒÀ¸¸ç, °è¼Ó ¼öÁ¤µÉ °ÍÀÌ´Ù.

1 Nutch ¼Ò°³

Java·Î µÈ open source °Ë»ö¿£ÁøÀÎ nutchÀÇ crawling ±â´ÉÀ» ÅëÇØ¼­ °ü·ÃÁ¤º¸¸¦ ÃëµæÇϱâ·Î Çß´Ù. ÀÌ ¹®¼­ÀÇ ¿øº»Àº http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html ¿¡¼­ ãÀ» ¼ö ÀÖ´Ù.

Nutch ´Â Google ¼­Ãë¿£ÁøÀ» ´ëüÇϱâ À§Çؼ­ ¸¸µé¾îÁ³À¸¸ç, ´ÙÀ½°ú °°Àº Ư¡À» °¡Áö°í ÀÖ´Ù.
  1. Åõ¸í¼º : Nutch´Â ¿ÀǼҽº´Ù. °Ë»ö¿£ÁøÀÇ °¡Àå Áß¿äÇÑ ºÎºÐÀ̱⵵ÇÑ ·©Å· ¾Ë°í¸®Áò°ú ±× ±¸ÇöÀÌ ¿ÏÀüÈ÷ °ø°³µÇ¾î ÀÖ´Ù. »ó¿ë °Ë»ö¿£ÁøÀÇ °æ¿ì ranking¿£Áø°ú °ü·ÃµÈ ºÎºÐÀº ¿ÏÀüÈ÷ °¨Ãß¾îÁ® ÀÖ´Ù(·©Å·°ú °ü·ÃµÈ ±âº»ÀûÀÎ ¾Ë°í¸®ÁòÀÌ °ø°³µÇ¾î ÀÖ±ä ÇÏÁö¸¸). Nutch´Â ÇнÀ¿ë ȤÀº °ø°ø´Üüµî¿¡¼­ »ç¿ëµÇ´Â Á¤º¸ÀÇ Á߿䵵¸¦ üũÇϱâ À§ÇÑ ÁÁÀº ¼Ö·ù¼ÇÀÌ´Ù.
  2. ÀÌÇØÇϱ⠽¬¿ò : Nutch´Â °Ë»ö¿£Áø°ú °ü·ÃµÈ ´Ù¾çÇÑ À̷еéÀ» Æ÷ÇÔÇϰí ÀÖ´Ù. ¿©±â¿¡´Â distribute processing model·Î »ç¿ëµÇ´Â Map Reduce°ú °°Àº °ÍµéÀ» Æ÷ÇÔÇϰí ÀÖ´Ù. Map Reduce´Â Google ¿¬±¸¼Ò¿¡¼­ °³¹ßµÈ ´ë·® Data Processing ¿£ÁøÀÌ´Ù. À̿ܿ¡µµ Nutch´Â ÃÖ±Ù¿¡ ¿¬±¸µÇ¾îÁö°í ÀÖ´Â °Ë»ö¾Ë°í¸®ÁòÀ» Àû¿ëÇϰí Å×½ºÆ®Çϱâ À§ÇÑ ½Ãµµ¸¦ Çϰí ÀÖÀ¸¸ç, °ü·ÃµÈ ÀÌ·ÐÀûÀÎ Áö½ÄÀ» ¾Ë°í ÀÖ´Ù¸é, ½±°Ô Á¢±ÙÇϰí ÀÌÇØÇÒ ¼ö ÀÖ´Ù.
  3. È®À强 : ´Ù¸¥ °Ë»ö¿£ÁøµéÀº ´ëºÎºÐ ƯȭµÇ¾î ÀÖÀ¸¸ç, ³»ºÎ°¡ °¨Ãß¾îÁ® Àֱ⠶§¹®¿¡ ÀÚ½ÅÀÌ Áö¿øÇϴ ȯ°æÀÌ ¾Æ´Ñ°æ¿ì È®ÀåÀÌ ¾î·Æ°Å³ª ºÒ°¡´ÉÇÑ °æ¿ì°¡ ¸¹´Ù. ¹Ý¸é Nutch´Â ÀϹÝÀûÀÎ ¼­Ãë¿£ÁøÀ» ±¸ÇöÇϰí ÀÖÀ¸¸ç, ¼Ò½º°¡ °ø°³µÇ¾î Àֱ⠶§¹®¿¡ ½±°Ô È®Àå°¡´ÉÇÏ´Ù.

Nutch´Â Local filesystem, intranet, WebµîÀÇ ¿µ¿ªÀÇ Á¤º¸°Ë»öÀ» À§Çؼ­ ¼³Ä¡µÉ ¼ö ÀÖ´Ù. À̵鿵¿ªÀº ¼­·Î ´Ù¸¥ Ư¡À» °¡Áö°í Àִµ¥, Áö¿ª ÆÄÀϽýºÅÛÀ» ¿¹·Î µéÀÚ¸é, WebÀ̳ª intranet¿¡¼­ ÇÊ¿äÇÑ caching copy°¡ ºÒÇÊ¿äÇÏ´Ù. ¹Ý¸é WebÀÇ °æ¿ì ³×Æ®¿öÅ© ¿¡·¯¶óµçÁö ³×Æ®¿öÅ© ÀÚ¿øÀÇ È¿À²Àû »ç¿ëµîÀ» À§Çؼ­ º¹»çº»À» ÁغñÇØ¾ß ÇÑ´Ù. °Ô´Ù°¡ ¼ö¹é¸¸°Ç ÀÌ»óÀÇ ¹®¼­ µ¥ÀÌÅ͸¦ ó¸®ÇØ¾ß ÇÏ´Â °æ¿ìµµ ºñÀϺñÀçÇϸç, ÀÀ´äÇÏÁö ¾Ê´Â ¼­¹ö¿Í ±úÁø¸µÅ©, ¿¬°á/Áߺ¹µÈ ¸µÅ©, µ¿ÀÏÇÏ°Ô º¹»çµÈ ¹®¼­µé¿¡ ´ëÇØ¼­ 󸮸¦ ÇØ¾ß ÇÑ´Ù.

2 Architecture

Nutch ´Â crawler°ú searcher µÎºÎºÐÀ¸·Î ±¸¼ºµÈ´Ù. crawler´Â ÆäÀÌÁö¸¦ ¼öÁýÇϰí, ÆäÀÌÁö¿¡ ´ëÇÑ index¸¦ ¸¸µé¸ç, searcherÀº À¯ÀúÀÇ ¿äûÀ» ¹Þ¾Æ¼­ ÇÊ¿äÇÑ Á¤º¸¸¦ ã¾Æ¼­ º¸¿©ÁÖ´Â ÀÏÀ» ÇÑ´Ù. index´Â µÎ°³ÀÇ ¼­·Î´Ù¸¥ ±¸¼º¿ä¼Ò°£ÀÇ °¡±³¿ªÇÒÀ» ÇÑ´Ù.

2.1 Crawler

crawlerÀº crawl tool°ú web database, segments¿Í À妽ºµîÀ» Æ÷ÇÔÇÑ ´Ù¾çÇÑ µ¥ÀÌÅÍ ±¸Á¶Ã¼¸¦ ¸¸µé°í À¯ÁöÇϱâ À§ÇÑ Åøµé·Î ±¸¼ºµÈ ½Ã½ºÅÛÀÌ´Ù. ÀÌ ½Ã½ºÅÛ¿¡ ´ëÇØ¼­ ÀÚ¼¼È÷ ¾Ë¾Æº¸µµ·Ï ÇϰڴÙ.

Web database (ÀÌÇÏ WebDB)´Â ±×·¡ÇÁ ±¸¼ºÀ» °¡Áö´Â À¥ÆäÀÌÁöÀÇ Á¤º¸µéÀ» °¡Áö°í Àִ ƯȭµÈ µ¥ÀÌÅÍ ±¸Á¶¸¦ Áö¿øÇÏ´Â µ¥ÀÌÅͺ£À̽º´Ù. À¥ÆäÀÌÁö´Â ±×·¡ÇÁ ±¸Á¶¸¦ °¡Áö¸ç ÀÌ·¯ÇÑ ±×·¡ÇÁ´Â ¼ö½Ã·Î À籸¼ºµÇ¹Ç·Î, ÀÏÁ¤ Áֱ⸦ °¡Áö°í ±×·¡ÇÁ¸¦ À籸¼ºÇؼ­, ÇØ´ç ±×·¡ÇÁÀÇ °æ·Î¸¦ µû·¯¼­ ¹®¼­¸¦ ¼öÁýÇÒ ¼ö ÀÖµµ·Ï Áö¿øÇØ¾ß ÇÑ´Ù. WebDB´Â page¿Í ¸µÅ© µÎ°³ÀÇ Å¸ÀÔÀ¸·Î ±¸¼ºµÈ´Ù.
Web PageÀÇ Grap ±¸Á¶ 
 
               +-------+ 
        /------| Page  |-----\ 
       /       +-------+      \ Link 
      /           |            \ 
 +-------+     +-------+    +-------+ 
 | Page  |-----| Page  |----| Page  | 
 +-------+     +-------+    +-------+ 
      \           |           / 
       \       +-------+     / 
        \------| Page  |----/ 
               +-------+ 
 
À¥ ÆäÀÌÁöµéÀº URL°ú ÄÁÅÙÃ÷ÀÇ MD5ÇØ½¬Á¤º¸¸¦ ÀÌ¿ëÇØ¼­ À籸¼ºµÈ´Ù. À̿ܿ¡ °¢ ÆäÀÌÁö¿Í °ü°èÀÖ´Â Á¤º¸µéÀÎ ÆäÀÌÁö¿¡ Æ÷ÇÔµÈ ¸µÅ©ÀÇ °¹¼ö, ¼öÁýµÈ ½Ã°£, Á߿䵵(º¸Åë ÇØ´ç ÆäÀÌÁö¸¦ Âü°íÇÏ´Â À¥¹®¼­°¡ ¾ó¸¶³ª ¸¹ÀºÁö¸¦ °¡Áö°í ÆÇ´ÜÇÑ´Ù)µîÀ» ÀúÀåÇÑ´Ù. À§ÀÇ Web PageÀÇ ¿¬°á±¸¼ºÀ» º¸¸é ¾Ë°ÚÁö¸¸, page°¡ node°¡ µÇ°í, LinkÁ¤º¸·Î ¿¬°áµÇ´Â ±×·¡ÇÁ ±¸Á¶¸¦ °¡Áø´Ù.

segment´Â crawler¿¡ ÀÇÇØ¼­ ¼öÁýµÇ°í À妽ºµÈ ÆäÀÌÁöÀÇ ¸ðÀ½ÀÌ´Ù. ÀÌ·¯ÇÑ segment·Î ºÎÅÍ URLÀÇ ¸ñ·ÏÀ» »Ì¾Æ³»¾î¼­ fetchlist¸¦ ¸¸µé¾î³»°í, ÀÌ·¯ÇÑ fetchlist¸¦ ÀÌ¿ëÇØ¼­ À¥ÆäÀÌÁö¸¦ ºê¶ó¿ì¡ÇÏ°í µ¥ÀÌÅ͸¦ °¡Á®¿À°Ô µÈ´Ù. segment´Â µð½ºÅ© °ø°£À» Â÷ÁöÇÏ°Ô µÇ¹Ç·Î, ¹«ÇÑÁ¤À¸·Î segment¸¦ À¯ÁöÇϱ⿡´Â µð½ºÅ© ºñ¿ëÀÌ ¸¹ÀÌ ¼ÒºñµÇ°Ô µÉ°ÍÀÌ´Ù. ¶§¹®¿¡ °¢°¢ÀÇ segment´Â »ý¼ºµÈ ³¯Â¥¿Í ½Ã°£Á¤º¸¸¦ °¡Áö°í ÀÖÀ¸¸ç, ÀÏÁ¤½Ã°£ÀÌ Áö³­ ÈÄ ¿¡´Â »èÁ¦°¡ µÈ´Ù.

index´Â °¡Á®¿Â ¸ðµç ÆäÀÌÁö¸¦ »öÀÎÈ­ ÇÑ °ÍÀ¸·Î, °³°³ÀÇ ¼¼±×¸ÕÆ®ÀÇ »öÀÎµé º´ÇÕÇØ¼­ ¸¸µé¾îÁø´Ù. Nutch´Â LucenceÀÇ Àε¦½Ì Åø°ú API¸¦ ÀÌ¿ëÇϰí ÀÖ´Ù.

2.2 crawltool

¹®¼­ÀÇ ¼öÁý(crawling)´Â ´ÙÀ½°ú °°Àº ÇÁ·Î¼¼½ºÀÇ »çÀÌŬÀ» °¡Áø´Ù.
  1. »õ·Î¿î WebDB¸¦ »ý¼ºÇÑ´Ù.
  2. WebDB·Î ºÎÅÍ ¼öÁýÀÌ ÃÖÃÊ·Î ½ÃÀÛµÉ root URLÀ» ¼³Á¤ÇÑ´Ù.
  3. »õ·Î¿î segmentÀÇ WebDB·Î ºÎÅÍ fetchlist¸¦ »ý¼ºÇÑ´Ù.
  4. fetchlistÀÇ URL·ÎºÎÅÍ page¸¦ ¼öÁýÇÑ´Ù.
  5. ¼öÁýµÈ page·Î ºÎÅÍ ¸µÅ©¸¦ ¾ò¾î¿À°í, WebDBÀÇ Á¤º¸¸¦ °»½ÅÇÑ´Ù.
  6. 3-5´Ü°è¸¦ °è¼Ó ¹Ýº¹ÇÑ´Ù.
  7. Á߿䵵¿Í linksÁ¤º¸¸¦ UpdateÇÑ´Ù.
  8. ¼öÁýÇÑ ÆäÀÌÁöÀÇ »öÀÎÀ» ¸¸µç´Ù.
  9. »öÀÎÀ¸·Î ºÎÅÍ Áߺ¹µÈ ÆäÀÌÁö¸¦ Á¦°ÅÇÑ´Ù.
  10. È¿À²Àû °Ë»öÀ» À§Çؼ­ ´ÜÀÏ »öÀεéÀ» º´ÇÕÇÑ´Ù.

dedup ÅøÀ» ÀÌ¿ëÇØ¼­ segment »öÀÎÀ¸·Î ºÎÅÍ Áߺ¹µÈ URLÀ» Á¦°ÅÇÒ ¼ö ÀÖ´Ù.

3 CrawlÀÇ ÀÛµ¿

±×·³ Nutch¸¦ Crawl·Î ÀÛµ¿½ÃÄѼ­, »çÀÌÆ®¸¦ ÁöÁ¤Çؼ­ Á¤º¸¸¦ ¼öÁýÇØ º¸µµ·Ï ÇϰڴÙ. nutch´Â Javaȯ°æ¿¡¼­ ½ÇÇàµÈ´Ù. ±×·¯¹Ç·Î Javaȯ°æÀ» ¸¸µé°í ȯ°æº¯¼ö JAVA_HOMEÀ» ÁöÁ¤ÇØÁà¾ß ÇÑ´Ù. Java´Â http://Java.sum.com ¿¡¼­ ´Ù¿î¹Þ¾Æ¼­ ¼³Ä¡Çϵµ·Ï ÇÑ´Ù.

ÀÌÁ¦ [http]downloadÆäÀÌÁö·Î °¡¼­ nutch¸¦ ´Ù¿î·Îµå ¹Þµµ·Ï ÇÑ´Ù. ÇÊÀÚ´Â 0.7.1¹öÁ¯À» °¡Áö°í Å×½ºÆ®¸¦ Çß´Ù.

nutch.gif

¿ì¼± Å×½ºÆ®¸¦ À§Çؼ­ À§ÀÇ ±¸¼ºÀ» °¡Áö´Â °£´ÜÇÑ À¥ÆäÀÌÁö¸¦ ¸¸µé¾ú´Ù. ·ÎÄýýºÅÛ¿¡ À¥¼­¹ö¸¦ ±¸¼ºÇßÀ¸¸ç, index.html, a.html, b.html, c.htmlÀÇ 4°³ÀÇ ÆäÀÌÁö¸¦ °¡Áö¸ç, À§ÀÇ ±×¸²¿¡¼­ ó·³ ¸µÅ©¸¦ °¡Áöµµ·Ï ÆäÀÌÁöÀÇ ³»¿ëÀ» ä¿ö³Ö¾ú´Ù.

À¥¼­¹ö¸¦ °¡µ¿½ÃÄ×À¸´Ï nutchÀÇ ¼³Á¤À» º¯°æÇØÁà¾ß ÇÑ´Ù. ¿ì¼± root urlÀ» ÁöÁ¤ÇØÁà¾ß ÇÑ´Ù.
# echo 'http://ubuntu/index.html' > urls 
 
±×´ÙÀ½ WebDB¿¡ ÀúÀåµÉ URLµéÀÇ ÇÊÅ͸¦ ÁöÁ¤ÇØ¾ß ÇÑ´Ù. ´ÜÁö ÇÊÅ͸¦ Åë°úÇÑ URL¸¸ WebDB¿¡ ÀúÀåÀÌ µÈ´Ù. ¼öÁ¤ÇØ¾ßµÉ ÆÄÀÏÀº conf/crawl-urlfilter.txt ÀÌ´Ù. ÇÊÅÍ´Â Á¤±ÔÇ¥ÇöÀ» Áö¿øÇÑ´Ù. ¿©±â¿¡¼­´Â http://ubuntu »çÀÌÆ®ÀÇ ³»¿ë¸¸À» °¡Á®¿Ã °ÍÀ̹ǷÎ
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ 
 
À»
+^http://ubuntu/ 
 
·Î ¼öÁ¤ÇÑ´Ù.

ÀÌÁ¦ nutch¸¦ ÀÌ¿ëÇØ¼­ ¹®¼­¸¦ ¼öÁýÇÏ¸é µÈ´Ù.
# bin/nutch crawl urls -dir ubuntu.test -depth 3 >& crawl.log 
 
crawlÀº ¹®¼­¸¦ ¼öÁýÇϰڴٴ ¿É¼ÇÀÌ´Ù. urls´Â root urlÀÌ ÀúÀåµÇ¾î ÀÖ´Â ÆÄÀÏÀÌ´Ù. -dirÀº ¼öÁýµÈ ¹®¼­ÀÇ Á¤º¸°¡ ÀúÀåµÉ µð·ºÅ丮ÀÌ´Ù. -depth ´Â generate/fetch/update ÁÖ±âÀÇ ¹üÀ§¸¦ °áÁ¤Çϱâ À§Çؼ­ »ç¿ëÇÑ´Ù. Å×½ºÆ® URLÀÇ ¹®¼­µéÀº ´Ü¼øÇϱ⠶§¹®¿¡ 3´Ü°è¸é ÃæºÐÇÒ °ÍÀÌ´Ù. ±×·¯³ª ½ÇÁ¦ »çÀÌÆ®¸¦ °¡Áö°í ¿î¿ëÇÒ·Á¸é ±âº» 5´Ü°èÁ¤µµ´Â ÁöÁ¤µÇ¾î¾ß ÇÑ´Ù.

4 Crawl °á°ú ºÐ¼®

¹®¼­´Â ¼öÁýµÇ´Â °ÍÀ¸·Î ³¡³ª´Â°Ô ¾Æ´Ï´Ù. segments´ÜÀ§·Î ³ª´©¾î¾ß Çϸç, fetchlist, »öÀÎ, ¹®¼­³»¿ëµî ¿¬°üµÈ Á¤º¸¸¦ µ¥ÀÌÅͺ£À̽ºÈ­ ÇØ¼­ °ü¸®ÇØ¾ß ÇÑ´Ù. ¶§¹®¿¡ ¸î°³ÀÇ µð·ºÅ丮·Î ±¸¼ºµÈ µ¶ÀÚÀûÀÎ ÆÄÀϵðºñ ÇüÅ·ΠÀúÀåµÈ´Ù. ´ÙÀ½Àº ubuntu.test µð·ºÅ丮ÀÇ ºê¶ó¿ì¡ °á°ú´Ù.

crawlsample.gif

ÆÄÀϽýºÅÛÀÇ ±¸Á¶¸¦ ºÐ¼®ÇÑ´Ù¸é, ¼öÁýµÈ ¹®¼­¸¦ ¾î¶»°Ô È¿À²ÀûÀÎ ±¸Á¶·Î À妽ºÇϰí, ±¸Á¶È­ÇÒ ¼ö ÀÖ´ÂÁö¿¡ ´ëÇØ¼­ ¾Ë ¼ö ÀÖÀ»°ÍÀÌ´Ù. ÀÌ¿¡ ´ëÇÑ ³»¿ëÀº µû·Î ´Ù·ç±â·Î ÇϰڴÙ.

5 WebDB

CrawlÀ» ÅëÇØ¼­ ¼öÁýµÈ¹®¼­´Â DBÈ­ µÈ´Ù. ¿©±â¿¡´Â Æ÷ÇԵȸµÅ©, ÆäÀÌÁö °¹¼ö, DID(Document ID), ¼öÁýÀÏ, ¹®¼­ÀÇ Á߿䵵 µî°ú °°Àº Áß¿äÇÑ Á¤º¸µéÀÌ µé¾î°£´Ù. À̵é Á¤º¸´Â ¾÷µ¥ÀÌÆ® ½ÃÄÑ¾ß Çϱ⠶§¹®¿¡ - ¿¹¸¦ µé¾î, ÇØ´ç ¹®¼­¸¦ Âü°íÇÏ´Â À¥ÆäÀÌÁö°¡ ³ª¿Ô´Ù¸é, ¹®¼­ÀÇ Á߿䵵¸¦ ¿Ã·ÁÁà¾ß ÇÒ°ÍÀÌ´Ù - Á¤·Ä/°Ë»ö/¾÷µ¥ÀÌÆ®°¡ °¡´ÉÇØ¾ß Çϸç, nutch´Â ÀÌ·¯ÇÑ µµ±¸¸¦ Á¦°øÇÑ´Ù. ¿ì¼± ÇØ´ç »çÀÌÆ®ÀÇ ´ë·«ÀûÀÎ Á¤º¸¸¦ ¾Ë¾Æº¸µµ·Ï ÇÏÀÚ.
# bin/nutch readdb crawl-tinysite/db -stats 
 
À§ÀÇ ¸í·ÉÀ» ½ÇÇàÇÏ¸é ¾Æ·¡¿Í °°Àº Á¤º¸¸¦ º¼ ¼ö ÀÖÀ» °ÍÀÌ´Ù.
Stats for org.apache.nutch.db.WebDBReader@1c9b9ca 
------------------------------- 
Number of pages: 5 
Number of links: 4 
 
À§ÀÇ Á¤º¸¸¦ ÅëÇØ¼­ ubuntu »çÀÌÆ®¿¡¼­ 5°³ÀÇ ÆäÀÌÁö¸¦ °¡Á®¿ÔÀ¸¸ç, ÃÑ 4°³ÀÇ ¸µÅ©°¡ Á¸ÀçÇÔÀ» È®ÀÎÇÒ ¼ö ÀÖ´Ù.

ÀÌÁ¦ °¢ÆäÀÌÁö ´ÜÀ§·Î »ó¼¼Á¤º¸¸¦ ¾Ë¾Æº¸µµ·Ï ÇÏÀÚ.
# bin/nutch readdb ubuntu.test/db -dumppageurl 
 
¾Æ·¡¿Í °°Àº °á°ú¸¦ È®ÀÎÇÒ ¼ö ÀÖÀ» °ÍÀÌ´Ù.
Page 1: Version: 4 
URL: http://ubuntu/ 
ID: f14dfdbdb13ff576277bbd58bf061d23 
Next fetch: Wed Jul 26 17:51:36 KST 2006 
Retries since fetch: 0 
Retry interval: 30 days 
Num outlinks: 1 
Score: 1.0 
NextScore: 1.0 
 
 
Page 2: Version: 4 
URL: http://ubuntu/a.html 
ID: e1a0ed7a767bc0920b888c750224f39b 
Next fetch: Wed Jul 26 17:51:38 KST 2006 
Retries since fetch: 0 
Retry interval: 30 days 
Num outlinks: 3 
Score: 1.0 
NextScore: 1.0 
 

´ÙÀ½°ú °°Àº ¹æ¹ýÀ¸·Î °¢ÆäÀÌÁöÀÇ ¸µÅ©»óȲÀ» È®ÀÎÇÒ ¼ö ÀÖ´Ù.
# bin/nutch readdb ubuntu.test/db -dumplinks  
from http://ubuntu/a.html 
 to http://ubuntu/b.html 
 to http://ubuntu/c.html 
 to http://ubuntu/index.html 
 
from http://ubuntu/ 
 to http://ubuntu/a.html 
 
from http://ubuntu/index.html 
 to http://ubuntu/a.html 
 

6 Segments

À¥¿¡´Â ¾öû³ª°Ô ¸¹Àº ¹®¼­°¡ Á¸ÀçÇϸç, ¹®¼­´Â ½Ã½Ã°¢°¢ º¯ÇÑ´Ù. µû¶ó¼­ Á¦´ë·Î ¹®¼­ÀÇ Á¤º¸¸¦ °ü¸®Çϱâ À§Çؼ­´Â ¸ðµç ¹®¼­ÀÇ º¹»çº»ÀÇ Á¤º¸¸¦ °¡Áö°í ÀÖ¾î¾ß ÇϰÚÁö¸¸, ÀÌ´Â Çö½ÇÀûÀ¸·Î ºÒ°¡´ÉÇÏ´Ù. ¶§¹®¿¡ À̸¦ ±¸È¹È­Çؼ­ °ü¸®ÇÒ Çʿ䰡 ÀÖ´Ù. nutch´Â ½Ã°£À» ÅëÇØ¼­ ±¸È¹È­ÇÑ´Ù. ¿¹¸¦µéÀÚ¸é ¿À´Ã ÇϷ絿¾È ¼öÁýÇÑ ¹®¼­ÀÇ Á¤º¸¸¦ ¸ð¾Æ¼­ ÇÑ´ÞÀÇ ¹®¼­ Åë°è¸¦ ¸¸µé°í, ´Ù½Ã ÇÑ´ÞÀÇ ¹®¼­Åë°è¸¦ ¸ð¾Æ¼­ ÀϳâÀÇ ¹®¼­Á¤º¸¸¦ À¯ÁöÇÏ´Â ¹æ½ÄÀÌ´Ù. ´ÙÀ½°ú °°Àº ¹æ½ÄÀ» ÅëÇØ¼­ segmentÁ¤º¸¸¦ È®ÀÎÇÒ ¼ö ÀÖ´Ù.
# bin/nutch segread -list -dir ubuntu.test/segments/ 
PARSED?   STARTED                 FINISHED                COUNT   DIR NAME 
true      20060626-17:51:36       20060626-17:51:36       1       ubuntu.test/segments/20060626175136 
true      20060626-17:51:38       20060626-17:51:38       1       ubuntu.test/segments/20060626175138 
true      20060626-17:51:39       20060626-17:51:40       3       ubuntu.test/segments/20060626175139 
TOTAL: 5 entries in 3 segments. 
 

´ÙÀ½°ú °°Àº ¹æ¹ýÀ¸·Î °¢ ¼¼±×¸ÕÆ®ÀÇ »ó¼¼Á¤º¸¸¦ ¾òÀ» ¼ö ÀÖ´Ù.
s=`ls -d ubuntu.test/segments/* | head -1` 
# bin/nutch segread -dump $s  
 

Recno:: 0 
FetcherOutput:: 
FetchListEntry: version: 2 
fetch: true 
page: Version: 4 
URL: http://ubuntu/ 
ID: 091f90073a9ba19470be1581e7adb865 
Next fetch: Mon Jul 03 17:51:36 KST 2006 
Retries since fetch: 0 
Retry interval: 30 days 
Num outlinks: 0 
Score: 1.0 
NextScore: 1.0 
 
anchors: 0 
Fetch Result: 
MD5Hash: f14dfdbdb13ff576277bbd58bf061d23 
ProtocolStatus: success(1), lastModified=0 
FetchDate: Mon Jun 26 17:51:36 KST 2006 
 
Content:: 
url: http://ubuntu/ 
base: http://ubuntu/ 
contentType: text/html 
metadata: {Date=Mon, 26 Jun 2006 08:51:36 GMT, Server=Apache/2.2.0 (Unix) PHP/5.1.2,  
X-Powered-By=PHP/5.1.2, Connection=close, Content-Type=text/html, Content-Length=129} 
Content: 
<html> 
<body> 
Hello World <br> 
1. <a href=a.html>Page A</a><br> 
2. <a href=http://www.joinc.co.kr>Joinc</a><br> 
</body> 
</html> 
 
ParseData:: 
Status: success(1,0) 
Title: 
Outlinks: 2 
  outlink: toUrl: http://ubuntu/a.html anchor: Page A 
  outlink: toUrl: http://www.joinc.co.kr/ anchor: Joinc 
  Metadata: {Date=Mon, 26 Jun 2006 08:51:36 GMT,  
     CharEncodingForConversion=windows-1252, X-Powered-By=PHP/5.1.2, Server=Apache/2.2.0 (Unix)  
     PHP/5.1.2, Content-Type=text/html, Connection=close, Content-Length=129} 
 
ParseText:: 
Hello World 1. Page A 2. Joinc 
 
°¢ segments¿¡ Æ÷ÇÔµÈ ¹®¼­ÀÇ DID¿Í Á߿䵵, link, http header, ¹®¼­ÀÇ º¹»çº»µîÀÇ Á¤º¸¸¦ È®ÀÎÇÒ ¼ö ÀÖ´Ù. ¶ÇÇÑ HTMLÅױ׸¦ Á¦°ÅÇÑ ÆÄ½ÌµÈ ¹®¼­ÀÇ È®Àεµ °¡´ÉÇÏ´Ù. ÀÌ ÆÄ½ÌµÈ ¹®¼­´Â »öÀÎÀ» ¸¸µé±â À§ÇÑ Á¤º¸·Î »ç¿ëµÈ´Ù.

7 »öÀÎ (index)

EmailÀ» ±âÀÔÇϸé, ´ñ±ÛÀÌ ¸ÞÀÏ·Î Àü´ÞµË´Ï´Ù.