ÃÑ ÆäÀÌÁö ¼ö : 3224
![]()
|
Facebook Joinc ±×·ì
Joinc QA »çÀÌÆ®
![]()
Tweet
joinc´Â Firefox¿Í chrome¿¡¼ Å×½ºÆ® Çß½À´Ï´Ù. IE¿¡¼´Â Å×À̺íÀÌ ±úÁö°Å³ª À̹ÌÁö°¡ º¸ÀÌÁö ¾ÊÀ» ¼ö ÀÖ½À´Ï´Ù. ƯÈ÷ ±¸±Û DocsÀ̹ÌÁöÀÇ °æ¿ì ¿¢¹Úó¸®µÉ ¼ö ÀÖ½À´Ï´Ù. ¹®¼È¸¦ ¸ÕÀúÇÑ ÈÄ ÇØ´ç ÇÁ·Î±×·¥À» ÀÌ¿ëÇØ¼ Å×½ºÆ®¸¦ ÁøÇàÇÒ °ÍÀ̱⠶§¹®¿¡, ¾Æ·¡ÀÇ ¹®¼µéÀº ÃæºÐÈ÷ °ËÁõµÇÁö ¾Ê¾ÒÀ¸¸ç, °è¼Ó ¼öÁ¤µÉ °ÍÀÌ´Ù. 1 Nutch ¼Ò°³
Java·Î µÈ open source °Ë»ö¿£ÁøÀÎ nutchÀÇ crawling ±â´ÉÀ» ÅëÇØ¼ °ü·ÃÁ¤º¸¸¦ ÃëµæÇϱâ·Î Çß´Ù. ÀÌ ¹®¼ÀÇ ¿øº»Àº http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html ¿¡¼ ãÀ» ¼ö ÀÖ´Ù.
Nutch ´Â Google ¼Ãë¿£ÁøÀ» ´ëüÇϱâ À§Çؼ ¸¸µé¾îÁ³À¸¸ç, ´ÙÀ½°ú °°Àº Ư¡À» °¡Áö°í ÀÖ´Ù.
2 Architecture
Nutch ´Â crawler°ú searcher µÎºÎºÐÀ¸·Î ±¸¼ºµÈ´Ù. crawler´Â ÆäÀÌÁö¸¦ ¼öÁýÇϰí, ÆäÀÌÁö¿¡ ´ëÇÑ index¸¦ ¸¸µé¸ç, searcherÀº À¯ÀúÀÇ ¿äûÀ» ¹Þ¾Æ¼ ÇÊ¿äÇÑ Á¤º¸¸¦ ã¾Æ¼ º¸¿©ÁÖ´Â ÀÏÀ» ÇÑ´Ù. index´Â µÎ°³ÀÇ ¼·Î´Ù¸¥ ±¸¼º¿ä¼Ò°£ÀÇ °¡±³¿ªÇÒÀ» ÇÑ´Ù. 2.1 Crawler
crawlerÀº crawl tool°ú web database, segments¿Í À妽ºµîÀ» Æ÷ÇÔÇÑ ´Ù¾çÇÑ µ¥ÀÌÅÍ ±¸Á¶Ã¼¸¦ ¸¸µé°í À¯ÁöÇϱâ À§ÇÑ Åøµé·Î ±¸¼ºµÈ ½Ã½ºÅÛÀÌ´Ù. ÀÌ ½Ã½ºÅÛ¿¡ ´ëÇØ¼ ÀÚ¼¼È÷ ¾Ë¾Æº¸µµ·Ï ÇϰڴÙ.
Web database (ÀÌÇÏ WebDB)´Â ±×·¡ÇÁ ±¸¼ºÀ» °¡Áö´Â À¥ÆäÀÌÁöÀÇ Á¤º¸µéÀ» °¡Áö°í ÀÖ´Â Æ¯ÈµÈ µ¥ÀÌÅÍ ±¸Á¶¸¦ Áö¿øÇÏ´Â µ¥ÀÌÅͺ£À̽º´Ù. À¥ÆäÀÌÁö´Â ±×·¡ÇÁ ±¸Á¶¸¦ °¡Áö¸ç ÀÌ·¯ÇÑ ±×·¡ÇÁ´Â ¼ö½Ã·Î À籸¼ºµÇ¹Ç·Î, ÀÏÁ¤ Áֱ⸦ °¡Áö°í ±×·¡ÇÁ¸¦ À籸¼ºÇؼ, ÇØ´ç ±×·¡ÇÁÀÇ °æ·Î¸¦ µû·¯¼ ¹®¼¸¦ ¼öÁýÇÒ ¼ö ÀÖµµ·Ï Áö¿øÇØ¾ß ÇÑ´Ù. WebDB´Â page¿Í ¸µÅ© µÎ°³ÀÇ Å¸ÀÔÀ¸·Î ±¸¼ºµÈ´Ù. Web PageÀÇ Grap ±¸Á¶
+-------+
/------| Page |-----\
/ +-------+ \ Link
/ | \
+-------+ +-------+ +-------+
| Page |-----| Page |----| Page |
+-------+ +-------+ +-------+
\ | /
\ +-------+ /
\------| Page |----/
+-------+
À¥ ÆäÀÌÁöµéÀº URL°ú ÄÁÅÙÃ÷ÀÇ MD5ÇØ½¬Á¤º¸¸¦ ÀÌ¿ëÇØ¼ À籸¼ºµÈ´Ù. À̿ܿ¡ °¢ ÆäÀÌÁö¿Í °ü°èÀÖ´Â Á¤º¸µéÀÎ ÆäÀÌÁö¿¡ Æ÷ÇÔµÈ ¸µÅ©ÀÇ °¹¼ö, ¼öÁýµÈ ½Ã°£, Á߿䵵(º¸Åë ÇØ´ç ÆäÀÌÁö¸¦ Âü°íÇÏ´Â À¥¹®¼°¡ ¾ó¸¶³ª ¸¹ÀºÁö¸¦ °¡Áö°í ÆÇ´ÜÇÑ´Ù)µîÀ» ÀúÀåÇÑ´Ù. À§ÀÇ Web PageÀÇ ¿¬°á±¸¼ºÀ» º¸¸é ¾Ë°ÚÁö¸¸, page°¡ node°¡ µÇ°í, LinkÁ¤º¸·Î ¿¬°áµÇ´Â ±×·¡ÇÁ ±¸Á¶¸¦ °¡Áø´Ù.
segment´Â crawler¿¡ ÀÇÇØ¼ ¼öÁýµÇ°í À妽ºµÈ ÆäÀÌÁöÀÇ ¸ðÀ½ÀÌ´Ù. ÀÌ·¯ÇÑ segment·Î ºÎÅÍ URLÀÇ ¸ñ·ÏÀ» »Ì¾Æ³»¾î¼ fetchlist¸¦ ¸¸µé¾î³»°í, ÀÌ·¯ÇÑ fetchlist¸¦ ÀÌ¿ëÇØ¼ À¥ÆäÀÌÁö¸¦ ºê¶ó¿ì¡ÇÏ°í µ¥ÀÌÅ͸¦ °¡Á®¿À°Ô µÈ´Ù. segment´Â µð½ºÅ© °ø°£À» Â÷ÁöÇÏ°Ô µÇ¹Ç·Î, ¹«ÇÑÁ¤À¸·Î segment¸¦ À¯ÁöÇϱ⿡´Â µð½ºÅ© ºñ¿ëÀÌ ¸¹ÀÌ ¼ÒºñµÇ°Ô µÉ°ÍÀÌ´Ù. ¶§¹®¿¡ °¢°¢ÀÇ segment´Â »ý¼ºµÈ ³¯Â¥¿Í ½Ã°£Á¤º¸¸¦ °¡Áö°í ÀÖÀ¸¸ç, ÀÏÁ¤½Ã°£ÀÌ Áö³ ÈÄ ¿¡´Â »èÁ¦°¡ µÈ´Ù.
index´Â °¡Á®¿Â ¸ðµç ÆäÀÌÁö¸¦ »öÀÎÈ ÇÑ °ÍÀ¸·Î, °³°³ÀÇ ¼¼±×¸ÕÆ®ÀÇ »öÀÎµé º´ÇÕÇØ¼ ¸¸µé¾îÁø´Ù. Nutch´Â LucenceÀÇ Àε¦½Ì Åø°ú API¸¦ ÀÌ¿ëÇϰí ÀÖ´Ù. 2.2 crawltool
¹®¼ÀÇ ¼öÁý(crawling)´Â ´ÙÀ½°ú °°Àº ÇÁ·Î¼¼½ºÀÇ »çÀÌŬÀ» °¡Áø´Ù.
3 CrawlÀÇ ÀÛµ¿
±×·³ Nutch¸¦ Crawl·Î ÀÛµ¿½ÃÄѼ, »çÀÌÆ®¸¦ ÁöÁ¤Çؼ Á¤º¸¸¦ ¼öÁýÇØ º¸µµ·Ï ÇϰڴÙ. nutch´Â Javaȯ°æ¿¡¼ ½ÇÇàµÈ´Ù. ±×·¯¹Ç·Î Javaȯ°æÀ» ¸¸µé°í ȯ°æº¯¼ö JAVA_HOMEÀ» ÁöÁ¤ÇØÁà¾ß ÇÑ´Ù. Java´Â http://Java.sum.com ¿¡¼ ´Ù¿î¹Þ¾Æ¼ ¼³Ä¡Çϵµ·Ï ÇÑ´Ù.
ÀÌÁ¦ downloadÆäÀÌÁö·Î °¡¼ nutch¸¦ ´Ù¿î·Îµå ¹Þµµ·Ï ÇÑ´Ù. ÇÊÀÚ´Â 0.7.1¹öÁ¯À» °¡Áö°í Å×½ºÆ®¸¦ Çß´Ù. ![]()
¿ì¼± Å×½ºÆ®¸¦ À§Çؼ À§ÀÇ ±¸¼ºÀ» °¡Áö´Â °£´ÜÇÑ À¥ÆäÀÌÁö¸¦ ¸¸µé¾ú´Ù. ·ÎÄýýºÅÛ¿¡ À¥¼¹ö¸¦ ±¸¼ºÇßÀ¸¸ç, index.html, a.html, b.html, c.htmlÀÇ 4°³ÀÇ ÆäÀÌÁö¸¦ °¡Áö¸ç, À§ÀÇ ±×¸²¿¡¼ ó·³ ¸µÅ©¸¦ °¡Áöµµ·Ï ÆäÀÌÁöÀÇ ³»¿ëÀ» ä¿ö³Ö¾ú´Ù.
À¥¼¹ö¸¦ °¡µ¿½ÃÄ×À¸´Ï nutchÀÇ ¼³Á¤À» º¯°æÇØÁà¾ß ÇÑ´Ù. ¿ì¼± root urlÀ» ÁöÁ¤ÇØÁà¾ß ÇÑ´Ù. # echo 'http://ubuntu/index.html' > urls±×´ÙÀ½ WebDB¿¡ ÀúÀåµÉ URLµéÀÇ ÇÊÅ͸¦ ÁöÁ¤ÇØ¾ß ÇÑ´Ù. ´ÜÁö ÇÊÅ͸¦ Åë°úÇÑ URL¸¸ WebDB¿¡ ÀúÀåÀÌ µÈ´Ù. ¼öÁ¤ÇØ¾ßµÉ ÆÄÀÏÀº conf/crawl-urlfilter.txt ÀÌ´Ù. ÇÊÅÍ´Â Á¤±ÔÇ¥ÇöÀ» Áö¿øÇÑ´Ù. ¿©±â¿¡¼´Â http://ubuntu »çÀÌÆ®ÀÇ ³»¿ë¸¸À» °¡Á®¿Ã °ÍÀ̹ǷΠ+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/À» +^http://ubuntu/·Î ¼öÁ¤ÇÑ´Ù.
ÀÌÁ¦ nutch¸¦ ÀÌ¿ëÇØ¼ ¹®¼¸¦ ¼öÁýÇÏ¸é µÈ´Ù. # bin/nutch crawl urls -dir ubuntu.test -depth 3 >& crawl.logcrawlÀº ¹®¼¸¦ ¼öÁýÇϰڴٴ ¿É¼ÇÀÌ´Ù. urls´Â root urlÀÌ ÀúÀåµÇ¾î ÀÖ´Â ÆÄÀÏÀÌ´Ù. -dirÀº ¼öÁýµÈ ¹®¼ÀÇ Á¤º¸°¡ ÀúÀåµÉ µð·ºÅ丮ÀÌ´Ù. -depth ´Â generate/fetch/update ÁÖ±âÀÇ ¹üÀ§¸¦ °áÁ¤Çϱâ À§Çؼ »ç¿ëÇÑ´Ù. Å×½ºÆ® URLÀÇ ¹®¼µéÀº ´Ü¼øÇϱ⠶§¹®¿¡ 3´Ü°è¸é ÃæºÐÇÒ °ÍÀÌ´Ù. ±×·¯³ª ½ÇÁ¦ »çÀÌÆ®¸¦ °¡Áö°í ¿î¿ëÇÒ·Á¸é ±âº» 5´Ü°èÁ¤µµ´Â ÁöÁ¤µÇ¾î¾ß ÇÑ´Ù. 4 Crawl °á°ú ºÐ¼®
¹®¼´Â ¼öÁýµÇ´Â °ÍÀ¸·Î ³¡³ª´Â°Ô ¾Æ´Ï´Ù. segments´ÜÀ§·Î ³ª´©¾î¾ß Çϸç, fetchlist, »öÀÎ, ¹®¼³»¿ëµî ¿¬°üµÈ Á¤º¸¸¦ µ¥ÀÌÅͺ£À̽ºÈ ÇØ¼ °ü¸®ÇØ¾ß ÇÑ´Ù. ¶§¹®¿¡ ¸î°³ÀÇ µð·ºÅ丮·Î ±¸¼ºµÈ µ¶ÀÚÀûÀÎ ÆÄÀϵðºñ ÇüÅ·ΠÀúÀåµÈ´Ù. ´ÙÀ½Àº ubuntu.test µð·ºÅ丮ÀÇ ºê¶ó¿ì¡ °á°ú´Ù. ![]()
ÆÄÀϽýºÅÛÀÇ ±¸Á¶¸¦ ºÐ¼®ÇÑ´Ù¸é, ¼öÁýµÈ ¹®¼¸¦ ¾î¶»°Ô È¿À²ÀûÀÎ ±¸Á¶·Î À妽ºÇϰí, ±¸Á¶ÈÇÒ ¼ö ÀÖ´ÂÁö¿¡ ´ëÇØ¼ ¾Ë ¼ö ÀÖÀ»°ÍÀÌ´Ù. ÀÌ¿¡ ´ëÇÑ ³»¿ëÀº µû·Î ´Ù·ç±â·Î ÇϰڴÙ. 5 WebDB
CrawlÀ» ÅëÇØ¼ ¼öÁýµÈ¹®¼´Â DBÈ µÈ´Ù. ¿©±â¿¡´Â Æ÷ÇԵȸµÅ©, ÆäÀÌÁö °¹¼ö, DID(Document ID), ¼öÁýÀÏ, ¹®¼ÀÇ Á߿䵵 µî°ú °°Àº Áß¿äÇÑ Á¤º¸µéÀÌ µé¾î°£´Ù. À̵é Á¤º¸´Â ¾÷µ¥ÀÌÆ® ½ÃÄÑ¾ß Çϱ⠶§¹®¿¡ - ¿¹¸¦ µé¾î, ÇØ´ç ¹®¼¸¦ Âü°íÇÏ´Â À¥ÆäÀÌÁö°¡ ³ª¿Ô´Ù¸é, ¹®¼ÀÇ Á߿䵵¸¦ ¿Ã·ÁÁà¾ß ÇÒ°ÍÀÌ´Ù - Á¤·Ä/°Ë»ö/¾÷µ¥ÀÌÆ®°¡ °¡´ÉÇØ¾ß Çϸç, nutch´Â ÀÌ·¯ÇÑ µµ±¸¸¦ Á¦°øÇÑ´Ù. ¿ì¼± ÇØ´ç »çÀÌÆ®ÀÇ ´ë·«ÀûÀÎ Á¤º¸¸¦ ¾Ë¾Æº¸µµ·Ï ÇÏÀÚ. # bin/nutch readdb crawl-tinysite/db -statsÀ§ÀÇ ¸í·ÉÀ» ½ÇÇàÇÏ¸é ¾Æ·¡¿Í °°Àº Á¤º¸¸¦ º¼ ¼ö ÀÖÀ» °ÍÀÌ´Ù. Stats for org.apache.nutch.db.WebDBReader@1c9b9ca ------------------------------- Number of pages: 5 Number of links: 4À§ÀÇ Á¤º¸¸¦ ÅëÇØ¼ ubuntu »çÀÌÆ®¿¡¼ 5°³ÀÇ ÆäÀÌÁö¸¦ °¡Á®¿ÔÀ¸¸ç, ÃÑ 4°³ÀÇ ¸µÅ©°¡ Á¸ÀçÇÔÀ» È®ÀÎÇÒ ¼ö ÀÖ´Ù.
ÀÌÁ¦ °¢ÆäÀÌÁö ´ÜÀ§·Î »ó¼¼Á¤º¸¸¦ ¾Ë¾Æº¸µµ·Ï ÇÏÀÚ. # bin/nutch readdb ubuntu.test/db -dumppageurl¾Æ·¡¿Í °°Àº °á°ú¸¦ È®ÀÎÇÒ ¼ö ÀÖÀ» °ÍÀÌ´Ù. Page 1: Version: 4 URL: http://ubuntu/ ID: f14dfdbdb13ff576277bbd58bf061d23 Next fetch: Wed Jul 26 17:51:36 KST 2006 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 1 Score: 1.0 NextScore: 1.0 Page 2: Version: 4 URL: http://ubuntu/a.html ID: e1a0ed7a767bc0920b888c750224f39b Next fetch: Wed Jul 26 17:51:38 KST 2006 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 3 Score: 1.0 NextScore: 1.0
´ÙÀ½°ú °°Àº ¹æ¹ýÀ¸·Î °¢ÆäÀÌÁöÀÇ ¸µÅ©»óȲÀ» È®ÀÎÇÒ ¼ö ÀÖ´Ù. # bin/nutch readdb ubuntu.test/db -dumplinks from http://ubuntu/a.html to http://ubuntu/b.html to http://ubuntu/c.html to http://ubuntu/index.html from http://ubuntu/ to http://ubuntu/a.html from http://ubuntu/index.html to http://ubuntu/a.html 6 Segments
À¥¿¡´Â ¾öû³ª°Ô ¸¹Àº ¹®¼°¡ Á¸ÀçÇϸç, ¹®¼´Â ½Ã½Ã°¢°¢ º¯ÇÑ´Ù. µû¶ó¼ Á¦´ë·Î ¹®¼ÀÇ Á¤º¸¸¦ °ü¸®Çϱâ À§Çؼ´Â ¸ðµç ¹®¼ÀÇ º¹»çº»ÀÇ Á¤º¸¸¦ °¡Áö°í ÀÖ¾î¾ß ÇϰÚÁö¸¸, ÀÌ´Â Çö½ÇÀûÀ¸·Î ºÒ°¡´ÉÇÏ´Ù. ¶§¹®¿¡ À̸¦ ±¸È¹ÈÇØ¼ °ü¸®ÇÒ Çʿ䰡 ÀÖ´Ù. nutch´Â ½Ã°£À» ÅëÇØ¼ ±¸È¹ÈÇÑ´Ù. ¿¹¸¦µéÀÚ¸é ¿À´Ã ÇϷ絿¾È ¼öÁýÇÑ ¹®¼ÀÇ Á¤º¸¸¦ ¸ð¾Æ¼ ÇÑ´ÞÀÇ ¹®¼ Åë°è¸¦ ¸¸µé°í, ´Ù½Ã ÇÑ´ÞÀÇ ¹®¼Åë°è¸¦ ¸ð¾Æ¼ ÀϳâÀÇ ¹®¼Á¤º¸¸¦ À¯ÁöÇÏ´Â ¹æ½ÄÀÌ´Ù. ´ÙÀ½°ú °°Àº ¹æ½ÄÀ» ÅëÇØ¼ segmentÁ¤º¸¸¦ È®ÀÎÇÒ ¼ö ÀÖ´Ù. # bin/nutch segread -list -dir ubuntu.test/segments/ PARSED? STARTED FINISHED COUNT DIR NAME true 20060626-17:51:36 20060626-17:51:36 1 ubuntu.test/segments/20060626175136 true 20060626-17:51:38 20060626-17:51:38 1 ubuntu.test/segments/20060626175138 true 20060626-17:51:39 20060626-17:51:40 3 ubuntu.test/segments/20060626175139 TOTAL: 5 entries in 3 segments.
´ÙÀ½°ú °°Àº ¹æ¹ýÀ¸·Î °¢ ¼¼±×¸ÕÆ®ÀÇ »ó¼¼Á¤º¸¸¦ ¾òÀ» ¼ö ÀÖ´Ù.
s=`ls -d ubuntu.test/segments/* | head -1` # bin/nutch segread -dump $s Recno:: 0 FetcherOutput:: FetchListEntry: version: 2 fetch: true page: Version: 4 URL: http://ubuntu/ ID: 091f90073a9ba19470be1581e7adb865 Next fetch: Mon Jul 03 17:51:36 KST 2006 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0 anchors: 0 Fetch Result: MD5Hash: f14dfdbdb13ff576277bbd58bf061d23 ProtocolStatus: success(1), lastModified=0 FetchDate: Mon Jun 26 17:51:36 KST 2006 Content:: url: http://ubuntu/ base: http://ubuntu/ contentType: text/html metadata: {Date=Mon, 26 Jun 2006 08:51:36 GMT, Server=Apache/2.2.0 (Unix) PHP/5.1.2, X-Powered-By=PHP/5.1.2, Connection=close, Content-Type=text/html, Content-Length=129} Content: <html> <body> Hello World <br> 1. <a href=a.html>Page A</a><br> 2. <a href=http://www.joinc.co.kr>Joinc</a><br> </body> </html> ParseData:: Status: success(1,0) Title: Outlinks: 2 outlink: toUrl: http://ubuntu/a.html anchor: Page A outlink: toUrl: http://www.joinc.co.kr/ anchor: Joinc Metadata: {Date=Mon, 26 Jun 2006 08:51:36 GMT, CharEncodingForConversion=windows-1252, X-Powered-By=PHP/5.1.2, Server=Apache/2.2.0 (Unix) PHP/5.1.2, Content-Type=text/html, Connection=close, Content-Length=129} ParseText:: Hello World 1. Page A 2. Joinc°¢ segments¿¡ Æ÷ÇÔµÈ ¹®¼ÀÇ DID¿Í Á߿䵵, link, http header, ¹®¼ÀÇ º¹»çº»µîÀÇ Á¤º¸¸¦ È®ÀÎÇÒ ¼ö ÀÖ´Ù. ¶ÇÇÑ HTMLÅױ׸¦ Á¦°ÅÇÑ ÆÄ½ÌµÈ ¹®¼ÀÇ È®Àεµ °¡´ÉÇÏ´Ù. ÀÌ ÆÄ½ÌµÈ ¹®¼´Â »öÀÎÀ» ¸¸µé±â À§ÇÑ Á¤º¸·Î »ç¿ëµÈ´Ù. 7 »öÀÎ (index) |
|
|
|
EmailÀ» ±âÀÔÇϸé, ´ñ±ÛÀÌ ¸ÞÀÏ·Î Àü´ÞµË´Ï´Ù. |
|