Nutch Hadoop ¼³Ä¡ ¿î¿ë ¹®¼­
ÃÑ ÆäÀÌÁö ¼ö : 3121

Àüü ÇÔ¼ö/¿ë¾î»çÀü
ÇöÀçÀ§Ä¡ : JCvs>Search>Document>nutch>Hadoop


Contents

1 Nutch ¼Ò°³
2 ³×Æ®¿öÅ© ȯ°æ ¼³Á¤
3 Nutch¿Í Hadoop ¼³Ä¡
4 Nutch Hadoop ¼³Ä¡
5 Hadoop Áö¿ø ¼³Á¤
6 hadoop ȯ°æ ¸¸µé±â
6.1 ssh ¿¬°á ȯ°æ ¸¸µé±â
6.2 ºÐ»ê ÆÄÀÏ ½Ã½ºÅÛ ¸¸µé±â
6.3 ºÐ»ê ÆÄÀÏ ½Ã½ºÅÛ¿¡ crawling Çϱâ
6.4 °Ë»ö Å×½ºÆ®
7 Hadoop MapReduce
7.1 Map
7.2 Reduce
8 Åä·Ð


<!> .. Á» ¿À·¡µÈ ¹®¼­¶ó¼­, Áö±Ý »óȲ (2008/10/22)¿¡ ¸Âµµ·Ï ¼öÁ¤ÇÒ °èȹÀÓ)

1 Nutch ¼Ò°³

HadoopÀº ºÐ»ê(distributed file system)À¸·Î MapReduceÇÁ·Î±×·¡¹Ö ¸ðµ¨À» ÅëÇØ¼­ ±¸ÇöµÇ¾ú´Ù. ÀÌ ¹®¼­´Â HadoopÀ» ÀÌ¿ëÇØ¼­ ºÐ»êÆÄÀÏ ½Ã½ºÅÛ È¯°æÀ» ¸¸µé°í, ±× À§¿¡¼­ Nutch¸¦ ¿î¿ëÇÏ´Â ¹æ¹ý¿¡ ´ëÇØ¼­ ´Ù·é´Ù. Nutch 0.7.x ÀÌÇÏÀÇ ¹öÁ¯À̶ó¸é ÀÌ ¹®¼­¸¦ ÀÐÀ» Çʿ䰡 ¾ø´Ù. HadoopÀº Nutch 0.8.x ÀÌ»óÀÇ ¹öÁ¯ºÎÅÍ »ç¿ëµÇ°í ÀÖ´Ù.

ÀÌ ¹®¼­´Â Nutch¿Í HadoopÀÇ ±¸Á¶ÀûÀÎ ºÎºÐÀ» ¼³¸íÇÏÁö´Â ¾ÊÀ» °ÍÀÌ´Ù. ´ÜÁö ¼³Ä¡ÇÏ°í ¿î¿ëÇϴµ¥¿¡ ÃÐÁ¡À» ¸ÂÃâ °ÍÀÌ´Ù.

¿ø¹®Àº http://wiki.apache.org/nutch/NutchHadoopTutorial ¿¡¼­ È®ÀÎÇÒ ¼ö ÀÖ´Ù.
  • ¿î¿µÃ¼Á¦ : Linux Kernel2.6.x
  • Tomcat 5.x + JDK1.5

2 ³×Æ®¿öÅ© ȯ°æ ¼³Á¤

ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀ» À§Çؼ­ ³×Æ®¿öÅ© ȯ°æÀ» ¸¸µé Çʿ䰡 ÀÖ´Ù. ½Ã½ºÅÛÀÌ ÃæºÐÈ÷ ÁغñµÇÁö ¾Ê¾Ò´Ù¸é, ´ÜÀÏ ½Ã½ºÅÛÀ¸·Î ±¸¼ºÇصµ º° ¹®Á¦´Â ¾øÀ» °ÍÀÌ´Ù. ¿ì¼±Àº ºÐ»êÆÄÀϽýºÅÛÀ» À§ÇÑ Àû´çÇÑ ½Ã½ºÅÛÀÌ ÁغñµÇ¾î ÀÖ´Ù´Â °¡Á¤ÇÏ¿¡ ¼³¸íÀ» Çϵµ·Ï ÇÑ´Ù. 256Mbyte ÀÌ»óÀÇ ·¥°ú 10Gbyte ÀÌ»óÀÇ Çϵåµð½ºÅ©µîÀÇ ¹°¸®ÀûÀΠȯ°æ°ú Linux¿Í Nutch°¡ ÀÛµ¿µÉ ¼ö ÀÖ´Â ¼ÒÇÁÆ®¿þ¾îÀûÀΠȯ°æÀÌ ±¸ÃàµÇ¾î ÀÖ¾î¾ß ÇÑ´Ù. Å×½ºÆ®¸¦ À§Çؼ­ ´ÙÀ½°ú °°Àº 3´ëÀÇ ÄÄÇ»Å͸¦ ÁغñÇß´Ù.
devcluster01 
devcluster02 
devcluster03 
 
master node´Â devcluster01À¸·Î ÇÒ °ÍÀÌ´Ù. master node¶õ ´Ù¸¥ ÇÏÀ§ ³ëµå(slave node)¸¦ ÅëÇÕÇϱâ À§ÇÑ Hadoop ¼­ºñ½º°¡ ½ÇÇàµÇ´Â ȯ°æÀÌ´Ù. ¶ÇÇÑ crawlÀ» ¼öÇàÇØ¼­ À¥¹®¼­¸¦ ¼öÁýÇÏ°í °Ë»ö¿£ÁøÀÌ Å¾ÀçµÈ ½Ã½ºÅÛÀ̱⵵ ÇÏ´Ù.

3 Nutch¿Í Hadoop ¼³Ä¡

HadoopÀº Nutch 0.8.x ÀÌ»óÀÇ ¹öÀüºÎÅÍ »ç¿ëµÇ°í ÀÖ´Ù. subversion ÀúÀå¼Ò¸¦ ÅëÇØ¼­ ´Ù¿î·Îµå ¹Þ¾Æ¼­ ¼³Ä¡Çϵµ·Ï ÇÑ´Ù.

´ÙÀ½ URLÀ» ÅëÇØ¼­ subversionÀúÀå¼Ò·Î Á¢±ÙÇÒ ¼ö ÀÖ´Ù.
cvs¸¦ »ç¿ëÇÑ´Ù¸é ¾Æ·¡ÀÇ URLÀ» ÀÌ¿ëÇϵµ·Ï ÇÏÀÚ.
ÀÌŬ¸³½º¸¦ °³¹ßȯ°æÀ¸·Î »ç¿ëÇÑ´Ù¸é subversion Ç÷¯±×ÀÎÀ» ¼³Ä¡ÇØ¾ß ÇÑ´Ù.

4 Nutch Hadoop ¼³Ä¡

¼³Ä¡´Â °£´ÜÇÏ´Ù. ¿øÇÏ´Â µð·ºÅ丮¿¡ Nutch¾ÐÃàÀ» Ç®¾îÁֱ⸸ ÇϸéµÈ´Ù. ³ª´Â /home/yundream/workspace¿¡ ¾ÐÃàÀ» Ç®¾ú´Ù.

5 Hadoop Áö¿ø ¼³Á¤

nutch´Â master node¿Í data ³ëµå¿¡¼­ bin/start-all.sh ¸¦ ½ÇÇàÇÔÀ¸·Î½á hadhoopȯ°æÀ» ¸¸µé°Ô µÈ´Ù. ÀÌ´Â master ³ëµå¿¡¼­ °ü¸®ÇÏ´Â data ³ëµå¿¡ ¿¬°áÀ» ÇÑ´ÙÀ½¿¡ ÇØ´ç ½ºÅ©¸³Æ®¸¦ ½ÇÇà½ÃÄÑÁà¾ß ÇÏ´Â °É ÀǹÌÇÑ´Ù. ¿¬°áÀ» À§Çؼ­ ssh¸¦ »ç¿ëÇÑ´Ù.

start-all.sh ½ºÅ©¸³Æ®´Â ¸ðµç ¼­¹öÀÇ °°Àº À§Ä¡¿¡ ¼³Ä¡µÇ¾î¾ß ÇÑ´Ù. ¶ÇÇÑ ºÐ»êÆÄÀÏÀ» ÀúÀåÇϱâ À§ÇÑ °æ·Î¿ª½Ã µ¿ÀÏÇÏ°Ô Àâ¾Æ ÁÖ¾î¾ß ÇÑ´Ù. À̸¦ À§Çؼ­ ¾Æ·¡¿Í µð·ºÅ丮 ±¸Á¶¸¦ ¸¸µé¾ú´Ù.
 / --+-- home/yundream/nutch/search   <nutch ¼³Ä¡ µð·ºÅ丮> 
     | 
     +-- nutch/filesystem             <hadoop ÆÄÀϽýºÅÛ>  
     | 
     +-- usr/local/tomcat             <tomcat ¼³Ä¡ µð·ºÅ丮> 
 

tomcatÀº À¯Àú¿¡°Ô °Ë»ö ÀÎÅÍÆäÀ̽º¸¦ Á¦°øÇϱâ À§ÇÑ ¸ñÀûÀ¸·Î »ç¿ëµÈ´Ù. ÀÚ¼¼ÇÑ ³»¿ëÀº JCvs/Search/Document/nutch/Searching nutch °Ë»ö¹®¼­¸¦ Âü°íÇϱ⠹ٶõ´Ù. ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀÌ ±¸¼ºµÇ´Â È帧Àº ¾Æ·¡¿Í °°´Ù.
  1. master node ¿¡¼­ hadoop ½ºÅ©¸³Æ®¸¦ ½ÇÇàÇÑ´Ù.
  2. hadoop ½ºÅ©¸³Æ®´Â hadoop-site.xmlÀÇ Á¤º¸¸¦ Àо master nodeÀÇ HOSTNAME:PORTÁ¤º¸¸¦ ¾ò¾î¿Â´Ù.
  3. ssh ÇÁ·ÎÅäÄÝÀ» ÀÌ¿ëÇØ¼­ master nodeÀÇ HOSTNAME:PORT·Î¿¬°áÀ» ÇÑ´Ù.
  4. ¿¬°áÈÄ hadoop-env.shÀ» ½ÇÇà½ÃÄѼ­, data nodeÀÇ hadoop½ÇÇàȯ°æÀ» È®ÀÎÇÑ´Ù.
  5. master nodeÀÇ hadoopÀÌ ½ÇÇàÀÌ µÇ°í, hadoop-site.xmlÀÇ Á¤º¸¸¦ ÀÌ¿ë ºÐ»êµð·ºÅ丮 ·çÆ®¸¦ »ý¼ºÇÑ´Ù.
  6. slaves ¿¡ ÀÖ´Â data node host ¸ñ·ÏÀ» ÀоîµéÀδÙ.
  7. slaves ·Î ¿¬°áÇØ¼­ hadoop-env.sh¸¦ Àо hadoopÀ» ½ÇÇà ºÐ»ê µð·ºÅ丮 ¿î¿ëÀ» ½ÃÀÛÇÑ´Ù.

ÇÙ½É ¼³Á¤ÆÄÀÏÀº hadoop-env.sh¿Í hadoop-site.xml, slavesÀÓÀ» ¾Ë ¼ö ÀÖ´Ù.
  • hadoop-env.sh : hadoopÀ» ¿î¿ëÇϱâ À§ÇÑ È¯°æº¯¼ö
  • hadoop-site.xml : master nodeÀÇ È£½ºÆ®Á¤º¸¿Í ºÐ»êµð·ºÅ丮 ÆÄÀϽýºÅÛ Á¤º¸
  • slaves : data node È£½ºÆ®À̸§

´ÙÀ½Àº conf/hadoop-env.shÀÇ ³»¿ëÀÌ´Ù. ÀÌ ÆÄÀÏÀº master node¿Í date node¿¡¼­ hadoopÀ» ½ÇÇà½Ã۱â À§Çؼ­ ¹Ýµå½Ã ÇÊ¿äÇÏ´Ù.
export HADOOP_HOME=/home/yundream/workspace/nutch-nightly 
export JAVA_HOME=/usr/local/java 
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs 
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves 
 

´ÙÀ½Àº conf/hadoop-site.xmlÀÇ ³»¿ëÀÌ´Ù.
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
 
<!-- Put site-specific property overrides in this file. --> 
 
<configuration> 
<property> 
  <name>fs.default.name</name> 
  <value>devcluster01:9000</value> 
</property> 
 
<property> 
  <name>mapred.job.tracker</name> 
  <value>ubuntu:9001</value> 
</property> 
 
<property> 
  <name>dfs.name.dir</name> 
  <value>/nutch/filesystem/name</value> 
</property> 
 
<property> 
  <name>dfs.data.dir</name> 
  <value>/nutch/filesystem/data</value> 
</property> 
 
<property> 
  <name>mapred.system.dir</name> 
  <value>/nutch/filesystem/mapreduce/system</value> 
</property> 
 
<property> 
  <name>mapred.local.dir</name> 
  <value>/nutch/filesystem/mapreduce/local</value> 
</property> 
 
<property> 
  <name>dfs.replication</name> 
  <value>1</value> 
</property> 
 
</configuration> 
 
master nodeÀÎ devclust01ÀÇ ¿¬°áÁ¤º¸¿Í ºÐ»êÆÄÀÏ ½Ã½ºÅÛ Á¤º¸¸¦ Æ÷ÇÔÇϰí ÀÖ´Ù. ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀÇ root °æ·Î´Â /nutch/filesystemÀ¸·Î Çß´Ù.

´ÙÀ½Àº slavesÀÇ ³»¿ëÀÌ´Ù. °ü¸®ÇÒ data nodeÀÇ È£½ºÆ®À̸§À» °¡Áö°í ÀÖ´Ù.
devcluster02 
devcluster03 
 

6 hadoop ȯ°æ ¸¸µé±â

6.1 ssh ¿¬°á ȯ°æ ¸¸µé±â

master node´Â data node¸¦ °ü¸®Çϱâ À§Çؼ­ ssh ¿¬°áÀ» »ç¿ëÇÑ´Ù. ÀÌ °æ¿ì ÃÖÃÊ¿¬°áÀÌ ÀÖÀ» ¶§, ÆÐ½º¿öµå¸¦ ¹°¾îº¸°Ô µÇ´Âµ¥, À̸¦ ÀÚµ¿È­ÇÒ Çʿ䰡 ÀÖ´Ù. ±×·¡¼­ ÀÎÁõ۸¦ ¸¸µé¾î¼­ ¹èÆ÷ÇÏ´Â ¹æ¹ýÀ» »ç¿ëÇϱâ·Î Çß´Ù. ¿ì¼± master node¿¡¼­ ¾Æ·¡¿Í °°Àº ¹æ¹ýÀ¸·Î ÀÎÁõ۸¦ »ý¼ºÇϵµ·Ï ÇÑ´Ù.
# ssh-keygen -t rsa 
Generating public/private rsa key pair. 
Enter file in which to save the key (/root/.ssh/id_rsa):  
/root/.ssh/id_rsa already exists. 
Overwrite (y/n)? y 
Enter passphrase (empty for no passphrase):  
Enter same passphrase again:  
Your identification has been saved in /root/.ssh/id_rsa. 
Your public key has been saved in /root/.ssh/id_rsa.pub. 
The key fingerprint is: 
a5:f9:0d:96:77:57:8d:0c:c4:70:0f:19:5a:f2:d0:3e root@ubuntu 
 

# cd /root/.ssh 
# cp id_rsa.pdu authorized_keys 
 
ÀÌÁ¦ ¸¸µé¾îÁø ÀÎÁõ۸¦ ¸ðµç data node¿¡ º¹»çÇϵµ·Ï ÇÑ´Ù.
# scp /root/.ssh/authorized_keys root@devcluster02:/nutch/home/.ssh/authorized_keys 
 

ÀÎÁõ۰¡ º¹»çµÇ¾úÀ¸¹Ç·Î, ÀÏ¹Ý ssh ¿¬°á½Ã¿¡µµ ÆÐ½º¿öµå¸¦ ÀÔ·ÂÇÒ Çʿ䰡 ¾ø´Ù.
# ssh devcluster02 
// ÆÐ½º¿öµå ÀÔ·Â ÇÁ·ÒÇÁÆ®°¡ ¶ßÁö ¾Ê°í ¹Ù·Î ¿¬°áÀÌ µÈ´Ù.  
 

6.2 ºÐ»ê ÆÄÀÏ ½Ã½ºÅÛ ¸¸µé±â

  1. ¸ðµç node¿¡ /nutch/filesystemÀ» ¸¸µéµµ·Ï ÇÏÀÚ.
  2. master node¿¡¼­ ´ÙÀ½ÀÇ ¸í·ÉÀ» ÀÌ¿ëÇØ¼­ ºÐ»êÆÄÀÏ ½Ã½ºÅÛ µð·ºÅ丮¸¦ ÃʱâÈ­ ÇÑ´Ù.

    # bin/hadoop namenode -format 
     
  3. start-all.sh¸¦ ½ÇÇà½ÃÄѼ­ ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀ» °¡µ¿ÇÑ´Ù.

    # bin/start-all.sh 
     
  4. ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀÇ ÁßÁö´Â stop-all.sh ½ºÅ©¸³Æ®¸¦ ÀÌ¿ëÇÑ´Ù.

    # bin/stop-all.sh 
     

ÀÌÁ¦ urls/urllist.txt¸¦ ÆíÁýÇØ¼­ tcl.apache.org »çÀÌÆ®¸¦ °¡»óÆÄÀϽýºÅÛ °æ·Î¿¡ Ãß°¡Çϵµ·Ï ÇÑ´Ù. ¸¸µé¾îÁø °æ·Î´Â nutch crawlÀÌ À¥¹®¼­¸¦ ¼öÁýÇÒ ¶§ »ç¿ëÇÏ°Ô µÈ´Ù.
# bin/hadoop dfs -put urls urls 
 

´ÙÀ½°ú °°Àº ¸í·ÉÀ¸·Î ¼º°øÀûÀ¸·Î °æ·Î°¡ Ãß°¡µÇ¾ú´ÂÁö È®ÀÎÇÒ ¼ö ÀÖ´Ù.
# bin/hadoop dfs -ls 
Found 2 items 
/user/root/apache       <dir> 
... 
 

6.3 ºÐ»ê ÆÄÀÏ ½Ã½ºÅÛ¿¡ crawling Çϱâ

hadoopÀ¸·Î ºÐ»êÆÄÀÏ ½Ã½ºÅÛµµ ±¸ÃàÇßÀ¸´Ï, nutch crawlÀ» ½ÇÇàÇÒ Â÷·Ê´Ù. ¸ÕÀú crawl-rulfilter.txt¸¦ ¼öÁ¤Çϵµ·Ï ÇÑ´Ù.
+^http://tcl.apache.org/ 
 
ÀÌÁ¦ nutch crawlÀ» °¡µ¿½Ã۵µ·Ï ÇÏÀÚ. ´ë·« 20ºÐ Á¤µµÀÇ ½Ã°£ÀÌ ¼Ò¿äµÉ °ÍÀÌ´Ù.
cd /nutch/search 
bin/nutch crawl urls -dir crawled -depth 3 
 

6.4 °Ë»ö Å×½ºÆ®

±×·³ ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀ¸·Î ºÎÅÍ °Ë»öÀÌ Á¦´ë·Î ÀÌ·ç¾îÁö´Â Áö¸¦ È®ÀÎÇØ º¸µµ·Ï ÇÏÀÚ. ¿©·¯ºÐÀÌ eclipse·Î nutch - lucene °³¹ßȯ°æÀ» ±¸ÃàÇϰí ÀÖ´Ù°í °¡Á¤ÇϰڴÙ.
nutch ¿¡¼­ °Ë»öÀº luceneÀÇ SearcherÀÌ ´ã´çÇϰí Àִµ¥, lucene´Â local¿¡ ÀÖ´Â ÆÄÀϽýºÅÛÀÇ °Ë»ö¸¸À» Áö¿øÇϰí ÀÖ´Ù. °á±¹ ºÐ»êÆÄÀÏ ½Ã½ºÅÛ¿¡ ÀúÀåµÈ ÆÄÀÏ·Î ºÎÅÍ °Ë»öÀ» Çϱâ À§Çؼ­´Â nutch¿¡¼­ Á¦°øÇÏ´Â hadoop API¸¦ ÀÌ¿ëÇØ¼­ lucene.searcher¿¡ local ÆÄÀϽýºÅÛÈ­ ÇÑ °æ·Î¸¦ ³Ñ°ÜÁà¾ß ÇÑ´Ù. Á¤È®È÷ ¸»ÇÏÀÚ¸é indexµð·ºÅ丮¸¦ ³Ñ°ÜÁà¾ß ÇÑ´Ù. ¿¹¸¦ µé¾î¼­ »öÀΰ˻öÀ» ÇÏ±æ ¿øÇÑ´Ù¸é luceneÀÇ IndexSearcherÀ» ÀÌ¿ëÇØ¾ß ÇÑ´Ù.

¿©±â¿¡ ´ëÇÑ ³»¿ëÀº http://www.joinc.co.kr/modules/moniwiki/wiki.php/JCvs/Search/Document/nutch/query#s-8.2 ¸¦ Âü°íÇϱ⠹ٶõ´Ù.


7 Hadoop MapReduce

Hadoop´Â MapReduce ÇÁ·Î±×·¡¹Ö ¸ðµ¨À» ÅëÇØ¼­ ±¸ÇöµÇ¾ú´Ù. ¿©±â¿¡¼­´Â ºÐ»êÆÄÀÏ ½Ã½ºÅÛÀ» À§Çؼ­ MapReduce°¡ ¾î¶»°Ô Àû¿ëµÇ¾ú´ÂÁö¿¡ ´ëÇØ¼­ °³·«ÀûÀ¸·Î ¾Ë¾Æº¸µµ·Ï ÇÒ °ÍÀÌ´Ù.

7.1 Map

MapÀ» À§ÇØ ÀÔ·Â(Input)µÇ´Â µ¥ÀÌÅÍ´Â º´·Ä·Î 󸮵DZâ À§ÇÑ ÆÄÀϵéÀÇ ¸®½ºÆ®µéÀÌ´Ù. ÀÌµé ÆÄÀϵéÀº FileSplits¸¦ ÅëÇØ¼­ ¿©·¯°³ÀÇ Á¶°¢À¸·Î ³ª´µ¾îÁø´Ù. ¸¸¾à¿¡ ¸Å¿ì Å« ´ÜÀÏ ÆÄÀÏÀÌ Á¸ÀçÇÑ´Ù¸é, ÀÌ ÆÄÀÏÀº seek¿¬»êÀ» ÅëÇØ¼­ ¿©·¯°³ÀÇ ÀÛÀº Á¶°¢À¸·Î ³ª´µ°Ô µÈ´Ù. ÀÌ·¯ÇÑ Á¶°¢È­´Â ÆÄÀÏÀÇ ³í¸®ÀûÀÎ ±¸¼ºÀº ÀüÇô »ó°üÇÏÁö ¾Ê°í ÀÌ·ç¾îÁø´Ù. ¿¹¸¦µé¾î¼­ ÁÙÀ» ±âº»´ÜÀ§·Î ÇÏ´Â text ÆÄÀÏÀÇ °æ¿ì¿¡µµ byte´ÜÀ§·Î Á¶°¢È­ µÉ°ÍÀÌ´Ù. ÀÌ·¸°Ô ÇØ¼­ °¢°¢ÀÇ Map task´Â FileSplit¸¦ »ý¼ºÇÏ°Ô µÈ´Ù.

°³º°ÀûÀÎ MapTask°¡ ½ÃÀÛÇÏ°Ô µÇ¸é »õ·Î¿î OutputÀ» À§ÇÑ Reduce task°¡ ½ÃÀÛµÇ°Ô µÈ´Ù. À̰ÍÀº RecordReader¸¦ ÀÌ¿ëÇØ¼­ FileSplit·Î ºÎÅÍ µ¥ÀÌÅ͸¦ ÀоîµéÀδÙ. ÀоîµéÀÎ µ¥ÀÌÅÍ´Â InputFormat¿¡ ÀÇÇØ¼­ key valueÇü½ÄÀ¸·Î º¯È¯µÈ´Ù. InputFormaterÀº FileSplitÀÇ °¢°¢ÀÇ Á¶°¢È­µÈ ÆÄÀÏÀ» Á¦¾îÇϱâ À§ÇÑ ·çƾÀ» °¡Áö°í ÀÖ´Ù. ¿¹¸¦ µé¾î TextInputFormat´Â Á¶°¢ÆÄÀÏÀÇ ¸¶Áö¸·À» ÀÐÀº ´ÙÀ½ ±×°Ô »õ·Î¿î ¶óÀÎÀ» ³ªÅ¸³»´Â ¹®ÀÚ°¡ ¾Æ´Ï¶ó¸é, °³Ç๮ÀÚ¸¦ ¸¸³¯ ¶§±îÁö ¹®ÀÚ¸¦ ´õ ÀоîµéÀÌ°Ô µÉ°ÍÀÌ´Ù. ±×¸®°í FileSplitÀÇ ´ÙÀ½ Á¶°¢À» ÀÐÀ» ¶§¿¡´Â óÀ½ °³Ç๮ÀÚÀüÀÇ ¹®ÀÚµéÀº ¹«½ÃÇÏ°Ô µÉ °ÍÀÌ´Ù.


7.2 Reduce

Reduce ÀÛ¾÷À» À§ÇÑ InputÀº ¿©±â Àú±â Èð¾îÁ® ÀÖ´Â ¸¹Àº ÆÄÀÏÀÌ µÉ °ÍÀÌ´Ù. ÀÌ·¯ÇÑ ÆÄÀϵéÀº map ÀÛ¾÷¿¡ ÀÇÇØ¼­ ³ëµå¿¡ »ó°ü¾øÀÌ Èð¾îÁ® ÀÖ´Ù. Reduce¸¦ ÅëÇÑ ºÐ»êȯ°æÀÌ ½ÇÇàµÈ´Ù¸é, ÇÊ¿äÇÑ ÆÄÀϵéÀ» ÀÏ´Ü ·ÎÄÃÆÄÀϽýºÅÛÀ¸·Î º¹»çÇÏ°Ô µÈ´Ù.

¸ðµç µ¥ÀÌÅ͵éÀº ·ÎÄÃÆÄÀϽýºÅÛ¿¡ ÇϳªÀÇ ÆÄÀÏ·Î Ãß°¡µÇ°Ô µÇ¸ç, ÀÌ ÆÄÀϵéÀº ´Ù½Ã key, pair¿¡ ÀÇÇØ¼­ ¿¬¼ÓµÈ ¿ÏÀüÇÑ ÇϳªÀÇ ÆÄÀÏ·Î ¸¸µé¾îÁö°Ô µÈ´Ù. Reduce´Â ´ÙÀ½Å°°¡ ÀÖ´ÂÁö¸¦ È®ÀÎÇϸ鼭 °è¼Ó ¼øÈ¯Çϸ鼭, key value¸¦ ¼øÂ÷ÀûÀ¸·Î ÀоîµéÀδÙ. read iteratorÀ» ÅëÇØ¼­ ¼øÈ¯Çϰí value¸¦ ÀоîµéÀÌ¸é µÇ±â ¶§¹®¿¡, Reduce ÀÛ¾÷Àº °£´ÜÇÏ´Ù°í º¼ ¼ö ÀÖ´Ù.

8 Åä·Ð

  • hadoop ´ë½Å [http]GFS¸¦ ÀÌ¿ëÇÏ´Â°Ç ¾î¶²°¡ ?