SlideShare ist ein Scribd-Unternehmen logo
1 von 42
hadoop

Tech share
• Hadoop Core, our flagship sub-project,
  provides a distributed filesystem (HDFS) and
  support for the MapReduce distributed
  computing metaphor.
• Pig is a high-level data-flow language and
  execution framework for parallel computation.
  It is built on top of Hadoop Core.
ZooKeeper
• ZooKeeper is a highly available and reliable
  coordination system. Distributed applications
  use ZooKeeper to store and mediate updates
  for critical shared state.
JobTracker
• JobTracker: The JobTracker provides command
  and control for job management. It supplies
  the primary user interface to a MapReduce
  cluster. It also handles the distribution and
  management of tasks. There is one instance of
  this server running on a cluster. The machine
  running the JobTracker server is the
  MapReduce master.
TaskTracker
• TaskTracker: The TaskTracker provides
  execution services for the submitted jobs.
  Each TaskTracker manages the execution of
  tasks on an individual compute node in the
  MapReduce cluster. The JobTracker manages
  all of the TaskTracker processes. There is one
  instance of this server per compute node.
NameNode
• NameNode: The NameNode provides metadata
  storage for the shared file system. The
  NameNode supplies the primary user interface to
  the HDFS. It also manages all of the metadata for
  the HDFS. There is one instance of this server
  running on a cluster. The metadata includes such
  critical information as the file directory structure
  and which DataNodes have copies of the data
  blocks that contain each file’s data. The machine
  running the NameNode server process is the
  HDFS master.
Secondary NameNode
• Secondary NameNode: The secondary
  NameNode provides both file system metadata
  backup and metadata compaction. It supplies
  near real-time backup of the metadata for the
  NameNode. There is at least one instance of this
  server running on a cluster, ideally on a separate
  physical machine than the one running the
  NameNode. The secondary NameNode also
  merges the metadata change history, the edit log,
  into the NameNode’s file system image.
Design of HDFS
• Design of HDFS
  – Very large files
  – Streaming data access
  – Commodity hardware
• not a good fit
  – Low-latency data access
  – Lots of small files
  – Multiple writers, arbitrary file modifications
blocks
• normally 512 bytes
• HDFS : 64 MB by default
HDFS文件读取
               内存
•
HDFS文件写入
HDFS文件写入
• Outputsream.write()
• Outputstream.flush() 刷新,超过一个block
  的时候,才会读到。
• Outputstream.sync() 强制同步
• Outputstream.close() 包括sync()
DistCp分布式复制
• hadoop distcp -update hdfs://namenode1/foo
  hdfs://namenode2/bar

• hadoop distcp –update ……
  – 只更新修改过的文件
• hadoop distcp –overwrite ……
  – 覆盖
• hadoop distcp –m 100 ……
  – 复制任务被分成N个MAP执行
Hadoop 文件归档
• Har文件

• Hadoop archive –archiveName file.har
  /myfiles /outpath

• Hadoop fs –ls /outpath/file.har
• Hadoop fs –lsr har:///outpath/file.har
文件操作
• Hadoop fs –rm hdfs://192.168.126.133:9000/xxx


   •cat             •cp         •lsr             •rmr
   •chgrp           •du         •mkdir           •setrep
   •chmod           •dus        •moveFromLocal   •stat
   •chown           •expunge    •moveToLocal     •tail
   •copyFromLocal   •get        •mv              •test
   •copyToLocal     •getmerge   •put             •text
   •count           •ls         •rm              •touchz
分布式部署
• Master&slave 192.168.0.10
• Slave 192.168.0.20

• 修改conf/master
  – 192.168.0.10
• 修改Conf/slave
  – 192.168.0.10
  – 192.168.0.20
安装hadoop
• ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa

• Cat ~/.ssh/id_dsa.pub >>
  ~/.ssh/authorized_keys

• 关闭防火墙Sudo ufw disable
分布式部署Core-site.xml
             (master&slave相同)
• <configuration>

• <property>
•      <name>hadoop.tmp.dir</name>
•      <value>/home/tony/tmp/tmp</value>
•      <description>Abaseforothertemporarydirectories.</description>
• </property>

• <property>
•      <name>fs.default.name</name>
•      <value>hdfs://192.168.0.10:9000</value>
• </property>

• </configuration>
分布式部署Hdfs-site.xml
               (master&slave)
•   <configuration>
•   <property>
•        <name>dfs.replication</name>
•        <value>1</value>
•      </property>
•   <property>
•        <name>dfs.name.dir</name>
•        <value>/home/tony/tmp/name</value>
•      </property>
•   <property>
•        <name>dfs.data.dir</name>
•        <value>/home/tony/tmp/data</value>
•      </property>
•   </configuration>
•   并且保证当前机器有该目录
分布式部署Mapred-site.xml
• <configuration>
• <property>
•     <name>mapred.job.tracker</name>
•     <value>192.168.0.10:9001</value>
•   </property>

• </configuration>
• 所有的机器都配成master的地址
Run
• Hadoop namenode –format
  – 每次fotmat前,先stop-all,并清空tmp一下的
    所有目录
• Start-all.sh
• 显示运行情况:
  – http://192.168.0.20:50070/dfshealth.jsp
  – 或 hadoop dfsadmin -report
could only be replicated
• java.io.IOException: could only be replicated
  to 0 nodes, instead of 1.

• 解决:
  – XML的配置不正确,要保证slave的mapred-
    site.xml和core-site.xml的地址都跟master一致
Incompatible namespaceIDs
• java.io.IOException: Incompatible
  namespaceIDs in /home/hadoop/data:
  namenode namespaceID = 1214734841;
  datanode namespaceID = 1600742075
• 原因:
  – 格式化前没清空tmp,导致ID不一致
• 解决:
  – 修改 namenode 的
    /home/hadoop/name/current/VERSION
UnknownHostException
• # hostname
• Vi /etc/hostname 修改hostname
• Vi /etc/hosts 增加hostname对应的IP
error in shuffle in fetcher
• org.apache.hadoop.mapreduce.task.reduce.Sh
  uffle$ShuffleError: error in shuffle in fetcher
• 解决方式:
  – 问题出在hosts文件的配置上,在所有节点的
    /etc/hosts文件中加入其他节点的主机名和IP映
    射
Auto sync
动态增加datanode
• 主机的conf/slaves中,增加namenode的地址
•
• 启动新增的namenode
 – bin/hadoop-daemon.sh start datanode
   bin/hadoop-daemon.sh start tasktracker
•
• 启动后,Hadoop自动识别。
screenshot
容错
• 如果一个节点很长时间没反应,就会清出
  集群,并且其它节点会把replication补上
执行 MapReduce
• hadoop jar a.jar com.Map1
  hdfs://192.168.126.133:9000/hadoopconf/
  hdfs://192.168.126.133:9000/output2/
Read From Hadoop URL
•   //execute: hadoop ReadFromHDFS
•   public class ReadFromHDFS {
•      static {
•       URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
•     }
•      public static void main(String[] args){
•        try {
•   URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt");
•   IOUtils.copyBytes(uri.openStream(), System.out, 4096, false);
•   }catch (FileNotFoundException e) {
•   e.printStackTrace();
•        } catch (IOException e) {
•   e.printStackTrace();
•        }
•      }
•   }
Read By FileSystem API
•   //execute : hadoop ReadByFileSystemAPI
•   public class ReadByFileSystemAPI {
•      public static void main(String[] args) throws Exception {
•        String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");;
•        Configuration conf = new Configuration();
•        FileSystem fs = FileSystem.get(URI.create(uri), conf);
•        FSDataInputStream in = null;
•        try {
•   in = fs.open(new Path(uri));
•   IOUtils.copyBytes(in, System.out, 4096, false);
•        } finally {
•   IOUtils.closeStream(in);
•        }
•      }
•   }
FileSystemAPI
•   Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/"));
•   if(fs.exists(path)){
•      fs.delete(path,true);
•      System.out.println("deleted-----------");
•   }else{
•      fs.mkdirs(path);
•      System.out.println("creted=====");
•   }

•   /**
•    * List files
•    */
•   FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/")));
•   for(FileStatus fileStatus : fileStatuses){
•      System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory());
•   }

•   PathFilter pathFilter = new PathFilter(){
•      @Override
•      public boolean accept(Path path) {
•        return true;
•      }
•   };
文件写入策略
•   在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示:
•   1. Path p = new Path("p");
•   2. Fs.create(p);
•   3. assertThat(fs.exists(p),is(true));
•   但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文
    件长度显
•   示为0:
•   1. Path p = new Path("p");
•   2. OutputStream out = fs.create(p);
•   3. out.write("content".getBytes("UTF-8"));
•   4. out.flush();
•   5. assertThat(fs.getFileStatus(p).getLen(),is(0L));
•   一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之
    后的块也
•   是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。
•   out.sync(); 强制同步, close()的时候会自动调用sync()
集群复制 归档
• hadoop distcp -update hdfs://n1/foo
  hdfs://n2/bar/foo
• 归档
  – hadoop archive -archiveName files.har /my/files
    /my
• 使用归档
  – hadoop fs -lsr har:///my/files.har
  – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di
• 归档缺点:修改文件、增加删除文件 都需重新归档
SequenceFile Reader&Writer
•   Configuration conf = new Configuration();
•       SequenceFile.Writer writer =null ;
•       try {
•         System.out.println("start....................");
•         FileSystem fileSystem = FileSystem.newInstance(conf);
•         IntWritable key = new IntWritable(1);
•         Text value = new Text("");
•         Path path = new Path("hdfs://192.168.126.133:9000/t1/seq");
•         if(!fileSystem.exists(path)){
•             fileSystem.create(path);
•             writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass());

•            for(int i=1; i<10; i++){
•               writer.append(new IntWritable(i), new Text("value" + i));
•            }
•            writer.close();
•          }else{
•            SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf);
•            System.out.println("now while segment");
•            while(reader.next(key, value)){
•               System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition());
•            };
•          }
•       } catch (IOException e) {
•          e.printStackTrace();
•       } finally{
•          IOUtils.closeStream(writer);
•       }
SequenceFile
•   1 value1
•   2 value2
•   3 value3
•   4 value4
•   5 value5
•   6 value6
•   7 value7
•   8 value8
•   9 value9
•   包括一个Key 和一个 Value
•   可以用hadoop fs –text hdfs://……… 来显示文件
SequenceMap
• 重建索引:MapFile.fix(fileSystem, path,
  key.getClass(), value.getClass(), true, conf);

• MapFile.Writer writer = new MapFile.Writer(conf,
  fileSystem, path.toString(), key.getClass(),
  value.getClass());

• MapFile.Reader reader = new
  MapFile.Reader(fileSystem,path.toString(),conf);

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
Gerrit van Vuuren
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
Mohamed Ali Mahmoud khouder
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 

Was ist angesagt? (20)

Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
 
Perl Programming - 03 Programming File
Perl Programming - 03 Programming FilePerl Programming - 03 Programming File
Perl Programming - 03 Programming File
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
 
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
Open Source Backup Conference 2014: Workshop bareos introduction, by Philipp ...
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Hadoop
HadoopHadoop
Hadoop
 
Hive commands
Hive commandsHive commands
Hive commands
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Hadoop installation on windows
Hadoop installation on windows Hadoop installation on windows
Hadoop installation on windows
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
 

Andere mochten auch (6)

Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Seminar report on Symbian OS
Seminar report on Symbian OSSeminar report on Symbian OS
Seminar report on Symbian OS
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Seminar report on paper battery
Seminar report on paper batterySeminar report on paper battery
Seminar report on paper battery
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Report on cloud computing by prashant gupta
Report on cloud computing by prashant guptaReport on cloud computing by prashant gupta
Report on cloud computing by prashant gupta
 

Ähnlich wie Hadoop 20111117

20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
garrett honeycutt
 
New microsoft power point presentation
New microsoft power point presentationNew microsoft power point presentation
New microsoft power point presentation
rajsandhu1989
 
Ericas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-GuideEricas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-Guide
Erica StJohn
 
Javase7 1641812
Javase7 1641812Javase7 1641812
Javase7 1641812
Vinay H G
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
fann wu
 

Ähnlich wie Hadoop 20111117 (20)

20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
 
Hadoop HDFS
Hadoop HDFS Hadoop HDFS
Hadoop HDFS
 
Hdfs java api
Hdfs java apiHdfs java api
Hdfs java api
 
Hadoop single node setup
Hadoop single node setupHadoop single node setup
Hadoop single node setup
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
New microsoft power point presentation
New microsoft power point presentationNew microsoft power point presentation
New microsoft power point presentation
 
Ericas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-GuideEricas-Linux-Plus-Study-Guide
Ericas-Linux-Plus-Study-Guide
 
20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnw20100425 Configuration Management With Puppet Lfnw
20100425 Configuration Management With Puppet Lfnw
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
Infrastructure-as-Code (IaC) Using Terraform (Advanced Edition)
 
PyFilesystem
PyFilesystemPyFilesystem
PyFilesystem
 
Javase7 1641812
Javase7 1641812Javase7 1641812
Javase7 1641812
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Hadoop 20111117

  • 2. • Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor. • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • 3. ZooKeeper • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
  • 4. JobTracker • JobTracker: The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the MapReduce master.
  • 5. TaskTracker • TaskTracker: The TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node.
  • 6. NameNode • NameNode: The NameNode provides metadata storage for the shared file system. The NameNode supplies the primary user interface to the HDFS. It also manages all of the metadata for the HDFS. There is one instance of this server running on a cluster. The metadata includes such critical information as the file directory structure and which DataNodes have copies of the data blocks that contain each file’s data. The machine running the NameNode server process is the HDFS master.
  • 7. Secondary NameNode • Secondary NameNode: The secondary NameNode provides both file system metadata backup and metadata compaction. It supplies near real-time backup of the metadata for the NameNode. There is at least one instance of this server running on a cluster, ideally on a separate physical machine than the one running the NameNode. The secondary NameNode also merges the metadata change history, the edit log, into the NameNode’s file system image.
  • 8. Design of HDFS • Design of HDFS – Very large files – Streaming data access – Commodity hardware • not a good fit – Low-latency data access – Lots of small files – Multiple writers, arbitrary file modifications
  • 9. blocks • normally 512 bytes • HDFS : 64 MB by default
  • 10. HDFS文件读取 内存 •
  • 12. HDFS文件写入 • Outputsream.write() • Outputstream.flush() 刷新,超过一个block 的时候,才会读到。 • Outputstream.sync() 强制同步 • Outputstream.close() 包括sync()
  • 13. DistCp分布式复制 • hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar • hadoop distcp –update …… – 只更新修改过的文件 • hadoop distcp –overwrite …… – 覆盖 • hadoop distcp –m 100 …… – 复制任务被分成N个MAP执行
  • 14. Hadoop 文件归档 • Har文件 • Hadoop archive –archiveName file.har /myfiles /outpath • Hadoop fs –ls /outpath/file.har • Hadoop fs –lsr har:///outpath/file.har
  • 15. 文件操作 • Hadoop fs –rm hdfs://192.168.126.133:9000/xxx •cat •cp •lsr •rmr •chgrp •du •mkdir •setrep •chmod •dus •moveFromLocal •stat •chown •expunge •moveToLocal •tail •copyFromLocal •get •mv •test •copyToLocal •getmerge •put •text •count •ls •rm •touchz
  • 16. 分布式部署 • Master&slave 192.168.0.10 • Slave 192.168.0.20 • 修改conf/master – 192.168.0.10 • 修改Conf/slave – 192.168.0.10 – 192.168.0.20
  • 17. 安装hadoop • ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa • Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys • 关闭防火墙Sudo ufw disable
  • 18. 分布式部署Core-site.xml (master&slave相同) • <configuration> • <property> • <name>hadoop.tmp.dir</name> • <value>/home/tony/tmp/tmp</value> • <description>Abaseforothertemporarydirectories.</description> • </property> • <property> • <name>fs.default.name</name> • <value>hdfs://192.168.0.10:9000</value> • </property> • </configuration>
  • 19. 分布式部署Hdfs-site.xml (master&slave) • <configuration> • <property> • <name>dfs.replication</name> • <value>1</value> • </property> • <property> • <name>dfs.name.dir</name> • <value>/home/tony/tmp/name</value> • </property> • <property> • <name>dfs.data.dir</name> • <value>/home/tony/tmp/data</value> • </property> • </configuration> • 并且保证当前机器有该目录
  • 20. 分布式部署Mapred-site.xml • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>192.168.0.10:9001</value> • </property> • </configuration> • 所有的机器都配成master的地址
  • 21. Run • Hadoop namenode –format – 每次fotmat前,先stop-all,并清空tmp一下的 所有目录 • Start-all.sh • 显示运行情况: – http://192.168.0.20:50070/dfshealth.jsp – 或 hadoop dfsadmin -report
  • 22.
  • 23.
  • 24. could only be replicated • java.io.IOException: could only be replicated to 0 nodes, instead of 1. • 解决: – XML的配置不正确,要保证slave的mapred- site.xml和core-site.xml的地址都跟master一致
  • 25. Incompatible namespaceIDs • java.io.IOException: Incompatible namespaceIDs in /home/hadoop/data: namenode namespaceID = 1214734841; datanode namespaceID = 1600742075 • 原因: – 格式化前没清空tmp,导致ID不一致 • 解决: – 修改 namenode 的 /home/hadoop/name/current/VERSION
  • 26. UnknownHostException • # hostname • Vi /etc/hostname 修改hostname • Vi /etc/hosts 增加hostname对应的IP
  • 27. error in shuffle in fetcher • org.apache.hadoop.mapreduce.task.reduce.Sh uffle$ShuffleError: error in shuffle in fetcher • 解决方式: – 问题出在hosts文件的配置上,在所有节点的 /etc/hosts文件中加入其他节点的主机名和IP映 射
  • 28.
  • 30. 动态增加datanode • 主机的conf/slaves中,增加namenode的地址 • • 启动新增的namenode – bin/hadoop-daemon.sh start datanode bin/hadoop-daemon.sh start tasktracker • • 启动后,Hadoop自动识别。
  • 32. 容错 • 如果一个节点很长时间没反应,就会清出 集群,并且其它节点会把replication补上
  • 33.
  • 34. 执行 MapReduce • hadoop jar a.jar com.Map1 hdfs://192.168.126.133:9000/hadoopconf/ hdfs://192.168.126.133:9000/output2/
  • 35. Read From Hadoop URL • //execute: hadoop ReadFromHDFS • public class ReadFromHDFS { • static { • URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); • } • public static void main(String[] args){ • try { • URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt"); • IOUtils.copyBytes(uri.openStream(), System.out, 4096, false); • }catch (FileNotFoundException e) { • e.printStackTrace(); • } catch (IOException e) { • e.printStackTrace(); • } • } • }
  • 36. Read By FileSystem API • //execute : hadoop ReadByFileSystemAPI • public class ReadByFileSystemAPI { • public static void main(String[] args) throws Exception { • String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");; • Configuration conf = new Configuration(); • FileSystem fs = FileSystem.get(URI.create(uri), conf); • FSDataInputStream in = null; • try { • in = fs.open(new Path(uri)); • IOUtils.copyBytes(in, System.out, 4096, false); • } finally { • IOUtils.closeStream(in); • } • } • }
  • 37. FileSystemAPI • Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/")); • if(fs.exists(path)){ • fs.delete(path,true); • System.out.println("deleted-----------"); • }else{ • fs.mkdirs(path); • System.out.println("creted====="); • } • /** • * List files • */ • FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/"))); • for(FileStatus fileStatus : fileStatuses){ • System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory()); • } • PathFilter pathFilter = new PathFilter(){ • @Override • public boolean accept(Path path) { • return true; • } • };
  • 38. 文件写入策略 • 在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示: • 1. Path p = new Path("p"); • 2. Fs.create(p); • 3. assertThat(fs.exists(p),is(true)); • 但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文 件长度显 • 示为0: • 1. Path p = new Path("p"); • 2. OutputStream out = fs.create(p); • 3. out.write("content".getBytes("UTF-8")); • 4. out.flush(); • 5. assertThat(fs.getFileStatus(p).getLen(),is(0L)); • 一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之 后的块也 • 是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。 • out.sync(); 强制同步, close()的时候会自动调用sync()
  • 39. 集群复制 归档 • hadoop distcp -update hdfs://n1/foo hdfs://n2/bar/foo • 归档 – hadoop archive -archiveName files.har /my/files /my • 使用归档 – hadoop fs -lsr har:///my/files.har – hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di • 归档缺点:修改文件、增加删除文件 都需重新归档
  • 40. SequenceFile Reader&Writer • Configuration conf = new Configuration(); • SequenceFile.Writer writer =null ; • try { • System.out.println("start...................."); • FileSystem fileSystem = FileSystem.newInstance(conf); • IntWritable key = new IntWritable(1); • Text value = new Text(""); • Path path = new Path("hdfs://192.168.126.133:9000/t1/seq"); • if(!fileSystem.exists(path)){ • fileSystem.create(path); • writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass()); • for(int i=1; i<10; i++){ • writer.append(new IntWritable(i), new Text("value" + i)); • } • writer.close(); • }else{ • SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf); • System.out.println("now while segment"); • while(reader.next(key, value)){ • System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition()); • }; • } • } catch (IOException e) { • e.printStackTrace(); • } finally{ • IOUtils.closeStream(writer); • }
  • 41. SequenceFile • 1 value1 • 2 value2 • 3 value3 • 4 value4 • 5 value5 • 6 value6 • 7 value7 • 8 value8 • 9 value9 • 包括一个Key 和一个 Value • 可以用hadoop fs –text hdfs://……… 来显示文件
  • 42. SequenceMap • 重建索引:MapFile.fix(fileSystem, path, key.getClass(), value.getClass(), true, conf); • MapFile.Writer writer = new MapFile.Writer(conf, fileSystem, path.toString(), key.getClass(), value.getClass()); • MapFile.Reader reader = new MapFile.Reader(fileSystem,path.toString(),conf);