Hadoop interview questions

Spread the love

Hadoop interview questions


1. How will you add/delete a Node to the existing cluster?

A) Add: Add the host name/Ip address in dfs.hosts/slaves file and refresh the cluster with $hadoop dfsamin -refreshNodes

Delete: Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file and refresh the cluster with $hadoop dfsamin -refreshNodes

2. What is SSH? What is the use of it In Hadoop?

A) Secure Shell.

3. How will you setup Password-less SSH?

A) search in this site

4. How will you format the HDFS? How frequently it will be done?

A) $hadoop namnode -format.

Note: Format had to be done only once that to during initial cluster setup.

5. How will you manage the Log files generated in Hadoop cluster?


Hadoop developer online training


6. Do you know about cron jobs? How will you Setup?

A) In Ubuntu, go to the terminal and type:

$ crontab -e

this will open our personal crontab (cron configuration file), the first line in that file explains it all, In every line we can define one command to run, and the format is quite simple. So the structure is:

minute hour day-of-month month day-of-week command

For all the numbers you can use lists eg, 5,34,55 in the first field will mean run at 5 past 34 past and 55 past what ever hour is defined.

7. What is the role of /etc/hosts file in setting up of HDFS cluster?

A) For hostname to Ip address maping

8. What is dfsadmin command in Hadoop?

9. If one of the data node is failed to start on the cluster how will you come to know? And what are the necessary actions to be taken now ?

A)Via HDFS web UI, we can see no of decommissioned nodes and we need to rebalance the cluster now

10. What is the impact if namenode fails and what are the necessary action items now?

A) Entire hdfs will be down and we need to restart the namenode after copying fsimage and edits from secondaryNN

11. What is Log4j?

A) Logging Framework

12. How do we set logging level for hadoop daemons/commands?

A) In log4j.properties or in hadoop-env.sh file, hadoop.root.logger=INFO,console (WARN,DRFA)

13. Is there any impact on mapreduce jobs if there is no mapred-site.xml file created in HADOOP_HOME/conf directory but all the necessary properties are difined in yarn-site.xml?

A) no

14. How does Hadoop’s CLASSPATH plays vital role in starting or stopping in hadoop daemons.

A) Classpath will contain list of directories containing jar files required to start/stop daemons for example HADOOP_HOME/share/hadoop/common/lib contains all the common utility jar files.

15. What is the default logging level in hadoop?

A) hadoop.root.logger=INFO,console.

Hadoop admin online training

16. What is the ‘hadoop.tmp.dir’ configuration parameter default to ?

A) It is user.name. We need a directory that a user can write and also not to interfere with other users. If we didn’t include the username, then different users would share the same tmp directory. This can cause authorization problems, if folks’ default umask doesn’t permit write by others. It can also result in folks stomping on each other, when they’re, e.g., playing with HDFS and re-format their filesystem.

17. How do we verify the status and health of the cluster?

A) Either by HDFS Web UI at http://namenode:50070/ or by $ hadoop dfsadmin -report.

18. What is the reason for the frequent exception connection refused in hadoop?

A) If there is no configuration error at client machine or namenode machine, a common cause for this is the Hadoop service isn’t running. If there is problem with Check that there isn’t an entry for our hostname mapped to or in /etc/hosts.

19. How do we set a configuration property to be unique/constant across the cluster nodes and no slave nodes should override this?

A) We can achive this by defining this property in core/hdfs/mapred/yarn-site.xml file on namenode with final tag as shown below.




20. Does the name-node stay in safe mode till all under-replicated files are fully replicated?

A)No. The name-node waits until all or majority of data-nodes report their blocks. But name-node will stay in safe mode until a specific percentage of blocks of the system is minimally replicated. minimally replicated is not fully replicated.

1. What is Default replication factor and how will you change it at file level?

2. Why do we need replication factor > 1 in production Hadoop cluster?

3. How will you combine the 4 part-r files of a mapreduce job?

A) Using hadoop fs -getmerge

4. What are the Compression techniques in HDFS and which is the best one and why?

5. How will you view the compressed files via HDFS command?

A) hadoop fs -text

6. What is Secondary Namenode and its Functionalities? why do we need it?

7. What is Backup node and how is it different from Secondary namenode?

8. What is FSimage and editlogs and how they are related?

9. what is default block size in HDFS? and why is it so large?

10. How will you copy a large file of 50GB into HDFS in parllel

A) distcp

11. what is Balancing in HDFS?

12. What is expunge in HDFS ?

A) Trash empty

13. What is the default uri for HDFS WEB UI? Can we create files via HDFS WEB UI?

A) namenode:50070. No. It is read only

14. How can we check existence of non zero length file in HDFS commands

A) hadoop fs -test command

15. What is IOUtils in HDFS API and how is it useful?

16. Can we archive files in HDFS? If yes, how can we do that?

A) hadoop archive -archiveName NAME -p <parent path> src dest

17. What is safemode in Hadoop and what are the restrictions during safemode?

18. What is rack awareness in hadoop?

19. Can we come out of safe mode manually, if yes how?

A) $ hadoop dfsadmin -safemode enter/get/leave

20. Why block size in hadoop is maintained as very big compared to traditional block size?

21. What are Sequence files and how are they different from text files?

22. What is the limitation of Sequence files?

A) supports only java, no other API

23. What are Avro files ?

Hadoop Online Training

24. Can an avro file created in Java in machine 1 can be read on machine with Ruby API?

A) Yes

25. Where does the schema of an Avro file is store if the file is transferred from one host to another?

A) in the same file itself as a header section

26. How do we handle small files in HDFS?

A) merge into sequence/avro file or archive them into har files.

27. What is delegation token in Hadoop and why is it important?

28. What is fsck in Hadoop?

29. Can we append data records to an existing file in HDFS?

A) Yes by command $ hdfs dfs -appendToFile … Appends single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

30. Can we get count of files in a directory on HDFS via command line?

A) Yes by using command $ hdfs dfs -count hdfs://NN/file1

31. How do we achieve security on Hadoop cluster?

A) With Kerberose

32. Can we create multiple files in HDFS with different block sizes?

Yes. HDFS provides api to specify block size at the time of file creation. Below is the method signature:

public FSDataOutputStream create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize) throws IOException;

33. What is the importance of dfs.namenode.name.dir?

It contains the fsimage file for namenode, it should be configured to write to atleast two filesystems on different physical hosts, namenode and secondary namenode, as if we lose fsimage file we will lose entire HDFS file system and there is no other recovery mechanism if there is no fsimage file available.

34. What is the need for fsck in hadoop?

it can be used to determine the files with missing blocks.

35. Does HDFS block boundaries be between records or across the records?

No, HDFS does not provide record-oriented boundaries, So blocks can end in the middle of a record.


Mapreduce Interview Questions and Answers:

1. What is Speculative execution?

2. What is Distributed Cache?

3. WorkFlow of MapReduce job?

A) map,combiner,reducer,shuffle,partitioner

4. How will you globally sort the output of mapreduce job?

A) totalorder partitioner

5. Difference between map side and reducer side Join?

6. What is Map reduce chaining?

7. How will You pass parameters to mapper or reducer?

8. How will you create custom key and value type’s?

9. Sorting based on any column other than Key?

10. How will you create custom input formats?

11. How will you process huge number of small files in MR job?

A) After converting into sequence file/avro file

12. Can we run Reducer without Mapper?

A) Yes in this Identity mapper will be run in the back ground to copy the input to reducer

13. Whether mapper and reducer tasks run in parallel? If no, why see some times as (map 80%,reduce 10%)?

A) No, its due to data copy phase.

14. How will you setup a custom counter to detect bad records in the input?

A) context.getcounter.enumvalue

15. How will you schedule mapreduce Jobs?

A) Through Oozie or Azkaban

16. what is combiner?Tell me one scenario where it is not suitable?

A) for aggregate functions

17. How will you submit mapreduce job through command line?

18. How will you kill a running mapreduce job?

19. For a failed mapreduce job how will trace for the root cause

A) Yarn WEB UI ? logs –> Userlogs ? Application ID container ? Syserr/syslog/

20. What will you do if a mapreduce job failed with Java heap space error message?

A) In HADOOP_CLIENT_OPTS or JAVA_CHILD_OPTS increase Xmx property

21. How many map tasks & reduce tasks will run on each datanode by default

A) 2 map tasks and 1 reduce task

22) What is the minimum RAM capacity needed for this datanode?

As there 3 jvms running for 3 tasks, 1 data node daemon also runs, so, it is needed at least 4 GB RAM, assuming that at least 1GB can be asssigned for each YARN task.

22. What is difference between Mapreduce and YARN?

23. What is Tez framework?

A) An alternative framework for mapreduce, it can be used in Yarn in place of mapreduce

24. What is the difference between Tez and Mapreduce ?

A) Tez is at least 2 times faster than Mapreduce

25. What is input split, input format and record reader in Mapreduce programming?

26. Does Mapreduce support processing of Avro files ? If yes, what is the main classes of the API?

27. How will you process a dataset in JSON format in mapreduce job?

A) JSONObject class can be used to parse the JSON records in the dataset

28. Can we create multi level directory structure (year/month/date) in Mapreduce based on the input data?

A) yes by using multipleoutputs

29. What is the relation between TextOutputFormat and KeyValueTextInputFormat?

A) second one is used to read the files created by first one

30. What is LazyOutpuFormat in Mapreduce and why do we need it?

A) creates output files if data is present

31. How do we prevent file splitting in Mapreduce ?

A) by returning false from isSplittable method on our custom InputFormat Class

32. What is the difference between Writable and WritableComparable interfaces? And what is sufficient for value type in MR job?

A) writable

33. What is the Role of Application Master in running Mapreduce job through YARN?

34. What is Uber task ?

35. What are IdentityMapper & IdentityReducer classes?

36. How do we create jar file with .class files in a directory through command line?

37. What is the default port for YARN Web UI?

A) 8088

38. How can we distribute our application’s jars to all of the nodes in the YARN cluster that need it?

39. How do We include native libraries in YARN jobs?

A) by using -Djava.library.path option on the command or else by setting LD_LIBRARY_PATH in .bashrc file.

40. What is the default scheduler inside YARN framework for starting tasks?

A) CapacityScheduler

41. How do we handle record bounderies in Text files or Sequence files in Mapreduce Inputsplits?

In Mapreduce, InputSplit’s RecordReader will start and end at a record boundary. In SequenceFiles, every 2k bytes has a 20 bytes sync mark between the records. These sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a file, offset and length and find the first sync mark after the start of the split. The RecordReader continues processing records until it reaches the first sync mark after the end of the split. Text files are handled similarly, using newlines instead of sync marks.

42. Some times mapreduce jobs will fail if we submit the same jobs from a different user? What is the cause and how do we fix these?

A) It might be due to missing of setting mapreduce.jobtracker.system.dir

43. How to change the default location of mapreduce job’s intermediate data ?

A) by chaning the value in mapreduce.cluster.local.dir

44. If a map task is failed once during mapreduce job execution will job fail immediately?

A) No it will try restarting the tasks upto max attempts allowed on map/reduce tasks, by default it

Hadoop Online Training



Hadoop interview questions

HADOOP training, HADOOP overview ppt, HADOOP openings, HADOOP demo,HADOOP developer portal, HADOOP definition, HADOOP basics, HADOOP books, HADOOP buyer, HADOOP business case,HADOOP technical training, HADOOP training centre, HADOOP logo, HADOOP wiki,HADOOP overview, HADOOP address, HADOOP training ppt, HADOOP functional training,HADOOP training bengaluru, HADOOP online training India USA,HADOOP training institute in India USA, HADOOP training in india,HADOOP training karnataka, HADOOP online training material, HADOOP training and certification,HADOOP administration training,HADOOP training centers in India USA

For More:  Online Training

India|US|UK|Canada|Australia|Germany|Philippines|New Zealand|Switzerland

Mumbai|Kolkata|Bangalore|Chennai|Kerala|Pune|Hyderabad|Lucknow|New Delhi

Scroll to top