1 :-
You need to analyze 60,000,000 images stored in JPEG format, each of which is approximately 25 KB. Because you Hadoop cluster isn’t optimized for storing and processing many small files, you decide to do the following actions: 1. Group the individual images into a set of larger files 2. Use the set of larger files as input for a MapReduce job that processes them directly with python using Hadoop streaming. Which data serialization system gives the flexibility to do this?
2 :-
During the execution of a MapReduce v2 (MRv2) job on YARN, where does the Mapper place the intermediate data of each Map Task?
3 :-
Your cluster has the following characteristics: ? A rack aware topology is configured and on ? Replication is set to 3 ? Cluster block size is set to 64MB Which describes the file read process when a client application connects into the cluster and requests a 50MB file?
4 :-
You suspect that your NameNode is incorrectly configured, and is swapping memory to disk. Which Linux commands help you to identify whether swapping is occurring?(Select all that apply)
5 :-
You want to understand more about how users browse your public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server across logs into your Hadoop cluster analysis?
6 :-
You have recently converted your Hadoop cluster from a MapReduce 1 (MRv1) architecture to MapReduce 2 (MRv2) on YARN architecture. Your developers are accustomed to specifying map and reduce tasks (resource allocation) tasks when they run jobs: A developer wants to know how specify to reduce tasks when a specific job runs. Which method should you tell that developers to implement?
7 :-
Each node in your Hadoop cluster, running YARN, has 64GB memory and 24 cores. Your yarn.site.xml has the following configuration: yarn.nodemanager.resource.memory-mb 32768 yarn.nodemanager.resource.cpu-vcores 12 You want YARN to launch no more than 16 containers per node. What should you do?
8 :-
Your Hadoop cluster is configuring with HDFS and MapReduce version 2 (MRv2) on YARN. Can you configure a worker node to run a NodeManager daemon but not a DataNode daemon and still have a functional cluster?
9 :-
You are working on a project where you need to chain together MapReduce, Pig jobs. You also need the ability to use forks, decision points, and path joins. Which ecosystem project should you use to perform these actions?
10 :-
Which two features does Kerberos security add to a Hadoop cluster?
11 :-
Your cluster’s mapred-start.xml includes the following parameters mapreduce.map.memory.mb 4096 mapreduce.reduce.memory.mb 8192 And any cluster’s yarn-site.xml includes the following parameters yarn.nodemanager.vmen-pmen-ration 2.1 What is the maximum amount of virtual memory allocated for each map task before YARN will kill its Container?
12 :-
You are running a Hadoop cluster with a NameNode on host mynamenode. What are two ways to determine available HDFS space in your cluster?
13 :-
You observed that the number of spilled records from Map tasks far exceeds the number of map output records. Your child heap size is 1GB and your io.sort.mb value is set to 1000MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio?
14 :-
Which scheduler would you deploy to ensure that your cluster allows short jobs to finish within a reasonable time without starting long-running jobs?
15 :-
You are running Hadoop cluster with all monitoring facilities properly configured. Which scenario will go undeselected?
16 :-
Which YARN daemon or service monitors a Controller’s per-application resource using (e.g., memory CPU)?
17 :-
Which YARN process run as “container 0” of a submitted job and is responsible for resource qrequests?
18 :-
Which command does Hadoop offer to discover missing or corrupt HDFS data?
19 :-
Table schemas in Hive are:
20 :-
You have a Hadoop cluster HDFS, and a gateway machine external to the cluster from which clients submit jobs. What do you need to do in order to run Impala on the cluster and submit jobs from the command line of the gateway machine?
Tips for improving your score: