Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Premium Access

View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam

Go to Exam

672 Students Passed

91% Average Score

97% Same Questions

Viewing page 4 out of 6 pages

Viewing questions 31-40 out of questions

Questions # 31:

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

Options:

from pyspark import StorageLevel

transactionsDf.cache(StorageLevel.MEMORY_ONLY)

transactionsDf.cache()

transactionsDf.storage_level('MEMORY_ONLY')

transactionsDf.persist()

transactionsDf.clear_persist()

from pyspark import StorageLevel

transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Questions # 32:

Which of the following statements about Spark's execution hierarchy is correct?

Options:

In Spark's execution hierarchy, a job may reach over multiple stage boundaries.

In Spark's execution hierarchy, manifests are one layer above jobs.

In Spark's execution hierarchy, a stage comprises multiple jobs.

In Spark's execution hierarchy, executors are the smallest unit.

In Spark's execution hierarchy, tasks are one layer above slots.

Questions # 33:

Which of the following statements about RDDs is incorrect?

Options:

An RDD consists of a single partition.

The high-level DataFrame API is built on top of the low-level RDD API.

RDDs are immutable.

RDD stands for Resilient Distributed Dataset.

RDDs are great for precisely instructing Spark on how to do a query.

Questions # 34:

Which of the following code blocks generally causes a great amount of network traffic?

Options:

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Questions # 35:

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Question # 35

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Options:

Since itemId is the index, it does not need to be an argument to the select() method.

The alias() method needs to be called after the select() method.

The explode() method expects a Column object rather than a string.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The split() method should be used inside the select() method instead of the explode() method.

Questions # 36:

Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?

Sample of itemsDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

Options:

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", StringType()),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType),

3. StructField("attributes", ArrayType(StringType)),

4. StructField("supplier", StringType)])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDf = spark.read.schema('itemId integer, attributes , supplier string').parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", ArrayType(StringType())),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", ArrayType([StringType()])),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath)

Questions # 37:

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

Options:

1.counter = 0

3.for index, row in itemsDf.iterrows():

4. if 'Inc.' in row['supplier']:

5. counter = counter + 1

7.print(counter)

1.counter = 0

3.def count(x):

4. if 'Inc.' in x['supplier']:

5. counter = counter + 1

7.itemsDf.foreach(count)

8.print(counter)

print(itemsDf.foreach(lambda x: 'Inc.' in x))

print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

1.accum=sc.accumulator(0)

3.def check_if_inc_in_supplier(row):

4. if 'Inc.' in row['supplier']:

5. accum.add(1)

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Questions # 38:

Which of the following is one of the big performance advantages that Spark has over Hadoop?

Options:

Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.

Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.

Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.

Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.

Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.

Questions # 39:

The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching

value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.

Code block:

itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

Options:

The join statement is incomplete.

The union method should be used instead of join.

The join method is inappropriate.

The merge method should be used instead of join.

The join expression is malformed.

Questions # 40:

Which of the following is a problem with using accumulators?

Options:

Only unnamed accumulators can be inspected in the Spark UI.

Only numeric values can be used in accumulators.

Accumulator values can only be read by the driver, but not by executors.

Accumulators do not obey lazy evaluation.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Answer

Explanation

Accumulator values can only be read by the driver, but not by executors.

Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for

example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good

way to do that.

Only numeric values can be used in accumulators.

No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).

Accumulators do not obey lazy evaluation.

Incorrect – accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a

subsequent action is run.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator

variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be

repeated.

Only unnamed accumulators can be inspected in the Spark UI.

No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.

More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator — PySpark 3.1.2 documentation, and

pyspark.AccumulatorParam — PySpark 3.1.2 documentation

Viewing page 4 out of 6 pages

Viewing questions 31-40 out of questions