Cyber Monday Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access

View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam

Go to Exam

645 Students Passed

96% Average Score

97% Same Questions

Viewing page 1 out of 5 pages

Viewing questions 1-10 out of questions

Questions # 1:

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

Execute their pyspark shell with the option--remote "https://localhost"

Execute their pyspark shell with the option--remote "sc://localhost"

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code

Questions # 2:

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

df.write.mode("overwrite").json("path/to/file")

df.write.overwrite.json("path/to/file")

df.write.json("path/to/file", overwrite=True)

df.write.format("json").save("path/to/file", mode="overwrite")

Questions # 3:

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Question # 3

Which operation is supported with streamingdf ?

Options:

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Answer

Explanation

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

Comprehensive and Detailed Explanation:

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed — Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide – Unsupported Operations., B. groupby("Id").count()Supported — Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs → Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4)Not allowed — Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming – Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show()Not allowed — show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide – Output operations like show() are not supported., , Reference Extract from Official Guide:, “Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.”— Databricks Structured Streaming Programming Guide]

Questions # 4:

Given this code:

Question # 4

.withWatermark("event_time","10 minutes")

.groupBy(window("event_time","15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Questions # 5:

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.

Questions # 6:

Which configuration can be enabled to optimize the conversion between Pandas and PySpark DataFrames using Apache Arrow?

Options:

spark.conf.set("spark.pandas.arrow.enabled", "true")

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set("spark.sql.arrow.pandas.enabled", "true")

Questions # 7:

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference inevent_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

It is not able to handle deduplication in this scenario

It removes duplicates that arrive within the 30-minute window specified by the watermark

It removes all duplicates regardless of when they arrive

It accepts watermarks in seconds and the code results in an error

Questions # 8:

Given the following code snippet inmy_spark_app.py:

Question # 8

What is the role of the driver node?

Options:

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

The driver node only provides the user interface for monitoring the application

The driver node holds the DataFrame data and performs all computations locally

The driver node stores the final result after computations are completed by worker nodes

Questions # 9:

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

Optimize the data processing logic by repartitioning the DataFrame.

Modify the Spark configuration to disable garbage collection

Increase the memory allocated to the Spark Driver.

Cache large DataFrames to persist them in memory.

Questions # 10:

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Question # 10