Weekend Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = simple70

Pass the Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.
Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access

View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam


493 Students Passed

88% Average Score

96% Same Questions
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

A.

Execute their pyspark shell with the option--remote "https://localhost"

B.

Execute their pyspark shell with the option--remote "sc://localhost"

C.

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

D.

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code

Questions # 2:

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

Questions # 3:

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Question # 3

Which operation is supported with streamingdf ?

Options:

A.

streaming_df. select (countDistinct ("Name") )

B.

streaming_df.groupby("Id") .count ()

C.

streaming_df.orderBy("timestamp").limit(4)

D.

streaming_df.filter (col("count") < 30).show()

Questions # 4:

Given this code:

Question # 4

.withWatermark("event_time","10 minutes")

.groupBy(window("event_time","15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Questions # 5:

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

A.

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.

B.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.

Questions # 6:

Which configuration can be enabled to optimize the conversion between Pandas and PySpark DataFrames using Apache Arrow?

Options:

A.

spark.conf.set("spark.pandas.arrow.enabled", "true")

B.

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

C.

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

D.

spark.conf.set("spark.sql.arrow.pandas.enabled", "true")

Questions # 7:

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference inevent_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It is not able to handle deduplication in this scenario

B.

It removes duplicates that arrive within the 30-minute window specified by the watermark

C.

It removes all duplicates regardless of when they arrive

D.

It accepts watermarks in seconds and the code results in an error

Questions # 8:

Given the following code snippet inmy_spark_app.py:

Question # 8

What is the role of the driver node?

Options:

A.

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

B.

The driver node only provides the user interface for monitoring the application

C.

The driver node holds the DataFrame data and performs all computations locally

D.

The driver node stores the final result after computations are completed by worker nodes

Questions # 9:

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

A.

Optimize the data processing logic by repartitioning the DataFrame.

B.

Modify the Spark configuration to disable garbage collection

C.

Increase the memory allocated to the Spark Driver.

D.

Cache large DataFrames to persist them in memory.

Questions # 10:

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Question # 10

Which block of Spark code can be used to achieve this requirement?

Options:

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)

B.

filtered_df = users_raw_df.na.drop(how='all')

C.

filtered_df = users_raw_df.na.drop(how='any')

D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
TOP CODES

TOP CODES

Top selling exam codes in the certification world, popular, in demand and updated to help you pass on the first try.