Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access

View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam

Go to Exam

807 Students Passed

85% Average Score

96% Same Questions

Viewing page 5 out of 5 pages

Viewing questions 41-50 out of questions

Questions # 41:

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Questions # 42:

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:

Same number as the cluster executors

Questions # 43:

23 of 55.

A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.

Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)

Options:

It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.

It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.

It processes data solely on disk storage, reducing the need for memory resources.

It eliminates the need to write any code, automatically handling all data processing.

It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.

Questions # 44:

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

Shuffle join because no broadcast hints were provided.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Questions # 45:

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame

Questions # 46:

A data engineer is building an Apache Spark™ Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.

Which code snippet should the data engineer use?

Options:

query = streaming_df.writeStream \

.format("console") \

.option("checkpoint", "/path/to/checkpoint") \

.outputMode("append") \

.start()

query = streaming_df.writeStream \

.format("console") \

.outputMode("append") \

.option("checkpointLocation", "/path/to/checkpoint") \

.start()

query = streaming_df.writeStream \

.format("console") \

.outputMode("complete") \

.start()

query = streaming_df.writeStream \

.format("console") \

.outputMode("append") \

.start()

Questions # 47:

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

df_user_non_pii = df_user.drop("name", "email", "birthdate")

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

df_user_non_pii = df_user.select("name", "email", "birthdate")

df_user_non_pii = df_user.remove("name", "email", "birthdate")

Questions # 48:

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

users.write(path="/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table").option("path", "/some/path")

users.write.option("path", "/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table", path="/some/path")

Questions # 49:

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

Options:

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Questions # 50:

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?