Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror
Pass the Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with ExamsMirror
Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access
View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam
807 Students Passed
85% Average Score
96% Same Questions
42 of 55.
A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.
Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).
The current code:
from pyspark.sql import functions as F
final = df.withColumn("event_year", F.year("event_ts")) \
.withColumn("event_month", F.month("event_ts")) \
.bucketBy(42, ["event_year", "event_month"]) \
.saveAsTable("events.liveLatest")
However, consumers report poor query performance.
Which change will enable efficient querying by year and month?
Given a DataFrame df that has 10 partitions, after running the code:
result = df.coalesce(20)
How many partitions will the result DataFrame have?
23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)
47 of 55.
A data engineer has written the following code to join two DataFrames df1 and df2:
df1 = spark.read.csv("sales_data.csv")
df2 = spark.read.csv("product_data.csv")
df_joined = df1.join(df2, df1.product_id == df2.product_id)
The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.
Which join strategy will Spark use?
What is the difference between df.cache() and df.persist() in Spark DataFrame?
A data engineer is building an Apache Spark™ Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.
Which code snippet should the data engineer use?
26 of 55.
A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.
Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.
The PII columns in df_user are name, email, and birthdate.
Which code snippet can be used to meet this requirement?
A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.
Which line of code ensures the data is saved to a specific location?
Options:
54 of 55.
What is the benefit of Adaptive Query Execution (AQE)?
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
TOP CODES
Top selling exam codes in the certification world, popular, in demand and updated to help you pass on the first try.