Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

821 Students Passed

86% Average Score

94% Same Questions

Viewing page 2 out of 7 pages

Viewing questions 11-20 out of questions

Questions # 11:

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Answer

Explanation

The scenario presented involves inconsistent microbatch processing times in a Structured Streaming job during peak hours, with the need to ensure that records are processed within 10 seconds. The trigger once option is the most suitable adjustment to address these challenges:

Understanding Triggering Options:

Fixed Interval Triggering (Current Setup):The current trigger interval of 10 seconds may contribute to the inconsistency during peak times as it doesn't adapt based on the processing time of the microbatches. If abatch takes longer to process, subsequent batches will start piling up, exacerbating the delays.

Trigger Once:This option allows the job to run a single microbatch for processing all available data and then stop. It is useful in scenarios where batch sizes are unpredictable and can vary significantly, which seems to be the case during peak hours in this scenario.

Implementation of Trigger Once:

Setup:Instead of continuously running, the job can be scheduled to run every 10 seconds using a Databricks job. This scheduling effectively acts as a custom trigger interval, ensuring that each execution cycle handles all available data up to that point without overlapping or queuing up additional executions.

Advantages:This approach allows for each batch to complete processing all available data before the next batch starts, ensuring consistency in handling data surges and preventing the system from being overwhelmed.

Rationale Against Other Options:

Option A and E (Decrease Interval):Decreasing the trigger interval to 5 seconds might exacerbate the problem by increasing the frequency of batch starts without ensuring the completion of previous batches, potentially leading to higher overhead and less efficient processing.

Option B (Increase Interval):Increasing the trigger interval to 30 seconds could lead to latency issues, as the data would be processed less frequently, which contradicts the requirement of processing records in less than 10 seconds.

Option C (Modify Partitions):While increasing parallelism through more shuffle partitions can improve performance, it does not address the fundamental issue of batch scheduling and could still lead to inconsistency during peak loads.

Conclusion:

By using the trigger once option and scheduling the job every 10 seconds, you ensure that each microbatch has sufficient time to process all available data thoroughly before the next cycle begins, aligning with the need to handle peak loads more predictably and efficiently.

References

Structured Streaming Programming Guide - Triggering

Databricks Jobs Scheduling

Questions # 12:

A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.

Questions # 13:

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Questions # 14:

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Questions # 15:

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Question # 15

Which response correctly fills in the blank to meet the specified requirements?

Question # 15

Options:

Option A

Option B

Option C

Option D

Option E

Questions # 16:

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Questions # 17:

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

Options:

Improves the quality of your data

Validates a complete use case of your application

Troubleshooting is easier since all steps are isolated and tested individually

Yields faster deployment and execution times

Ensures that all steps interact correctly to achieve the desired end result

Questions # 18:

What is true for Delta Lake?

Options:

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.

Z-ORDERcan only be applied to numeric values stored in Delta Lake tables.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Questions # 19:

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

Question # 19

An analyze who is not a member of the auditing group executing the following query:

Question # 19

Which result will be returned by this query?

Options:

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

All records from all columns will be displayed with the values in user_ltv.

Questions # 20:

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Question # 20

Which statement describes what will happen when the above code is executed?

Options:

The connection to the external table will fail; the string "redacted" will be printed.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

The connection to the external table will succeed; the string value of password will be printed in plain text.

The connection to the external table will succeed; the string "redacted" will be printed.