Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

821 Students Passed

86% Average Score

94% Same Questions

Viewing page 3 out of 7 pages

Viewing questions 21-30 out of questions

Questions # 21:

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Options:

All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

Questions # 22:

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table namedusers.

Question # 22

Assuming thatuser_idis a unique identifying key and thatdelete_requestscontains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Questions # 23:

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. Theuser_idfield represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table namedaccount_historywhich maintains a full record of all data in the same schema as the source. The next table in the system is namedaccount_currentand is implemented as a Type 1 table representing the most recent value for each uniqueuser_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the describedaccount_currenttable as part of each hourly batch job?

Options:

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Questions # 24:

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

"Can Manage" privileges on the required cluster

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

Cluster creation allowed. "Can Attach To" privileges on the required cluster

"Can Restart" privileges on the required cluster

Cluster creation allowed. "Can Restart" privileges on the required cluster

Questions # 25:

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Question # 25

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

Options:

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

No: the change data feed only tracks inserts and updates not deleted records.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Questions # 26:

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

The checkpoint directory will be used to track state information for the unique keys present in the join.

Stream-static joins cannot use static Delta tables because of consistency issues.

The checkpoint directory will be used to track updates to the static Delta table.

Questions # 27:

Which is a key benefit of an end-to-end test?

Options:

It closely simulates real world usage of your application.

It pinpoint errors in the building blocks of your application.

It provides testing coverage for all code paths and branches.

It makes it easier to automate your test suite

Questions # 28:

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

Task queueing resulting from improper thread pool assignment.

Spill resulting from attached volume storage being too small.

Network latency due to some cluster nodes being in different regions from the source data

Skew caused by more data being assigned to a subset of spark-partitions.

Credential validation errors while pulling data from an external system.

Questions # 29:

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Options:

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Questions # 30:

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in abronzetable created with the propertydelta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Question # 30

Which statement describes the execution and results of running the above query multiple times?

Options:

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.