Cyber Monday Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

669 Students Passed

88% Average Score

92% Same Questions

Viewing page 1 out of 6 pages

Viewing questions 1-10 out of questions

Questions # 1:

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Answer

Explanation

The code block that should be used to create the date Python variable used in the above code block is:

dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)

This code block uses the dbutils.widgets API to create and get a text widget named “date” that can accept a string value as a parameter1. The default value of the widget is “null”, which means that if no parameter is passed, the date variable will be “null”. However, if a parameter is passed through the Databricks Jobs API, the date variable will be assigned the value of the parameter. For example, if the parameter is “2021-11-01”, the date variable will be “2021-11-01”. This way, the notebook can use the date variable to load data from the specified path.

The other options are not correct, because:

Option A is incorrect because spark.conf.get(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The spark.conf API is used to get or set Spark configuration properties, not notebook parameters2.

Option B is incorrect because input() is not a valid way to get a parameter passed through the Databricks Jobs API. The input() function is used to get user input from the standard input stream, not from the API request3.

Option C is incorrect because sys.argv1 is not a valid way to get a parameter passed through the Databricks Jobs API. The sys.argv list is used to get the command-line arguments passed to a Python script, not to a notebook4.

Option D is incorrect because dbutils.notebooks.getParam(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The dbutils.notebooks API is used to get or set notebook parameters when running a notebook as a job or as a subnotebook, not when passing parameters through the API5.

References: Widgets, Spark Configuration, input(), sys.argv, Notebooks

Questions # 2:

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

When a database is being created, make sure that the LOCATION keyword is used.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Questions # 3:

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in thegeo_lookuptable.

Before executing the code, runningSHOWTABLESon the current database indicates the database contains only two tables:geo_lookupandsales.

Question # 3

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

Both commands will fail. No new variables, tables, or views will be created.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Questions # 4:

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

The job_id is returned in this field.

The job_id and number of times the job has been are concatenated and returned.

The number of times the job definition has been run in the workspace.

The globally unique ID of the newly triggered run.

Questions # 5:

A Delta Lake table representing metadata about content from user has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

Date

Post_id

User_id

Post_time

Questions # 6:

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Questions # 7:

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.

Why are the cloned tables no longer working?

Options:

The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.

Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.

The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.

Questions # 8:

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

In the Executor's log file, by gripping for "predicate push-down"

In the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

In the Storage Detail screen, by noting which RDDs are not stored on disk

In the Delta Lake transaction log. by noting the column statistics

In the Query Detail screen, by interpreting the Physical Plan

Questions # 9:

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Questions # 10:

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.