Summer Certification Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = getmirror

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Professional Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.

Exam Databricks-Machine-Learning-Professional Premium Access

View all detail and faqs for the Databricks-Machine-Learning-Professional exam

Go to Exam

796 Students Passed

84% Average Score

92% Same Questions

Viewing page 1 out of 2 pages

Viewing questions 1-10 out of questions

Questions # 1:

A machine learning engineer needs to select a deployment strategy for a new machine learning application. The feature values are not available until the time of delivery, and results are needed exceedingly fast for one record at a time.

Which of the following deployment strategies can be used to meet these requirements?

Options:

Edge/on-device

Streaming

None of these strategies will meet the requirements.

Batch

Real-time

Answer

Questions # 2:

A machine learning engineer wants to programmatically create a new Databricks Job whose schedule depends on the result of some automated tests in a machine learning pipeline.

Which of the following Databricks tools can be used to programmatically create the Job?

Options:

MLflow APIs

AutoML APIs

MLflow Client

Jobs cannot be created programmatically

Databricks REST APIs

Questions # 3:

A machine learning engineer is monitoring categorical input variables for a production machine learning application. The engineer believes that missing values are becoming more prevalent in more recent data for a particular value in one of the categorical input variables.

Which of the following tools can the machine learning engineer use to assess their theory?

Options:

Kolmogorov-Smirnov (KS) test

One-way Chi-squared Test

Two-way Chi-squared Test

Jenson-Shannon distance

None of these

Questions # 4:

Which of the following MLflow operations can be used to delete a model from the MLflow Model Registry?

Options:

client.transition_model_version_stage

client.delete_model_version

client.update_registered_model

client.delete_model

client.delete_registered_model

Questions # 5:

Which of the following Databricks-managed MLflow capabilities is a centralized model store?

Options:

Models

Model Registry

Model Serving

Feature Store

Experiments

Answer

Questions # 6:

A data scientist has developed and logged a scikit-learn random forest model model, and then they ended their Spark session and terminated their cluster. After starting a new cluster, they want to review the feature_importances_ of the original model object.

Which of the following lines of code can be used to restore the model object so that feature_importances_ is available?

Options:

mlflow.load_model(model_uri)

client.list_artifacts(run_id)["feature-importances.csv"]

mlflow.sklearn.load_model(model_uri)

This can only be viewed in the MLflow Experiments UI

client.pyfunc.load_model(model_uri)

Questions # 7:

After a data scientist noticed that a column was missing from a production feature set stored as a Delta table, the machine learning engineering team has been tasked with determining when the column was dropped from the feature set.

Which of the following SQL commands can be used to accomplish this task?

Options:

VERSION

DESCRIBE

HISTORY

DESCRIBE HISTORY

TIMESTAMP

Questions # 8:

A machine learning engineering team has written predictions computed in a batch job to a Delta table for querying. However, the team has noticed that the querying is running slowly. The team has alreadytuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the query condition are sparsely located throughout each of the data files.

Based on the scenario, which of the following optimization techniques could speed up the query by colocating similar records while considering values in multiple columns?

Options:

Z-Ordering

Bin-packing

Write as a Parquet file

Data skipping

Tuning the file size

Answer

Explanation

Z-Ordering is an optimization technique that can speed up the query by colocating similar records while considering values in multiple columns. Z-Ordering is a way of organizing data in storage based on the values of one or more columns. Z-Ordering maps multidimensional data to one dimension while preserving locality of the data points. This means that rows with similar values for the specified columns are stored close together in the same set of files. This improves the performance of queries that filter on those columns, as they can skip over irrelevant files or data blocks. Z-Ordering also enhances data skipping and caching, as it reduces the number of distinct values per file for the chosen columns1. The other options are incorrect because:

Option B: Bin-packing is an optimization technique that compacts small files into larger ones, but does not colocate similar records based on multiple columns. Bin-packing can improve the performance of queries by reducing the number of files that need to be read, but it does not affect the data layout within the files2.

Option C: Writing as a Parquet file is not an optimization technique, but a file format choice. Parquet is a columnar storage format that supports efficient compression and encoding schemes. Parquet can improve the performance of queries by reducing the storage footprint and the amount of data transferred, but it does not colocate similar records based on multiple columns3.

Option D: Data skipping is an optimization technique that skips over files or data blocks that do not match the query predicates, but does not colocate similar records based on multiple columns. Data skipping can improve the performance of queries by avoiding unnecessary data scans, but it depends on the data layout and the metadata collected for each file4.

Option E: Tuning the file size is an optimization technique that adjusts the size of the data files to a target value, but does not colocate similar records based on multiple columns. Tuning the file size can improve the performance of queries by balancing the trade-off between parallelism and overhead, but it does not affectthe data layout within the files5. References: Z-Ordering (multi-dimensional clustering), Compaction (bin-packing), Parquet, Data skipping, Tuning file sizes

Questions # 9:

Which of the following is a simple, low-cost method of monitoring numeric feature drift?

Options:

Jensen-Shannon test

Summary statistics trends

Chi-squared test

None of these can be used to monitor feature drift

Kolmogorov-Smirnov (KS) test

Questions # 10:

A machine learning engineer is migrating a machine learning pipeline to use Databricks Machine Learning. They have programmatically identified the best run from an MLflow Experiment and stored its URI in themodel_urivariable and its Run ID in therun_idvariable. They have also determined that the model was logged with the name"model". Now, the machine learning engineer wants to register that model in the MLflow Model Registry with the name"best_model".

Which of the following lines of code can they use to register the model to the MLflow Model Registry?