Weekend Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = simple70

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with ExamsMirror

Practice at least 50% of the questions to maximize your chances of passing.
Exam Databricks-Machine-Learning-Associate Premium Access

View all detail and faqs for the Databricks-Machine-Learning-Associate exam


517 Students Passed

95% Average Score

95% Same Questions
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Singular value decomposition

C.

Iterative optimization

D.

Least-squares method

Questions # 2:

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

Options:

A.

They can refactor their notebook to process the data in parallel.

B.

They can refactor their notebook to use the PySpark DataFrame API.

C.

They can refactor their notebook to use the Scala Dataset API.

D.

They can refactor their notebook to use Spark SQL.

E.

They can refactor their notebook to utilize the pandas API on Spark.

Questions # 3:

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

A.

Run each notebook interactively

B.

Review the matrix view in the Job's runs

C.

Migrate the Job to a Delta Live Tables pipeline

D.

Change each Task’s setting to use a dedicated cluster

Questions # 4:

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.describe()

B.

dbutils.data(spark_df).summarize()

C.

This task cannot be accomplished in a single line of code.

D.

spark_df.summary()

E.

dbutils.data.summarize (spark_df)

Questions # 5:

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

A.

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

B.

Gradient boosting requires access to all data at once which cannot happen during parallelization.

C.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

D.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Questions # 6:

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

Question # 6

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

Options:

A.

It does not impute both the training and test data sets.

B.

The inputCols and outputCols need to be exactly the same.

C.

The fit method needs to be called instead of transform.

D.

It does not fit the imputer on the data to create an ImputerModel.

Questions # 7:

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

Options:

A.

Leave-one-out encoding

B.

Target encoding

C.

One-hot encoding

D.

Categorical

E.

String indexing

Questions # 8:

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

A.

Spark ML decision trees test every feature variable in the splitting algorithm

B.

Spark ML decision trees automatically prune overfit trees

C.

Spark ML decision trees test more split candidates in the splitting algorithm

D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

E.

Spark ML decision trees test binned features values as representative split candidates

Questions # 9:

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options:

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Questions # 10:

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
TOP CODES

TOP CODES

Top selling exam codes in the certification world, popular, in demand and updated to help you pass on the first try.