Updated October 14th, 2022 by chetan.kardekar

Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above

Problem  The statistics functions covar_samp, kurtosis, skewness, std, stddev, stddev_samp, variance, and var_samp, return NaN when a divide by zero occurs during expression evaluation in Databricks Runtime 7.3 LTS. The same functions return null in Databricks Runtime 9.1 LTS and above, as well as Databricks SQL endpoints when a divide by zero occur...

0 min reading time
Updated July 22nd, 2022 by chetan.kardekar

Parsing post meridiem time (PM) with to_timestamp() returns null

Problem You are trying to parse a 12-hour (AM/PM) time value with to_timestamp(), but instead of returning a 24-hour time value it returns null. For example, this sample code: %sql SELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd HH:mm:ss a'); Returns null when run: Cause to_timestamp() requires the hour format to be in lowercase. If the ho...

0 min reading time
Updated February 27th, 2023 by chetan.kardekar

Apache Spark UI is not in sync with job

Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...

1 min reading time
Updated October 26th, 2022 by chetan.kardekar

Optimize streaming transactions with .trigger

When running a structured streaming application that uses cloud storage buckets (S3, ADLS Gen2, etc.) it is easy to incur excessive transactions as you access the storage bucket. Failing to specify a .trigger option in your streaming code is one common reason for a high number of storage transactions. When a .trigger option is not specified, the sto...

1 min reading time
Updated May 10th, 2022 by chetan.kardekar

Identify duplicate data on append operations

A common issue when performing append operations on Delta tables is duplicate data. For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table. In this article, we review basic troubleshooting steps that you can use to...

1 min reading time
Updated May 16th, 2022 by chetan.kardekar

Hyperopt fails with maxNumConcurrentTasks error

Problem You are tuning machine learning parameters using Hyperopt when your job fails with a py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist error. You are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster. Cause Databricks Runtime ML has a compatible version of Hyperopt pre-installed (AWS | Azure | ...

0 min reading time
Load More