Registering Native Spark Functions
This post explains how Spark registers native functions internally and the public facing APIs for you to register your own functions. Registering native functions is important if you want to access...
View ArticleWriting Dask DataFrame to a Single CSV File
Dask DataFrames are composed of multiple partitions and are outputted as multiple files, one per partition, by default. This post explains the different approaches to write a Dask DataFrame to a...
View ArticleDifferent ways to write CSV files with Dask
This post explains how to write a Dask DataFrame to CSV files. You’ll see how to write CSV files, customize the filename, change the compression, and append files to an existing lake. We’ll also...
View ArticleCreating and reusing the SparkSession with PySpark
This post explains how to create a SparkSession with getOrCreate and how to reuse the SparkSession with getActiveSession. You need a SparkSession to read data stored in files, when manually creating...
View ArticleNavigating None and null in PySpark
This blog post shows you how to gracefully handle null in PySpark and how to avoid null input errors. Mismanaging the null case is a common source of errors and frustration in PySpark. Following the...
View ArticleAdding constant columns with lit and typedLit to PySpark DataFrames
This post explains how to add constant columns to PySpark DataFrames with lit and typedLit. You’ll see examples where these functions are useful and when these functions are invoked implicitly. lit...
View ArticleRenaming Columns in Pandas DataFrames
This article explains how to rename a single or multiple columns in a Pandas DataFrame. There are multiple different ways to rename columns and you’ll often want to perform this operation, so listen...
View ArticleDefining PySpark Schemas with StructType and StructField
This post explains how to define PySpark schemas and when this design pattern is useful. It’ll also explain when defining schemas seems wise, but can actually be safely avoided. Schemas are often...
View ArticleWorking with PySpark ArrayType Columns
This post explains how to create DataFrames with ArrayType columns and how to perform common data processing operations. Array columns are one of the most useful column types, but they’re hard for...
View ArticleTesting Pandas Code
This post explains how to test Pandas code with the built-in test helper methods and with the beavis functions that give more readable error messages. Unit testing helps you write Pandas code that...
View ArticleReading Delta Lakes into pandas DataFrames
This post explains how to read Delta Lakes into pandas DataFrames. The delta-rs library makes this incredibly easy and doesn’t require any Spark dependencies. Let’s look at some simple examples,...
View ArticleSelf Publishing High Quality Programming Books
This post describes a workflow for self publishing programming books that readers will love. Writing a book seems like a daunting task, but it’s less intimidating if each chapter is “pre-published” as...
View Article7 Steps for rejecting meeting invites
Meetings are the main way to kill your productivity as a creative professional. Two strategically timed meetings can eliminate your makers hours for an entire day. Rejecting meeting invites to protect...
View ArticleSplitting Large CSV files with Python
This blog post demonstrates different approaches for splitting a large CSV file into smaller CSV files and outlines the costs / benefits of the different approaches. TL;DR It’s faster to split a CSV...
View ArticleManaging Dask Software Environments with Conda
This post shows you how to set up conda on your machine and explains why it’s the best way to manage software environments for Dask projects. This blog post says that Python projects should be set up...
View ArticleAdd Category Column to pandas DataFrame with cut
This post explains how to add a category column to a pandas DataFrame with cut(). cut makes it easy to categorize numerical values in buckets. Let’s look at a a DataFrame of people and categorize them...
View ArticleReading Delta Lakes into Dask DataFrames
This post explains how to read Delta Lakes into Dask DataFrames. It shows how you can leverage powerful data lake management features like time travel, versioned data, and schema evolution with Dask....
View ArticleContent creators making more than $50,000 a month
This post demonstrates how much money you can make as a content creator and contrasts the content creation and restaurant business models. Content creators can make a lot of money and enjoy a nice...
View ArticleWriting NumPy Array to Text Files
This post explains the different ways to save a NumPy array to text files. After showing the different syntax options the post will teach you some better ways to write NumPy data: using binary file...
View ArticleScale big data pandas workflows with Dask
pandas is a great DataFrame library for datasets that fit comfortably in memory, but throws out of memory exceptions for datasets that are too large. This post shows how pandas works well for a small...
View Article
More Pages to Explore .....