Quantcast
Channel: MungingData
Browsing all 26 articles
Browse latest View live

Image may be NSFW.
Clik here to view.

Registering Native Spark Functions

This post explains how Spark registers native functions internally and the public facing APIs for you to register your own functions. Registering native functions is important if you want to access...

View Article



Writing Dask DataFrame to a Single CSV File

Dask DataFrames are composed of multiple partitions and are outputted as multiple files, one per partition, by default. This post explains the different approaches to write a Dask DataFrame to a...

View Article

Different ways to write CSV files with Dask

This post explains how to write a Dask DataFrame to CSV files. You’ll see how to write CSV files, customize the filename, change the compression, and append files to an existing lake. We’ll also...

View Article

Creating and reusing the SparkSession with PySpark

This post explains how to create a SparkSession with getOrCreate and how to reuse the SparkSession with getActiveSession. You need a SparkSession to read data stored in files, when manually creating...

View Article

Navigating None and null in PySpark

This blog post shows you how to gracefully handle null in PySpark and how to avoid null input errors. Mismanaging the null case is a common source of errors and frustration in PySpark. Following the...

View Article


Adding constant columns with lit and typedLit to PySpark DataFrames

This post explains how to add constant columns to PySpark DataFrames with lit and typedLit. You’ll see examples where these functions are useful and when these functions are invoked implicitly. lit...

View Article

Renaming Columns in Pandas DataFrames

This article explains how to rename a single or multiple columns in a Pandas DataFrame. There are multiple different ways to rename columns and you’ll often want to perform this operation, so listen...

View Article

Defining PySpark Schemas with StructType and StructField

This post explains how to define PySpark schemas and when this design pattern is useful. It’ll also explain when defining schemas seems wise, but can actually be safely avoided. Schemas are often...

View Article


Working with PySpark ArrayType Columns

This post explains how to create DataFrames with ArrayType columns and how to perform common data processing operations. Array columns are one of the most useful column types, but they’re hard for...

View Article


Image may be NSFW.
Clik here to view.

Testing Pandas Code

This post explains how to test Pandas code with the built-in test helper methods and with the beavis functions that give more readable error messages. Unit testing helps you write Pandas code that...

View Article

Reading Delta Lakes into pandas DataFrames

This post explains how to read Delta Lakes into pandas DataFrames. The delta-rs library makes this incredibly easy and doesn’t require any Spark dependencies. Let’s look at some simple examples,...

View Article

Self Publishing High Quality Programming Books

This post describes a workflow for self publishing programming books that readers will love. Writing a book seems like a daunting task, but it’s less intimidating if each chapter is “pre-published” as...

View Article

7 Steps for rejecting meeting invites

Meetings are the main way to kill your productivity as a creative professional. Two strategically timed meetings can eliminate your makers hours for an entire day. Rejecting meeting invites to protect...

View Article


Image may be NSFW.
Clik here to view.

Splitting Large CSV files with Python

This blog post demonstrates different approaches for splitting a large CSV file into smaller CSV files and outlines the costs / benefits of the different approaches. TL;DR It’s faster to split a CSV...

View Article

Image may be NSFW.
Clik here to view.

Managing Dask Software Environments with Conda

This post shows you how to set up conda on your machine and explains why it’s the best way to manage software environments for Dask projects. This blog post says that Python projects should be set up...

View Article


Image may be NSFW.
Clik here to view.

Add Category Column to pandas DataFrame with cut

This post explains how to add a category column to a pandas DataFrame with cut(). cut makes it easy to categorize numerical values in buckets. Let’s look at a a DataFrame of people and categorize them...

View Article

Reading Delta Lakes into Dask DataFrames

This post explains how to read Delta Lakes into Dask DataFrames.  It shows how you can leverage powerful data lake management features like time travel, versioned data, and schema evolution with Dask....

View Article


Image may be NSFW.
Clik here to view.

Content creators making more than $50,000 a month

This post demonstrates how much money you can make as a content creator and contrasts the content creation and restaurant business models. Content creators can make a lot of money and enjoy a nice...

View Article

Writing NumPy Array to Text Files

This post explains the different ways to save a NumPy array to text files. After showing the different syntax options the post will teach you some better ways to write NumPy data: using binary file...

View Article

Image may be NSFW.
Clik here to view.

Scale big data pandas workflows with Dask

pandas is a great DataFrame library for datasets that fit comfortably in memory, but throws out of memory exceptions for datasets that are too large. This post shows how pandas works well for a small...

View Article
Browsing all 26 articles
Browse latest View live




Latest Images