This blog is intended to introduce a few interesting alternatives to Pandas.
Data is found everywhere in various formats of CSVs, flat files, JSON, etc. When the data size is large, it’s difficult to read into the memory and time consuming for EDA (exploratory data analysis). This blog revolves around handling tabular data in CSV format and processing it with Pandas and some alternatives like cuDF, dask, modin and datatable.
Problem: Importing (reading) a large CSV file leads Out of Memory error. Not enough RAM to read the entire CSV at once crashes the computer often. And processing it is slower at times with a single CPU core.
About data used in exploration: A sales data record of 5 million rows and 14 columns as shown below. This dataset in Zip format can be found here:
Downloads 18 - Sample CSV Files / Data Sets for Testing (till 5 Million Records) - Sales
Disclaimer - The datasets are generated through random logic in VBA. These are not real sales data and should not be…
First 5 rows of the data:
Pandas is designed to work only on a single core. Pandas cannot utilize the multi-cores available on your system.
However, the cuDF library aims to implement the Pandas API on the GPU; Modin as well as Dask Dataframe library provides parallel algorithms around the Pandas API.
Modin is targeted toward parallelizing the entire pandas API, without exception. This implies all pandas function can be used on MODIN. And this happens just by changing one line in code: import modin.pandas as pd (Check the code in git link below to learn about the implementation)
Dask is currently missing multiple APIs from pandas that MODIN has implemented.
To learn how to control the number of processors that MODIN uses by parameter num_cpus, check out :
On windows, you can check the number of cores on your system in Task Manager > Performance.
Using num_cpu higher than the cores available on your system, wouldn’t improve the performance. Rather, it might end up lowering it. In my case, beyond 6 cores available on my system, the performance degrades slightly as shown in the above plot.
Note: This is the first blog of a series. Stay tuned for more.