djangoproject.com | python.org | nginx.org
version seven.
  http://demongin.org
demongin.org - The Simplest Possible CSV Batcher

The Simplest Possible CSV Batcher

How to split a CSV file on a column -or- a post in which I play a little "python golf" against myself.


Wednesday, 2011-05-25 | Careerism, Programming

The whole idea of drama is three things:

  1. Who wants what from whom.
  2. What happens if they don't get it.
  3. Why now.

David Mamet

So a client dropped a monster, 18MB CSV file on me.

At work, we've got tools for analyzing CSV files--breaking them up, organizing columns, etc.--but they were choking on the file because of its (kind of) outrageous size.

So, I decided to take my file to the command line, fly out there for the day and straighten it out. Just to keep things interesting, I decided to see how short I could keep the script:

#!/usr/bin/env python

import csv

batch_suffix = "_batch.csv"     # for naming output files
split_on = 2                    # the column to batch on
path_to_file = "reports.csv"    # 

c = file(path_to_file,"r").readlines()
csv_reader = csv.reader(c)

for r in csv_reader:
    fh = file(r[split_on - 1] + batch_suffix,"a")
    csv_writer = csv.writer(fh)
    csv_writer.writerow(r)

Alternately, you could import sys, swap out the path_to_file routine for a sys.argv[1] type of deal and run the thing interactively. In such a case, you'd get something like this:

toconnell@kumiko:~/whatever$ ls -ltar
total 18268
drwxr-xr-x 7 toconnell toconnell     4096 2011-05-25 15:56 ..
-rw-r--r-- 1 toconnell toconnell 18691905 2011-05-25 16:01 reports.csv
-rwxr-xr-x 1 toconnell toconnell      365 2011-05-25 18:17 csv_split.py
drwxr-xr-x 2 toconnell toconnell     4096 2011-05-25 18:30 .
toconnell@kumiko:~/whatever$ ./csv_split.py reports.csv 
toconnell@kumiko:~/whatever$ ls -ltar
total 36468
drwxr-xr-x 7 toconnell toconnell     4096 2011-05-25 15:56 ..
-rw-r--r-- 1 toconnell toconnell 18691905 2011-05-25 16:01 reports.csv
-rwxr-xr-x 1 toconnell toconnell      365 2011-05-25 18:17 csv_split.py
-rw-r--r-- 1 toconnell toconnell  1742831 2011-05-25 18:30 COLUMN_VALUE_1_batch.csv
-rw-r--r-- 1 toconnell toconnell  2080895 2011-05-25 18:30 COL_VAL_2_batch.csv
-rw-r--r-- 1 toconnell toconnell  1250599 2011-05-25 18:30 THING_batch.csv
...

...and so on.