I am analyzing a trace in GTFS, a collection of CSV files. Each record of a csv file is consisting of one or more fields speparated by commas
,. It is error-prone to read csv files by simply using
[line.split(',') for line in f.readlines()] for fields might contain commas. Therefore, I decide to switch to the Python module csv.
1. CSV format
CSV (Comma-separated values) files store tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV format is the most common import and export format for spreadsheets and databases. Note that fields with embedded commas must be quoted (enclosed within double-quote characters) to avoid confusions. For instance,
stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type,parent_station 3666595,,"227, AVENUE DES NATIONS","AVENUE DE LA PYRAMIDE - 93073",48.98292708588317,2.5197046726137797,0,
The Python module csv is designed to read and write csv files, primarily including:
csv.writer(...), read and write sequences
csv.DictWriter(...), read and write data in dictionary form
2. Read and write sequences
2.1 Read a csv file into a list of lists
csv.reader(csvfile, dialect='excel', **fmtparams) returns a reader object which will iterate over lines in the given `csvfile.
import csv # Usage csv.reader(csvfile, dialect='excel', **fmtparams) # Return a reader object which will iterate over lines in the given csvfile. with open(file_preprocess, 'r') as f: lists = [row for row in csv.reader(f, delimiter=',')] # read a csv file into a list of lists
2.2 Write a list of lists to a csv file
csv.writer(csvfile, dialect='excel', **fmtparams) returns a writer object responsible for converting the user’s data into delimited strings on the given file-like object.
csvfile can be any object with a
# Usage: csv.writer(csvfile, dialect='excel', **fmtparams) # return a writer object def write_lists_to_csv(filename, lists, fieldnames=None): with open(out_file, 'w') as f: writer = csv.writer(f) # add header if fieldnames: writer.writerow(fieldnames) writer.writerows(lists)
2.3 dialect and fmtparams
The value of
dialect can be one of:
excel, the usual properties of an Excel-generated CSV file
excel-tab, the usual properties of an Excel-generated TAB-delimited file
unix, the usual properties of a CSV file generated on UNIX systems, i.e. using ‘\n’ as line terminator and quoting all fields
fmtparams is described in Dialects and Formatting Parameters. To make it easier to specify the format of input and output records, specific formatting parameters are grouped together into dialects. Dialects support the following attributes:
delimiter=',' # A one-character string used to separate fields doublequote=True # When True, a field within `quotechar` should be quoted; When False, the escapechar is used as a prefix to the quotechar lineterminator='\r\n' # the string used to terminate lines produced by the writer. Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. quotechar='"' # A one-character string used to quote fields containing special characters escapechar=QUOTE_NONE # Escaping is disable by default. quoting='QUOTE_MINIMAL' # Controls when quotes should be generated by the writer and recognised by the reader. skipinitialspace=False # When True, whitespace immediately following the delimiter is ignored. strict=False # When True, raise exception Error on bad CSV input.
2. Read and write data in dictionary form
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) # Create an object that maps the information read into a dict class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds) # Create an object that maps dictionaries onto output rows.
Appendix: file open mode
The available modes for
ropen for reading (default)
wopen for writing, truncating the file first
xopen for exclusive creation, failing if the file already exists
aopen for writing, appending to the end of the file if it exists
ttext mode (default)
+open a disk file for updating (reading and writing)
Uuniversal newlines mode (deprecated since
Wikipedia: Comma-separated values