Saturday, February 17, 2007

Python script to convert CSV files to Excel

I spent much of my last weekend generating large flat files of denormalized data from various data sources, and then converting it to Excel spreadsheets for human consumption. Although the process is quite simple, thanks to OpenOffice (I am a Linux user, in case you haven't guessed already), its a pain to have to navigate the GUI repeatedly to do this.

Looking around the web, I did not find a ready made script that would do what I wanted, but it seemed fairly simple to do, so I decided to do this rather than suffer through yet another GUI invocation just to convert my CSV file to Excel. To read and parse the file, I used Python's built in CSV module, and to write out the Excel spreadsheets, I downloaded and installed the pyExcelerator module. In addition, since my data files were rather large, I put in a splitting mechanism that would allow me to split my output into multiple Excel files of a specified file size. Here is the script.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#!/usr/local/bin/python
# Tool to convert CSV files (with configurable delimiter and text wrap
# character) to Excel spreadsheets.
import string
import sys
import getopt
import re
import os
import os.path
import csv
from pyExcelerator import *

def usage():
  """ Display the usage """
  print "Usage:" + sys.argv[0] + " [OPTIONS] csvfile"
  print "OPTIONS:"
  print "--title|-t: If set, the first line is the title line"
  print "--lines|-l n: Split output into files of n lines or less each"
  print "--sep|-s c [def:,] : The character to use for field delimiter"
  print "--output|o : output file name/pattern"
  print "--help|h : print this information"
  sys.exit(2)

def openExcelSheet(outputFileName):
  """ Opens a reference to an Excel WorkBook and Worksheet objects """
  workbook = Workbook()
  worksheet = workbook.add_sheet("Sheet 1")
  return workbook, worksheet

def writeExcelHeader(worksheet, titleCols):
  """ Write the header line into the worksheet """
  cno = 0
  for titleCol in titleCols:
    worksheet.write(0, cno, titleCol)
    cno = cno + 1

def writeExcelRow(worksheet, lno, columns):
  """ Write a non-header row into the worksheet """
  cno = 0
  for column in columns:
    worksheet.write(lno, cno, column)
    cno = cno + 1

def closeExcelSheet(workbook, outputFileName):
  """ Saves the in-memory WorkBook object into the specified file """
  workbook.save(outputFileName)

def getDefaultOutputFileName(inputFileName):
  """ Returns the name of the default output file based on the value
      of the input file. The default output file is always created in
      the current working directory. This can be overriden using the
      -o or --output option to explicitly specify an output file """
  baseName = os.path.basename(inputFileName)
  rootName = os.path.splitext(baseName)[0]
  return string.join([rootName, "xls"], '.')

def renameOutputFile(outputFileName, fno):
  """ Renames the output file name by appending the current file number
      to it """
  dirName, baseName = os.path.split(outputFileName)
  rootName, extName = os.path.splitext(baseName)
  backupFileBaseName = string.join([string.join([rootName, str(fno)], '-'), extName], '')
  backupFileName = os.path.join(dirName, backupFileBaseName)
  try:
    os.rename(outputFileName, backupFileName)
  except OSError:
    print "Error renaming output file:", outputFileName, "to", backupFileName, "...aborting"
    sys.exit(-1)

def validateOpts(opts):
  """ Returns option values specified, or the default if none """
  titlePresent = False
  linesPerFile = -1
  outputFileName = ""
  sepChar = ","
  for option, argval in opts:
    if (option in ("-t", "--title")):
      titlePresent = True
    if (option in ("-l", "--lines")):
      linesPerFile = int(argval)
    if (option in ("-s", "--sep")):
      sepChar = argval
    if (option in ("-o", "--output")):
      outputFileName = argval
    if (option in ("-h", "--help")):
      usage()
  return titlePresent, linesPerFile, sepChar, outputFileName

def main():
  """ This is how we are called """
  try:
    opts,args = getopt.getopt(sys.argv[1:], "tl:s:o:h", ["title", "lines=", "sep=", "output=", "help"])
  except getopt.GetoptError:
    usage()
  if (len(args) != 1):
    usage()
  inputFileName = args[0]
  try:
    inputFile = open(inputFileName, 'r')
  except IOError:
    print "File not found:", inputFileName, "...aborting"
    sys.exit(-1)
  titlePresent, linesPerFile, sepChar, outputFileName = validateOpts(opts)
  if (outputFileName == ""):
    outputFileName = getDefaultOutputFileName(inputFileName)
  workbook, worksheet = openExcelSheet(outputFileName)
  fno = 0
  lno = 0
  titleCols = []
  reader = csv.reader(inputFile, delimiter=sepChar)
  for line in reader:
    if (lno == 0 and titlePresent):
      if (len(titleCols) == 0):
        titleCols = line
      writeExcelHeader(worksheet, titleCols)
    else:
      writeExcelRow(worksheet, lno, line)
    lno = lno + 1
    if (linesPerFile != -1 and lno >= linesPerFile):
      closeExcelSheet(workbook, outputFileName)
      renameOutputFile(outputFileName, fno)
      fno = fno + 1
      lno = 0
      workbook, worksheet = openExcelSheet(outputFileName)
  inputFile.close()
  closeExcelSheet(workbook, outputFileName)
  if (fno > 0):
    renameOutputFile(outputFileName, fno)

if __name__ == "__main__":
  main()

Based on my not finding completed examples for this kind of thing on the web, I can only assume that the script is probably too trivial for most seasoned programmers to consider contributing back to the community. However, based on the few questions I saw about whether such a thing exists, I would conclude that there are some programmers (like me) for whom this is sufficiently non-trivial to consider looking around before writing one themselves. If you are in the latter category, I hope you find this script useful.

60 comments (moderated to prevent spam):

jerry said...

Thank you. I am new to python, and dislike excel, but need to do just this.

Can I ask what license you are offering this under?

Sujit Pal said...

Hi Jerry, I am glad you find this script useful. As for licensing, I would appreciate an attribution if you feel it is practical (for example, if you are using it in open-source code), but other than that, you are welcome to use it as you see fit. Its a small enough thing considering the benefits I have received from other programmers posting their code and expertise online...

Peter Skomoroch said...

Nice work, I'm going to give it a try on my macbook.

Peter Skomoroch said...

Thanks for posting this, here's a quick hack to get it working on Mac where the CSV module chokes on '\r' newline characters:

Modify the file read in main() to use universal newline for Macs:


inputFile = open(inputFileName,'rUb')

-Pete

Sujit Pal said...

Thanks very much for the patch, Peter. I don't use Mac OSX, but this would be helpful info for the people that do. Thanks again.

Anonymous said...

Hi there Sujit,

Just leaving a comment to say how helpful this was for me. Had to make a few minor changes, but other than that it suited my purpose perfectly!

THANKS!!

David

Sujit Pal said...

Thanks for the feedback David, and I am glad that it helped.

ప్రదీప్ said...

Thank you so much for this package. :)
Finding it more useful !!

Sujit Pal said...

You are welcome, and I am glad you find it useful.

Anonymous said...

Would it be possible to get the header line to output in bold and to set the column widths?

Thanks,

P.S. This script is great !

Sujit Pal said...

Yes, I am pretty sure you can do this - there are methods in the Python Excel extension to change the font and set column widths.

Daniel said...

Here is a patch that will allow import and use of this program in python. Instead of command line opts you specify the kwargs titlePresent, linesPerFile, sepChar, and outputFilename, which are set to the same defaults as they were in the original.

3c3,8
< # character) to Excel spreadsheets.
---
> # character) to Excel spreadsheets. Written by Sujit Pal
> # (sujitpal.blogspot.com), posted at http://sujitpal.blogspot.com/2007/02/python-script-
to-convert-csv-files-to.html.
> # Modified by Daniel Frishberg on 4/9/2008 to separate out main() into
> # main_from_cmd_line() and main(), to allow use from other python programs.
> # Dependency: pyExcelerator at http://pypi.python.org/pypi/pyExcelerator.
>
89c94
< def main():
---
> def main_from_cmd_line():
97a103,106
> titlePresent, linesPerFile, sepChar, outputFileName = validateOpts(opts)
> main(inputFileName, titlePresent, linesPerFile, outputFileName, sepChar)
>
> def main(inputFileName, titlePresent=False, linesPerFile=-1, outputFileName="", sepChar=
","):
103d111
< titlePresent, linesPerFile, sepChar, outputFileName = validateOpts(opts)
131c139
< main()
---
> main_from_cmd_line()

Sujit Pal said...

Thanks Daniel, much appreciated. I will probably continue to use this as a script, but maybe this will be useful to others.

ZooX said...

Thank You very much for that (excel)lent script!

This is exactly what I need!

Sujit Pal said...

Thanks for the feedback, Zook, glad it helped.

Anonymous said...

You have some minor but possibly important typos in your script: You have 'cvs' in a number of places where you mean 'csv'.

Thanks for the post. It's saved my a good amount of work by giving me a nice starting place for a small project.

Sujit Pal said...

Hi, glad it helped you and thanks for pointing this out, I did a search and found the place you are referring to - fixing now.

Matt Dubins said...

Thanks so much, this script works like a charm :)

Matt Dubins

Sujit Pal said...

Thanks Matt, glad it helped.

rana said...

This a great script ..... I used this to convert my large csv file to .xls with multiple sheets .... But for a large .csv file, conversion taking more time ... sometimes it hangs ... I donot have much idea over Python scipting ..... is it possible to reduce time ???..

Sujit Pal said...

Thanks, rana. I have been using my script quite extensively these past few months (once again generating large spreadsheets from flat files :-)), and I haven't actually found a case where it hangs. In fact, I find in lots of cases, OpenOffice chokes on some of my files, but csv2xls.py rips right through. There /is/ a part of the script that is memory intensive though, where it uses the pyExcelerator to build potentially large spreadsheets in memory before dumping them out to the spreadsheet file, and that could cause a hang like you noticed. If thats the case, you may want to ask the script to produce a larger number of smaller excel files.

michael said...

Very useful script. Thanks that you made it available. I couldn't have written is myself (yet :)

Sujit Pal said...

You're welcome, Michael, and thanks for the feedback.

John said...

This was awesome. Thanks!

Sujit Pal said...

Thanks, John.

pclin1988 said...

Hy everybody, nice script. I used at work :). but I have a problem, when I have to add to the worksheet special characters (French letters) it doesn't work: even with: # -*- coding:utf-8 -*- at the beginning of the script.

Can anyone help me?

Sujit Pal said...

Thanks pclin1988. I have only used it with US_ASCII, so won't be able to help you there. If someone knows the fix to this, I am guessing they would post it here. You may want to post to the pyExcelerator support forums, they would probably be more likely able to help with your issue.

Alex said...

My sister asked me an advice about crashed excel files. I advised it an one tool, which to my mind is the best solution in this and probably other cases - excel 2007 recovery.

Sujit Pal said...

Thanks Alex, good to know.

Satish said...

sujit, thanks for a script. I would like to know if it is possible to combine many csv files into a single xls file? that would be an added flexibility..

Satish

Sujit Pal said...

Thanks Satish. The functionality you are looking for is fairly easy to build in, instead of reading a single csv file, just get a list of files and loop through them.

Ranjeet said...

Hi Sujit, thaks very much for this. Surprising that since 2007 there aren't any other competitors out there for your CSV2Excel, which probably indicates the usability of your script. I'll give it a shot and let you know how it works for me.

Ranjeet

Josh said...

Sujit,
First I must say, thank you for a wonderful script that has solved worlds of headaches. I find it incredible that 4 years after it's release, it's still be looking for and found based on a simple google search.

I, in the future, would like to learn from this and tailor it to some specific needs/wants I have, but I do have one question.

Is there a way to change this around so you can pipe output directly to the script instead of having to save a file and run the script against that?

Thanks

Sujit Pal said...

@Josh/Ranjit: probably indicates that CSV to Excel conversion is somewhat rare, which is why nobody really cares that much about the problem to write different versions :-). Anyway, glad it helped, I know it has helped me a lot over the years :-).

@Josh: regarding reading from stdin, haven't tried it myself, but you may want to change the script something like this:

114: reader = csv.read(sys.stdin, delimiter=sepChar)

Ranjeet said...

Sujit, Thank you so much, It works like a charm, and is teaching me how to write good code.

Sujit Pal said...

Thanks for the kind words, Ranjeet, and you are welcome.

Tejas said...

Hi Sujith
Do you have any script to convert .xlsx to .csv
I tried using xlrd but it only works for .xls and not for .xlsx

Any inputs to convert .xlsx to .csv would be much appreciated

Sujit Pal said...

Hi Tejas, AFAIK XLSX uses the MS Open XML format for spreadsheets (the format they came up with in response to Open Document Format's ODS format), so it is written out as an XML file ultimately. With ODS, the file is a zipped set of (mainly) 4 XML files, one of which is the content XML file. I suspect (but am not sure) that MS OOXML does it similarly, but it should be easy to find on the internet or figure out. So once you have the XML contents, it should be relatively simple to parse with an XML parser and convert to CSV.

Anonymous said...

This is an awesome conversion tool. However, is there a way to make it convert spreadsheets into csv files instead?

Thanks

Sujit Pal said...

We had this same requirement couple of weeks ago, and our DBA suggested the unix add-on xls2csv. Check out this page for a review of both the C version (which was suggested by our DBA) and a Perl version.

jeeads said...

Hello, Great script but how do you turn on word wrapping for all columns? I get how to format each column but not with word wrapping.

Sujit Pal said...

Hi jeeads, by word wrapping you mean excel columns with embedded newlines, yes? If so, I believe this script will work, although I haven't tried it myself. Columns with embedded newlines are represented in CSV as quoted multiline text, thus your CSV row will span multiple lines. Reason I think it should just work is that I used the Python csv module to read the input (rather than just readline and split), which should handle this condition.

Milan said...

Hi Sujit,

many thanks for your script!
I adapted it for my needs and it works perfectly! And it helped me improving my Python skills!

Cheers

Sujit Pal said...

You are welcome Milan, glad it helped.

Sahana said...

Thanks, Sujit. Helpful script but it is only half of what I am in need of! I am tweaking it a bit but, is there an updated version available somewhere which can take in multiple csv files for conversion?(the output needs to be in a single xls file with each csv input file in a separate sheet)


Sujit Pal said...

Thanks Sahana, but unfortunately, no there no updates as far as I know... I built this for myself and I haven't had a need for loading separate worksheets from multiple csv files, I guess my clients are less demanding than yours :-). But I just looked at the pyExcelerator documentation (in the python REPL, import pyExcelerator; help(pyExcelerator) and they seem to have support for worksheets, so I would think this would be fairly easy to extend by calling the current functionality (one file -> one worksheet) multiple times via function calls.

jeeads said...

Thanks to my earlier reply. I did eventually figure out how to word wrap -- I changed the line 41 to
worksheet.write(lno, cno, column, font_style('left',1))
but end up with errors if csv is over 2 megs. And if I take the script as is I end up getting same errors if csv is over 15 megs.
Traceback (most recent call last):
File "csvtoexcel_Patrons.py", line 175, in
main()
File "csvtoexcel_Patrons.py", line 170, in main
closeExcelSheet(workbook, outputFileName)
File "csvtoexcel_Patrons.py", line 90, in closeExcelSheet
workbook.save(outputFileName)
File "/Library/Python/2.7/site-packages/pyExcelerator/Workbook.py", line 610, in save
doc.save(filename, self.get_biff_data())
File "/Library/Python/2.7/site-packages/pyExcelerator/Workbook.py", line 595, in get_biff_data
data = sheet.get_biff_data()
File "/Library/Python/2.7/site-packages/pyExcelerator/Worksheet.py", line 1424, in get_biff_data
result += self.__row_blocks_rec()
File "/Library/Python/2.7/site-packages/pyExcelerator/Worksheet.py", line 1333, in __row_blocks_rec
result.append(self.__rows[used_rows[i]].get_row_biff_data())
File "/Library/Python/2.7/site-packages/pyExcelerator/Row.py", line 200, in get_row_biff_data
return BIFFRecords.RowRecord(self.__idx, self.__min_col_idx, self.__max_col_idx, height_options, options).get()
File "/Library/Python/2.7/site-packages/pyExcelerator/BIFFRecords.py", line 1430, in __init__
options)
struct.error: ushort format requires 0 <= number <= USHRT_MAX

Sujit Pal said...

Hi, this appears to be a bug/limitation of pyExcelerator, it may be better to ask on its mailing list. I also found this python-excel issue, which indicates that the column in question may be defined incorrectly, that may be something you may want to look at as well.

Prem Prakash said...

hi,

i am new to python. i was trying to run this code by the command "python csv2xls.py" (without the quotes) and it is giving me the following without any output.

OPTIONS:
--title|-t: If set, the first line is the title line
--lines|-l n: Split output into files of n lines or less each
--sep|-s c [def:,] : The character to use for field delimiter
--output|o : output file name/pattern
--help|h : print this information


What must be going wrong here? Thanks in advance and thanks for the script!

Sujit Pal said...

Hi Prem, its telling you how to use it. if you look one line up in the generated message (you didn't include it in your comment), it will say something like "Usage: ...". So minimally you will need to call it like "python csv2xls.py inputfile.csv". Additionally, if your column delimiters are other than "," then you must specify this using the --sep or -s option. If you want the output written to a specific Excel file (as opposed to basename(inputfile).xls, then you must specify the --output or -o option, etc.

Anonymous said...

For reading from stdin, to allow bash pipeing, this may work

< inputFile = open(inputFileName, 'r')
---
> if (inputFileName=="-"):
> inputFile= sys.stdin
> else:
> inputFile = open(inputFileName, 'r')

Sujit Pal said...

Thanks!

Zwindler said...

Even after 7 years, this post is much alive ! Many thanks.

For my personal needs, I just modified your script to allow me to insert as many CSV files as I want in a single XLS file (each CSV has his own worksheet), not just one.

This way, I can aggregate all my CSV files in a single XLS with just a call of :
csv2excel.py csv_path/*.csv

Sujit Pal said...

Nice! And yes, it does look I ended up scratching a big itch with this one :-).

Bernd said...

Hi,
I am trying to convert a .csv file with 2500 columns and 14 rows. The script generates an erroneous file which is only partly readable in excel. The transposed version of this table is processed correctly. Do you have any suggestions how I could get it to work?

Thanks, Bernd

Sujit Pal said...

Hi Bernd, its possible that you are hitting some limit in pyExcelerator with the high number of columns. Since you mention your ability to transpose, I take it that your data is probably coming from a matrix library like numpy? In that case, perhaps you could simply transpose and write it to the file, then open Excel and transpose it back?

Paula Vieira said...

Thank you for this wonderful script.
I am trying to change it so it can create a unique excel file with multiple tabs, one per csv file but so far I was not able to do this. Thanks Again!

Sujit Pal said...

You are welcome and thank you for the kind words. With respect to creating spreadsheets with multiple tabs I haven't been able to do it either - I needed this for generating spreadsheets out of reports for our Informaticist, and since it was for internal use, I was able to convince him to do without the convenience of multiple tabs :-).

Anonymous said...

To make a .xls file from multiple csv files, in Linux you can install gnumeric, which comes with a CLI utility called ssconvert.

ssconvert --merge-to=mynewfiile.xls 1.csv 2.csv 3.csv ...

Each file ends up on a tab in the xls with the filename.

Sujit Pal said...

Very nice, thank you!