Python How to Read a Csv Using Command Line

CSV (comma-separated value) files are a mutual file format for transferring and storing data. The ability to read, manipulate, and write data to and from CSV files using Python is a primal skill to master for any information scientist or business analysis. In this post, we'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files postal service assay.

Pandas is the virtually popular data manipulation package in Python, and DataFrames are the Pandas data type for storing tabular 2D data.

  1. Load CSV files to Python Pandas
  2. 1. File Extensions and File Types
  3. 2. Data Representation in CSV files
    • Other Delimiters / Separators – TSV files
    • Delimiters in Text Fields – Quotechar
  4. iii. Python – Paths, Folders, Files
    • Finding your Python Path
    • File Loading: Absolute and Relative Paths
  5. 4. Pandas CSV File Loading Errors
  6. Advanced Read CSV Files
    • Specifying Data Types
    • Skipping and Picking Rows and Columns From File
    • Custom Missing Value Symbols
  7.  CSV Format Advantages and Disadvantages
  8. Additional Reading

Load CSV files to Python Pandas

The basic process of loading information from a CSV file into a Pandas DataFrame (with all going well) is accomplished using the "read_csv" function in Pandas:

# Load the Pandas libraries with alias 'pd'  import pandas as pd   # Read data from file 'filename.csv'  # (in the same directory that your python procedure is based) # Control delimiters, rows, column names with read_csv (run into later)  data = pd.read_csv("filename.csv")   # Preview the starting time five lines of the loaded information  data.head()

While this code seems simple, an understanding of three fundamental concepts is required to fully grasp and debug the operation of the data loading procedure if you lot come across issues:

  1. Understanding file extensions and file types – what exercise the letters CSV actually mean? What's the deviation between a .csv file and a .txt file?
  2. Understanding how information is represented within CSV files – if yous open up a CSV file, what does the data actually wait similar?
  3. Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are y'all working in?
  4. CSV data formats and errors – common errors with the function.

Each of these topics is discussed beneath, and nosotros stop this tutorial by looking at some more advanced CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.

ane. File Extensions and File Types

The showtime step to working with comma-separated-value (CSV) files is agreement the concept of file types and file extensions.

  1. Data is stored on your computer in individual "files", or containers, each with a different proper noun.
  2. Each file contains information of unlike types – the internals of a Give-and-take document is quite different from the internals of an image.
  3. Computers determine how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
  4. So, a filename is typically in the course "<random proper noun>.<file extension>". Examples:
    • project1.DOCX – a Microsoft Word file chosen Project1.
    • shanes_file.TXT – a simple text file chosen shanes_file
    • IMG_5673.JPG – An image file called IMG_5673.
    • Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, Aught – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a complete list of extensions here.
  5. A CSV file is a file with a ".csv" file extension, e.g. "data.csv", "super_information.csv". The "CSV" in this case lets the calculator know that the data contained in the file is in "comma separated value" format, which we'll discuss below.

File extensions are hidden by default on a lot of operating systems. The first step that any self-respecting engineer, software engineer, or information scientist will practise on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

Folder with file extensions showing. Before working with CSV files, ensure that you tin can see your file extensions in your operating system. Unlike file contents are denoted past the file extension, or letters later on the dot, of the file proper noun. eastward.g. TXT is text, DOCX is Microsoft Give-and-take, PNG are images, CSV is comma-separated value information.

To check if file extensions are showing in your organization, create a new text document with Notepad (Windows) or TextEdit (Mac) and save it to a binder of your choice. If you can't run into the ".txt" extension in your binder when you view it, you will have to change your settings.

  • In Microsoft Windows: Open Control Panel > Advent and Personalization.  Now, click on Folder Options or File Explorer Selection, as it is at present called > View tab. In this tab, under Advance Settings, you will see the option Hibernate extensions for known file types. Uncheck this pick and click on Utilize and OK.
  • In Mac OS: Open Finder > In bill of fare, click Finder > Preferences, Click Avant-garde, Select the checkbox for "Testify all filename extensions".

2. Data Representation in CSV files

A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open a CSV file and testify the contents. Sublime Text is a wonderful and multi-functional text editor selection for whatever platform.

CSV is a standard for storing tabular data in text format, where commas are used to separate the dissimilar columns, and newlines (carriage render / printing enter) used to carve up rows. Typically, the first row in a CSV file contains the names of the columns for the information.

And example tabular array data set and the respective CSV-format data is shown in the diagram below.

Pandas read csv function read_csv is used to process this comma-separated file into tabular format in the Python DataFrame. Here we look at the innards of a CSV file to examine how columns are specified.
Comma-separated value files, or CSV files, are simple text files where commas and newlines are used to define tabular information in a structured way.

Note that almost whatsoever tabular data can be stored in CSV format – the format is popular considering of its simplicity and flexibility. You can create a text file in a text editor, save it with a .csv extension, and open that file in Excel or Google Sheets to run across the table form.

Other Delimiters / Separators – TSV files

The comma separation scheme is past far the about popular method of storing tabular data in text files.

However, the option of the ',' comma graphic symbol to delimiters columns, however, is arbitrary, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-split files are known equally TSV (Tab-Separated Value) files.

When loading data with Pandas, the read_csv part is used for reading any delimited text file, and by irresolute the delimiter using the sep  parameter.

Delimiters in Text Fields – Quotechar

One complication in creating CSV files is if you have commas, semicolons, or tabs actually in one of the text fields that you desire to store. In this case, it's important to utilise a "quote character" in the CSV file to create these fields.

The quote character can exist specified in Pandas.read_csv using the quotechar statement. By default (as with many systems), it'southward set as the standard quotation marks ("). Any commas (or other delimiters every bit demonstrated below) that occur between ii quote characters volition exist ignored as cavalcade separators.

In the example shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The employ of the quotechar allows the "NickName" column to contain semicolons without being carve up into more than columns.

" data-medium-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-300x215.png" data-large-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-1024x734.png" src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png" alt="Demonstration of semicolon separated file data with quote character to prevent unnecessary splits in columns. Read this CSV file with pandas using read_csv with the ";" sep specified." class="wp-image-1103" width="818" height="586" data-old-src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20818%20586'%3E%3C/svg%3E" data-lazy-src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png">
Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote characters are used if the data in a column may incorporate the separating grapheme. In this instance, the 'NickName' column contains semicolon characters, and then this column is "quoted". Specify the separator and quote character in pandas.read_csv

3. Python – Paths, Folders, Files

When y'all specify a filename to Pandas.read_csv, Python will look in your "electric current working directory". Your working directory is typically the directory that you started your Python procedure or Jupyter notebook from.

When filenotfounderrors occur, it can be due to a misspelled filename or a working directory mistake,
Pandas searches your 'current working directory' for the filename that you specify when opening or loading files. The FileNotFoundError can be due to a misspelled filename, or an incorrect working directory.

Finding your Python Path

Your Python path can be displayed using the congenital-in os module. The OS module is for operating organization dependent functionality into Python programs and scripts.

To detect your electric current working directory, the function required is os.getcwd(). Theos.listdir() office can be used to brandish all files in a directory, which is a good cheque to see if the CSV file you are loading is in the directory as expected.

# Find out your current working directory import os print(bone.getcwd())  # Out: /Users/shane/Documents/blog  # Display all of the files establish in your current working directory print(os.listdir(bone.getcwd())   # Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']

In the instance higher up, my current working directory is in the '/Users/Shane/Document/blog' directory. Whatsoever files that are places in this directory will exist immediately available to the Python file open() function or the Pandas read csv office.

Instead of moving the required data files to your working directory, you lot can besides change your current working directory to the directory where the files reside usingbone.chdir().

File Loading: Accented and Relative Paths

When specifying file names to the read_csv role, you lot tin supply both absolute or relative file paths.

  • A relative pathis the path to the file if yous start from your current working directory. In relative paths, typically the file volition exist in a subdirectory of the working directory and the path will not kickoff with a drive specifier, e.one thousand. (data/test_file.csv). The characters '..' are used to motility to a parent directory in a relative path.
  • An absolute pathis the complete path from the base of your file system to the file that y'all desire to load, e.g. c:/Documents/Shane/information/test_file.csv. Absolute paths volition start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)

Information technology'due south recommended and preferred to utilize relative paths where possible in applications, because accented paths are unlikely to work on unlike computers due to different directory structures.

absolute vs relative file paths
Loading the same file with Pandas read_csv using relative and absolute paths. Relative paths are directions to the file starting at your current working directory, where accented paths e'er start at the base of operations of your file system.

4. Pandas CSV File Loading Errors

The most common error'south you'll get while loading data from CSV files into Pandas volition be:

  1. FileNotFoundError: File b'filename.csv' does not exist
    A File Not Establish mistake is typically an issue with path setup, current directory, or file proper noun defoliation (file extension can play a function here!)
  2. UnicodeDecodeError: 'utf-8' codec tin't decode byte in position : invalid continuation byte
    A Unicode Decode Fault is typically caused past not specifying the encoding of the file, and happens when y'all have a file with non-standard characters. For a quick fix, endeavour opening the file in Sublime Text, and re-saving with encoding 'UTF-8'.
  3. pandas.parser.CParserError: Error tokenizing data.
    Parse Errors can be caused in unusual circumstances to do with your information format – try to add together the parameter "engine='python'" to the read_csv function call; this changes the data reading function internally to a slower just more than stable method.

Advanced Read CSV Files

There are some additional flexible parameters in the Pandas read_csv() part that are useful to have in your arsenal of data science techniques:

Specifying Data Types

As mentioned before, CSV files do non comprise whatever blazon data for data. Information types are inferred through exam of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, thedtype parameter can be used with a lexicon of column names and information types to be practical, for example:dtype={"name": str, "age": np.int32}.

Note that for dates and appointment times, the format, columns, and other behaviour can be adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.

Skipping and Picking Rows and Columns From File

Thenrows parameter specifies how many rows from the top of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows you to specify rows to exit out, either at the starting time of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter tin be used to specify which columns in the data to load.

Custom Missing Value Symbols

When information is exported to CSV from unlike systems, missing values tin be specified with different tokens. Thena_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted as NA/NaN are: '', '#Due north/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', 'i.#IND', '1.#QNAN', 'Due north/A', 'NA', 'NULL', 'NaN', 'northward/a', 'nan', 'null'.

# Advanced CSV loading case  information = pd.read_csv(     "data/files/complex_data_example.tsv",      # relative python path to subdirectory     sep='\t' 					# Tab-separated value file.     quotechar="'",				# single quote allowed as quote character     dtype={"salary": int}, 		        # Parse the salary column every bit an integer      usecols=['name', 'birth_date', 'salary'].   # But load the three columns specified.     parse_dates=['birth_date'], 		# Intepret the birth_date column as a date     skiprows=x, 				# Skip the kickoff 10 rows of the file     na_values=['.', '??'] 			# Take whatsoever '.' or '??' values as NA )

 CSV Format Advantages and Disadvantages

As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will encounter every bit y'all load, shop, and exchange data in CSV format:

On the plus side:

  • CSV format is universal and the data tin be loaded by almost any software.
  • CSV files are uncomplicated to understand and debug with a basic text editor
  • CSV files are quick to create and load into memory before analysis.

Nonetheless, the CSV format has some negative sides:

  • There is no data blazon information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data merely.
  • There'due south no formatting or layout information storable – things like fonts, borders, column width settings from Microsoft Excel will be lost.
  • File encodings can become a problem if there are non-ASCII compatible characters in text fields.
  • CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You lot volition find even so that your CSV data compresses well using zip compression.

As and aside, in an effort to counter some of these disadvantages, ii prominent data scientific discipline developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Plume Format, which aims to be a fast, elementary, open, flexible and multi-platform data format that supports multiple data types natively.

Additional Reading

  1. Official Pandas documentation for the read_csv function.
  2. Python 3 Notes on file paths, working directories, and using the OS module.
  3. Datacamp Tutorial on loading CSV files, including some additional OS commands.
  4. PythonHow Loading CSV tutorial.

rodriguezwairespleet.blogspot.com

Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/

0 Response to "Python How to Read a Csv Using Command Line"

إرسال تعليق

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel