R Read the First Row as Header
Reading and Writing CSV Files
Overview
Educational activity: 30 min
Exercises: 0 minQuestions
How practise I read data from a CSV file into R?
How practise I write data to a CSV file?
Objectives
Read in a .csv, and explore the arguments of the csv reader.
Write the altered data prepare to a new .csv, and explore the arguments.
The most common mode that scientists store data is in Excel spreadsheets. While in that location are R packages designed to access data from Excel spreadsheets (e.m., gdata, RODBC, XLConnect, xlsx, RExcel), users oftentimes observe it easier to save their spreadsheets in comma-separated values files (CSV) and and then apply R's congenital in functionality to read and manipulate the data. In this short lesson, we'll larn how to read data from a .csv and write to a new .csv, and explore the arguments that permit yous read and write the data correctly for your needs.
Read a .csv and Explore the Arguments
Let's start by opening a .csv file containing data on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv). Nosotros volition apply the built in read.csv(...) function phone call, which reads the data in as a information frame, and assign the data frame to a variable (using <-) and then that it is stored in R'southward retentivity. Then we will explore some of the basic arguments that tin can be supplied to the function. First, open up the RStudio projection containing the scripts and data you were working on in episode 'Analyzing Patient Information'.
# Import the data and look at the kickoff six rows carSpeeds <- read.csv ( file = 'data/car-speeds.csv' ) head ( carSpeeds ) Color Speed State 1 Blueish 32 NewMexico 2 Carmine 45 Arizona 3 Blueish 35 Colorado 4 White 34 Arizona five Red 25 Arizona 6 Bluish 41 Arizona Changing Delimiters
The default delimiter of the
read.csv()function is a comma, but yous tin use other delimiters by supplying the 'sep' argument to the function (e.yard., typingsep = ';'allows a semi-colon separated file to exist correctly imported - see?read.csv()for more than information on this and other options for working with different file types).
The call above will import the data, only we have non taken advantage of several handy arguments that tin be helpful in loading the data in the format we want. Permit'due south explore some of these arguments.
The default for read.csv(...) is to set up the header argument to TRUE. This ways that the first row of values in the .csv is set every bit header data (cavalcade names). If your data prepare does non have a header, set up the header argument to FALSE:
# The first row of the information without setting the header argument: carSpeeds [ 1 , ] Color Speed State one Blue 32 NewMexico # The first row of the data if the header argument is prepare to FALSE: carSpeeds <- read.csv ( file = 'information/automobile-speeds.csv' , header = FALSE ) carSpeeds [ 1 , ] V1 V2 V3 1 Color Speed State Conspicuously this is not the desired behavior for this data set, but information technology may be useful if you have a dataset without headers.
The stringsAsFactors Argument
In older versions of R (prior to 4.0) this was mayhap the most of import statement in read.csv(), particularly if you were working with categorical data. This is because the default behavior of R was to convert character strings into factors, which may make it difficult to do such things equally replace values. Information technology is important to exist aware of this behaviour, which we will demonstrate. For example, let's say we detect out that the data collector was color blind, and accidentally recorded green cars equally being blue. In order to correct the information set, let'southward replace 'Bluish' with 'Green' in the $Colour cavalcade:
# Here we volition use R's `ifelse` function, in which we provide the examination phrase, # the outcome if the issue of the test is 'TRUE', and the outcome if the # result is 'FALSE'. We will likewise assign the results to the Color cavalcade, # using '<-' # Beginning - reload the information with a header carSpeeds <- read.csv ( file = 'data/auto-speeds.csv' , stringsAsFactors = TRUE ) carSpeeds $ Colour <- ifelse ( carSpeeds $ Colour == 'Blue' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color [1] "Green" "ane" "Green" "5" "4" "Green" "Green" "2" "5" [10] "4" "iv" "5" "Light-green" "Green" "ii" "four" "Green" "Green" [19] "5" "Greenish" "Green" "Green" "4" "Green" "4" "iv" "4" [28] "4" "5" "Greenish" "four" "5" "two" "4" "2" "2" [37] "Green" "4" "2" "four" "ii" "ii" "iv" "four" "5" [46] "ii" "Greenish" "4" "4" "two" "2" "4" "five" "iv" [55] "Green" "Green" "2" "Dark-green" "5" "2" "four" "Greenish" "Dark-green" [64] "5" "ii" "iv" "4" "2" "Light-green" "v" "Greenish" "four" [73] "5" "5" "Green" "Light-green" "Green" "Green" "Greenish" "5" "2" [82] "Green" "5" "2" "ii" "4" "4" "five" "5" "5" [91] "5" "iv" "4" "4" "5" "ii" "5" "2" "two" [100] "5" What happened?!? It looks similar 'Blue' was replaced with 'Green', only every other color was turned into a number (every bit a character string, given the quote marks before and afterwards). This is because the colors of the cars were loaded every bit factors, and the factor level was reported post-obit replacement.
To run across the internal structure, we can use another part, str(). In this case, the dataframe'due south internal construction includes the format of each cavalcade, which is what we are interested in. str() will be reviewed a footling more in the lesson Data Types and Structures.
# Reload the information with a header (the previous ifelse call modifies attributes) carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = True ) str ( carSpeeds ) 'data.frame': 100 obs. of 3 variables: $ Color: Factor w/ 5 levels " Red","Black",..: 3 1 3 5 4 3 3 2 5 4 ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ Land: Factor west/ 4 levels "Arizona","Colorado",..: 3 1 2 1 one i 3 2 1 2 ... We can see that the $Colour and $Land columns are factors and $Speed is a numeric column.
Now, allow'due south load the dataset using stringsAsFactors=False, and see what happens when nosotros try to replace 'Blue' with 'Dark-green' in the $Color cavalcade:
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = FALSE ) str ( carSpeeds ) 'information.frame': 100 obs. of three variables: $ Color: chr "Blue" " Red" "Blueish" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: chr "NewMexico" "Arizona" "Colorado" "Arizona" ... carSpeeds $ Color <- ifelse ( carSpeeds $ Colour == 'Blue' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color [one] "Green" " Scarlet" "Green" "White" "Red" "Light-green" "Light-green" "Black" "White" [ten] "Red" "Ruby" "White" "Dark-green" "Green" "Black" "Red" "Green" "Green" [19] "White" "Greenish" "Green" "Green" "Blood-red" "Light-green" "Cherry" "Red" "Carmine" [28] "Cherry" "White" "Dark-green" "Red" "White" "Blackness" "Red" "Black" "Blackness" [37] "Green" "Red" "Black" "Cherry" "Black" "Black" "Red" "Red" "White" [46] "Black" "Green" "Red" "Red" "Blackness" "Black" "Ruby" "White" "Cherry" [55] "Green" "Greenish" "Black" "Green" "White" "Black" "Ruby-red" "Green" "Greenish" [64] "White" "Blackness" "Red" "Scarlet" "Black" "Green" "White" "Dark-green" "Ruby-red" [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black" [82] "Green" "White" "Blackness" "Black" "Reddish" "Cherry-red" "White" "White" "White" [91] "White" "Reddish" "Cherry" "Cerise" "White" "Black" "White" "Blackness" "Black" [100] "White" That's better! And we tin see how the data now is read every bit character instead of cistron. From R version 4.0 onwards we do non accept to specify stringsAsFactors=Imitation, this is the default beliefs.
The as.is Argument
This is an extension of the stringsAsFactors argument, but gives you lot command over individual columns. For example, if we want the colors of cars imported as strings, but we want the names of u.s. imported as factors, we would load the data fix as:
carSpeeds <- read.csv ( file = 'data/auto-speeds.csv' , every bit.is = ane ) # Notation, the 1 applies equally.is to the first column only Now nosotros tin see that if we endeavor to replace 'Blue' with 'Light-green' in the $Colour column everything looks fine, while trying to supersede 'Arizona' with 'Ohio' in the $State cavalcade returns the factor numbers for the names of states that we haven't replaced:
'data.frame': 100 obs. of 3 variables: $ Color: chr "Blue" " Cherry-red" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Factor w/ 4 levels "Arizona","Colorado",..: iii 1 2 ane ane 1 3 2 ane 2 ... carSpeeds $ Color <- ifelse ( carSpeeds $ Color == 'Blue' , 'Greenish' , carSpeeds $ Colour ) carSpeeds $ Color [1] "Green" " Ruby" "Light-green" "White" "Cerise" "Green" "Green" "Blackness" "White" [10] "Red" "Red" "White" "Greenish" "Dark-green" "Black" "Red" "Dark-green" "Greenish" [nineteen] "White" "Dark-green" "Green" "Green" "Reddish" "Green" "Red" "Red" "Red" [28] "Red" "White" "Dark-green" "Red" "White" "Blackness" "Red" "Black" "Blackness" [37] "Green" "Red" "Black" "Red" "Black" "Black" "Red" "Red" "White" [46] "Blackness" "Green" "Red" "Cherry" "Blackness" "Blackness" "Red" "White" "Carmine" [55] "Dark-green" "Green" "Blackness" "Green" "White" "Black" "Red" "Green" "Green" [64] "White" "Blackness" "Scarlet" "Red" "Black" "Green" "White" "Green" "Ruddy" [73] "White" "White" "Greenish" "Green" "Green" "Dark-green" "Dark-green" "White" "Black" [82] "Light-green" "White" "Black" "Blackness" "Crimson" "Red" "White" "White" "White" [91] "White" "Red" "Ruddy" "Red" "White" "Black" "White" "Black" "Black" [100] "White" carSpeeds $ Country <- ifelse ( carSpeeds $ State == 'Arizona' , 'Ohio' , carSpeeds $ State ) carSpeeds $ State [1] "iii" "Ohio" "2" "Ohio" "Ohio" "Ohio" "3" "2" "Ohio" "two" [11] "4" "four" "4" "4" "four" "iii" "Ohio" "3" "Ohio" "4" [21] "4" "4" "three" "2" "2" "3" "2" "4" "2" "4" [31] "3" "2" "two" "4" "2" "ii" "3" "Ohio" "4" "2" [41] "2" "3" "Ohio" "4" "Ohio" "2" "iii" "3" "3" "2" [51] "Ohio" "4" "iv" "Ohio" "3" "2" "4" "two" "4" "4" [61] "4" "two" "three" "ii" "3" "2" "3" "Ohio" "3" "iv" [71] "iv" "2" "Ohio" "4" "2" "2" "two" "Ohio" "iii" "Ohio" [81] "four" "2" "2" "Ohio" "Ohio" "Ohio" "4" "Ohio" "4" "iv" [91] "four" "Ohio" "Ohio" "3" "ii" "2" "4" "3" "Ohio" "4" We can see that $Colour column is a grapheme while $State is a factor.
Updating Values in a Factor
Suppose nosotros want to keep the colors of cars as factors for some other operations we want to perform. Write lawmaking for replacing 'Bluish' with 'Light-green' in the
$Colourcolumn of the cars dataset without importing the data withstringsAsFactors=FALSE.Solution
carSpeeds <- read.csv ( file = 'information/automobile-speeds.csv' ) # Replace 'Blueish' with 'Green' in cars$Color without using the stringsAsFactors # or equally.is arguments carSpeeds $ Color <- ifelse ( as.grapheme ( carSpeeds $ Color ) == 'Bluish' , 'Green' , equally.character ( carSpeeds $ Colour )) # Convert colors back to factors carSpeeds $ Color <- as.gene ( carSpeeds $ Color )
The strip.white Argument
It is not uncommon for mistakes to take been made when the data were recorded, for example a space (whitespace) may take been inserted before a data value. By default this whitespace will be kept in the R surround, such that '\ Red' volition be recognized every bit a different value than 'Red'. In order to avert this type of fault, use the strip.white argument. Allow's see how this works by checking for the unique values in the $Colour column of our dataset:
Here, the data recorder added a space before the color of the car in i of the cells:
# We use the built-in unique() role to extract the unique colors in our dataset unique ( carSpeeds $ Color ) [i] Green Red White Red Black Levels: Red Black Green Ruby White Oops, we see two values for red cars.
Let'south try again, this time importing the information using the strip.white argument. Annotation - this argument must exist accompanied past the sep statement, by which nosotros indicate the blazon of delimiter in the file (the comma for most .csv files)
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = FALSE , strip.white = True , sep = ',' ) unique ( carSpeeds $ Color ) [1] "Blue" "Scarlet" "White" "Black" That'southward better!
Specify Missing Data When Loading
It is common for information sets to have missing values, or mistakes. The convention for recording missing values often depends on the individual who collected the information and tin can be recorded every bit
n.a.,--, or empty cells " ". R recognises the reserved character stringNAas a missing value, merely not some of the examples higher up. Let'southward say the inflamation scale in the information set we used beforeinflammation-01.csvactually starts at1for no inflamation and the naught values (0) were a missed ascertainment. Looking at the?read.csvhelp page is in that location an argument we could use to ensure all zeros (0) are read in asNA? Perhaps, in thecar-speeds.csvdata contains mistakes and the person measuring the car speeds could not accurately distinguish between "Black or "Blueish" cars. Is there a way to specify more than i 'string', such equally "Blackness" and "Blueish", to be replaced byNASolution
read.csv ( file = "data/inflammation-01.csv" , na.strings = "0" )or , in
car-speeds.csvuse a grapheme vector for multiple values.read.csv ( file = 'data/machine-speeds.csv' , na.strings = c ( "Blackness" , "Blue" ) )
Write a New .csv and Explore the Arguments
After altering our cars dataset by replacing 'Blue' with 'Green' in the $Colour column, nosotros now want to save the output. There are several arguments for the write.csv(...) function telephone call, a few of which are particularly important for how the data are exported. Let's explore these now.
# Export the data. The write.csv() function requires a minimum of ii # arguments, the information to be saved and the name of the output file. write.csv ( carSpeeds , file = 'information/motorcar-speeds-cleaned.csv' ) If you open up the file, you lot'll see that it has header names, because the data had headers within R, just that in that location are numbers in the start column.
The row.names Argument
This argument allows us to set up the names of the rows in the output information file. R's default for this argument is True, and since it does not know what else to name the rows for the cars information set, it resorts to using row numbers. To right this, we tin can gear up row.names to FALSE:
write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' , row.names = Faux ) Now we see:
Setting Column Names
There is also a
col.namesstatement, which can be used to prepare the column names for a data fix without headers. If the data set already has headers (e.chiliad., nosotros used theheaders = Trueargument when importing the data) and then acol.namesargument volition be ignored.
The na Argument
In that location are times when we want to specify certain values for NAsouthward in the data fix (e.g., we are going to pass the data to a plan that only accepts -9999 as a nodata value). In this case, we want to set the NA value of our output file to the desired value, using the na statement. Let's run into how this works:
# First, replace the speed in the tertiary row with NA, by using an alphabetize (square # brackets to signal the position of the value nosotros want to replace) carSpeeds $ Speed [ 3 ] <- NA head ( carSpeeds ) Color Speed State 1 Blue 32 NewMexico 2 Red 45 Arizona 3 Bluish NA Colorado 4 White 34 Arizona 5 Scarlet 25 Arizona 6 Bluish 41 Arizona write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' , row.names = Faux ) Now nosotros'll set NA to -9999 when we write the new .csv file:
# Annotation - the na argument requires a string input write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' , row.names = FALSE , na = '-9999' ) And nosotros see:
Key Points
Import data from a .csv file using the
read.csv(...)function.Understand some of the cardinal arguments available for importing the data properly, including
header,stringsAsFactors,as.is, andstrip.white.Write data to a new .csv file using the
write.csv(...)functionUnderstand some of the key arguments available for exporting the data properly, such as
row.names,col.names, andna.
Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/
0 Response to "R Read the First Row as Header"
Post a Comment