R Read the First Row as Header

Reading and Writing CSV Files

Overview

Educational activity: 30 min
Exercises: 0 min

Questions

  • How practise I read data from a CSV file into R?

  • How practise I write data to a CSV file?

Objectives

  • Read in a .csv, and explore the arguments of the csv reader.

  • Write the altered data prepare to a new .csv, and explore the arguments.

The most common mode that scientists store data is in Excel spreadsheets. While in that location are R packages designed to access data from Excel spreadsheets (e.m., gdata, RODBC, XLConnect, xlsx, RExcel), users oftentimes observe it easier to save their spreadsheets in comma-separated values files (CSV) and and then apply R's congenital in functionality to read and manipulate the data. In this short lesson, we'll larn how to read data from a .csv and write to a new .csv, and explore the arguments that permit yous read and write the data correctly for your needs.

Read a .csv and Explore the Arguments

Let's start by opening a .csv file containing data on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv). Nosotros volition apply the built in read.csv(...) function phone call, which reads the data in as a information frame, and assign the data frame to a variable (using <-) and then that it is stored in R'southward retentivity. Then we will explore some of the basic arguments that tin can be supplied to the function. First, open up the RStudio projection containing the scripts and data you were working on in episode 'Analyzing Patient Information'.

                          # Import the data and look at the kickoff six rows                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              )                                          head              (              carSpeeds              )                                                  
                          Color Speed     State 1  Blueish    32 NewMexico 2   Carmine    45   Arizona 3  Blueish    35  Colorado 4 White    34   Arizona five   Red    25   Arizona 6  Bluish    41   Arizona                      

Changing Delimiters

The default delimiter of the read.csv() function is a comma, but yous tin use other delimiters by supplying the 'sep' argument to the function (e.yard., typing sep = ';' allows a semi-colon separated file to exist correctly imported - see ?read.csv() for more than information on this and other options for working with different file types).

The call above will import the data, only we have non taken advantage of several handy arguments that tin be helpful in loading the data in the format we want. Permit'due south explore some of these arguments.

The default for read.csv(...) is to set up the header argument to TRUE. This ways that the first row of values in the .csv is set every bit header data (cavalcade names). If your data prepare does non have a header, set up the header argument to FALSE:

                          # The first row of the information without setting the header argument:                                          carSpeeds              [              1              ,                                          ]                                                  
                          Color Speed     State one  Blue    32 NewMexico                      
                          # The first row of the data if the header argument is prepare to FALSE:                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/automobile-speeds.csv'              ,                                          header                                          =                                          FALSE              )                                          carSpeeds              [              1              ,                                          ]                                                  
                          V1    V2    V3 1 Color Speed State                      

Conspicuously this is not the desired behavior for this data set, but information technology may be useful if you have a dataset without headers.

The stringsAsFactors Argument

In older versions of R (prior to 4.0) this was mayhap the most of import statement in read.csv(), particularly if you were working with categorical data. This is because the default behavior of R was to convert character strings into factors, which may make it difficult to do such things equally replace values. Information technology is important to exist aware of this behaviour, which we will demonstrate. For example, let's say we detect out that the data collector was color blind, and accidentally recorded green cars equally being blue. In order to correct the information set, let'southward replace 'Bluish' with 'Green' in the $Colour cavalcade:

                          # Here we volition use R's `ifelse` function, in which we provide the examination phrase,                                          # the outcome if the issue of the test is 'TRUE', and the outcome if the                                          # result is 'FALSE'. We will likewise assign the results to the Color cavalcade,                                          # using '<-'                                          # Beginning - reload the information with a header                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/auto-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Colour                                          ==                                          'Blue'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" "ane"     "Green" "5"     "4"     "Green" "Green" "2"     "5"      [10] "4"     "iv"     "5"     "Light-green" "Green" "ii"     "four"     "Green" "Green"  [19] "5"     "Greenish" "Green" "Green" "4"     "Green" "4"     "iv"     "4"      [28] "4"     "5"     "Greenish" "four"     "5"     "two"     "4"     "2"     "2"      [37] "Green" "4"     "2"     "four"     "ii"     "ii"     "iv"     "four"     "5"      [46] "ii"     "Greenish" "4"     "4"     "two"     "2"     "4"     "five"     "iv"      [55] "Green" "Green" "2"     "Dark-green" "5"     "2"     "four"     "Greenish" "Dark-green"  [64] "5"     "ii"     "iv"     "4"     "2"     "Light-green" "v"     "Greenish" "four"      [73] "5"     "5"     "Green" "Light-green" "Green" "Green" "Greenish" "5"     "2"      [82] "Green" "5"     "2"     "ii"     "4"     "4"     "five"     "5"     "5"      [91] "5"     "iv"     "4"     "4"     "5"     "ii"     "5"     "2"     "two"     [100] "5"                      

What happened?!? It looks similar 'Blue' was replaced with 'Green', only every other color was turned into a number (every bit a character string, given the quote marks before and afterwards). This is because the colors of the cars were loaded every bit factors, and the factor level was reported post-obit replacement.

To run across the internal structure, we can use another part, str(). In this case, the dataframe'due south internal construction includes the format of each cavalcade, which is what we are interested in. str() will be reviewed a footling more in the lesson Data Types and Structures.

                          # Reload the information with a header (the previous ifelse call modifies attributes)                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          True              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  3 variables:  $ Color: Factor w/ 5 levels " Red","Black",..: 3 1 3 5 4 3 3 2 5 4 ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ Land: Factor west/ 4 levels "Arizona","Colorado",..: 3 1 2 1 one i 3 2 1 2 ...                      

We can see that the $Colour and $Land columns are factors and $Speed is a numeric column.

Now, allow'due south load the dataset using stringsAsFactors=False, and see what happens when nosotros try to replace 'Blue' with 'Dark-green' in the $Color cavalcade:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              )                                          str              (              carSpeeds              )                                                  
            'information.frame':	100 obs. of  three variables:  $ Color: chr  "Blue" " Red" "Blueish" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Colour                                          ==                                          'Blue'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [one] "Green" " Scarlet"  "Green" "White" "Red"   "Light-green" "Light-green" "Black" "White"  [ten] "Red"   "Ruby"   "White" "Dark-green" "Green" "Black" "Red"   "Green" "Green"  [19] "White" "Greenish" "Green" "Green" "Blood-red"   "Light-green" "Cherry"   "Red"   "Carmine"    [28] "Cherry"   "White" "Dark-green" "Red"   "White" "Blackness" "Red"   "Black" "Blackness"  [37] "Green" "Red"   "Black" "Cherry"   "Black" "Black" "Red"   "Red"   "White"  [46] "Black" "Green" "Red"   "Red"   "Blackness" "Black" "Ruby"   "White" "Cherry"    [55] "Green" "Greenish" "Black" "Green" "White" "Black" "Ruby-red"   "Green" "Greenish"  [64] "White" "Blackness" "Red"   "Scarlet"   "Black" "Green" "White" "Dark-green" "Ruby-red"    [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black"  [82] "Green" "White" "Blackness" "Black" "Reddish"   "Cherry-red"   "White" "White" "White"  [91] "White" "Reddish"   "Cherry"   "Cerise"   "White" "Black" "White" "Blackness" "Black" [100] "White"                      

That's better! And we tin see how the data now is read every bit character instead of cistron. From R version 4.0 onwards we do non accept to specify stringsAsFactors=Imitation, this is the default beliefs.

The as.is Argument

This is an extension of the stringsAsFactors argument, but gives you lot command over individual columns. For example, if we want the colors of cars imported as strings, but we want the names of u.s. imported as factors, we would load the data fix as:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/auto-speeds.csv'              ,                                          every bit.is                                          =                                          ane              )                                          # Notation, the 1 applies equally.is to the first column only                                                  

Now nosotros tin see that if we endeavor to replace 'Blue' with 'Light-green' in the $Colour column everything looks fine, while trying to supersede 'Arizona' with 'Ohio' in the $State cavalcade returns the factor numbers for the names of states that we haven't replaced:

            'data.frame':	100 obs. of  3 variables:  $ Color: chr  "Blue" " Cherry-red" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Factor w/ 4 levels "Arizona","Colorado",..: iii 1 2 ane ane 1 3 2 ane 2 ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Greenish'              ,                                          carSpeeds              $              Colour              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" " Ruby"  "Light-green" "White" "Cerise"   "Green" "Green" "Blackness" "White"  [10] "Red"   "Red"   "White" "Greenish" "Dark-green" "Black" "Red"   "Dark-green" "Greenish"  [nineteen] "White" "Dark-green" "Green" "Green" "Reddish"   "Green" "Red"   "Red"   "Red"    [28] "Red"   "White" "Dark-green" "Red"   "White" "Blackness" "Red"   "Black" "Blackness"  [37] "Green" "Red"   "Black" "Red"   "Black" "Black" "Red"   "Red"   "White"  [46] "Blackness" "Green" "Red"   "Cherry"   "Blackness" "Blackness" "Red"   "White" "Carmine"    [55] "Dark-green" "Green" "Blackness" "Green" "White" "Black" "Red"   "Green" "Green"  [64] "White" "Blackness" "Scarlet"   "Red"   "Black" "Green" "White" "Green" "Ruddy"    [73] "White" "White" "Greenish" "Green" "Green" "Dark-green" "Dark-green" "White" "Black"  [82] "Light-green" "White" "Black" "Blackness" "Crimson"   "Red"   "White" "White" "White"  [91] "White" "Red"   "Ruddy"   "Red"   "White" "Black" "White" "Black" "Black" [100] "White"                      
                          carSpeeds              $              Country                                          <-                                          ifelse              (              carSpeeds              $              State                                          ==                                          'Arizona'              ,                                          'Ohio'              ,                                          carSpeeds              $              State              )                                          carSpeeds              $              State                                                  
                          [1] "iii"    "Ohio" "2"    "Ohio" "Ohio" "Ohio" "3"    "2"    "Ohio" "two"     [11] "4"    "four"    "4"    "4"    "four"    "iii"    "Ohio" "3"    "Ohio" "4"     [21] "4"    "4"    "three"    "2"    "2"    "3"    "2"    "4"    "2"    "4"     [31] "3"    "2"    "two"    "4"    "2"    "ii"    "3"    "Ohio" "4"    "2"     [41] "2"    "3"    "Ohio" "4"    "Ohio" "2"    "iii"    "3"    "3"    "2"     [51] "Ohio" "4"    "iv"    "Ohio" "3"    "2"    "4"    "two"    "4"    "4"     [61] "4"    "two"    "three"    "ii"    "3"    "2"    "3"    "Ohio" "3"    "iv"     [71] "iv"    "2"    "Ohio" "4"    "2"    "2"    "two"    "Ohio" "iii"    "Ohio"  [81] "four"    "2"    "2"    "Ohio" "Ohio" "Ohio" "4"    "Ohio" "4"    "iv"     [91] "four"    "Ohio" "Ohio" "3"    "ii"    "2"    "4"    "3"    "Ohio" "4"                      

We can see that $Colour column is a grapheme while $State is a factor.

Updating Values in a Factor

Suppose nosotros want to keep the colors of cars as factors for some other operations we want to perform. Write lawmaking for replacing 'Bluish' with 'Light-green' in the $Colour column of the cars dataset without importing the data with stringsAsFactors=FALSE.

Solution

                                  carSpeeds                                                      <-                                                      read.csv                  (                  file                                                      =                                                      'information/automobile-speeds.csv'                  )                                                      # Replace 'Blueish' with 'Green' in cars$Color without using the stringsAsFactors                                                      # or equally.is arguments                                                      carSpeeds                  $                  Color                                                      <-                                                      ifelse                  (                  as.grapheme                  (                  carSpeeds                  $                  Color                  )                                                      ==                                                      'Bluish'                  ,                                                      'Green'                  ,                                                      equally.character                  (                  carSpeeds                  $                  Colour                  ))                                                      # Convert colors back to factors                                                      carSpeeds                  $                  Color                                                      <-                                                      as.gene                  (                  carSpeeds                  $                  Color                  )                                                                  

The strip.white Argument

It is not uncommon for mistakes to take been made when the data were recorded, for example a space (whitespace) may take been inserted before a data value. By default this whitespace will be kept in the R surround, such that '\ Red' volition be recognized every bit a different value than 'Red'. In order to avert this type of fault, use the strip.white argument. Allow's see how this works by checking for the unique values in the $Colour column of our dataset:

Here, the data recorder added a space before the color of the car in i of the cells:

                          # We use the built-in unique() role to extract the unique colors in our dataset                                          unique              (              carSpeeds              $              Color              )                                                  
            [i] Green  Red  White Red   Black Levels:  Red Black Green Ruby White                      

Oops, we see two values for red cars.

Let'south try again, this time importing the information using the strip.white argument. Annotation - this argument must exist accompanied past the sep statement, by which nosotros indicate the blazon of delimiter in the file (the comma for most .csv files)

                          carSpeeds                                          <-                                          read.csv              (                                          file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              ,                                          strip.white                                          =                                          True              ,                                          sep                                          =                                          ','                                          )                                          unique              (              carSpeeds              $              Color              )                                                  
            [1] "Blue"  "Scarlet"   "White" "Black"                      

That'southward better!

Specify Missing Data When Loading

It is common for information sets to have missing values, or mistakes. The convention for recording missing values often depends on the individual who collected the information and tin can be recorded every bit n.a., --, or empty cells " ". R recognises the reserved character string NA as a missing value, merely not some of the examples higher up. Let'southward say the inflamation scale in the information set we used before inflammation-01.csv actually starts at 1 for no inflamation and the naught values (0) were a missed ascertainment. Looking at the ?read.csv help page is in that location an argument we could use to ensure all zeros (0) are read in as NA? Perhaps, in the car-speeds.csv data contains mistakes and the person measuring the car speeds could not accurately distinguish between "Black or "Blueish" cars. Is there a way to specify more than i 'string', such equally "Blackness" and "Blueish", to be replaced by NA

Solution

                                  read.csv                  (                  file                                                      =                                                      "data/inflammation-01.csv"                  ,                                                      na.strings                                                      =                                                      "0"                  )                                                                  

or , in car-speeds.csv use a grapheme vector for multiple values.

                                  read.csv                  (                                                      file                                                      =                                                      'data/machine-speeds.csv'                  ,                                                      na.strings                                                      =                                                      c                  (                  "Blackness"                  ,                                                      "Blue"                  )                                                      )                                                                  

Write a New .csv and Explore the Arguments

After altering our cars dataset by replacing 'Blue' with 'Green' in the $Colour column, nosotros now want to save the output. There are several arguments for the write.csv(...) function telephone call, a few of which are particularly important for how the data are exported. Let's explore these now.

                          # Export the data. The write.csv() function requires a minimum of ii                                          # arguments, the information to be saved and the name of the output file.                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'information/motorcar-speeds-cleaned.csv'              )                                                  

If you open up the file, you lot'll see that it has header names, because the data had headers within R, just that in that location are numbers in the start column.

csv written without row.names argument

The row.names Argument

This argument allows us to set up the names of the rows in the output information file. R's default for this argument is True, and since it does not know what else to name the rows for the cars information set, it resorts to using row numbers. To right this, we tin can gear up row.names to FALSE:

                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Faux              )                                                  

Now we see:

csv written with row.names argument

Setting Column Names

There is also a col.names statement, which can be used to prepare the column names for a data fix without headers. If the data set already has headers (e.chiliad., nosotros used the headers = True argument when importing the data) and then a col.names argument volition be ignored.

The na Argument

In that location are times when we want to specify certain values for NAsouthward in the data fix (e.g., we are going to pass the data to a plan that only accepts -9999 as a nodata value). In this case, we want to set the NA value of our output file to the desired value, using the na statement. Let's run into how this works:

                          # First, replace the speed in the tertiary row with NA, by using an alphabetize (square                                          # brackets to signal the position of the value nosotros want to replace)                                          carSpeeds              $              Speed              [              3              ]                                          <-                                          NA                                          head              (              carSpeeds              )                                                  
                          Color Speed     State 1  Blue    32 NewMexico 2   Red    45   Arizona 3  Bluish    NA  Colorado 4 White    34   Arizona 5   Scarlet    25   Arizona 6  Bluish    41   Arizona                      
                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Faux              )                                                  

Now nosotros'll set NA to -9999 when we write the new .csv file:

                          # Annotation - the na argument requires a string input                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              ,                                          na                                          =                                          '-9999'              )                                                  

And nosotros see:

csv written with -9999 as NA

Key Points

  • Import data from a .csv file using the read.csv(...) function.

  • Understand some of the cardinal arguments available for importing the data properly, including header, stringsAsFactors, as.is, and strip.white.

  • Write data to a new .csv file using the write.csv(...) function

  • Understand some of the key arguments available for exporting the data properly, such as row.names, col.names, and na.

scottthationeath.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/

0 Response to "R Read the First Row as Header"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel