# Dplyr Cheat Sheet 2020

Posted : admin On 1/1/2022As such the guides complement and do not substitute the official dplyr cheat sheet which is available here. Note that many dplyr functions are not covered here; all the nice joining functions designed to work with pairs of tables, but also many others. These guides about dplyr have been created on the October 2020 using LibreOffice. This book contains community contributions for STAT GR 5702 Fall 2020 at Columbia University. 6.5.1 DPLYR Queries. 27 likert data cheat sheet. I’m looking forward to our session at FSCI 2020 on “W24 - Working with Scholarly Literature in R: Pulling, Wrangling, Cleaning, and Analyzing Structured Bibliographic Metadata.”. This entire course is distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License. The OSF home is at https://osf.io/hd4re/.

### Data Wrangling Dplyr Cheat Sheet

Sep 9, 2018 · 8 minute read · Comments*Updated August 2020: changed to broom’s newer nest-map-unnest pattern, from the older group_by-do pattern.*

I recently needed to fit curves on several sets of similar data, measured from different sensors. I found how to achieve this with dplyr, without needing to define outside functions or use for-loops. This approach integrates perfectly with my usual dplyr and ggplot2 workflows, which means it adapts to new data or new experimental conditions with no changes. Here are the ready-made recipes for any one else who may run into a similar problem.

First we’ll generate some artificial data so that you can follow along at home. Here’s an exponentially decaying value measured by two sensors at different locations:

Our data frame looks like this:

t | y | sensor |
---|---|---|

1 | 52.25302 | sensor1 |

2 | 51.71440 | sensor1 |

3 | 51.13971 | sensor1 |

… | … | … |

98 | 38.13328 | sensor2 |

99 | 38.17017 | sensor2 |

100 | 37.72184 | sensor2 |

And our data looks like this:

I created the data in *long format*, as it works best with dplyr pipelines. You can read more about long format vs wide format data here.

## TL;DR

### I want to plot the fitted curve over my data points

### I want to plot the residuals

## I want a table of my fit parameters for each condition:

sensor | a | b |
---|---|---|

sensor1 | -0.2495347 | 47.87376 |

sensor2 | -0.2350898 | 59.77793 |

### I want a less horrendous fit. How can I do better?

Try with an actual exponential. You’ll probably want a self-starting function to avoid the “singular gradient” error — read more in my post on the subject.

## Explanation: intro to curve fitting in R

The goal of our fitting example was to find an estimate $hat{y}(t) = a t + b$ that approximates our measured data $y$. In R, we can directly write that we want to approximate $y$ as a function $a cdot t + b$, using the very intuitive built-in **formula** syntax:

In R’s documentation, $a$ and $b$ are the **coefficients**, $t$ is the **independent variable** and $y$ is the **dependent variable** — $t$ can go about its day without every knowing $y$ exists, but $y$ can’t progress without knowing the latest trend in $t$.

We fit these models using various fitting functions: `lm`

or `glm`

for linear models or `nls`

for non-linear least squares for example. I’m using `nls`

here because I can specify the names of my coefficients, which I find clearer to explain:

The fitting functions all return a fit object:

You can use one of these base R functions to extract useful data from the fit object:

`coef(fit)`

returns the values of the coefficients $a$ and $b$.`predict(fit)`

returns $hat{y}(t_i)$, i.e. it applies the fitted model to each of the original data points $t_i$. It can also be applied on new values of $t$.`resid(fit)`

returns the residuals $y_i - hat{y}(t_i)$ at each point of the original data.

Try them out in the console:

Now these functions all return vectors, which work best with R’s native plotting functions. To integrate with dplyr and ggplot, we’d rather have data frames or tibbles. This is where the *broom* package comes in.

## Introducing broom

*Broom* is a separate R package that feeds on fit results and produces useful data frames. Install it directly within the R console if you haven’t already:

It provides three useful functions, which are easier to demonstrate than to explain. All of them return data frames:

**tidy**extracts the fit coefficients and related information. The**term**column contains the name of your coefficient, and**estimate**contains its fitted value:

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

a | -0.2495347 | 0.006377179 | -39.12932 | 1.267667e-61 |

b | 47.8737585 | 0.370946838 | 129.05827 | 3.123848e-111 |

**augment**is equivalent to`predict`

and`resid`

combined. The returned data frame contains columns`t`

and`y`

with your original data, as well as`.fitted`

, the fitted curve, and`.resid`

, the residuals:

t | y | .fitted | .resid |
---|---|---|---|

1 | 52.25302 | 47.62422 | 4.62879752 |

2 | 51.71440 | 47.37469 | 4.33971163 |

3 | 51.13971 | 47.12515 | 4.01455804 |

4 | 50.45188 | 46.87562 | 3.57625801 |

… | … | … | … |

You can plot the ‘augmented’ fit with qplot:

**glance**shows fit statistics:

sigma | isConv | finTol | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|

1.840841 | TRUE | 1.571375e-09 | -201.906 | 409.8119 | 417.6274 | 332.0921 | 98 |

## Using broom with dplyr

### Dplyr Cheat Sheet 2020 Pdf

Since broom’s functions return data frames, they integrate naturally with magrittr pipelines (`%>%`

) and dplyr. Let’s illustrate by grouping the data by sensor id, then producing a fit object for each group. The `nest`

function creates a new tibble with one line per sensor. Each line contains the name of the sensor (in the `sensor`

column) and an embedded tibble with that sensor’s data (in the `data`

column):

sensor | data |
---|---|

sensor1 | `<tibble>` |

sensor2 | `<tibble>` |

Each `<tibble>`

here contains the $t$ and $y$ columns, with rows from one sensor.

We can then `map`

the fitting function onto each sensor’s tibble to create a new column with fit results:

sensor | data | fit |
---|---|---|

sensor1 | `<tibble>` | `<S3: nls>` |

sensor2 | `<tibble>` | `<S3: nls>` |

We now have one nls object per group, which can be fed into either `tidy`

or `augment`

. Again, we use `map`

to apply these functions to each separate nls object:

sensor | data | fit | tidied |
---|---|---|---|

sensor1 | `<tibble>` | `<S3: nls>` | `<tibble>` |

sensor2 | `<tibble>` | `<S3: nls>` | `<tibble>` |

To actually use the result we unnest it into a regular tibble. Unnest inflates each sensor’s tibble and concatenates them (`rbinds`

them, similarly to how we created `df`

in the first place):

sensor | data | fit | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|---|

sensor1 | `<tibble>` | `<S3: nls>` | a | -0.2492956 | 0.006427426 | -38.78622 | 2.853154e-61 |

sensor1 | `<tibble>` | `<S3: nls>` | b | 47.8620767 | 0.373869603 | 128.01810 | 6.872225e-111 |

sensor2 | `<tibble>` | `<S3: nls>` | a | -0.2357433 | 0.003189115 | -73.92124 | 9.150052e-88 |

sensor2 | `<tibble>` | `<S3: nls>` | b | 59.8061527 | 0.185503969 | 322.39824 | 4.294989e-150 |

From here, we can use `select`

and `spread`

to generate a more succinct table, as in the TL;DR example, or keep the table as-is to inspect the quality of the fit.

The result of `augment`

is a long data frame that can be used immediately with qplot, also shown in the TL;DR section above:

sensor | data | fit | t | y | .fitted | .resid |
---|---|---|---|---|---|---|

sensor1 | `<tibble>` | `<S3: nls>` | 1 | 52.41684 | 47.58220 | 4.834645067 |

sensor1 | `<tibble>` | `<S3: nls>` | 2 | 51.64323 | 47.33347 | 4.309757950 |

sensor1 | `<tibble>` | `<S3: nls>` | 3 | 51.09595 | 47.08475 | 4.011196179 |

… | … | … | … | … | … | … |

sensor2 | `<tibble>` | `<S3: nls>` | 98 | 38.06029 | 36.72428 | 1.336007642 |

sensor2 | `<tibble>` | `<S3: nls>` | 99 | 37.86891 | 36.48926 | 1.379642201 |

sensor2 | `<tibble>` | `<S3: nls>` | 100 | 37.89936 | 36.25425 | 1.645117019 |

That’s it for broom and dplyr. The magic of this approach is that you can add a sensor to your input data then re-run all your code; your new curves will just appear on the plots. Similarly, you can change the formula of your fit, and your new coefficients will appear as a new column in the coefficients table. No need to rewrite all your code.

## References

- R documentation: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nls.html
- Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
- Stack overflow: https://stackoverflow.com/questions/24356683/apply-grouped-model-back-onto-data
- Stack overflow: https://stackoverflow.com/questions/22713325/fitting-several-regression-models-with-dplyr