Hands on introduction to Git and GitHub

Version control

Image title — 'Piled Higher and Deeper' by Jorge Cham www.phdcomics.com

Version control keeps track of changes to a file over time. While the document above is technically under version control, using file names can be chaotic and unreliable especially when files are passed between multiple people working on different computers.

Version control systems work by tracking incremental differences -- delineated by commits -- in files and keeping a history of those changes. This allows you to return to return to previous versions at any time. The file name doesn't change and only one version of a file is displayed at a time, but the full history is accessible. This is similar to the version control that occurs in programs like Google Docs or Microsoft Word.

Version control with Git and GitHub

Git is a free, distributed version control system. The complete history of your project is safely stored in a hidden .git repository. A repository is like a project folder or directory on your computer. It contains all of the files and folders for a project as well as the history of those files. Changes to files are recorded as commits. A commit is marked by the user when they're ready to check in a set of changes to a file. You can always go back to a previous commit.

GitHub is an internet service for hosting files that are under git version control. You can think of it like social media for code and small text files. GitHub houses repositories in a central locations. Repositories on GitHub are referred to as remote repositories. Users can create a local copy of a repository by cloning (downloading) it to their computer, or to a computer in the cloud that they have access to.

You don't need GitHub to use the Git version control system. By using the Git command line interface, you can use version control locally without ever pushing your files to a remote repository on GitHub.

You also don't need to know the Git command line interface to take advantage of the version control system: Git is well-integrated into GitHub, so if you know the basic terminology and definitions, you can still version control your project through GitHub alone.

However, both Git and GitHub are most powerful when combined together. This workshop will teach you the basics of Git and GitHub so you can leverage both in your research.

Setting up

Signing up for a GitHub account

You will need a GitHub account for today's workshop. If you already have one, you can use your current username (e.g. there is no need to create an Arcadia-specific GitHub username). If you don't, sign up at github.com. Usernames are public so choose accordingly.

Accessing a Unix shell & Git

For this lesson, we need to have access to a Unix shell. If you're not sure how to open a terminal on your computer, see these instructions.

Many computers come with Git installed. Once you have a Unix shell open, run the following command to see if git is installed:

which git

You should see something like:

/usr/bin/git

If you don't, you'll need to install git. If Git is not already installed on your computer, see these instructions.

First time setup and configuration

Whenever you use Git for the first time on a computer, there are a few things you need to set up. Unless you need to change something, these commands only need to be used the first time you use Git on a computer.

First, set your name and email:

git config --global user.name "My Name"
git config --global user.email "myemail@gmail.com"

Your email and user name is recorded with every commit. This helps ensure integrity and authenticity of the history. Most people keep their email public, but if you are concerned about privacy, check GitHub's tips to hide your email.

Next, we'll set up a key pair. GitHub recently discontinued password authentication. Cryptographic keys are a convenient and secure way to authenticate without having to use passwords and are an authentication method still supported by GitHub. They consist of a pair of files called the public and private keys: the public part can be shared with whoever you'd like to authenticate with (in our case, GitHub), and the private part is kept "secret" on your machine. The private key should never be shared in an unencrypted way with anyone -- this includes over email, slack, etc. If your private key is accidentally shared, you should stop using it for authentication and create a new key pair. Things that are encrypted with the public key can be be decrypted with the private key, but it is computationally intractable (ie, it would take on the order of thousands of years) to determine a private key from a public key. You can read more about it here.

cd ~
mkdir -p .ssh
cd .ssh
ssh-keygen

The ssh-keygen program will prompt you to input a key file name (below, we use 20220920-github-workshop) and a passphrase. It's ok to leave the passphrase blank; if you put in a passphrase, you'll need to remember it and type it every time you use the key file.

ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/jovyan/.ssh/id_rsa): 20220920-github-workshop
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 

Your identification has been saved in 20220920-github-workshop.
Your public key has been saved in 20220920-github-workshop.pub.
The key fingerprint is:
SHA265:
The key's randomart image is:

Next, we'll check the key files permissions. Permissions control who has access to a file on your computer, and key files need to have very restricted permissions.

ls -lah

We see:

total 16K
drwxr-xr-x 2 jovyan jovyan 4.0K Sep  9 20:21 .
drwxr-xr-x 1 jovyan jovyan 4.0K Sep  9 20:20 ..
-rw------- 1 jovyan jovyan 1.7K Sep  9 20:21 20220920-github-workshop
-rw-r--r-- 1 jovyan jovyan  445 Sep  9 20:21 20220920-github-workshop.pub

We are the only user who has read access to our private key file, so our permissions are fine.

When we ran ls, we saw that the ssh-keygen program generated two files. The first file 20220920-github-workshop is the private key file and should never be shared. The second file 20220920-github-workshop.pub is the public key file that we'll upload to GitHub. We can tell it's the public key file because it ends in .pub.

Next, we need to get our public key file uploaded to GitHub so we can use it for authentication. GitHub will need the text in the public key file. You can view it by running cat:

cat 20220920-github-workshop.pub

Your public key file text should look something like this:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtrKIjBDjfAt3sIfHOPKEE/RkcuPAfdl0xO7M+CBQNWuYqST2bW20yRFu4lpCyNuz7uG12DgIVMmLMdlfGlGjJpj/B/f3FUw6XxIaAzQIfYtg+Q+qJ0M2GSaooeEoKi9lgiJsR69fwAoVPgI0kyyA4253F/SLVD/QpWMQgcN5m43tPztc9vp1Lt5u8PZmZJUBMyMolOgtvRUYDKx7MRb9nWO/Rmzeibj96hLEm8GHiERRGDpHK1BOryiq2jI9C2+o3ujj+SWCeqRVTdBp7raSzhwPWCsLpNRX1MNQ9t+807eDV8pUDnJ6gfHZndcQ23k+OMwhTdhPk74drz2k5X+/h jovyan@jupyter-arcadia-2dscience-2dtional-2dtraining-2dgi36ozk6

To upload the key file text, navigate to GitHub and click on settings.

Then use the menu on the left hand side of the page to navigate to the SSH and GPG keys tab.

Once there, select New SSH key.

Give your key a descriptive name (such as 20220920-github-workshop) and then paste in the contents of your public key file to text editor.

The very last thing we need to do is tell our computers which key file to use when we want to authenticate with GitHub. We do this by creating a config file in our .ssh directory. We'll use nano to do this. Type the following contents into your config file and save it as config.

nano config

Host github.com
 User git
 HostName ssh.github.com
 IdentityFile ~/.ssh/20220920-github-workshop

Follow the command prompts in the bottom ribbon of nano to exit.

To confirm that you have set up authentication properly, run the following command:

ssh -i ~/.ssh/20220920-github-workshop -T git@github.com

First, you'll see a message like:

The authenticity of host 'ssh.github.com (192.30.255.122)' can't be established.
ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

Type yes and hit enter. Then, you should see:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

This may be all of the set up you need to do. However, if you continue to have issues, you may need to load your key into the SSH agent with:

ssh-add ~/.ssh/20220920-github-workshop

Getting started with version control

Now that we've established a secure connection between our computers and GitHub, it's time to learn how to start a version controlled repository and add files to it. We'll start by creating a repository on GitHub.

Creating your first repository on GitHub

Follow the steps below to create your first repository.

Navigate to GitHub.
Click on the plus sign in the upper right of the screen.
Select "New repository" from the drop-down menu.
On the "Create a new repository" page, choose a name for your repository. For this workshop, name the repository 2022-git-workshop. A repository can have any valid directory name, but putting the year at the beginning is a good practice because it makes it clear when the repo was created.
Check the box "Add a README file".
Click the green "Create repository" button at the bottom of the page.

This will create a new repository and take you to the repository's landing page on GitHub.

Cloning a repository from remote to local

Cloning is the process of copying an existing Git repository from a remote location (here, on GitHub) to your local computer (here, on our binder instances). To clone the repository, click the green "Code" button in the top right hand corner of the repository screen. This creates a drop down menu with clone options. We'll select the SSH tab because we configured an ssh key pair. Once you select the tab, copy the path that starts with git@gitub.com:. Then, navigate to your terminal and use the command below to clone the repository. Remember to substitute out the username/URL with your own URL that we copied.

cd ~
git clone git@github.com:your_username/2022-git-workshop
cd 2022-git-workshop

The basic Git workflow

We can now make changes to our files in our local repository. The basic Git workflow begins when you communicate to Git about changes you've made to your files. Once you like a set of changes you’ve made you can tell Git about them by using the command git add. This stages the changes you have made. Staging a file tells Git that you're ready to commit those files -- it's a way of telling Git which files you're ready to commit. Next, you bake them into the branch using git commit. When you’re ready, you can communicate those changes back to GitHub using git push. This will push your the changes that are in your local repository up to the remote repository.

Let's try this workflow out. Throughout this process, we'll use the command git status to track our progress through the workflow. git status displays the state of a working direcotry and the staging area.

git status

We'll use the echo command to create a new file, notes.txt.

ls
echo "some interesting notes" > notes.txt
ls

Take a look at the contents of your notes.txt file:

less notes.txt

And run git status to see how creating a new file changes the output of that command:

git status

Once you have made changes in your repository, you need to tell Git to start tracking the file. The command git add adds the files to the "staging area", meaning Git is now tracking the changes.

git add notes.txt

After adding this file, we see our output of git status changes.

git status

The text associated with our file is now green because the file is staged. When you've made all of the changes to a file that represent a unit of changes, you can use git commit to create a snapshot of the file in your Git version history.

git commit -m "start notes file"

git status

The -m flag is the message that is associated with that commit. The message should be short and descriptive so that you or someone looking at your code could quickly determine what changes took place from one commit to the next. What constitutes a unit of changes worthy of running git commit? That depends on the project and the person, but think about returning to this code six months in the future. What set of changes would make it most easy to return to an earlier version of document?

Committing a file bakes changes into your local repository. To communicate that changes back up to your remote repository, use git push.

git push

Challenge: Add today's date to the README.md text file, stage the changes in those files, commit them to version history, and push them up to your remote repository.

Challenge solution

You can make changes to your README.md file however you choose -- using a text editor on your system (TextEdit, BBEdit, VSCode, Sublime), using a text editor in your terminal, or by using a shell redirect. This solution demonstrates how to add text using echo and redirects >>.

echo "20220920" >> README.md
git add README.md
git commit -m "add date to readme"
git push

Working on branches

When you want to make changes to files in a repository, it’s best practice to do it in a branch. A branch is an independent line of development of files in a repository. You can think of it like making a copy of a document and making changes to that copy only: the original stays the same, but the copy diverges with new content. In a branch, all files are copied and can be changed.

Branches are particularly powerful for collaborating. They allow multiple people to work on the same code base and seamless integrate changes.

By default, you start on the main branch. You can see which branch you're on locally using the git branch command.

git branch

Let's create a branch and make changes to it. We can create a new local branch using git checkout -b. git checkout tells git we want to switch which branch we're currently on, while the -b flag tells git that we're creating a new branch.

git checkout -b my-first-branch

Now, if we run git branch we'll see we're on the branch my-first-branch.

git branch

To go back to the main branch, we can use the git checkout command without the flag -b.

git checkout main

Branch names can be used to convey information that can help you and others keep track of what you're doing. For example, you can prepend your branch names with your initials and a slash to make it clear that you were the one who created the branch, and you can specify a name that is associated with the file changes made on the branch: <your initials>/<brief description of the code change>. Example: ter/git-workshop.

Let's practice making a branch using these conventions. In our branch, we'll update our README.md file. Use the code below to create a new branch, but remember to substitute out your-initials for your initials.

git checkout -b your-initials/update-readme.

Whenever you create a new branch, it branches off from the branch you're currently in when you make the new branch. In this case, we started in main, so our branch will branch off of from here.

Next, let's add some changes to our README.md file and run git status.

echo "adding more text to the README file" >> README.md
git status

Stage the changes with git add:

git add README.md

And then create a new commit with git commit:

git commit -m "update readme w more text"

Run git push:

git push

This gives us an error message!

fatal: The current branch tmp has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream origin ter/update-readme

While we created our branch locally, we haven't created the same branch remotely. We need to tell git where to put our changes in the remote repository. Luckily, the error message points us to the code we need to run to set the remote branch.

git push --set-upstream origin ter/update-readme

In the above command, origin is shorthand for the remote repository that a project was cloned from. To view what this value is, you can run git remote -v.

If we navigate to our repositories on GitHub, we now see a yellow banner wtih our branch name and a green button inviting us to "Compare & pull request".

Integrating changes into `main` using pull requests

We just created changes to our README.md file in a branch. To integrate this changes formally back into the main branch, you can open a pull request. Pull requests create a line-by-line comparison between the original code and the new code and generate an interface for inline and overall comments and feedback. Pull requests are how changes are reviewed before they’re integrated back into the main branch.

To open a pull request, click the green button "Compare & pull request". This will take you to a page where you can draft your pull request. You should update the title of your pull request to reflect the changes you made in your branch in plain language. Then you should add a few sentence description of those changes. Once you've done that, click the "Create pull request" button.

The pull request interface has a lot of features that make collaboration and code review straight forward. The main pull request page supports many important features:

Each commit and the descriptive message your provided for them are displayed.
Conversations about the file changes can take place via comments.
You can convert your pull request to a draft if you're not done adding to it yet, or request review from other GitHub users.

The main pull request page also has a tab for "Files changed". This provides a succint view of every line changed in every file, showing you the new content in green and the deleted content in red.

The "Files changed" interface is the most useful interface for code review. Code review is when someone who didn't write the code but who has domain expertise relevant to the code reads and reviews the code being added or removed by a pull request. Code review a) encourages you to write better documentation and cleaner code as you go because you know someone will be reading it soon, b) gives a collaborator an opportunity to comment on and improve your work, and c) increases accountability and visibility of your work. There is no one-size-fits-all approach for code review, but these are some things to keep an eye out for when you're reviewing someone's code (modified from here):

Bugs/Potential bugs
- Repetitive code
- Code saying one thing, documentation saying another
- Off-by-one errors
- Making sure each function does one thing only
- Lack of tests and sanity checks for what different parts are doing
- Magic numbers (a number hardcoded in the script)
Unclear, messy code
- Bad variable/method names
- Inconsistent indentation
- The order of the different steps
- Too much on one line
- Lack of comments and signposting
Fragile and non-reusable code
- Software or software versions not documented
- Tailor-made and manual steps
- Only works with the given data

For more on code review, see this lesson.

From the "Files changed" tab, you can see all of the changes suggested in a pull request. You can use in-line comments and suggestions or holistic comments on changes. When you're finished with your review, you can submit it as a set of comments, approved changes, or requesting further changes.

Once a pull request is approved, the changes are merged into the main branch.

Challenge Open a pull request with the changes you made to your README.md. Request a review from the person sitting next to you. After your PR has been reviewed, merge the changes into your main branch.

Pulling changes from the remote repository back to the local repostory

After we merged our pull requests, our main branches in our remote repositories on GitHub now contain content that our To get the merged changes back to your local branch, you can run git pull in your local repository.

git pull

This will pull in all of the changes that are on GitHub but not in our local repo.

Challenge Checkout the main branch and run git pull. What happens?

Challenge solution

git checkout main
git pull

You should see a message that ends with:

Already up to date.

This happens because we already pulled in all of the remote changes to our local repository, so there is nothing left to do.

Now that all of our changes to our README.md are in our local and remote branch, we can safely delete our branch.

git branch

git branch -d ter/update-readme

While you don't have to delete a branch when you're finished with it, deleting branches you're done with helps keep your project tidy and helps yourself and others not get confused when you revisit work in the future.

Bringing the whole process together

This figure gives an overview of the entire Git & GitHub ecosystem. It provides a succinct review of the concepts we've covered today.

GitHub has integrated enough features that many of these steps can be orchestrated completely on GitHub without needing to clone a repository to your local machine. However, this mental model is still helpful: you can create a branch, make edits to a text file and commit them, open a pull request, and merge the pull request all from the GitHub online interface. You do not need to learn the Git CLI to experience the joys and benefits of GitHub and to contribute to projects that live there.

Undoing Changes

Confusingly, git has four different subcommands for undoing things:

restore
revert
reset
checkout

We'll cover these commands below.

Restoring a file to its state at the previous commit

The git restore command is useful if you want to undo changes you haven't committed yet.

Open up the README.md file in a text editor and make some changes. For instance, let's add the line This is a mistake to the end of the file. Then save and close the file.

Next, check the repository status:

git status

As expected, git tells us there are changes to the README.md file:

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   README.md
no changes added to commit (use "git add" and/or "git commit -a")

What if now you decide the changes were a mistake and you don't want to keep them? The git restore command restores a file in the working directory to the version of the file in the most recent commit.

Let's try out restoring the README.md file:

git restore README.md

Now check the status again:

git status

The status shows that there are no longer any new changes to the README.md file:

On branch main
nothing to commit, working tree clean

There is no way to undo git restore, so be careful when you're using it.

In older versions of git, the command to restore a file in the working directory was git checkout -- FILENAME. You may occasionally see people mention this command online, and it still works in recent versions of git. However, the git restore command is preferable because it disambiguates what you're trying to do (git checkout is also used for other things unrelated to restoring files).

Restoring the staging area

With a different set of arguments, you can also use the git restore command to remove changes from the staging area.

Open up the README.md file in a text editor again and make some more changes. This time let's add the line Git is tough to the end of the file. As usual, save and close the file.

Next, add the changes to the staging area:

git add README.md

Then check the repository status with git status:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   README.md

git tells us our changes to README.md are staged.

Now suppose you decide you want to unstage the changes to README.md. This removes the changes from the staging area, but not from the working directory. Unstaging changes is especially useful when you're working with multiple files and accidentally add a file you don't want to commit yet.

To unstage a file, use git restore with the --staged argument:

git restore --staged README.md

Then check the status again:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   README.md
no changes added to commit (use "git add" and/or "git commit -a")

The changes to README.md are still in the working directory, but no longer in the staging area.

In older versions of git, the command to restore a file in the staging area was git reset FILENAME. You may occasionally see people mention this command online, and it still works in recent versions of git. However, the git restore command is preferable because it disambiguates what you're trying to do (git reset is also used for other things unrelated to restoring the staging area).

Reverting a commit

The git revert command is useful if you want to undo changes that you've already committed. Once a commit is in your repository's history, it generally stays there forever. There is a way to delete commits, but you should really only delete commits if you accidentally commit sensitive data. Instead, the idiomatic way to undo a commit is to create a new commit that reverses the changes.

Let's start by editing the README.md file again and making a commit. Add the line This is a big mistake. to the end of the file. Then run:

git add README.md
git commit -m "This commit is a mistake."

Now suppose you decide the commit is a mistake, and want to undo it. First, you need to find the ID of the commit you want to undo. The git log command opens a scrolling list of all commits in the repository's history and their IDs. Run the command:

git log

You can exit the log by typing q.

In the log, locate the the mistaken commit, and copy or remember the first 5 digits of its ID. Git is smart enough that it can generally recognize a commit from the first few digits of its ID, and will tell you if it needs more digits for disambiguation.

In my repo, the ID of the commit starts with e01d1. In your repo, the commit will likely have a different ID. Next, run this command, replacing the ID with the ID you copied:

git revert e01d1 --no-edit

The --no-edit flag tells git revert to generate the commit message for the new commit automatically. Without the flag, git revert will prompt you to enter a commit message.

Now inspect the file with nano or cat. You should see that the changes from your bad commit are gone. If you look in the log with git log, you'll also see a new commit to revert the changes of a previous commit.

Challenge: The goal of this challenge is to make a bad commit and then revert it. Work through these steps:

Change a file in your repository.
add and commit the changes.
Find the ID of the commit from step 2 in the repository history.
revert the commit from step 2.

Challenge solution

Change a file like `README.md`. Then run:

git add README.md
git commit -m "update README"

To find the ID of the commit, run:

git log

Then revert the commit:

git revert 07f61 --no-edit

Creating a branch from a previous commit

Another strategy to revert to a prior state is to create a new branch that starts at the state of a previous commit. This has the advantage of keeping a branch with the work you did on it, but allows you to continue to work in a different direction from a previous commit. To do this, you can use git log like above to identify which commit you want your new branch to start out at. Once you have the commit ID, run the following:

git checkout -b name-of-new-branch e01d1

You'll now be on a new branch that starts from the commit you indicated. To check, run git branch.

GitHub Goodies

Issues

GitHub issues started as a way to record problems with software and have since evolved into generalized project planning tools. Issues are a great way to keep track of to do items, have asynchronous conversations relevant to a repository, or otherwise deposit information relevant to a repository.

Releases

Releases bundle and deliver iterations of your project. When your entire repository reaches a consensus point, you can use GitHub to compress that version of files together. Releases integrate with zenodo to archive your repo and give it a DOI.

`.gitignore`

A .gitignore file is a hidden text file in you repository that you can use to "ignore" files that you don't want under version control. This is handy for really large data files that you don't want to add to your repo. To use a .gitignore file, create a .gitignore text file in the main directory of your repo and add paths to files that you would like to ignore. The .gitignore file accepts regular expressions (e.g. *.fastq.gz to ignore any file that ends in .fastq.gz).

Links to other goodies

Summary

Terms introduced:

Term	Definition
Git	a version control system
GitHub	an internet hosting service for version control using Git
repository	a collection of files and folders associated with a project that is under Git version control
local	accessed from your system; the computer you're working on
remote	stored on a remote computer; GitHub
commit	a snapshot of your files and folders at a given time
branch	an independent line of development in a repository
pull request	a mechanism to suggest changes from a branch be integrated back into the `main` branch
review	the process of reviewing changes in a branch through a pull request
issue	a tracking tool integrated into a GitHub repository

Commands introduced:

Command	Function
`git clone`	copies a Git repository from remote to local
`git add`	adds new or changed files in your working directory to the Git staging area
`git commit`	creates a commit of your repository
`git push`	uploads all local branch commits to the corresponding remote branch
`git checkout`	switches between local branches, or creates a new branch with the `-b` flag
`git branch`	reports what branch you're currently on and what other local branches exist
`git pull`	updates current local working branch and all of the remote tracking branches
`git restore`	restores a file to its state at the last commit or remove a file from the staging area
`git revert`	reverts changes that have already been committed

Hands on introduction to Git and GitHub

Version control

Version control with Git and GitHub

Setting up

Signing up for a GitHub account

Accessing a Unix shell & Git

First time setup and configuration

Getting started with version control

Creating your first repository on GitHub

Cloning a repository from remote to local

The basic Git workflow

Working on branches

Integrating changes into main using pull requests

Pulling changes from the remote repository back to the local repostory

Bringing the whole process together

Undoing Changes

Restoring a file to its state at the previous commit

Restoring the staging area

Reverting a commit

Creating a branch from a previous commit

GitHub Goodies

Issues

Releases

.gitignore

Links to other goodies

Summary

Terms introduced:

Commands introduced:

Integrating changes into `main` using pull requests

`.gitignore`