Introduction To Git and GitHub
Introduction To Git and GitHub
Git is a version control system, a software that helps you manage the different versions of your work. Version control
skills are rapidly becoming a prerequisite for all data scientists. Version control can improve teamwork among data
scientists by encouraging collaboration on projects, simplifying the exchange of work, and assisting other data scientists
in repeating the same or similar processes. It is usually helpful to have the option to undo changes or make modifications
to a branch before merging them into the active project, even if you are a data scientist working alone. This allows you to
ensure that your change won't break anything.
To summaries what Sajan explained, in the absence of a version control system, we tend to store files in a messy
format. you have definitely saved a file with suffixes such as final1, final2, final_final or really_final. This seems like a
quick fix while you are working, but when we revisit the same directory after a few months, it becomes really difficult to
locate the file and the folder.
If you are working in collaboration, it becomes difficult to track who made the change, what change was made and when
it was made. Other than that, it is difficult to retrieve the files in your system in case of any crash. However, if you use a
version control system, such as Git, you can fix these problems.
A version control system, such as Git, helps you by taking care of the following:
1. Tracking changes
2. Ability to revert back to any version
3. Finding what changes were made, when were they made and by whom
4. Easy recovery in case of any disaster
5. Improving the efficiency and agility of team projects
Git
Mercurial
Perforce
BitKeeper
The distributed version control system takes up more storage space as every developer has the entire version
database. However, this does not pose a great problem in the versioning of code files, which are mostly text files and
require storage space of only a few kilobytes.
Introduction to Git and GitHub
Difference Between Git and GitHub
Git is a distributed version control system. It is a tool to manage your project source code history. GitHub is
a web-based Git file hosting service that enables you to showcase or share your projects and files with others.
A repository is a directory that contains your project work. All the files in the repository can be uploaded to
GitHub and shared with other people either publicly or privately.
Git was developed in 2005 and it became very popular across the world as a main software development tool
across the world. It has been adopted by many fortune 500 companies, such as Google, Facebook, Microsoft,
Twitter, LinkedIn, etc.
As the popularity of Git increased, many other repository hosting platforms came into existence, such as GitHub,
Gitlab, BitBucket, etc. GitHub was the most popular repository hosting platform. GitHub provides the power to
share your repository with the public and it also provides a graphical user interface.
GitHub, which was launched in 2008, was widely adapted across several companies. As of 2021, it is being used
by 65+ million developers, 3+ million organizations, 200+ million repositories and 72% of Fortune 50 companies.
You can read more about .gitignore files here and licensing files here.
Introduction to Git and GitHub
You have configured a Git account with your local system. For the first-time Git configuration, you use the following
commands:
git config --global user.name <yourusername> # Using this, you will enter your GitHub
username
git config --global user.email <youremail@example.com> # Using this, you will enter
your GitHub username
These commands are required only for the first-time configuration. For later instances, you do not need to run these
commands.
ls - lists all the directories in the current directory. In windows, dir is used
instead of ls
Additional Links
1. Refer to this link for more information on basic shell commands. A cheat sheet for basic shell commands can be
found here.
Which command is used to link the local folder to the GitHub repository?
Using this command, you can add a new remote repository to your local repository.
Which shell command will you use to find the present working directory?
What will happen when you execute the command ‘git branch -m main’?
It will create another branch called ‘main’ and you will be directed to the main branch. If you
type git branch after executing the above command, you will notice that your current branch has
been changed from master to main.
Introduction to Git and GitHub
For configuration, you follow the following steps:
2. git init
Initializing the git repository for a particular folder
c. git status
status of the file in that instant
5. git branch
return the name of the main branch
In its most basic form, machine learning is a set of instructions given to the computer to learn patterns from the data and
create its version of the pattern.
The diagram below shows the conceptual flow chart of a machine learning process.
As you can see from the image, data represents a sample of reality. From this sample of reality, a machine learning code
(also called a model) builds relationships between the data.
For instance, Google maps app in your phone builts a relationship between the various data points like
the distance you want to travel, the traffic on the road, the vehicle you are using, and the time it will take
for you to arrive at your destination.
In simple words, a machine learning model is just a code that can relate some input data points to the desired result.
The process of building any model is iterative, as shown in the image above.
Since building a machine learning model is complicated, ML engineers split the process into smaller steps. Typically the
ML processes can be divided into a few predefined segments.
1. Data Processing:
As the name suggests, this piece of code will deal with all the processing that needs to be done
on the data in its raw form. Later in the program, you will learn about the processes that need to
be done on raw data.
2. Model Building:
This piece of code will have instructions about extracting the patterns hidden in the data.
3. Model Evaluation:
This segment of code will have instructions about evaluating the pattern the model has learnt
from the data.
Apart from these, there are other popular segments like feature extraction, model retraining. At this point, it is not
expected that you will understand all the nuances being discussed. Important take ways are:
1. The machine learning model is a set of instructions.
2. The machine learning code is usually split into multiple smaller code files.
Additional Reference
StatQuest video on explaining machine learning
Introduction to Git and GitHub
Basic Git Commands - II
Creation of any project, they are three main stages:
1. Modified: When you change any code in the file.
2. Staged: When we add the file using ‘git add’, it goes into the staging area.
3. Committed: When you commit all the changes in the files that were in the staging area.
We created a file called data_processing.py in our local repository ‘fraud_detection’. Then, we wrote a dummy code in
the data processing file and performed the following functions:
#Commands used
git add
git status
git commit -m “message”
git push -u origin master
1. It is used to add modified files to the staging area. You can add a
specific file using the command 'git add <filename>'.
'git add <filename>' or 'git add .'
2. If you wish to add all modified and unstaged files present in the
workspace to the staging area, you can use the 'git add .' command.
This command will display the state of the working directory and the
'git status' staging area. In other words, it lets you see the changes that have been
staged and the changes that have not been added to the staging area.
It gives a new commit message and commits all the files sitting in the
git commit -m “New commit message”
staging area.
It is used to upload all the files and changes that were included in the
git push -u origin master or git push
most recent commit to your remote repository on GitHub.
Basic Git commands: click here (These commands will help you throughout the course in case you forget anything)
Introduction to Git and GitHub
As shown in above diagram,
‘git add’ puts the file that you are working on in the staging area where the Git software keeps a track of the files.
‘git commit’ transfers it to the local version database and
‘git push’ transfers it to the remote repository in GitHub
Track File
Suppose that you create a new file named ‘model.py inside the Git repository. Will Git track it automatically?
No
Git will not track a new file created inside the Git repository automatically. You have to explicitly
ask Git to track a file. You can ask Git to track a newly added file named ‘model.py’ using the
following command:
git add model.py
This option is the correct choice.
git log:
This command shows you the commit details. It lists out the commits made in the repository in reverse-
chronological order, that is, the most recent commits show up first. It shows commits with the following details:
The commit ID or SHA
Author’s name (who made the commit)
Date and time
For a shorter version of git log, you have git log –oneline
git Log --oneline → show number of commit done till now in concise manner
‘git revert’ is used to go back to the previous version and ‘git reset’ is used to go back to any version.
‘git revert’ will create another commit that shows that you are reverting back to the previous commit.
‘git reset’ goes back to the commit id that is mentioned and all the commits after that are erased.
Introduction to Git and GitHub
you need to be very careful about using ‘git reset’ as you will lose all the commits that you have done after the desired
commit.
Although ‘git reset’ and ‘git revert’ are good tools to get back to previous commits. It is not the best practice in the
industry to experiment on a stable code.
So, you use branching, wherein each individual developer can experiment on the code separately, and after successful
experimentation, they can merge their branches to the master branch.
Branching
The trunk in this image plays the role of a master branch and the branches coming out of the trunk represent the
branches in Git.
In machine learning projects, after preprocessing the data, you proceed to the modelling step. In this process, you may
not want to disturb the master branch with unnecessary experimentation, so you will create another branch named
‘model’.
To create a new branch, we have used the following function:
git checkout -b < New Branch name >
checkout command can be used to switch branches. The syntax to switch to another branch is given as follows:
git checkout <destination branch name>
Pull request
To merge it to the main branch, we used the GitHub UI to create the pull request.
Pull request means to request the merger of branches. In our case, there were no conflicts and the status was shown as
‘able to merge’.
Introduction to Git and GitHub
Once the pull request was created, we went to the pull request tabs and merged the two branches.
Once you have merged with the master branch, try the following exercise to understand how to resolve conflicts:
1. Make changes to the existing code of model.py file. Example: change one of the print statements.
2. Then, create a pull request.
3. Does it show ‘able to merge’ now or does it show ‘there is some conflict’?
It will show that there is some conflict, it is because the same file has different codes. You can resolve the conflicts and
then merge the branches. You can learn more about merging conflicts here.
To summaries, you learnt that there are situations where you may want to parallelly develop an existing project code,
without making any changes to your initial/original branch. You can accomplish this goal by creating different branches
based on your need (that is, creating a branch per team member or a branch for every new feature) and each branch will
have the same copy of the initial/original branch of the project source code.
Summary
In this session on the ‘Introduction to Version Control and Git’, you learnt the answers to the following questions:
What issues would you face if you did not use version control to track the different changes and versions of your
project files?
o Some of the issues that you encounter are as follows:
Suppose you are working on a code and, after making some changes, you realise that you have
messed up the code, and now, you would like to revert to the last good version of your project.
Coordinating the changes that are being made to the project between you and
fellow developers
How version control comes to your rescue if you have a huge file and you want to keep track of all the changes in
your file?
About the three types of Version Control Systems (VCS):
o Local
o Centralized
o Distributed
About Git and why it is preferred over all the other distributed version control systems?
About GitHub and how you can send your changes from your local system to your remote repository on GitHub?
How to use a set of commands on the command line to push all your code from your local system to your GitHub
remote repository?
How to revert back to previous stable versions of codes?
How to create a branch for experimentation and merge it back to the master branch?