Part 1: Introduction — Why Python? What is environment setup?

Python is the de facto standard language for research data analysis. It is free to use and gives you access to a vast ecosystem of libraries shared by researchers worldwide. This course covers everything from setting up a Python environment, to LLM-assisted coding, to publishing reproducible analyses alongside your papers — explained step by step for beginners. In Part 1, we start by establishing the foundational concepts: what Python is, what environment setup means, why you should build your environment locally, and why reproducibility matters.

01 Why use Python?

There are many tools available for research data analysis — R, MATLAB, SAS, and more. Python stands out for three key reasons.

Free and open to everyone — It is open-source (publicly available source code, free for anyone to use) and does not depend on commercial licenses or institutional subscriptions.
Increasingly the standard across research fields — Python-based tools are available for almost every research domain, including machine learning, image analysis, statistics, and geospatial analysis. Reproducibility code in papers is often written in Python.
The language LLMs excel at — Because Python is open-source, enormous amounts of code, tutorials, and Q&A posts are publicly available. LLMs such as ChatGPT and Claude are trained on this data, making them especially accurate at generating Python code compared to other languages.

Python is also designed to be readable — its syntax is close to plain English, making it an excellent entry point for programming beginners.

02 What does "writing and running code" mean?

The general flow of data analysis in Python looks like this:

Write Python instructions (code) in a code editor
Pass that code to the Python interpreter — the "execution engine"
The interpreter reads the instructions line by line and performs calculations, reads files, draws charts, and so on
Results are displayed on screen or saved as files

Think of it like writing a formula in an Excel cell to get a result. The difference is that Python lets you write out the entire procedure — not just formulas, but the complete sequence of steps. For example, "read all CSVs in a folder, compute the mean, plot a chart, and save it as an image" can all be written as a reusable, repeatable script.

This is what enables reproducibility. Manual steps done in Excel are hard to recall later, but Python code is the procedure itself — it serves as its own documentation.

03 What is a code editor? (VSCode)

A code editor is a dedicated application for writing code. While you could write code in Notepad or Word, a code editor offers powerful features such as:

Syntax highlighting (instructions and strings shown in different colors for readability)
Autocomplete (suggests completions as you type)
Highlighting of error locations
An integrated terminal (command input panel) within the same window
Extensibility through plugins

This course uses VSCode (Visual Studio Code) — a free code editor from Microsoft. It is fast, widely used by programmers and researchers worldwide, and handles everything Python-related: editing code, managing virtual environments, and running .ipynb files.

We will cover how to install and get started with VSCode in Part 2. For now, think of it as the central app where you write and run your code.

04 What is a library?

Python by itself has limited capabilities. Features for reading data, running machine learning, and drawing charts are provided separately as libraries.

A library is a collection of useful functionality created and shared by someone else. Python has hundreds of thousands of libraries contributed by researchers and engineers around the world. Once installed, you can call them from your code. For example, instead of writing a regression model from scratch, you can install scikit-learn and train and run a machine learning model in just a few lines.

Commonly used libraries in research

scikit-learn: Machine learning (regression, classification, clustering, etc.)
pandas: Tabular data (reading, aggregating, and reshaping Excel/CSV)
numpy: Numerical computation and matrix operations
matplotlib / seaborn: Charts and figures
openpyxl: Reading and writing Excel files
scipy: Scientific computing and statistics

In your code, one line like import sklearn makes all of that library's features available. Installation takes a single command: pip install scikit-learn.

Each library has a version (e.g., scikit-learn 1.4.0), and behavior can differ between versions. This is why "environment setup" — explained next — is important.

05 What is environment setup?

To run Python code, you need to prepare a "Python execution environment" on your PC. This process is called environment setup. It involves four steps:

Create a project folder (e.g., my_first_analysis on your Desktop)
Create a "virtual environment" inside that folder (command: python -m venv .venv). This creates a .venv folder — a self-contained box for your project.
Install the libraries you need at the required versions (e.g., pip install scikit-learn pandas) inside the virtual environment.
Activate the virtual environment (command: .venv\Scripts\activate on Windows). When you run code with the environment active, that box's Python and libraries are used. In your code, you call them with import sklearn.

The reason for the virtual environment step is that different projects use different libraries and versions. Installing everything directly into the system Python causes conflicts between projects. A virtual environment is a project-specific isolated box that does not affect other projects.

How to think about virtual environments

Imagine setting up a dedicated "work room" for each project inside your PC. Each room has its own Python and library set, completely independent of the others. Project A uses scikit-learn 1.4.0, Project B uses scikit-learn 0.24 — they coexist on the same machine without conflict. This concept is also called a "sandbox" in technical terms.

The key point is that code runs correctly only when the right Python version and library versions are all aligned. Code written for scikit-learn 1.4 may not work on scikit-learn 0.24, and features available in Python 3.12 may not be available in Python 3.8.

"It worked on my machine but not on yours" and "the results differ even with the same code" — these common problems almost always come down to version mismatches. That is why creating a virtual environment and managing versions is so important. The requirements.txt approach covered later ensures others can match your versions exactly.

06 Why build a local environment?

There are two main ways to use Python: cloud-based (e.g., Google Colab, running in a browser) and local (building an environment on your own PC). This course uses the local approach for the following reasons:

Your data stays private — No need to upload unpublished research data to an external server.
Full use of your PC's performance — Analysis speed depends on your machine. If your PC is powerful, you use it to its fullest.
Intuitive folder and file management — You work with files the same way you always do in Explorer (Windows) or Finder (Mac).
Easy to package for reproducibility — As explained later, sharing is as simple as sharing the whole folder.

Conversely, if your PC has limited performance or you need large-scale GPU computation, cloud-based options like Google Colab or university servers may be more suitable. The local skills you learn here also apply when using cloud environments.

07 The benefits of reproducibility in research

In recent years, more journals have begun requiring the publication of raw data and analysis code. Examples include PLOS ONE (Data Availability Statement required), Nature-family journals (Code & Data Availability requested), and eLife (open data policy). The trend is spreading to agronomy journals such as Field Crops Research as well.

Setting up a local Python environment makes it very easy to meet these requirements. You only need to organize your project folder like this and upload it to Zenodo or GitHub:

Example folder structure as a reproducibility package

project/
├── .venv/            # Virtual environment (contains version info)
├── data/             # Raw data (Excel, CSV, etc.)
├── notebooks/    # Analysis scripts (.ipynb)
├── results/       # Output figures and tables
├── requirements.txt  # Library list for environment reproduction
└── README.md      # Usage instructions

With this folder set, a third party can reproduce the exact same analysis by downloading it, creating a virtual environment, installing the dependencies, and running the script. This level of reproducibility is difficult to achieve with manual Excel workflows.

→ Overview of this course

From Part 2 onward, you will put all of the concepts explained here into practice. The course follows this flow:

Part 2
Setup

→

Part 3
Vibe Coding

→

Part 4
Reproducibility

→

Part 5
Publishing

Part 2: Environment Setup — Install VSCode, create a folder and virtual environment. Experience everything from Hello World to outputting basic statistics.
Part 3: Vibe Coding with LLM — Ask ChatGPT to write your code. Error handling and iteration tips included.
Part 4: Ensuring Reproducibility — Make your environment reproducible by others.
Part 5: Publishing with Your Paper — Organize your folder, upload to Zenodo and GitHub, and add a link in your paper's Appendix.

The estimated time for Parts 2–5 combined is 3–5 hours. You can work through it in one day or split it by part. When you are ready, move on to Part 2.

Back ← Series Index Index Next — Part 2 Environment Setup →