Python Vibe Coding Course — Part 1
Why Python? What is environment setup?
Python is the de facto standard language for research data analysis. It is free to use and gives you access to a vast ecosystem of libraries shared by researchers worldwide. This course covers everything from setting up a Python environment, to LLM-assisted coding, to publishing reproducible analyses alongside your papers — explained step by step for beginners. In Part 1, we start by establishing the foundational concepts: what Python is, what environment setup means, why you should build your environment locally, and why reproducibility matters.
There are many tools available for research data analysis — R, MATLAB, SAS, and more. Python stands out for three key reasons.
Python is also designed to be readable — its syntax is close to plain English, making it an excellent entry point for programming beginners.
The general flow of data analysis in Python looks like this:
Think of it like writing a formula in an Excel cell to get a result. The difference is that Python lets you write out the entire procedure — not just formulas, but the complete sequence of steps. For example, "read all CSVs in a folder, compute the mean, plot a chart, and save it as an image" can all be written as a reusable, repeatable script.
This is what enables reproducibility. Manual steps done in Excel are hard to recall later, but Python code is the procedure itself — it serves as its own documentation.
A code editor is a dedicated application for writing code. While you could write code in Notepad or Word, a code editor offers powerful features such as:
This course uses VSCode (Visual Studio Code) — a free code editor from Microsoft. It is fast, widely used by programmers and researchers worldwide, and handles everything Python-related: editing code, managing virtual environments, and running .ipynb files.
We will cover how to install and get started with VSCode in Part 2. For now, think of it as the central app where you write and run your code.
Python by itself has limited capabilities. Features for reading data, running machine learning, and drawing charts are provided separately as libraries.
A library is a collection of useful functionality created and shared by someone else. Python has hundreds of thousands of libraries contributed by researchers and engineers around the world. Once installed, you can call them from your code. For example, instead of writing a regression model from scratch, you can install scikit-learn and train and run a machine learning model in just a few lines.
scikit-learn: Machine learning (regression, classification, clustering, etc.)
pandas: Tabular data (reading, aggregating, and reshaping Excel/CSV)
numpy: Numerical computation and matrix operations
matplotlib / seaborn: Charts and figures
openpyxl: Reading and writing Excel files
scipy: Scientific computing and statistics
In your code, one line like import sklearn makes all of that library's features available. Installation takes a single command: pip install scikit-learn.
Each library has a version (e.g., scikit-learn 1.4.0), and behavior can differ between versions. This is why "environment setup" — explained next — is important.
To run Python code, you need to prepare a "Python execution environment" on your PC. This process is called environment setup. It involves four steps:
my_first_analysis on your Desktop)python -m venv .venv). This creates a .venv folder — a self-contained box for your project.pip install scikit-learn pandas) inside the virtual environment..venv\Scripts\activate on Windows). When you run code with the environment active, that box's Python and libraries are used. In your code, you call them with import sklearn.The reason for the virtual environment step is that different projects use different libraries and versions. Installing everything directly into the system Python causes conflicts between projects. A virtual environment is a project-specific isolated box that does not affect other projects.
Imagine setting up a dedicated "work room" for each project inside your PC. Each room has its own Python and library set, completely independent of the others. Project A uses scikit-learn 1.4.0, Project B uses scikit-learn 0.24 — they coexist on the same machine without conflict. This concept is also called a "sandbox" in technical terms.
The key point is that code runs correctly only when the right Python version and library versions are all aligned. Code written for scikit-learn 1.4 may not work on scikit-learn 0.24, and features available in Python 3.12 may not be available in Python 3.8.
"It worked on my machine but not on yours" and "the results differ even with the same code" — these common problems almost always come down to version mismatches. That is why creating a virtual environment and managing versions is so important. The requirements.txt approach covered later ensures others can match your versions exactly.
There are two main ways to use Python: cloud-based (e.g., Google Colab, running in a browser) and local (building an environment on your own PC). This course uses the local approach for the following reasons:
Conversely, if your PC has limited performance or you need large-scale GPU computation, cloud-based options like Google Colab or university servers may be more suitable. The local skills you learn here also apply when using cloud environments.
In recent years, more journals have begun requiring the publication of raw data and analysis code. Examples include PLOS ONE (Data Availability Statement required), Nature-family journals (Code & Data Availability requested), and eLife (open data policy). The trend is spreading to agronomy journals such as Field Crops Research as well.
Setting up a local Python environment makes it very easy to meet these requirements. You only need to organize your project folder like this and upload it to Zenodo or GitHub:
project/
├── .venv/ # Virtual environment (contains version info)
├── data/ # Raw data (Excel, CSV, etc.)
├── notebooks/ # Analysis scripts (.ipynb)
├── results/ # Output figures and tables
├── requirements.txt # Library list for environment reproduction
└── README.md # Usage instructions
With this folder set, a third party can reproduce the exact same analysis by downloading it, creating a virtual environment, installing the dependencies, and running the script. This level of reproducibility is difficult to achieve with manual Excel workflows.
From Part 2 onward, you will put all of the concepts explained here into practice. The course follows this flow:
The estimated time for Parts 2–5 combined is 3–5 hours. You can work through it in one day or split it by part. When you are ready, move on to Part 2.