How to extract and analyze a code base with LLMs
When I need to understand a code base, I now use LLMs to quickly get an overview - here's how.
In this article, we will explore how to use PyLLMCore and LLM Components to extract and analyze a code base. The goal is to identify the high-level architecture of an undocumented code base.
We will use the advanced-stack/llm-components
repository as an example.
Prerequisites
- Python 3.8 or higher
- PyLLMCore library
- LLM Components library
Step 1: Install required libraries
First, we need to install the required libraries. You can do this by running the following commands in your terminal:
pip install py-llm-core llm-components
Step 2: Define data classes
We will define data classes to hold the extracted information. These classes will help us structure the data in a meaningful way.
from dataclasses import dataclass
from typing import List
@dataclass
class LowLevelModule:
name: str
description: str
@dataclass
class HighLevelModule:
name: str
description: str
sub_modules: List[LowLevelModule]
@dataclass
class SoftwareArchitecture:
system_prompt = "You are a software architect"
prompt = """
Code base:
{code_base}
----
Analyze this code base and carefully write a description of
the software architecture.
"""
name: str
description: str
modules: List[HighLevelModule]
def to_markdown(self) -> str:
lines = [
f"# {self.name}\n",
f"\n{fill(self.description, width=60)}\n\n",
]
for module in self.modules:
lines.append(f"## {module.name}\n\n")
lines.append(
f"**Description:** {fill(module.description, width=60)}\n\n"
)
lines.append("### Sub-modules\n\n")
for sub_module in module.sub_modules:
lines.append(f"**{sub_module.name}**\n\n")
lines.append(f"{fill(sub_module.description, width=60)}\n\n")
return "".join(lines)
Step 3: Clone the repository
We will clone the advanced-stack/llm-components
repository to a temporary directory.
def main():
repo_url = "https://github.com/advanced-stack/llm-components"
with tempfile.TemporaryDirectory() as temp_dir:
clone_dir = Path(temp_dir) / "repo"
clone_repository(repo_url, clone_dir)
Step 4: Map the code base to text
We will use the map_codebase_to_text
function from the llm-components
library to convert the code base into a structured markdown format.
from llm_components.loaders.code_base import map_codebase_to_text
code_base = map_codebase_to_text(clone_dir)
Step 5: Analyze the code and print the results
We will use PyLLMCore to parse the extracted text and identify the high-level architecture of the code base.
with OpenAIAssistant(SoftwareArchitecture, model='gpt-4o') as assistant:
software_architecture = assistant.process(code_base=code_base)
markdown_output = software_architecture.to_markdown()
print(markdown_output)
Example Output
# LLM Components
LLM Components is a Python library designed to feed large
language models with data from different sources. The
library formats content in a structured markdown format,
supporting functionalities such as traversing a directory
tree, cloning a git repository, and converting web pages to
markdown format.
## Core Library
**Description:** The core library of LLM Components, containing the main
functionalities and utilities.
### Sub-modules
**version.py**
Contains the version information of the library.
**__init__.py**
Initialization file for the core library.
**cli.py**
Command-line interface for interacting with the library.
Supports commands for formatting codebases and converting
web pages to markdown.
## Loaders
**Description:** Modules responsible for loading and converting different
types of data into markdown format.
### Sub-modules
**web_to_markdown.py**
Handles the conversion of web pages to markdown format.
Includes functions for fetching HTML content, cleaning it,
and converting it to markdown.
**__init__.py**
Initialization file for the loaders module.
**code_base.py**
Handles the traversal and formatting of codebases into
markdown format. Respects .gitignore rules and formats
directory structures and file contents.
**utils.py**
Utility functions used across the loaders module, such as
cleaning scraped content.
**git_utils.py**
Handles the cloning of git repositories.