Skip to content

How to extract and analyze a code base with LLMs

When I need to understand a code base, I now use LLMs to quickly get an overview - here's how.

In this article, we will explore how to use PyLLMCore and LLM Components to extract and analyze a code base. The goal is to identify the high-level architecture of an undocumented code base.

We will use the advanced-stack/llm-components repository as an example.

Prerequisites

  • Python 3.8 or higher
  • PyLLMCore library
  • LLM Components library

Step 1: Install required libraries

First, we need to install the required libraries. You can do this by running the following commands in your terminal:

sh
pip install py-llm-core llm-components

Step 2: Define data classes

We will define data classes to hold the extracted information. These classes will help us structure the data in a meaningful way.

python
from dataclasses import dataclass
from typing import List

@dataclass
class LowLevelModule:
    name: str
    description: str

@dataclass
class HighLevelModule:
    name: str
    description: str
    sub_modules: List[LowLevelModule]

@dataclass
class SoftwareArchitecture:
    system_prompt = "You are a software architect"
    prompt = """
    Code base:
    {code_base}
    ----

    Analyze this code base and carefully write a description of
    the software architecture.
    """

    name: str
    description: str
    modules: List[HighLevelModule]

    def to_markdown(self) -> str:
        lines = [
            f"# {self.name}\n",
            f"\n{fill(self.description, width=60)}\n\n",
        ]
        for module in self.modules:
            lines.append(f"## {module.name}\n\n")
            lines.append(
                f"**Description:** {fill(module.description, width=60)}\n\n"
            )
            lines.append("### Sub-modules\n\n")
            for sub_module in module.sub_modules:
                lines.append(f"**{sub_module.name}**\n\n")
                lines.append(f"{fill(sub_module.description, width=60)}\n\n")
        return "".join(lines)

Step 3: Clone the repository

We will clone the advanced-stack/llm-components repository to a temporary directory.

python
def main():
    repo_url = "https://github.com/advanced-stack/llm-components"
    with tempfile.TemporaryDirectory() as temp_dir:
        clone_dir = Path(temp_dir) / "repo"
        clone_repository(repo_url, clone_dir)

Step 4: Map the code base to text

We will use the map_codebase_to_text function from the llm-components library to convert the code base into a structured markdown format.

python
from llm_components.loaders.code_base import map_codebase_to_text

code_base = map_codebase_to_text(clone_dir)

Step 5: Analyze the code and print the results

We will use PyLLMCore to parse the extracted text and identify the high-level architecture of the code base.

python
with OpenAIAssistant(SoftwareArchitecture, model='gpt-4o') as assistant:
    software_architecture = assistant.process(code_base=code_base)
    markdown_output = software_architecture.to_markdown()
    print(markdown_output)

Example Output

markdown
# LLM Components

LLM Components is a Python library designed to feed large
language models with data from different sources. The
library formats content in a structured markdown format,
supporting functionalities such as traversing a directory
tree, cloning a git repository, and converting web pages to
markdown format.

## Core Library

**Description:** The core library of LLM Components, containing the main
functionalities and utilities.

### Sub-modules

**version.py**

Contains the version information of the library.

**__init__.py**

Initialization file for the core library.

**cli.py**

Command-line interface for interacting with the library.
Supports commands for formatting codebases and converting
web pages to markdown.

## Loaders

**Description:** Modules responsible for loading and converting different
types of data into markdown format.

### Sub-modules

**web_to_markdown.py**

Handles the conversion of web pages to markdown format.
Includes functions for fetching HTML content, cleaning it,
and converting it to markdown.

**__init__.py**

Initialization file for the loaders module.

**code_base.py**

Handles the traversal and formatting of codebases into
markdown format. Respects .gitignore rules and formats
directory structures and file contents.

**utils.py**

Utility functions used across the loaders module, such as
cleaning scraped content.

**git_utils.py**

Handles the cloning of git repositories.

Advanced Stack