Teaching Large Language Models how to generate workflows (3/n)
Analyzing performance constraints with local LLMs
When working with complex workflows and local LLMs, one of the significant challenges is managing the volume of tokens within the context window. Mapping an environment with numerous nodes and properties can quickly build up and fill the context window, leading to inefficiencies. For instance, I managed to generate a complete workflow using a large model, but it required 118 000 tokens which are unusable for a local model due to memory limitations and performance constraints.
Splitting tasks and optimizing the workflow
Given the limitation, I started exploring ways to split the task into smaller, manageable parts:
- Selecting relevant nodes: Identify relevant nodes using their names.
- Identifying relevant properties: Determine the properties relevant to the selected nodes.
- Building conditions: Construct conditions based on properties.
- Generating the workflow: Combine the selections and conditions to build the workflow.
This multi-stage process ensures that token usage remains efficient, but it introduces latency at each step, making the overall process too slow for practical use with a single LLM.
After a first optimization phase, I ended up with approximately 15 000 tokens for my testing environment. However, token processing time is still very high to be considered as a valid option:
MacBook M1:
- 2min30sec for llama-3.1-8B
- 40sec for Qwen3-1.7B
Older Intel Core i7 (From a Mac Mini 2012):
- 10 minutes for Qwen3-1.7B
These metrics led me to re-orient the initial strategy: We won't let a single LLM do everything.
Balancing pre-processing and inference
I am currently evaluating different strategies:
Reducing required context:
- Using an embedding model directly to decrease token usage.
- Combining both an embedding model and a smaller LLM (around 0.5B) for rewriting the query.
Planning logic:
- Removing structured generation to improve logic reasoning and parse the results afterward.
- Employing an agentic approach to build workflows step-by-step (on reduced context)
Combine smaller models:
- Use several smaller models together as a system (embedding, re-ranking, and LLM inference).
Fine-tune smaller models:
- Fine-tune a model specifically for entity recognition (NER) and logic reasoning around 200-500M parameters.
Micro-model fine-tuning:
- Perform on-device fine-tuning of micro-models (sub 100M parameters).
The 2 main decisive criteria are the latency budget (how much are we willing to let the user wait) and the overall quality (how well our query is mapped into a sound workflow).
While the workflow generation task is really fast with larger models like GPT-4o or Claude, ensuring the task won't take hours on a small devices requires a new approach.
Stay tuned for part 4.