The 7-step build process
Every custom GPT project follows roughly the same shape. The difference between projects that succeed and projects that don't isn't usually the technology — it's whether each step gets the attention it needs.
Step 1: Use case scoping (1–2 weeks). What's the problem? What does success look like? Who are the users? What's the failure mode if the GPT gives a wrong answer? This step is the most undervalued and most important. Many failed projects fail here.
Step 2: Data inventory (1 week). What data does the GPT need to know about? Where does it live? What's the access model? What's the privacy posture? Many projects discover at this step that the data they thought existed actually doesn't, or is too messy to use.
Step 3: Architecture & technology selection (1 week). Cloud or self-hosted? Which model? Vector store? Orchestration framework? Compliance posture? This is where 'custom GPT' splits into 50 different shapes.
Step 4: Data preparation (1–4 weeks). Cleaning, deduplication, chunking, metadata enrichment, vector embedding. The unglamorous work. The single biggest determinant of GPT quality.
Step 5: Initial build (2–6 weeks). System prompt design, retrieval pipeline, integration to source systems, UI/UX, evaluation harness.
Step 6: Evaluation & iteration (2–4 weeks). Build a 100–500 question evaluation set. Run the GPT against it. Tune until performance is acceptable. Most teams underestimate this step by 3x.
Step 7: Deployment & monitoring (ongoing). Production deployment, observability, feedback loop, retraining cadence, model upgrade plan.
Technology choices, simplified
Foundation Model
OpenAI GPT-4o for most use cases. Anthropic Claude 3.5 Sonnet for complex reasoning or long-context. Llama 3.1 70B if self-hosting is required.
Vector Database
Pinecone (managed, easy) for most projects. Qdrant or Weaviate (self-hosted) when data residency demands it. pgvector (PostgreSQL extension) for projects already on Postgres.
Orchestration Framework
LangChain or LlamaIndex for prototyping. Custom code for production. We've burnt out on LangChain's churn at production scale; we mostly write directly against the model APIs now.
Hosting
Azure (Australia East) for compliance-sensitive work. AWS (Sydney) for general production. Vercel + Cloudflare for the lightweight web layer.
DIY vs hire — honest take
When DIY makes sense: your team has experienced engineers comfortable with Python and the OpenAI/Anthropic APIs, the use case is well-understood, the data is clean, and there are no specific compliance constraints. A 2-engineer team can build a usable internal tool in 4–6 weeks.
When hiring makes sense: your team is busy with other things, the use case crosses multiple business systems you haven't integrated before, or the compliance posture (APRA, AHPRA, defence) is not your team's specialty. Specialist consultants compress build time 2–3x and reduce the risk of expensive mistakes.
When neither makes sense: the use case is a poor fit for AI. Some 'AI projects' are actually misnamed automation projects (which Zapier or Make.com handles fine), bad data infrastructure projects (which need a data engineer, not an AI consultant), or incoherent strategy projects (where the business hasn't decided what they want).
Pitfalls we see consistently
- Skipping evaluation. 'It worked when I tried it' is not the same as 'it works at production scale'. A proper eval set is non-negotiable.
- Underinvesting in data prep. 60–70% of project value comes from data quality. Treat it accordingly.
- Over-engineering retrieval. Sometimes the answer is 'just put 5 documents in the prompt'. Retrieval pipelines aren't always necessary.
- Picking the wrong model for the job. GPT-4o is overkill for routing tasks. GPT-4o-mini or Claude Haiku is 10x cheaper and good enough.
- No feedback loop. Without users marking responses as good/bad, the GPT can't improve. Build the feedback mechanism in week 1, not week 12.
- Compliance as an afterthought. Adding APRA controls retrofit is expensive. Design them in from day 1 if relevant.
The single most important step is scoping. Spending two extra weeks on the use case definition saves four months of wrong-direction development. If you take one thing from this guide, make it this: don't start the build until you can articulate the problem in one paragraph and define what 'good' looks like in 5 measurable bullets.
Frequently asked questions
How big a team do we need?
Internal DIY: 2 engineers (one ML/AI-experienced, one full-stack), plus a domain expert who can answer questions about the business problem. Hiring out: a consulting firm typically deploys 1–3 engineers depending on complexity. Internal involvement is still essential — somebody on your side has to own the use case definition and data access.
Can ChatGPT or Claude build my GPT for me?
They can write code, debug, and accelerate development substantially. They can't define your use case, organise your data, or handle the strategy. Use them as productivity tools for the engineers building, not as replacements for the engineers.
How do we know our GPT is 'good enough'?
An evaluation harness with 100–500 representative test queries scored against ground truth. Set the bar before the build (e.g., 90% accuracy on the eval set, 5% explicit-uncertainty rate, 0% hallucination on critical facts). Don't ship until the bar is met.
How long until ROI?
Simple internal tools: 3–6 months. Customer-facing deployments: 6–12 months. Compliance-grade enterprise: 12–18 months. The lag is mostly user adoption — building the GPT is the easy part; getting your team to use it consistently is the hard part.
Ready to build your custom GPT?
Get a free 30-minute scoping call. We'll map your use case, data sources, and ROI before you commit.
Start the Conversation