Key take-away: Owning a self-hosted AI stack slashes costs and gives you control of your data, without compromising performance.
We delve into every hosting option from smartphone edge to cloud, and show how real organizations, even with partial adoption into self-hosted open-source AI solutions, got a positive net benefit.
For a considerable period, commercial closed-source large language models (LLMs) offered by major tech companies, held a distinct performance advantage over their open-source counterparts.
However, recent trends indicate a significant narrowing down of the performance disparity, with open-source AI models rapidly catching up, and in some cases, even rivaling the performance of proprietary solutions.
With the help of open-source models, we can also design AI solutions that are not only high-performing, but also offer:
This shift opens up entirely new strategic possibilities: secure, self-hosted LLM deployments tailored to enterprise requirements without compromising on accuracy or speed.
Open-source models still lack the proprietary facts and brand voice of the organization. To close the gap, two levers can be pulled:
The primary advantage of adding knowledge bases is its ability to keep the AI model’s knowledge up to date without the need for re-training or continuous fine-tuning. When new information becomes available, it can simply be added to the knowledge base, and the system will be able to incorporate it into future responses as further contexts.
Connect data sources for knowledge volatility; invoke fine-tuning when policy adherence or style precision matters.
We will now explore different hosting solutions. When considering how to deploy AI solutions outside of public, black-box APIs, the solutions can be broadly categorized by the degree of infrastructure control and management burden undertaken by the users or organization.
This spectrum ranges from fully managed services that abstract away all infrastructure to bare-metal deployments where the engineering team manages everything.
In this option, the engineering teams rent virtual machines or dedicated servers from a cloud provider, often with GPU support. This setup also empowers the engineering team to control the operating system and the software stack to be used. Hardware is still managed by the provider.
Here at White Widget, we saw the need to evaluate open-source AI solutions to our team without the immediate cloud overhead.
For example, we want to explore features like new embedding models, fine-tuning or even testing the open-source models out of the box. We also see the need for a setup that is collaborative, secure, and can be accessed internally by our team. The tech stack we used is available online, with Tailscale offering generous free tier which we used for setting up the private network.
Tailscale allows managing of remote access using Access Control Lists (ACLs). For example, we can make the dedicated workstation like WW Office Machine run and make ports visible.
We will not dive deep into the Tailscale features, but with the help of Tailscale funnel and Tailscale serve, we can make services like Ollama API endpoints available online to engineering teams
We used the setup to explore features like AI-assisted generation of documentation for our codebases, which our team is actively evaluating.
Having a dedicated workstation for hosting local LLMs can be cost-effective in developing internal tooling, which may help increase developer productivity and widen our capabilities by trying new technologies without the cloud overhead. While not good for production, it is a good sandbox for developing in-house tooling and experimenting with OSS features.
Last May 2025, Google has released a demo application called Gallery that allows users to run LLMs on mobile devices. You can access the GitHub repository to grab their latest APK installation. Currently, the demo application is only available to Android users, with potentially an iOS version under development.
Running models on-device means the user has to download the model weights first to the device, for example in smartphones, before using it. This setup offers numerous advantages, including faster latency with no network involved, guaranteed privacy because the data is not sent off device, and the setup also allows offline use without requiring cellular network or data consumption.
What’s impressive with the OOB models available in the app is that it can process and answer questions about images accurately. We ran the model on low-end, consumer phones.
The Gallery provides useful metrics for assessing the model’s performance that is running on a smartphone.
First-token latency: measures the model startup, allocation of cache and reading the prompt to produce the first token.
Pre-fill speed: measures how fast the model reads the prompt.
Decode speed: measures how the model generates the answer with cache reuse.
End-to-end latency: the wall-clock time measured to generate the whole answer.
The performance out of the box looks impressive already given it runs on mobile devices.
Self-hosting now covers a wide spectrum from fully managed platforms that hide the plumbing to bare-metal clusters that give you total control. Now that open-source models are rapidly closing the performance gap with commercial offerings, and because techniques like adding knowledge bases and fine-tuning tools make customization possible, building your own AI stack is more feasible than ever.
The real-world wins back this up. Projects like Singapore’s Pair assistant shows how running models can deliver stronger privacy, tight security and lower long-term costs while letting the teams tailor the model to local regulations. Looking ahead, even on-device LLMs are pushing the same benefit to the edge, promising millisecond latency and zero-cloud data exposure for end-users.
Let our team help you cut costs on inference bills and cloud spend by exploring self-hosted and hybrid solutions.