Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 9 days ago

American here.

If Linus Torvalds bans Americans over something stupid we did, and my nationality gets insulted, this will be me:

(I am actually Central Texan, TYVM. Bring the insults.)

brucethemoose@lemmy.world · edit-2 9 days ago

American here.

Americans are idiots. All of us are morons. I accept the insults on my nationality, even personally. Bring it. I don’t know how else to say/generalize it, because it’s true.

I’m gonna nod my head if something happens and Linus Torvalds bans us too.

brucethemoose@lemmy.world · 15 days ago

No, from what I’ve seen it falls off below 4bpw (just less slowly than other models) and makes ~2.25 bit quants somewhat usable instead of totally impractical, largely like AQLM.

You are thinking of bitnet, which (so far, though not after many tries) requires models to be trained from scratch that way to be effective.

brucethemoose@lemmy.world · edit-2 22 days ago

I used to have a 6GB GPU, and around 7B is the sweetspot. This is still the case with newer models, you just have to pick the right model.

Try a IQ4 quantization of Qwen 2.5 7B coder.

Below 3bpw is where its starts to not be worth it, since we have so many open weights availible these days. A lot of people are really stubbern and run 2-3bpw 70B quants, but they are objectively worse than a similarly trained 32B model in the same space, even with exotic, expensive quantization like VPTQ or AQLM: https://huggingface.co/VPTQ-community

brucethemoose@lemmy.world · 23 days ago

Almost all of Qwen 2.5 is Apache 2.0, SOTA for the size, and frankly obsoletes many bigger API models.

brucethemoose@lemmy.world · 23 days ago

To actually answer this, you could look into free APIs of open source models, which have daily limits but are otherwise largely catch-free. You could even mirror endpoints on your VPS if you need to, or host “middleware” like prompt formatters and enhancers.

I say this because, as others said, you cannot actually host AI on a VPS…

brucethemoose@lemmy.world · edit-2 24 days ago

What’s to stop people from making “tunnel” instances, maybe even with reposts?

That aside, ultimately, we’re at the mercy of whoever is paying for the instance, and their interests. So if everything does fragment, it’s just kinda the nature of the hosts.

brucethemoose@lemmy.world · edit-2 24 days ago

I’d wager that enshittification is inevitable, but the Fediverse can “live on” between cycles because instances or even entire systems can go down while new Fediverse ones take their place.

brucethemoose@lemmy.world · 26 days ago

Lemmy.world and sh.itjust.works don’t seem to have any noticeable political leanings as far as I can tell.

…What?

I consider myself a raging liberal, at least in the US. A socialist. But lemmy.world is so liberal it makes me feel like a Trumpster.

I guess I don’t feel at risk of getting globally banned like I would for disagreeing with the consensus like on .ml, but claiming .world is neutral is quite a sweeping statement.

brucethemoose@lemmy.world · edit-2 29 days ago

These days, there are amazing “middle sized” models like Qwen 14B, InternLM 20B and Mistral/Codestral 22B that are such a massive step over 7B-9B ones you can kinda run on CPU. And there are even 7Bs that support a really long context now.

IMO its worth reaching for >6GB of VRAM if LLM running is a consideration at all.

brucethemoose@lemmy.world · 30 days ago

I am not a fan of CPU offloading because I like long context, 32K+. And that absolutely chugs if you even offload a layer or two.

brucethemoose@lemmy.world · 30 days ago

For local LLM hosting, basically you want exllama, llama.cpp (and derivatives) and vllm, and rocm support for all of them is just fine. It’s absolutely worth having a 24GB AMD card over a 16GB Nvidia one, if that’s the choice.

The big sticking point I’m not sure about is flash attention for exllama/vllm, but I believe the triton branch of flash attention works fine with AMD GPUs now.

brucethemoose@lemmy.world · edit-2 30 days ago

Basically the only thing that matters for LLM hosting is VRAM capacity. Hence AMD GPUs can be OK for LLM running, especially if a used 3090/P40 isn’t an option for you. It works fine, and the 7900/6700 are like the only sanely priced 24GB/16GB cards out there.

I have a 3090, and it’s still a giant pain with wayland, so much that I use my AMD IGP for display output and Nvidia still somehow breaks things. Hence I just do all my gaming in Windows TBH.

CPU doesn’t matter for llm running, cheap out with a 12600K, 5600, 5700x3d or whatever. And the single-ccd x3d chips are still king for gaming AFAIK.

brucethemoose@lemmy.world · edit-2 1 month ago

To go into more detail:

Exllama is faster than llama.cpp with all other things being equal.
exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)
With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.

brucethemoose@lemmy.world · edit-2 1 month ago

It’s less optimal.

On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.

brucethemoose@lemmy.world · 1 month ago

Nah, I should have mentioned it but exui is it’s own “server” like TabbyAPI.

Just run exui on the host that would normally serve tabby, and access the web ui through a browser.

If you need an API server, TabbyAPI fills that role.

brucethemoose@lemmy.world · edit-2 1 month ago

Shrug did you grab an older Qwen GGUF? The series goes pretty far back, and its possible you grabbed one that doesn’t support GQA or something like that.

Doesn’t really matter though, as long as it works!

brucethemoose@lemmy.world · 1 month ago

Your post is suggesting that the same models with the same parameters generate different result when run on different backends

Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.

brucethemoose@lemmy.world · 1 month ago

So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.

brucethemoose@lemmy.world · edit-2 1 month ago

It’s probably much smaller than whatever other GGUF you got, aka more tightly quantized.

Look at the filesize, thats basically how much RAM it takes.

brucethemoose@lemmy.world · edit-2 1 month ago

Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · 2 months ago

Qwen2.5: A Party of Foundation Models!

brucethemoose@lemmy.world · edit-2 2 months ago

How does Lemmy feel about "open source" machine learning, akin to the Fediverse vs Social Media?

brucethemoose@lemmy.world · edit-2 3 months ago

Cohere Drops Command-R 35B 08-2024 Update, Just About a Perfect Local LLM for 24GB GPUs.