Feeda - OnScreen Live

Don't do RAG - This method is way faster & accurate...

AI Jason

207K subscribers

165.9k views • 7 months ago

CAG intro + Build a MCP server that read API docs Setup helicone to monitor your LLM app cost now: ...

Setup helicone to monitor your LLM app cost now:

81 Comments

@AIJasonZ 7 months ago

Setup helicone to monitor your LLM app cost now: https://www.helicone.ai/?utm_source=ai-jason

Join AI builder club to access See More ple & Doc MCP: http://aibuilderclub.com/ See Less

@EDA.architect 7 months ago

More context = less attention, more latency, less repeatability in answer generation, more tokens, less concurrency, more cost See Less

@christianmagnus1003 7 months ago

Its true but CAG has a limit in TKM, tokens per minute, and tokens per request rate limit, with this limits you need play with LLMs models for implement CAG systems, so if you have a enourmo See More of context CAG is not the way See Less

@mojajojajo 7 months ago

This is just a worse version of RAG, why are we going backwards? See Less

@EternalKernel 7 months ago

I don't think this is CAG, this is.. just a BFP (big f!cking prompt).. I know there was a research paper about CAG and Im pretty sure it requires manipulating the internals of the model. See Less

@eitomiyamura 7 months ago

Helicone is honestly awful, would recommend using LangFuse instead. We used to be paying users of helicone, but their software is so slow and sluggish See Less

@quentinnnnn 7 months ago

strongly agree with this approach that I am experimenting as well : with RAG frameworks even with multi hop query etc. retrieval was really complicated. With CAG it destroy every query I do. See More ven more clever I think for large dataset is like a simplified GraphRAG by labelling the docs with tags, put every document that have relevant tags into the cache, and still perform a RAG query on every documents for precise and local request (for example, query like « who is xxx » where RAG works well on proper names) to know in which document the info is and load in the cache all the document with the same tags that the one found with the RAG and boom, knowledge issue is done See Less

@AvatarsGeek 7 months ago

Thank you very much for the information.. this is definitely better than RAG

How could we implement this in N8N? See Less

@beckbeckend7297 7 months ago

Preloading whole database into context? 🤣 See Less

@tshock22 7 months ago

I don't feel like you described actual CAG.....you left out the mechanism of 'C' (cache). CAG actually caches the KV computed values of your static knowledge base in the first l See More he model. Then, any incoming prompts are added as tokens AFTER that precomputed data. The model has to do far fewer computations to begin outputting the first tokens. This maximizes speed to first token, which is a huge part of building production chat/agents. However, this approach will require some model interface/API changes and is currently only supported in Gemini's latest offering as far as I know. There is another video you can search on "rag vs cag solving knowledge gaps" by ibm which goes into the caching mechanism. See Less

AI News AI News

00:42

Nano Banana Pro: Take a Selfie With Every Version of You

AI For Humans

1.9k views • 1 day ago

44:40

Google's Nano Banana Pro & Gemini 3 Just Changed Everything!

AI For Humans

10.1k views • 2 days ago

19:38

Google's UNREAL New Nano Banana Pro...

Wes Roth

31.0k views • 2 days ago

14:56

Nano Banana Pro: But Did You Catch These 10 Details?

AI Explained

48.0k views • 3 days ago