ggml and llama.cpp join Hugging Face & Custom AI chips for fast inference - Hacker News (Feb 20, 2026)

February 20, 202620m 34s

Audio is streamed directly from the publisher (mcdn.podbean.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

Please support this podcast by checking out our sponsors:
- KrispCall: Agentic Cloud Telephony - https://try.krispcall.com/tad
- Discover the Future of AI Audio with ElevenLabs - https://try.elevenlabs.io/tad
- Build Any Form, Without Code with Fillout. 50% extra signup credits - https://try.fillout.com/the_automated_daily

Support The Automated Daily directly:
Buy me a coffee: https://buymeacoffee.com/theautomateddaily

Today's topics: ggml and llama.cpp join Hugging Face - ggml.ai’s core team, including llama.cpp maintainer Georgi Gerganov, is joining Hugging Face to scale “Local AI” support while keeping ggml-org projects open-source and community-governed. Keywords: ggml, llama.cpp, Hugging Face, GGUF, local inference, open source. Custom AI chips for fast inference - Startup Taalas claims it can compile an AI model into dedicated silicon in about two months, targeting sub-millisecond latency and radically lower cost and power. Keywords: custom silicon, Llama 3.1 8B, tokens/sec, DRAM-like density, quantization, inference cost. Gemini 3.1 Pro and agentic tools - Google rolls out Gemini 3.1 Pro in preview across AI Studio, Vertex AI, Android Studio, and consumer apps, pitching stronger reasoning and agent-ready workflows. Keywords: Gemini 3.1 Pro, ARC-AGI-2, Antigravity, Gemini API, NotebookLM, reasoning. Faster diffusion language model decoding - Together AI proposes Consistency Diffusion Language Models to speed diffusion-style text generation using block-wise causal attention, KV caching, and trajectory distillation. Keywords: diffusion language models, CDLM, KV cache, distillation, latency, GSM8K, MBPP. Learning codebases with visualization tooling - A developer shows that building a custom event visualizer can turn an unfamiliar codebase into something understandable, illustrated via Next.js Turbopack and a tricky tree-shaking bug. Keywords: Turbopack, Next.js, SWC, PURE annotations, scope hoisting, visualization. Web Components as framework alternative - An argument that the modern browser platform—Custom Elements, Shadow DOM, and events—can handle many UI needs without heavy frameworks, avoiding upgrade churn. Keywords: Web Components, Custom Elements, Shadow DOM, Custom Events, standards, React alternative. Raspberry Pi Pico 2 extreme overclocking - Pimoroni pushed the RP2350 in the Raspberry Pi Pico 2 to 800–860+ MHz using voltage mods and dry-ice cooling, noting RISC-V cores slightly outperform ARM per MHz. Keywords: RP2350, Pico 2, overclocking, dry ice, core voltage, CoreMark, RISC-V. C defer cleanup lands in compilers - C’s proposed defer cleanup feature is now practical: TS 25755 is finalized and Clang 22 ships support, with GCC implementations emerging and portability fallbacks available. Keywords: C defer, TS 25755, Clang 22, GCC, cleanup, resource safety. Hokusai sketches rediscovered in Europe - 103 “lost” Hokusai sketches for an unfinished ‘Great Picture Book of Everything’ resurfaced and were acquired by the British Museum, expanding access through digitization. Keywords: Hokusai, ukiyo-e, rediscovery, British Museum, provenance, digitization. Austin robotics and acoustic hiring - 9 Mothers is hiring on-site engineers in Austin across AI, computer vision, robotics, and acoustic DSP roles, with high salary bands and equity. Keywords: hiring, Austin, robotics, computer vision, DSP, machine learning, equity.

-https://github.com/ggml-org/llama.cpp/discussions/19759
-https://taalas.com/the-path-to-ubiquitous-ai/
-https://jimmyhmiller.com/learn-codebase-visualizer
-https://www.caimito.net/en/blog/2026/02/17/web-components-the-framework-free-renaissance.html
-https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
-https://jobs.ashbyhq.com/9-mothers
-https://www.together.ai/blog/consistency-diffusion-language-models
-https://japan-forward.com/eternal-hokusai-the-rediscovery-of-103-hokusai-lost-sketches/
-https://learn.pimoroni.com/article/overclocking-the-pico-2
-https://gustedt.wordpress.com/2026/02/15/defer-available-in-gcc-and-clang/

Episode Transcript

ggml and llama.cpp join Hugging Face
Let’s start with what might be the most consequential community news for local AI.

The founding team behind ggml and llama.cpp—basically the toolkit and reference implementation that made running large language models on everyday CPUs feel normal—announced they’re joining Hugging Face. Georgi Gerganov, the main maintainer of llama.cpp, frames it as a practical move: local inference is advancing quickly, the user base is huge, and sustaining that pace needs durable resources.

What’s important in the announcement is what’s not changing. The ggml-org projects are staying open-source, community-driven, and governed the same way they’ve been governed: technical decisions continue to live with the people actually doing the engineering, and the broader community process remains in charge. The team says they’ll keep leading ggml and llama.cpp full-time, just with Hugging Face backing them so maintenance, support, and forward progress aren’t held together by heroics.

The post also gives Hugging Face credit for tangible work that many users already benefit from: things like an inference server and user interface, multimodal support, new model-architecture implementations, improved GGUF compatibility, and integrations that let llama.cpp-based deployments show up in Hugging Face’s Inference Endpoints.

Looking ahead, they’re aiming for tighter integration with Transformers—described as closer to “single-click” workflows—so model support and quality control improve without forcing users to become build-system experts. And there’s a second track: packaging and UX. The message is pretty clear: if local AI is going to be a real alternative to cloud inference for everyday people, it can’t feel like a weekend project. The long-term vision is an efficient, device-based inference stack that keeps advanced capabilities in the open ecosystem.

Custom AI chips for fast inference
Staying on the theme of inference—now let’s jump from open-source software to custom silicon.

A company called Taalas argues that AI’s biggest bottlenecks aren’t just model quality. They say adoption is being throttled by latency and by the sheer cost of running models at scale. Their critique is blunt: today’s state-of-the-art inference often depends on room-sized, power-hungry systems with complex memory stacks, advanced packaging, and liquid-cooled data centers. And that’s awkward if you want agentic systems that react in milliseconds, not seconds.

Taalas claims it has a platform that can turn “any AI model into custom silicon” in about two months. The pitch is “total specialization”: you don’t run a model on a general GPU; you generate a chip that is the model. Their other key idea is merging storage and computation on one chip, aiming for DRAM-like density, and simplifying the whole hardware stack so you don’t need HBM, 3D stacking, liquid cooling, or ultra-fast I/O.

Their first product is provocative: a hard-wired Llama 3.1 8B. They’re offering it through a chatbot demo and an API access program. The headline number is 17,000 tokens per second per user, and they claim roughly 10x speed, 20x lower build cost, and 10x lower power than software-based inference—based on comparisons they cite against Nvidia and other providers.

There are caveats, and they admit them. Their first-gen silicon uses a custom 3-bit datatype with aggressive 3- and 6-bit quantization. That can mean quality regressions versus GPU baselines, which matters if you’re doing nuanced reasoning or delicate instruction-following. They say LoRA fine-tuning is supported, and context window size remains configurable, but you should still read this as “early hardware with tradeoffs,” not magic.

They also sketch a roadmap: a second-generation platform moving toward standardized 4-bit floating-point formats, a mid-sized reasoning model coming in spring on their first-gen hardware, and a “frontier” model planned for winter on a denser follow-up.

Whether you buy the claims or not, the interesting signal here is the return of specialization. If inference cost and latency are the wall, a lot of smart people will try to break it with model-specific compute, not just better kernels.

Gemini 3.1 Pro and agentic tools
Now, from chips to cloud—and specifically Google’s cloud and dev tools.

Google announced Gemini 3.1 Pro, positioning it as the upgraded “core intelligence” for tasks where a quick response isn’t enough—think deeper reasoning, synthesis, and complex workflows. Developers can access it in preview via the Gemini API in Google AI Studio, plus Gemini CLI, Android Studio, and Google’s agent development platform called Antigravity. Enterprises get it through Vertex AI and Gemini Enterprise. Consumers will see it in the Gemini app and NotebookLM, though with the usual tiers: higher limits for Pro and Ultra subscribers, and NotebookLM access tied to those plans.

Google is making a benchmark claim too: on ARC-AGI-2, they report a verified score of 77.1%, describing it as more than double Gemini 3 Pro’s reasoning performance. Benchmarks always need context—dataset specifics, prompting, and evaluation harnesses matter—but it signals the kind of result Google wants associated with this release: better generalization to novel logic patterns.

The demos in the announcement lean toward “capable assistant that can actually build.” Examples include generating animated SVG assets from text prompts, configuring a live dashboard that visualizes the ISS orbit using a public telemetry stream, and coding a hand-tracked 3D experience with generative audio. Another demo is more design-forward: building a portfolio site themed around Wuthering Heights, translating mood and tone into interface and code rather than doing a simple book report.

The big picture is that Google is pushing Gemini 3.1 Pro as a baseline model for agentic workflows—where the model doesn’t just answer, it plans, writes, runs, iterates, and coordinates tools. Preview now, general availability “soon,” as they put it.

Faster diffusion language model decoding
Let’s stay with model efficiency, but shift from hardware specialization to algorithmic acceleration.

Together AI published a post on diffusion language models—DLMs—which generate text by iteratively refining a masked sequence, rather than emitting tokens one-by-one like standard autoregressive models. Diffusion approaches can be appealing because they can finalize multiple tokens in parallel and use bidirectional context, which is useful for infilling and refinement.

But DLMs have two practical headaches. First, full bidirectional attention makes typical KV caching difficult, so each denoising step can be expensive. Second, high quality often needs lots of refinement steps, which inflates latency.

Their proposed fix is Consistency Diffusion Language Models, or CDLM, which is a post-training recipe designed to make inference much faster without losing much quality. The idea is to train a “student” model using offline trajectories from a “teacher” DLM, where generation happens block-wise. They use a block-wise causal attention mask: the model can attend to the prompt, previously completed blocks, and the current block. That enables exact KV caching for finalized blocks, while still letting the model do within-block bidirectional refinement.

Training combines three losses: a distillation loss that matches teacher predictions on newly unmasked positions, a consistency loss that stabilizes predictions for still-masked positions as a block transitions from incomplete to complete, and an auxiliary masked-denoising objective on randomly masked real text—intended to preserve general capability, including math.

At inference time, CDLM decodes block-wise, reuses caches, and uses confidence thresholds to finalize tokens in parallel within a block, with early stopping when it hits end-of-text.

In their reported results on Dream-7B-Instruct, they claim roughly 4 to almost 8 times fewer refinement steps, translating to large latency wins—up to 11x on GSM8K-CoT and 14x on MBPP-Instruct—while keeping accuracy changes small on most benchmarks. They also note a subtle behavior: outputs can be shorter under block-causal decoding while pass-at-1 stays similar.

If you’re watching the broader landscape, this is part of the same story as custom silicon and local inference: everyone is hunting for ways to buy back latency and cost, and it’s happening at every layer—math, kernels, architecture, and hardware.

Learning codebases with visualization tooling
Switching gears to developer craft—two posts that complement each other nicely: one about understanding a complex codebase, and another about reducing dependency on heavyweight frameworks.

First, a software engineer, Jimmy Miller, argues that one of the most underused strategies for learning an unfamiliar codebase is building a visualizer that shows what the system is doing over time. He demonstrates this by digging into the Next.js repo, focusing on Turbopack, the Rust-based bundler.

Rather than trying to land a perfect fix immediately, he starts with an inventory of the repository and chooses a recent Turbopack bug report as a “learning probe.” The case: a dead TypeScript enum remains in Turbopack output but disappears when building with Webpack—classic “why didn’t tree shaking do its job?” territory.

He hits an unexpected obstacle almost immediately: local packaging. A pnpm command that creates a Next.js tarball is producing an oddly tiny archive that’s missing native binaries, because a regex-based deduplication step is accidentally filtering out the native directory. Fix that, and now he can instrument and trace.

Then the real work begins. Turning on an experimental tree-shaking flag triggers an internal out-of-bounds error, which becomes an important reminder: labels like “tree shaking” or “scope hoisting” are only hints. The understanding comes from observing the pipeline.

So he builds a WebSocket event visualizer that records key steps as modules are transformed and emitted into server chunks. The visualization reveals the crucial clue: with scope hoisting enabled, a /* #__PURE__ */ annotation gets dropped. Without that annotation, later minification can’t prove the enum is removable, so the dead code stays.

He traces the root cause to the seam between SWC and Turbopack: SWC represents PURE as a special sentinel BytePos, and Turbopack encodes cross-module byte positions into a single u32. That encoding misinterprets the sentinel. A minimal fix is to bypass encoding for PURE positions, treating them like dummy positions—though he also warns there may be other sentinels and decode behaviors worth auditing.

Finally, he extends the visualization idea to Turbopack’s incremental computation system—ValueCell, Vc, turbo-tasks—building an interactive tool that displays pending tasks and dependencies so the architecture “clicks.” The takeaway is pragmatic: if you can’t see it, you’ll guess wrong. Build tools that make the system observable, and learning accelerates.

Web Components as framework alternative
Related, Stephan Schwab makes the case that you can build sophisticated, reactive web interfaces without adopting a heavyweight JavaScript framework—because the browser platform already contains the primitives people used to reach for frameworks to get.

His argument is that the shift already happened: Custom Elements, Shadow DOM, templates and slots, plus a mature event model, are stable across major browsers. In that framing, the web platform itself becomes “the framework,” and you can step off the treadmill of major-version churn, deprecated patterns, and constantly shifting build setups.

A key pattern he emphasizes is communication through Custom Events. Events can bubble up the DOM and, if you set composed: true, they can cross Shadow DOM boundaries. That gives you a clean way to keep components decoupled without immediately reaching for global state stores or prop-drilling everything through intermediate layers.

Data can flow downward through attributes and properties, and lifecycle hooks—like attributeChangedCallback—let components react when inputs change. He shows minimal examples and encourages starting simple rather than treating Web Components as an all-or-nothing rewrite.

One practical architecture he outlines is an event-driven dashboard: a filter component emits a filters-changed event; a dashboard shell listens and coordinates; panels expose an applyFilters method and remain independently testable.

He also calls out a benefit many teams feel viscerally: Shadow DOM encapsulation that actually prevents style leakage, reducing the need for elaborate naming conventions or CSS-in-JS just to keep things from colliding.

Frameworks still have their place, he says—especially where teams already have deep expertise or the broader ecosystem expects a framework—but hybrid adoption is often easy because many frameworks can consume Web Components. His overall claim is simple: standards last, and developer patience for ecosystem complexity is running thin.

Raspberry Pi Pico 2 extreme overclocking
Now for hardware fun that’s still genuinely informative.

Pimoroni’s Mike Bell explored how far the Raspberry Pi Pico 2’s RP2350 can be overclocked. The key enabling detail is power: unlike the older RP2040 setup that effectively capped core voltage around 1.3 volts, the RP2350 regulator can have its voltage limit disabled, opening the door to more ambitious experiments.

Using MicroPython to request different voltages, he maps out stable points: roughly 312 MHz at 1.1V, 420 MHz at 1.3V, 512 MHz at 1.5V, and about 570 MHz at 1.7V—at which point the chip starts getting properly hot. Add a heatsink and fan, and he’s seeing around 636 MHz at 1.9V and 654 MHz at 2.0V, though the onboard regulator can’t really deliver the very top requested voltages due to current limits.

Then the team goes beyond “reasonable overclocking” into controlled madness: dry ice cooling—around minus 80 degrees Celsius—paired with a more serious test harness. They use CoreMark on both cores, UART output to avoid USB overhead, an external reference clock counted via PIO for accurate timing, and even extra boards acting as bridges and displays to track telemetry.

With the Pico 2 buried in dry ice and still using the internal regulator, they manage 700 MHz stable. To push further they disable the onboard regulator and inject core voltage via a bench supply at a test point on the board. That gets them to 800 MHz around 2.8V. Briefly hitting about 840 MHz becomes unstable, likely due to heating under roughly an amp of current draw and some voltage drop from grounding limitations.

The best sample—because there’s chip-to-chip variation—hit a clean 861.6 MHz at about 3.05V, and it ran for close to a minute at 873.5 MHz before crashing. 888 MHz wouldn’t run.

One more interesting nugget: they benchmarked the RP2350’s RISC-V cores and found CoreMark per MHz was just under 5% better than ARM cores. That’s not a revolution, but for integer-heavy embedded workloads it’s a meaningful nudge.

The conclusion is sensible: the RP2350 is tougher than you’d expect, but past about 700 MHz you’re in diminishing returns unless cooling gets extreme. For everyday use, they suggest something like 1.6V for around 500 MHz—still with the big asterisk that long-term stress testing is what really matters.

C defer cleanup lands in compilers
Back to programming languages—C, specifically.

Jens Gustedt reports that C’s proposed defer cleanup feature is now effectively usable in mainstream compilers. Two developments make that claim credible: the design has been finalized as Technical Specification TS 25755, edited by JeanHeyd Meneide and moving through ISO publication, and both GCC and Clang have been implementing it.

Gustedt says he’s already used Clang’s implementation, available starting with clang-22. For some setups, he notes you may need a compiler flag like -fdefer-ts if _Defer isn’t recognized out of the box. He hasn’t personally tested GCC’s native implementation yet, but it’s in progress.

The value proposition will be familiar to anyone who’s written C with multiple early returns: defer can centralize cleanup without goto ladders, reduce leaks, prevent mutexes from being left locked, and generally make error-handling less fragile.

He also provides a portability header strategy: if exists, use it; otherwise, for newer GCC versions, map defer to _Defer and use a fallback built on the __cleanup__ attribute and __COUNTER__. That fallback relies on GCC nested functions, and he argues it does not produce trampolines or hidden functions in the executable, so it shouldn’t create extra exploit surface.

Two practical examples sum it up: defer { free(BigArray); } right after malloc, and defer { mtx_unlock(&mtx); } right after locking. One caution: with the fallback approach, keep defer usage inside curly braces so scope behaves the way you intend.

If you’ve ever audited a C codebase for cleanup bugs, this is the kind of small language feature that can have an outsized effect on reliability.

Hokusai sketches rediscovered in Europe
A quick detour into art history—because Hacker News has range.

An article recounts how 103 sketches by Katsushika Hokusai resurfaced in Europe in 2019. They were created for an ambitious but unfinished project called The Great Picture Book of Everything. The sketches had effectively been “lost” to public view for decades—last seen at a Paris auction 71 years earlier—and, unlike many Japanese works that later returned to Japan through collectors and institutions, these stayed abroad.

The piece traces their provenance through the late-19th-century European craze for Japanese prints and a major French collector, Henri Vever, whose seals were later noticed on the sheets. When a French collector brought the sketches to the Piasa auction house in Paris, they were estimated at about 20,000 euros but sold for roughly six times that amount.

Collector Israel Goldman recognized the seals and connected with Tim Clark, formerly the British Museum’s head of Japanese art, who reportedly hadn’t even heard of this specific “Picture Book of Everything” project—suggesting just how obscure these works were in modern scholarship.

With support from an art fund, the British Museum acquired the set, calling it potentially one of the major art discoveries of the 21st century. The subject matter is broad—figures from mythology and religion, animals, supernatural beings, landscapes—showing Hokusai’s range and curiosity. The museum planned to digitize and publish the sketches online, which is the part I always like: discoveries are great, but access is better.

Austin robotics and acoustic hiring
Finally, one item for anyone listening with a résumé tab open.

9 Mothers is hiring across multiple software engineering roles in Austin, Texas, and they’re explicitly on-site, full-time positions. The listings include AI Engineer, Computer Vision Engineer, and Robotics Engineer, with compensation stated in a wide band—150 to 400 thousand dollars—plus equity in the 0.1% to 0.5% range.

They also list several acoustic-specialty roles—Applied Machine Learning Engineer focused on acoustics, a DSP Engineer for acoustics, and a Systems Engineer in that domain—with salary ranges around 150 to 250 thousand, again with equity.

The postings mention multiple pay ranges for some roles, which usually signals level-based compensation—so if you’re considering it, expect the “where you land” part to depend on experience and scope.

Subscribe to edition specific feeds:
- Space news
* Apple Podcast English
* Spotify English
* RSS English Spanish French
- Top news
* Apple Podcast English Spanish French
* Spotify English Spanish French
* RSS English Spanish French
- Tech news
* Apple Podcast English Spanish French
* Spotify English Spanish Spanish
* RSS English Spanish French
- Hacker news
* Apple Podcast English Spanish French
* Spotify English Spanish French
* RSS English Spanish French
- AI news
* Apple Podcast English Spanish French
* Spotify English Spanish French
* RSS English Spanish French

Visit our website at https://theautomateddaily.com/
Send feedback to [email protected]
Youtube
LinkedIn
X (Twitter)

← All episodes of The Automated Daily

ggml and llama.cpp join Hugging Face &amp; Custom AI chips for fast inference - Hacker News (Feb 20, 2026)

Show Notes

ggml and llama.cpp join Hugging Face & Custom AI chips for fast inference - Hacker News (Feb 20, 2026)