
π ThursdAI Oct-26, Jina Embeddings SOTA, Gradio-Lite, Copilot crossed 100M paid devs, and more AI news
ThursdAI - The top AI news from the past week Β· Alex Volkov, Bo, Abubakar Abid, Xenova, and Nisten
Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
ThursdAI October 26th
Timestamps and full transcript for your convinience
## [00:00:00] Intro and brief updates
## [00:02:00] Interview with Bo Weng, author of Jina Embeddings V2
## [00:33:40] Hugging Face open sourcing a fast Text Embeddings
## [00:36:52] Data Provenance Initiative at dataprovenance.org
## [00:39:27] LocalLLama effort to compare 39 open source LLMs +
## [00:53:13] Gradio Interview with Abubakar, Xenova, Yuichiro
## [00:56:13] Gradio effects on the open source LLM ecosystem
## [01:02:23] Gradio local URL via Gradio Proxy
## [01:07:10] Local inference on device with Gradio - Lite
## [01:14:02] Transformers.js integration with Gradio-lite
## [01:28:00] Recap and bye bye
Hey everyone, welcome to ThursdAI, this is Alex Volkov, I'm very happy to bring you another weekly installment of π ThursdAI.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
TL;DR of all topics covered:
* Open Source LLMs
* JINA - jina-embeddings-v2 - First OSS embeddings models with 8K context (Announcement, HuggingFace)
* Simon Willison guide to Embeddings (Blogpost)
* Hugging Face - Text embeddings inference (X, Github)
* Data Provenance Initiative - public audit of 1800+ datasets (Announcement)
* Huge open source LLM comparison from r/LocalLLama (Thread)
* Big CO LLMs + APIs
* NVIDIA research new spin on Robot Learning (Announcement, Project)
* Microsoft / Github - Copilot crossed 100 million paying users (X)
* RememberAll open source (X)
* Voice
* Gladia announces multilingual near real time whisper transcriptions (X, Announcement)
* AI Art & Diffusion
* Segmind releases SSD-1B - 50% smaller and 60% faster version of SDXL (Blog, Hugging Face, Demo)
* Prompt techniques
* How to use seeds in DALL-E to add/remove objects from generations (by - Thread)
This week was a mild one in terms of updates, believe it or not, we didn't get a new State of the art open source large language model this week, however, we did get a new state of the art Embeddings model from JinaAI (supporting 8K sequence length).
We also had quite the quiet week from the big dogs, OpenAI is probably sitting on updates until Dev Day (which I'm going to cover for all of you, thanks to Logan for the invite), Google had some leaks about Gemini (we're waiting!) and another AI app builder thing, Apple is teasing new hardware (but nothing AI related) coming soon, and Microsoft / Github announced that CoPilot has 100 million paying users! (I tweeted this and Idan Gazit, Sr. Director GithubNext where Copilot was born, tweeted that "we're literally just getting started" and mentioned November 8th as... a date to watch, so mark your calendars for some craziness next two weeks)
Additionally, we covered the Data provenance initiative that helps sort and validate licenses for over 1800 public datasets, a massive effort led by Shayne Redford with assistance from many folks including friend of the pod Enrico Shippole, we also covered another massive evaluation effort by a user named WolframRavenwolf on the LocalLLama subreddit, that evaluated and compared 39 open source models and GPT4. Not surprisingly the best model right now is the one we covered last week, OpenHermes 7B from Teknium.
Two additional updates were covered, one of them is Gladia AI, released their version of whisper over web-sockets, and I covered it on X with a reaction video, it allows developers to stream speech to text, with very low latency and it's multi-lingual as well, so if you're building an agent that folks can talk to, definitely give this a try, and finally, we covered SegMind SSD-1B, a distilled version of SDXL, making it 50% smaller in size and 60% faster in generation speed (you can play with it here)
This week I was lucky to host 2 deep dive conversations, one with Bo Wang, from Jina AI, and we covered embeddings, vector latent spaces, dimensionality, and how they retrained BERT to allow for longer sequence length, it was a fascinating conversation, even if you don't understand what embeddings are, it's well worth a listen.
And in the second part, I had the pleasure to have Abubakar Abid, head of Gradio at Hugging Face, to talk about Gradio, it's effect on the open source community, and then joined by Yuichiro and Xenova to talk about the next iteration of Gradio, called Gradio-lite that runs completely within the browser, no server required.
A fascinating conversation, if you're a machine learning engineer, AI engineer, or just someone who is interested in this field, we covered a LOT of ground, including Emscripten, python in the browser, Gradio as a tool for ML, webGPU and much more.
I hope you enjoy this deep dive episode with 2 authors of the updates this week, and hope to see you in the next one.
P.S - if you've been participating in the emoji of the week, and have read all the way up to here, your emoji of the week is π¦Ύ, please reply or DM me with it π
Timestamps and full transcript for your convinience
## [00:00:00] Intro and brief updates
## [00:02:00] Interview with Bo Weng, author of Jina Embeddings V2
## [00:33:40] Hugging Face open sourcing a fast Text Embeddings
## [00:36:52] Data Provenance Initiative at dataprovenance.org
## [00:39:27] LocalLLama effort to compare 39 open source LLMs +
## [00:53:13] Gradio Interview with Abubakar, Xenova, Yuichiro
## [00:56:13] Gradio effects on the open source LLM ecosystem
## [01:02:23] Gradio local URL via Gradio Proxy
## [01:07:10] Local inference on device with Gradio - Lite
## [01:14:02] Transformers.js integration with Gradio-lite
## [01:28:00] Recap and bye bye
Full Transcription:
[00:00:00] Alex Volkov: Hey, everyone. Welcome to Thursday. My name is Alex Volkov, and I'm very happy to bring you another weekly installment of Thursday. I. This week was actually a mild one in terms of updates, believe it or not. Or we didn't get the new state of the art opensource, large language model this week. However, we did get a new state of the art embeddings model. And we're going to talk about that. we got very lucky that one of the authors of this, a medics model, gold Gina embeddings V2, Bo Wang joined us on stage and gave us a masterclass in embeddings and share some very interesting things about this, including some stuff they haven't charged yet. So definitely worth a listen. Additionally recovered the data provenance initiative that helps sort and validate licenses for over 1800 public data sets. A massive effort led by Shane Redford with assistance from many folks, including a friend of the pod. Enrico Shippole.
[00:01:07] we also covered the massive effort by another user named Wolf from Ravenwolfe on the local Lama subreddit. Uh, that effort evaluated and compared to 39 open source models ranging from 7 billion parameters to 70 billion parameters and threw in the GPT4 comparison as well. Not surprisingly, the best model right now is the one we covered last week from friends of the politic new called open Hermes seven B.
[00:01:34] Do additional updates we've covered. One of them is Gladia AI, a company that offers transcription and translation APIs release their version of whisper over WebSockets. So live transcription, and I covered it on X with a reaction video. And I'll add that link in the show notes. It allows developers like you to stream speech, to text and. Very low latency and high quality and it's multi-lingual as well. So if you're building an agent that your users can talk to. Um, definitely give this a try. And finally Segmind segued mind accompany that just decided to open source a distilled version of. SDXL, making it 50% smaller in size and the in addition to that 60% faster in generation speed. The links to all these will be in the show notes.
[00:02:23] But this week I was lucky to host two deep dives, one with Bo Weng which I mentioned. Uh, we've covered the embeddings vector led in spaces that dimensionality and how they retrained Bert model to allow for a longer sequence length. It was a fascinating conversation. Even if you don't understand what embeddings are, it's well worth the listen. And, , I learned a lot. Now I hope you will, as well. And the second part, I had the pleasure to have a Brubaker a bit. The head of grandio at hugging face to talk about gradient. What is it? Uh, its effect on the open source community. And then joined by utero. And Sunnova to talk about the next iteration of Grigio called Grigio light that runs completely within the browser. No Serra required. We also covered a bit of what's coming to Gradio in the next release. on October 31st.
[00:03:15] A fascinating conversation. If you're a machine learning engineer, AI engineer, or just somebody who's interested in this skilled. You've probably used radio, even if you haven't written any Gradio apps, every model and hugging face usually gets a great deal demo.
[00:03:30] And we've covered a lot of ground, including M scripting. Then by filling the browser. As a tool for machine learning, web GPU, and so much more.
[00:03:38] Again, fascinating conversation. I hope you enjoy this deep dive episode. Um, humbled by the fact that sometimes the people. Who produced the updates we cover actually come to Thursday and talk to me about the things they released. And I hope this trend continues, and I hope you enjoyed this deep dive over an episode. And, um, I'll see you in the next one. And now I give you thursday october 26. oh, awesome. It looks like Bo, you joined us. Let's see if you're connecting to the audience, and can you unmute yourself, can you see if we can hear you?
[00:04:22] Bo Wang: Hi, can you hear me? Oh, we can hear you fine, awesome. this, this, this feature of, of Twitter.
[00:04:30] Alex Volkov: That's awesome. This, this usually happens, folks join and it's their first face and then they can't leave us. And so let me just do a little, maybe... Maybe, actually, maybe you can do it, right? Let me just present yourself.
[00:04:42] I think I followed you a while ago, because I've been mentioning embeddings and the MTB dashboard and Hug and Face for a while. And, obviously, embeddings are not a new concept, right? We started with Word2Vec ten years ago, but now, with the rise of LLMs, And now with the rise of AI tools and many people wanting to understand the similarity between the user query and an actual thing they, they, they stored in some database, embeddings have seen a huge boon.
[00:05:10] And also we've saw like all the vector databases pop up like mushrooms after the rain. I think Spotify just released a new one. And my tweet was like, Hey, do we really need another vector database? But Boaz, I think I started following you because you mentioned that you were working on something that's.
[00:05:25] It's coming very soon, and finally this week this was released. So actually, thank you for joining us, Beau, and thank you for doing the first ever Twitter space for yourself. How about can we start with your introduction of who you are and how are you involved with this effort, and then we can talk about Jina.
[00:05:41] Bo Wang: Yes, sure. Basically I have a very different background. I guess I was oriJinally from China, but my bachelor was more related to text retrieval. I have a retrieval experience rather than pure machine learning background, I would say. Then I came to the Europe. I came to the Netherlands like seven or eight years ago as a, as an international student.
[00:06:04] And I was really, really lucky and met my supervisor there. She basically guided me into the, in the world of the multimedia information retrieval, multimodal information retrieval, this kind of thing. And that was around 2015 or 2016. So I also picked up machine learning there because when I was doing my bachelor, it's not really hot at that moment.
[00:06:27] It's like 2013, 2014. Then machine learning becomes really good. And then I was really motivated, okay, how can I apply machine learning to, to search? That is, that is my biggest motivation. So when I was doing my master, I, I collaborated with my friends in, in, in the US, in China, in Europe. We started with a project called Match Zoo.
[00:06:51] And at that time, the embedding on search is just a nothing. We basically built a open source. Software and became at that time the standard of neural retrieval or neural search, this kind of thing. Then when the bird got released, then our project basically got queue because. Everyone's focus basically shifted to BERT, but it's quite interesting.
[00:07:16] Then I graduated and started to work as a machine learning engineer for three years in Amsterdam. Then I moved to Berlin and joined Jina AI three years ago as a machine learning engineer. Then basically always doing neural search, vector search, how to use machine learning to improve search. That is my biggest motivation.
[00:07:37] That's it.
[00:07:38] Alex Volkov: Awesome. Thank you. And thank you for sharing with us and, and coming up and Gene. ai is the company that you're now working and the embeddings thing that we're going to talk about is from Gene. ai. I will just mention the one thing that I missed in my introduction is the reason why embeddings are so hot right now.
[00:07:53] The reason why vectorDB is so hot right now is that pretty much everybody does RAG, Retrieval Augmented Generation. And obviously, For that, you have to store some information in embeddings, you have to do some retrieval, you have to figure out how to do chunking of your text, you have to figure out how to do the retrieval, like all these things.
[00:08:10] Many people understand that whether or not in context learning is this incredible thing for LLMs, and you can do a lot with it, you may not want to spend as much tokens on your allowance, right? Or you maybe not have enough in the context window in some in some other LLMs. So embeddings... Are a way for us to do one of the main ways to interact with these models right now, which is RAC.
[00:08:33] And I think we've covered open source embeddings compared to OpenAI's ADA002 embedding model a while ago, on ThursDAI. And I think It's been clear that models like GTE and BGE, I think those are the top ones, at least before you guys released, on the Hugging Face big embedding model kind of leaderboard, and thank you Hugging Face for doing this leaderboard.
[00:09:02] They are great for open source, but I think recently it was talked about they're lacking some context. And Bo, if you don't mind, please present what you guys open sourced this week, or released this week, I guess it's open source as well. Please talk through Jina Embeddings v2 and how it differs from everything else we've talked about.
[00:09:21] Bo Wang: Okay, good. Basically, it's not like embeddings for, how can I say, maybe two... point five years. But previously we are doing at a much smaller scale. Basically we built all the algorithm, all the platform, even like cloud fine tuning platform to helping people build better embeddings. So there is a not really open source, but a closed source project called fine tuner, which we built to helping user build better embeddings.
[00:09:53] But we didn't, we found it okay. Maybe we are maybe too early. because people are not even using embeddings. How could they find embeddings? So we decided to make a move. Basically, we basically scaled up our how can I say ambition. We decided to train, train our own embeddings. So six months ago, we started to train from scratch, but not really from scratch because in binding training, normally you have to train in two stages.
[00:10:23] The first stage, you need to pre train on massive scale of like text pairs. Your objective is to bring these text pairs as closer as possible, as possible, because these text pairs should be semantically related to each other. In the next stage, you need to fine tune with Carefully selected triplets, all this kind of thing.
[00:10:43] So we basically started from scratch, but by collecting data, I think it was like six months ago, we working with three to four engineers together, basically scouting every possible pairs from the internet. Then we basically created like one billion, 1. 2 billion sentence pairs from there. And we started to train our model based on the T5.
[00:11:07] Basically it's a very popular encoder decoder model. This is on the market. But if you look at the MTB leaderboard or all the models on the market, the reason why they only support 512 sequence lengths is constrained actually by the backbone itself. Okay, we figure out another reason after we release the V1 model.
[00:11:31] Basically, if you look at. And the leaderboard or massive text embedding leaderboard, that is the one Alex just mentioned. Sorry, it's really bad because everyone is trying to overfitting the leaderboard. That naturally happens because if you look at BGE, GTE, the scores will never that high if you don't add the training data into the, into the, That's really bad.
[00:12:00] And we decided to take a different approach. Okay. The biggest problem we want to solve first, improving the quality of the embeddings. The second thing we want to solve is. Enable user to making longer context lens. If we want to making user make user have longer context lens, so we have to rework the BERT model, because every basically the embedding model, the backbone was from BERT or T5.
[00:12:27] So we basically started from scratch. Why not we just borrow the latest research from large language model? Every large language model wants large context. Why not we just borrow the research ideas? into the musk language modeling modelings. So we basically borrowed some ideas, such as rotary position embeddings or alibi, maybe you did, and reworked BERT.
[00:12:49] We call it JinaBERT. So basically now the JinaBERT can handle much longer sequence. So we trained BERT from scratch. Now BERT has been a byproduct of our embeddings. Then we use this JinaBERT to contrastively train the models on the semantic pairs and triplets that finally allow us to encode 8K content.
[00:13:15] Alex Volkov: Wow, that's impressive. Just, just to react to what you're saying, because BERT is pretty much every, everyone uses BERT or at least use BERT, right? At least in the MTB leaderboard. I've also noticed many other examples that use BERT or distilled BERT and stuff like this. You're saying, what you're saying, if I'm understanding correctly, is this was the limitation for sequence length?
[00:13:36] for other embedding models in the open source, right? And the OpenAI one that's not open source, that does have 8, 000 sequence length. Basically, sequence length, if I'm explaining correctly, is just how much text you can embed without chunking.
[00:13:51] Yes. And you're basically saying that you, you guys saw this limitation and then retrained BERT to use rotary embeddings. We've talked about rotary embeddings multiple times here. We had folks behind the yarn paper for extending context windows. Alibi is we follow Ophir Press.
[00:14:08] I don't think Ophir ever joined ThursdAI, but Ophir, if you hear this, you're welcome to join as well. So Alibi is another way to extend context windows and I think Mosaic folks used Alibha and some other folks as well. Bo, could you speak more about like borrowing the context from there and retraining BERT to JinaBERT and whether or not JinaBERT is also open source?
[00:14:28] Bo Wang: Oh, we actually want to make JinaBERT open source, but I need to align with my colleagues. That's, that's, that's really, that's a decision to be made. And the, the idea is quite naive. If you didn't know, I don't want to dive into too much about technical details, but basically the idea of Alibi basically removed the position embeddings from the large language model pre training.
[00:14:55] And the Alibi technique allow us to train on the shorter sequences. But inference at every very long sequence. So in the end, I think if I, my remember is correct, the author of alibi paper, basically trained model on 512 sequence lens and 1,024 sequence lens, but he's able to inference on 16 K. 16 K, like sequence lens.
[00:15:23] If you further expand it, you are not capable because that's the limitation of hardware, that's the limitation of GPE. So he, he actually tested 16 K like a sequence lens. So what we did is just. Borrowed this idea from the autoregressive models into the mask language models. And integrate Alibi, remove the position embeddings from the bird, and add this Alibi slope and all the Alibi stuff back into the bird.
[00:15:49] And just borrowed the things how we train bird or something Roberta, something from Roberta, and retrained the bird. I never imagined bird could be a by product of our embedding model, but this... This happened. We could open source it. Maybe I have to discuss with my colleague.
[00:16:09] Alex Volkov: Okay. So when you talk to your colleagues, tell them that first of all, you already said that you may do this on ThursdAI Stage.
[00:16:15] So your colleagues are welcome also to join. And when you open source this, you guys are welcome to come here and tell us about this. We love the open source. The more you guys do, the better. And the more it happens on ThursdAI Stage, the better, of course, as well. Bo, you guys released the Jina Embedding Version 2, correct?
[00:16:33] Gene Embedding Version 2 has a sequence length of 8k tokens. So that actually allows to, if, just for folks in the audience, 8, 000 tokens is, I want to say, maybe like 6, 000 words in English around, right? And different languages as well. Could you talk about multilinguality as well? Is it multilingual, is it only English?
[00:16:53] How that how that appears within the embedding model?
[00:16:57] Bo Wang: Okay, actually, our Jina Embedding V2 is only English, so it's a monolingual embedding model. If you look at the MTV benchmark or all the public multilingual models, they are multilingual. But to be frankly, I don't think this is a fair solution for that.
[00:17:18] I think at least every major language.
[00:17:24] We decided to choose another hard way. We will not train a multilingual model, but we will train a bilingual model. Our first target will be German and Spanish. What we are doing at Jina AI is we basically Fix our English embedding model as it is just keep it at is, but we are continuously adding the German data, adding the Spanish data into the embedding model.
[00:17:51] And our embedding model cares two things. We make it bilingual. So it's either German, English or German English, Spanish, Spanish, English, German, English, or Japanese, English, whatever. And what we are doing is we want to build this embedding model to make it monolingual. So imagine you are, you have a German English embedding model.
[00:18:12] So if you search for German, you'll get German results. If you use English, you'll get English results. But we also care about the cross linguality of this bilingual model. So imagine you, you, you encode two, two sentences. One is in German, one is in English, which they are With the same meaning, we also want these vectors to be mapped into the similar semantic space.
[00:18:36] Because I, I'm a foreigner myself, sometimes, imagine I, I, I buy some stuff in the supermarket. Sometimes I have to translate, use Google Translate, for example, milk into Milch in German, then, then, then put it into the search box. I really want this bilingual model happen. And I believe every, at least, major language deserves such an embedding model.
[00:19:03] Alex Volkov: Absolutely. And thanks for clarifying this because one of the things that I often talk about here on Thursday Night is as a founder of Targum, which is inside videos, is just how much language barriers are preventing folks from conversing to each other. And definitely embeddings are... The way people extend memories parallel lines, right?
[00:19:21] So like a huge, a huge thing that you guys are working on and especially helpful. The sequence length is, and I think we have a question from the audience is what is the sequence lengths actually allow people to do? I guess Jina and I worked with some, some other folks in the embedding space. Could you talk about what is the longer sequence lengths now unlocking for people who want to use open source embeddings?
[00:19:41] Obviously. My answer here is, well, OpenAI's embeddings is the one that's most widely used, but that one you have to do online, and you have to send it to OpenAI, you have to have a credit card with them, blah, blah, blah, you have to be from supported countries. Could you talk about a little bit of what sequence length allows unlocks once you guys release something like this?
[00:20:02] Bo Wang: Okay, actually, we didn't think too much about applications. Most of the vector embeddings applications, you can imagine search and classification. You build another layer of, I don't know, classifier to classify items based on the representation. You can build some clustering. You can do some anomaly detection on the NLP text.
[00:20:22] This is something I can imagine. But the most important thing I I have to be frankly to you because we are, we are like writing a technical report as well. Something like a paper maybe we'll submit to academic conference. Longer embeddings doesn't really always work. That is because sometimes if the important message is in in the front of the document you want to embed, then it makes most of the sense just to encode let's say 256 tokens.
[00:20:53] or 512. But sometimes if you you have a document which the answer is at the middle or the end of the document, then you will never find it if the message is truncated. Another situation we find very interesting is for clustering tasks. Imagine you want to visualize your embeddings. Longer longer sequence length almost always helps and for clustering tasks.
[00:21:21] And to be frankly, I don't care too much about the application. I think people, we, what we're offering is the, how can I say, offering is, is like a key. We, we unlock this 512 sequence length. To educate and people can explore it. People, let's say I, I only need two K then, then people just set tokenize max lens to two k.
[00:21:44] Then, then embed. Based on their needs, I just don't want to be, people to be limited by the backbone, by the 500 to 12 sequence lengths. I think that's the most important thing.
[00:21:55] Alex Volkov: That's awesome. Thank you. Thank you for that. Thank you for your honesty as well. I love it. I appreciate it. The fact that, there's research and there's application and you not necessarily have to be limited with the application set in mind.
[00:22:07] We do research because you're just opening up doors. And I love, I love hearing that. Bo maybe last thing that I would love to talk to you about as the expert here on the topic of dimensions. Right. So dimensionality with embeddings I think is very important. Open the eye, I think is one of the highest ones.
[00:22:21] The kind of the, the thing that they give us is like 1200 mentioned as well. You guys, I
[00:22:26] think
[00:22:26] Jina is around 500. Or so is that correct? Could you talk a bit about that concept in broad strokes for people who may be not familiar? And then also talk about the why the state of the art OpenAI is so far ahead?
[00:22:39] And what will it take to get the open source embeddings also to catch up in dimensionality?
[00:22:46] Bo Wang: You mean the dimensionality of the vectors? Okay, basically we follow a very standard BERT size. The only thing we modified is actually the the alibi part and some training part.
[00:22:58] And our small model dimensionality is 512, and the base model is 768 and we have also a large model, haven't been released because of the training is too slow. We have so much data to change. Even the model size is small, but we have so much data and so large model dimensionality size is 1,024. And if my memory is correct, so are I embedding 0 0 2?
[00:23:23] Have but dimensionality of. 1, 5, 3, 6, something like that, which is a very strange dimensionality, I have to say, but I would say the dimensionality is, is, is the longer might be more Better or more expressive, but shorter, which means when you are doing the vector search, it's gonna be much more faster.
[00:23:48] So it's something you have to balance. So if you think the speed query speed, or the retrieval speed or whatever is more important to you. And if I, if I know correct, some of the Vector database, they make money by the dimensionality, let's say. They, they charge you by the dimensionality, so it's actually quite expensive if your dimensionality is too high.
[00:24:13] So it's a balance between expressionist and the, the, the, the speed and the, the, the, the cost you want to invest. So it's. It's very hard to determine, but I think 512, 768, and 1024 is very common as BERT.
[00:24:34] Alex Volkov: So great to hear that a bigger model is also coming, but it hasn't been released yet. So there's like the base model and the small model for embeddings, and we're waiting for the next one as well.
[00:24:46] I wanted to maybe ask you to maybe simplify for the audience, the concept of dimensionality. What does it mean between, what is the difference between embeddings that were stored with 512 and like 1235 or whatever OpenAI does? What does it mean for quality? So you mentioned the speed, right? It's easier to look up nearest neighbors, maybe within the 512 dimension space, what does it actually mean for quality of look up of different other ways that strings can compare? Could you maybe simplify the whole concept, if possible, for people who don't speak embeddings?
[00:25:19] Bo Wang: Okay maybe let me quickly start with the most basic version.
[00:25:24] If you imagine, if you type something in the search box right now, when doing, doing the matching, and it's actually also embedding, but it's something like if I make a simple version, it's a binary embedding. Imagine there 3, 000 words in English. Maybe there are much more, definitely. Imagine it's 3, 000 words in English, then the vector is 3, 000 dimensionality.
[00:25:48] Then what current solution of searching or matching do is just making... If the query has a token, if your document has a token, if your document has this token, then your occurrence will be one. If you query has the token, and this one will match your document token. But it's also about the, the frequency it appears, it's how, how rare it is.
[00:26:12] But the current solution is basically matching by the. By the English word, but with neural network, basically if you know about this, for example, ResNet know about a lot of different, for example, classification models, basically the output class of item, but if you chop up the classification layer, it will give you some a vector.
[00:26:36] Basically this vector is It's the representation of the information you want to encode. Basically it's a compressed version of the information in a certain dimensionality such as 512, 768, something like this. So it's a compressed list of non numerical numbers, which we normally call it dense vectors.
[00:26:57] because it's much more how can I say in English dense, right? Compared to the traditional way we store vectors, it's much more sparse. There is a lot of zero, there is a lot of one, because zero means not exist, one means exist. When one exists, then there is a match, then you've got the search result.
[00:27:16] So these dense vectors capture more about semantics, but if you match by the occurrence, then you might lose the semantics. But only matching by the occurrence of a token or a word.
[00:27:31] Alex Volkov: Thank you. More dimensions, basically, if I'm not saying it correctly, more dimensions just have more similarity vector. So like more things two strings or tokens can be similar on. And this basically means higher match rate. For more similarity things. And I think the basic stuff I think is covered in the Simon Wilson, the first pin tweet here, Simon Wilson did a basic, basic intro into what do dimensions embeddings mean and why they matter.
[00:28:00] And I specifically love the fact that there's arithmetic that can be done. I think somebody reads the paper even before this whole LLM thing, where if you take embeddings for Spain and embeddings for Germany, and then you take you, you can subtract like the embedding for Paris and then you get something closer to, to like Berlin, for example, right?
[00:28:19] So there's like concepts in, inside these things that are they're even arithmetic works and if you take like King and you subscribe male, then you get something closer to Queen and stuff like this. It's really, really interesting. And also Bo, you mentioned visualization as well. It's really impossible to visualize.
[00:28:36] 10, 24, et cetera, dimensions, right? Like we humans, we have perceived maybe three, maybe three and a half, four with time, whatever. And usually what happens is those multiple dimensions get down scaled to 3D in order to visualize in neighborhoods. And I think we've talked with folks from ARISE. They have a software called Phoenix that allows you to visualize embeddings for clustering and for semantics.
[00:29:02] Atlas does this as well, right? Nomic AI's Atlas does this as well. You can provide dimensions as well. And so you can provide embeddings and see clustering for concepts. And it's really pretty cool. If you haven't played with this, if you only did VectorDBs and you stored your stuff after you've done chunking, but you've never visualized how this looks, I strongly recommend you to do and I think well, thank you so much for joining us and explaining to us, the internals and sharing with us some exciting things about what's to come. Jina Burt is hopefully hopefully is coming, a, a retrained version of Burt, the, the, the, the, the... The grease of all how should I say, I can't, it's hard for me to define a verb, but I see it everywhere it's, it's the big base bone of a lot of NLP tasks, and it's great to see that you guys are about to first of all, retrain it for longer sequences, using tricks like Alibi and and I think you said Positional Embeddings, and hoping to see some open source action from this, but also that Jina Embedding's large model is coming as well with more dimensions waiting for that. Hopefully you guys didn't stop training that. And I just want to tell folks why I'm excited for this. And this kind of will take us to the next.
[00:30:08] Point as well is because, while I love OpenAI, I honestly do, I'm going to join their Dev Day, I'm going to report from their Dev Day and tell you all the interesting things that OpenAI does. We've been talking about we've been talking and we'll be talking today about local inference, about running models on edge, about running models of your own.
[00:30:28] Mistin is here, he even works on some bootable stuff that you can like completely off the grid run. And, so far, we've been focused on open source LLMs, for example, right? So we've had I see Pharrell in the audience from Skunks Works, and many other fine tuners, like Tignium, Alignment Labs, all these folks are working on local LLMs, and they never get to GPT 4 level yet.
[00:30:51] We're waiting for that, and they will. But the whole point of them is, you run them locally, they're uncensored, you can do whatever you want, you can fine tune them on whatever you want. However, the kind of the embeddings part Is the glue to connect it to an application and the reason is because there's only so much context window also context window is expensive and even if theoretically the yarn paper that we've talked with the authors of allows you to extend the context window to 128, 000 tokens The hardware requirements for that are incredible, right?
[00:31:22] Everybody in the world of AI engineers, they switch up to, to, to retrieval of data generation. Basically, instead of shoving everything in the context, they switched Hey, let's use a vector database. Let's say a Chroma. Or Pinecone, or Waviate, like all of those, vectorized from Cloudflare, and the other one from Spotify there, I forget its name or even Superbase now has one.
[00:31:43] Everybody has a vector database it seems these days, and the reason for that is because all the AI engineers now understand that you need to put some text into some embeddings, store them in some database. And many pieces of that were still requiring internet, requiring OpenAI API calls, requiring credit cards, like all these things.
[00:32:03] And I think it's great that we've finally got to a point where, first of all there are embeddings that are matching whatever OpenAI has given us. And now you can run them locally as well. You don't have to go to OpenAI. If you don't want to host, you can probably run them. I think though GeneEmbedding's base is very tiny.
[00:32:20] Like it's half like the small model is 770 megabytes, I think. Maybe a little bit more, if
[00:32:27] Bo Wang: I'm looking at this correctly. Sorry, it's half precision. So you need to double it to make it FP32.
[00:32:33] Alex Volkov: Oh yeah, it's half precision. So it's already quantized, you mean?
[00:32:37] Bo Wang: Oh no, it's just to store it as FV16,
[00:32:39] Alex Volkov: if you store it as FV16.
[00:32:43] Oh, if you store it as FV16. But the whole point is the next segment in ThursdAI today is going to be less about updates and more about the very specific things. We've been talking about local inference as well, and these models are tiny, you can run them on your own hardware, on Edge via Cloudflare, let's say, or on your computer.
[00:32:58] And you now can do almost end to end application wise. From the point of your user inputting a query embedding this query, running a match, a vector search, KNNN and whatever you want nearest neighbor search for that query for the user. Retrieve that all from like local open source. You basically you, you can basically go offline.
[00:33:20] And this is what we want in, in the era of upcoming regulation towards what AI can be and cannot be. And the era of like open source models getting better and better. We've talked last week where Zephyr and I think Mistral News from Technium is also matching some GPT 3. 5. All of those models you can download and nobody can tell you not to run inference on them.
[00:33:40] Hugging Face open sourcing a fast Text Embeddings Inference Server with Rust / Candle
[00:33:40] Alex Volkov: But the actual applications, they still require the web or they used to. And now I'm, I'm loving this like new move towards. Even the application layer, even the RAG systems, which are augmented generation, even the vector databases, and even the embeddings are now coming to, to open source, coming to your local computer.
[00:33:57] And this will just mean like more applications either on your phone or your computer. And absolutely love that. Bo, thank you for that. And thank you for coming to the stage here and talking about the things that you guys open sourced and hopefully we'll see more open source from Jina and everybody should follow you and, and Jina as well.
[00:34:13] Thank you. It looks like. Thank you for joining. I think the next thing that I wanna talk about is actually in this vein as well. Let me go find this o Of course, we love hug and face and the thing that I think that's already on top if you look, yeah if you look at the last thing, last tweet that's pinned it's a tweet from Jeri Lou from Lama Index, obviously.
[00:34:33] Well, well, well, we're following Jerry and whatever they're building and doing over at Lama Index because they implement everything like super fast. I think they also added support for Jina like extremely fast. He talks about this thing where HugInFace opensource for us something in Rust and Candlestick?
[00:34:51] Candlelight? Something like that? I forgot that they're like iteration on top of Rust. Basically, the open source is a server that's called TextEmbeddingsInferenceServer that you can run on your hardware, on your Linux. boxes and basically get the same thing that you get from OpenAI Embeddings.
[00:35:07] Because Embeddings is just one thing, but it's a model. And I think you could use this model. You could use this model with transformers but it wasn't as fast. And as Bo previously mentioned, there's considerations of latency for user experience, right? If you're building an application, you want it to be as responsive as possible.
[00:35:24] You need to look at all the places in your stack and say, Hey. What slows me down? For many of us, the actual inference, let's say use GPT 4, waiting on OpenAI to respond and stream that response is what slows many applications down. And but many people who do embeddings, let's say you have a interface of a chat or a search, you need to embed every query the user sends to you.
[00:35:48] And one such slowness there is how do you actually How do you actually embed this? And so it's great to see that Hackenface is working on that and improving that. So you previously could do this with transformers, and now they released this specific server for embeddings called TextEmbeddings Inference Server.
[00:36:04] And I think it's four, four times faster. than the previous way to run this, and I absolutely love it. So I wanted to highlight this in case you are interested. You don't have to, you can use OpenAI Embeddings. Like we said, we love OpenAI, it's very cheap. But if you are interested in doing the local embedding way, if you want to go end to end, complete, like offline, you want to build like an offline application, using their internet server I think is a good idea.
[00:36:29] And also it shows what HuggingFace is doing with Rust and I really need to remember what language there is but definitely a great attempt from Hug and Face, and yeah, just wanted to highlight that. Let's see. Before we are joined from the Grad. io folks, and I think there's some folks in the audience who are ready from Grad.
[00:36:48] io to come up here and talk about local inference which 15 minutes left,
[00:36:52] Data Provenance Initiative at dataprovenance.org
[00:36:52] Alex Volkov: I wanted to also mention the Data Provenance Initiative. Let me actually find this announcement, and then quickly... Quickly paste this here, and I was hoping that Enrico can be here. . There's a guy named Shane Longfree,
[00:37:05] and he released this massive, massive effort, included with many people. And basically what this effort is, it's called the Data Provenance Initiative. Data Provenance Initiative is now existing in dataprovenance. org. And hopefully can somebody maybe send me the, the direct link to the suite to add this.
[00:37:23] It... It is a massive effort to take 1, 800, so 1, 800 Instruct and Align datasets that are public, and to go through them to identify multiple things. You can filter them, exclude them, you can look at creators, and the most important thing, you can look at licenses. Why would you do this? Well, I don't know if somebody who builds an application needs this necessarily, but everybody who wants to fine tune models, the data is the most important key for this, and building data sets and running them through your fine tuning efforts is basically the number one thing that many people do in the fine tune community, right?
[00:38:04] Data wranglers, and now, thank you, Nishtan, thank you so much, and a friend of the pod, Enrico. is now pinned to the top of the tweet. Thank you for to the top of the space, the nest, whatever it's called. A friend of Enrico Cipolla, who we've talked previously in the context of extending I think Lama to first 16k and then 128k.
[00:38:24] I think Enrico is part of the team on yarn paper as well. I joined this effort, and I was hoping Enrique could join us to talk about this. But basically, if you're doing anything with data, this seems like a massive, massive effort. Many datasets from Lion, and we've talked about Lion, and Alpaca, GPT 4L Gorilla, all these datasets.
[00:38:46] It's very important when you release your model as open source that you have the license to actually release this. You don't want to get exposure, you don't want to get sued, whatever. And if you're in finding data sets and creating different mixes to fine tune different models, this is a very important thing.
[00:39:03] And we want to shout out, Shane Longpre, Enrico, and everybody who worked on this because I think... Just, I love these efforts for the open source, for the community, and it just makes, it's easier to fine tune, to train models. It makes it easier for us to advance and get better and smaller models, and it's worth celebrating and ThursdAI is the place to celebrate this, right?
[00:39:27] LocalLLama effort to compare 39 open source LLMs + GPT4
[00:39:27] Alex Volkov: On the topic of extreme, how should I say efforts that are happening by the community on the same topic, I want to add another one, and this one I think I have a way to pull it up, so give me just a second give me just a second, yes. A Twitter user named Wolfram Ravenwolf who is a participant of the local Lama community on Reddit and now is pinned to the nest at the top of the tweet did this massive effort of comparing open source LLMs and tested 39 different models ranging from 7 billion parameters to 70 billion, and also compared them to chat GPT