
Bingbot: Discovering, Crawling, Extracting and Indexing (Fabrice Canel with Jason Barnard)
April 7, 2020
Audio is streamed directly from the publisher as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Fabrice Canel with Jason Barnard at The Bing Series
Fabrice canel talks to Jason Barnard about Bingbot.
Fabrice was on the podcast last year talking about Javascript and the new indexing API. That was very interesting and he shared quite a few insights...
If you would rather read than watch or listen, here is an article I wrote based on this conversation >>
This Episode Takes the Conversation a BIG step Further
https://youtu.be/dSSZeTYOtMk
This conversation is on a whole different planet. Fabrice is head of the entire discovery-crawling-extracting-indexing process. Think about how much that involves. And how important he and his team are to the process of getting your content to the top of the results.
You cannot hope to get your content into search results if it isn't found, crawled, extracted and indexed... and since he manages every single one of those steps, he is a person we really need to listen to.
Bingbot and Googlebot Function in Much the Same Way
Obviously they don't function exactly the same way down to the tiniest detail. But close enough ...
the process is exactly the same (discover, crawl, extract, index)
the content they are indexing is exactly the same
the problems they face are exactly the same
the underlying technology they use is the same
So the details of exactly how they achieve each step will differ. But they are faced with the same environment and aim to do the same thing - index the web effectively. So, we can safely assume Google deals with the discovery-crawling-extracting-indexing process in a manner very, very close to Bing.
Just think about whatever industry you are in - details differ, but every competitor uses the same foundation. Easy to forget, but this is just another industry. So same here.
Google functions much the same way as Bing. And vice versa. Close enough for us not to need to worry too much about the differences.
Stunning Insights. I Learned sooooo Much.
The conversation with Frédéric Dubut that kicked off this series (this episode recorded at UnGagged) suddenly looks tame and unrevealing. A simple 'mise en bouche', as we say in French.
Listen and Learn
Google collaborate with Bing on Chromium
They discover 70 billion new webpages every day
Bingbot pre-filters to stores only the 'best' content
New technology is coming out for rendering (Machine Learning + Javascript)
Standardised HTML is powerful
Bing (and we can safely assume Google) is getting exponentially better at extracting information
The process of storing the content is MUCH more important than you probably imagine
Every candidate set team at Bing relies on Bingbot
Nofollow has always been just a hint
Sitemaps and RSS are incredibly important
Indexing includes annotation, and annotations are fundamentally important to all the other teams and their algos
Indexing includes classification, and classification is fundamentally important to all the other teams and their algos
In short, as SEOs, we all depend on Fabrice and his team to an extent most of of us have probably will only start to grasp after watching the episode. This is the foundation of ranking in search. Everything else depends on this.
Fabrice is a truly lovely guy who wants to help you as a website manager... if only you'd help him help you. Here he tells you what he (and, presumably, his equivalent at Google) wants from you so that he can help you get your content to rank.
Help them overcome their problems, and you WILL be rewarded. Groovy !
Catch the rest of the Bing Series:
How Ranking Works at Bing - Frédéric Dubut, Senior Program Manager Lead, Bing
Discovering, Crawling, Extracting and Indexing at Bing - Fabrice Canel Principal Program Manager, Bing
How the Q&A / Featured Snippet Algorithm Works - (this episode) Ali Alvi, Principal Lead Program Manager AI Products, Bing
How the Image and Video Algorithm Works - Meenaz Merchant, Principal Program Manager Lead, AI and Research, Bing
How the Whole Page Algorithm Works - Nathan Chalmers, Program Manager, Search Relevance Team, Bing
Full transcript of "Bingbot: Discovering, Crawling, Extracting and Indexing (Fabrice Canel with Jason Barnard)"
Jason Barnard: A quick hello, an we're good to go. Welcome to the show, Fabrice Canel!
Welcome, lovely — you know, we're in the Bing offices. Yes, again, I had you on the show last year, it was just audio, now we've got video so everyone can see what Fabrice looks like. Fabrice, incredibly important person at Bing who crawls, extracts, and stores.
Fabrice Canel: Yes, I do all of it. Every day I am in charge of discovering internet content — all the internet content. I am in charge of selecting the best content on the internet, as you said. I am fetching and crawling the best content from the internet, then processing it and understanding it.
Jason Barnard: So one question is: when you crawl, you're actually looking for what's best, so there's a pre-filter even before the ranking engine?
Fabrice Canel: Every day we discover more than seventy billion URLs that we have never seen before.
Jason Barnard: Every day? Seventy billion?
Fabrice Canel: Seventy billion — it's a lot of content. Obviously we will remove useless URLs. Just to give you a sense of the size of the internet: the size of the internet is really infinite, there is an infinite number of URLs out there. People create content, but then there are systems that are auto-generating content. You have pages with calendars where we can follow links — all kinds of useless links — but often you have to follow those links to discover whether they're good or not. You have to fetch them.
Jason Barnard: So when you say you select…
Fabrice Canel: Yes, we select the best content for indexing, but often we have to fetch first to discover whether a link is good or not.
Jason Barnard: So my initial idea was that you do a pre-sorting, but in fact you're just getting rid of the junk.
Fabrice Canel: We first get rid of the junk, then we still fetch to discover if it is useful or not, because we don't know. Sometimes it's just a link to a page we've never seen, and we take a decision based on what we find — whether this page is useful for satisfying user queries or not.
Jason Barnard: So with every page, you're going to crawl it, extract information, figure out if it's useful or not, and if it's not useful, do you still store it?
Fabrice Canel: Obviously, if we continuously see that these pages are dead links, we will stop indexing them at some point.
Jason Barnard: But in processing a page, how do you tag it to not be crawled again — or do you just keep crawling it?
Fabrice Canel: Dead links are a very good challenge, because often you have pages that are dead links but come back. You may buy a domain and not populate it — we call that a parked domain, where there is no useful content yet. Then you publish some content, and maybe you forget to renew the domain, so it becomes a parked domain again, and somebody else buys it. There is a lifecycle of URLs. At the end, yes, we take decisions based on URLs — especially what we call tail URLs, which are very long URLs that are essentially useless — and we will stop visiting them, especially if nobody links to them anymore.
Jason Barnard: So you've got a lifecycle of URLs — already an interesting concept. You keep tracking them just in case they come back.
Fabrice Canel: Yes.
Jason Barnard: And another thing you just said: very long URLs are a signal that the page is rubbish?
Fabrice Canel: It can be a signal, especially if we continue to see dead links, the URL is very long, and nobody is linking to it anymore. We may decide, okay, this page is a dead link and nobody is visiting it — until we see somebody linking to it again, at which point we say, well, there is a new link to this page, so maybe we should visit it again.
Jason Barnard: Right, so you crawl the URLs, look at what's in there, and decide if it's junk or if it's actually useful. What are the problems with extracting the information? I love HTML5, and John Mueller from Google said it's probably not worth using because people use it so badly that they can't rely on it and don't really pay attention to it.
Fabrice Canel: I disagree a little bit. The web is built not only from pages created by hand in Notepad, but also from content management systems using templates that are well structured with very good information. It's important to tag content properly to help search engines understand it — h1, h2, and h3 tags are useful for telling the story of headings, for marking the head of a section. Tables that are well structured also help search engines understand the concept of a table, the concept of a list.
Jason Barnard: Incredible. Sorry — I heard that 85% of tables are used for design, which creates an enormous problem for you, because a lot of tables are just there for layout and you'd expect data in them, but in fact they're just…
Fabrice Canel: Yes, we do not recommend that. We prefer divs, spans, and CSS positioning, and reserve tables for data — for saying, okay, this is the list of planets in the solar system, this is a real table with real data. Using tables for design confuses the understanding of a page.
Jason Barnard: Okay, none of that. So — WordPress tends to be structured more or less the same way. That must really help you. Whereas when I code myself, it never works out the way I intend.
Fabrice Canel: What you need to understand is that search engines these days are machine-learning based. Machine learning is about judgment — we look at a lot of content, tag it, and define what perfect tagging should look like. There is a variety of pages on the internet, some well structured, some not.