
Untitled
Audio is streamed directly from the publisher (media.museapp.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Discuss this episode in the Muse community
Show notes
00:00:00 - Speaker 1: But this totally changes how the data is persisted, and I think that’s important because the only way you get good results on sync systems, especially when you’re talking about offline versus online and partially online, it has to be the one system that you use all the time. You can’t have some second path that’s like the offline cache or offline mode that never works. It needs to be the one true data synchronization persistence layer.
00:00:29 - Speaker 2: Hello and welcome to Meta Muse. Muse is a tool for thought on iPad and Mac, but this podcast isn’t about Muse the product, it’s about me as the company and the small team behind it. I’m here today with two of my colleagues, Mark McGranaghan.
00:00:43 - Speaker 3: Hey, Adam.
00:00:44 - Speaker 2: And Adam Wulf.
00:00:46 - Speaker 3: Yeah, happy to be here.
00:00:48 - Speaker 2: Now Wulf, you are not at all new to the Muse team, I think you’ve been with us for coming up on 2 years now, but it is your first appearance here on this podcast, a long overdue one I would say. So we’d love to hear a little bit about your background and how you came to the team.
00:01:03 - Speaker 3: Yeah, thanks, it’s exciting. Before Muse, I worked for a number of years with Flexits on their calendar app, Fantastical, both on the Mac and the iPhone and iPad. Really enjoyed that. At the same time, I was also working on an iPad app called Luose Leaf, which was an open source just paper inking app, kind of note taking app of sorts, really enjoyed that as well.
00:01:28 - Speaker 2: And I’ll know when we came across your profile, let’s say, and I was astonished to see loose leaf. It felt to me like a sort of the same core vision or a lot of the same ideas as Muse, this kind of like open-ended scratch pad, multimedia inking fluid environment, but I think you started in what, 2013 or something like that, the Apple pencil didn’t even exist, and you were doing it all yourself and, you know, in a way maybe too early and too much for one person to do, but astonishing to me when I saw the similarity, the vision there.
00:02:03 - Speaker 3: Yeah, thanks. I think the vision really is extremely similar. I really wanted something that felt physical, where you could just quickly and easily get to a new page of paper and just ink, and the, the app itself got out of your way, and it could just be you and your content, very similar to you sitting at your desk with some pad of paper in front of you. But yeah, it was, I think I started when the iPad 2 was almost released. And so the hardware capabilities at the time were dramatically less, and the engineering problems were exponentially harder as a result of that, and it was definitely too early, but it was a lot of fun at the time.
00:02:42 - Speaker 2: And I think one of the things that came out of that, if I remember correctly, is this open source work you did on ink engines, which is how we came across you. Tell us what you did there.
00:02:52 - Speaker 3: Yeah, there’s a few different libraries I ended up open sourcing from that work.
One was the ink canvas itself, which that was the most difficult piece for me. The only way to get high performance ink on the iPad at the time was through OpenGL, which is a very low level.
Usually 3D rendering pipeline. I had no background in that, and so it was extremely difficult to get something up and running with that low level of an architecture.
And so, once I had it, I was excited to open source it and hopefully let other people use it without having to go through the same pain and horror that I did to make it work.
But then one of the other things that was very useful that came out of loose leaf was a clipping algorithm for Bezier curves, which are just fancy ways to define ink strokes, basically, or fancy ways to describe long curvy, self-intersecting lines. And that work has also been extremely important for Muse as well. We use that same library and that same algorithm to implement our eraser and our selection algorithms.
00:04:05 - Speaker 2: And when you’re not deep in the bowels of inking engines, or as we’ll talk about soon sinking engines, what do you do with your time?
00:04:13 - Speaker 3: Oh, I live up in northwest Houston in Texas with my wife Christie and my daughter Kaylin. And she is in high school now, which is a little terrifying, and learning to drive and we’re starting that whole adventure, so that’s been fun for us. I try and get outside as much as I can. I’ll go backpacking or hiking a little bit. That can be fun, and the Houston summer, it’s rather painful, but the springs and the falls, we have nice weather for outdoors and so.
00:04:42 - Speaker 2: What’s the terrain like in the day trip kind of range for you? Is it deserty? Are there mountainous or at least hilly areas, or is it pretty flat?
00:04:52 - Speaker 3: It is extremely flat and lots and lots of pine trees, and that’s pretty much it. Just pine trees and flat land. Sometimes I’ll drive a few hours north. We have some state parks that are nice and have a bit of variety compared to what’s immediately around Houston, so that’s a good backup plan when I have the time.
00:05:14 - Speaker 2: Flat with a lot of trees sounds surprisingly similar to the immediate vicinity of Berlin. I would not have expected Texas and northern Germany to have the commonality there. It gave me a lot of appreciation for the San Francisco Bay Area, while that city didn’t quite suit. Me, as we’ve discussed in the past, one thing that was quite amazing was the nature nearby and a lot of that ends up being less the foliage or whatever, but more just elevation change. Elevation change makes hikes interesting and views interesting and I think itself leads to, yeah, just landscape elements that engage you in a way that flatness does not.
00:05:55 - Speaker 3: Yeah, absolutely. I lived in the Pacific Northwest for a while, and the trees there are enormous, and the amount of green and elevation change there is also enormous. And so when we moved back to Houston, it was a bit of a shock almost to see what I used to think were tall trees in Houston are really not very tall compared to what I lived around up in Portland, Oregon.
00:06:21 - Speaker 2: So our topic today is sync.
Now Muse 2.0 is coming out very soon. We’ve got a launch date May 24th. Feels like tomorrow for our team scrambling to get all the pieces together here, but the biggest investment by far, even though we have the Mac app and we have text blocks are a part of it, the biggest kind of time, resource, energy, life force investment by far has been the local first sinking engine.
And we’ve spoken before about local first sync as a philosophy generally in our episode with Martin Klapman, but I thought it would be good to get really into the details here now that we have not only built out this whole system, both the client side piece and the server piece. But also that we’ve been running it in, won’t quite call it production, but we’ve been running it for our beta for a few months now, and we have quite a number of people using that, some for pretty serious data sizes, and so we’ve gotten a little glimpse of what it’s like to run a system like this in production. So first, maybe Mark, can you describe a little bit how the responsibilities breakdown works in terms of between the two of you on the implementation?
00:07:32 - Speaker 1: Yeah, so I’ve been developing the back end or the server component of our sync system, and Wulf has been developing our iOS client that is the core of the actual app.
00:07:45 - Speaker 2: Yeah, on that side, I kind of think of the client persistence or storage layer as being the back end of the front end. So that is to say it’s in the client, which obviously is a user interface heavy and oriented thing, but then it persists the user data to this persistence layer which in the past was core data, is that right? Well the kind of standard iOS storage library thing.
00:08:08 - Speaker 3: Yeah, that’s exactly right. Yeah, we used core data, which is Apple’s fancy wrapper on top of a SQL light database. And that just stores everything locally on the iPad, like you were saying, so that way the actual interface that people see, that’s what it talks to.
00:08:25 - Speaker 2: And then that persistence layer within the client can talk to this back in the mark has created. And much more to say about that, I think, but I thought it would be nice to start with a little bit of history here, a little bit of motivation.
I’ll be curious to hear both of your stories, but mine actually goes back to using my smartphone on the U-Bah, so that’s the subway system here in Berlin, when I was first working with some startups in the city back in, I guess it would have been 2014, so, 8 years ago I had this experience of using different apps and seeing how they handled both the offline state but actually the kind of unstable state because you have this thing where the train car goes in and out of stations and when you’re in the station, you usually have reception, weak reception, and you leave the station that fades off to you essentially fully offline, and so you’re in this kind of unreliable network state all the time.
And two that I remember really well because they were really dramatic, was one was pocket, which is the relator tool I was using at the time, and it handled that state really well. If it couldn’t load an article, it would just say you’re offline, you need to come back later, but the things it had saved, you could just read. The other one I was using was the Facebook mobile app, and there I was amazed how many errors and weird spinners, and you go to load a thing and it would get half of it, but not the rest of it, and the app just seemed to lose its mind because the network was unreliable, and I found myself thinking, what would make it possible to make more apps to work the way the pocket does and less the way that Facebook works. And I also had the opportunity to work with some startups here, including Clue and Wunderlust and some others that had their own.
Essentially everyone needs this. Everyone needs syncing because they want either one, the user to be able to access their stuff from different devices, or 2, they want some kind of sharing, and I think Vonunderlust was an interesting case because they built out this crack engineering team. To develop really good real-time syncing for a very simple case. It’s just a to do list, and the common case that people use it for, I think was, you know, a couple that’s grocery shopping and they want to like, make sure they don’t overlap and pick the same things in the cart. But it worked really well, but they built this huge, I think it was like a 15 person engineering team that spent years of effort to make really good real-time sin, and it seemed strange to me that you need this big engineering team to do what seems like a simple thing that every app needs.
We went down this road of trying CouchDB and Firebase and a bunch of others, and all were pretty unsatisfying.
And then that further led in, you know, that kind of idea, the sync problem lodged in my mind and then when we got started at ink and Switch, some of our early user studies there were on sync and how people thought about it. And one thing that stuck with me from those was we looked into just kind of syncing on. And note taking apps and talked to a whole bunch of people about this, and we didn’t have a product at the time, so it was just kind of a user research study, but we went and talked to a bunch of folks, most of whom were using Evernote was kind of the gold standard at the time. And almost everyone we talked to, when I asked what’s your number one most important feature from your notes app, they said sync and said, OK, so that’s why you chose Evernote, and they said, yeah, and they said, how well does it work? And they said terribly, it fails all the time. You know, I write a note on my computer, I close the lid, I go to lunch. Half an hour later, I go to pull it up on my phone. It’s not there. I have no idea why. And so some combination of those experiences sort of lodged this thing in my mind of the technology industry can just do so much better, and this is important and everyone needs it. What’s the missing piece. And I wasn’t really sure, but that led into once I met up with folks in the research world who indeed had been working on this problem for a while, and I got excited about the technologies they had to offer.
00:12:15 - Speaker 1: Yeah, and then I guess I was downstream of that because I got introduced to space by Peter Van Hartenburg with time was a principal at the Inn Switch Research Lab, and it’s now the director of the lab.
And he showed me a demo of the Pixel pusher project, and we can link to the article on this, but essentially this is a Pixel art editing tool that was peer to peer collaborative, and the app itself is very standard, but was amazing to me was he had implemented this app and he had 2 devices or 2 windows on the same device, and they were doing real-time collaboration, but there was no server.
And I had come from this world of wherever you add a feature to an app, you gotta write the front end and then you gotta write the back end, you gotta make sure they line up whenever anything changes, it’s a whole mess, and it was just magical to me that you could just type up this JavaScript app and have it collaborating with another client in real time.
So I went down that rabbit hole, and there was the obvious attractions of the austere locations and, you know, minimal network connectivity and things like that. And also at the time the research was very oriented around P2P, so there was this notion of the user having more control of their data and perhaps not even requiring a central server, but a couple of things became even more appealing to me as I researched it more. One was that Potential of higher performance. And I ended up writing a whole article about software performance that we can link to. But one of the key insights was that it’s not physically possible to have acceptably fast software if you have to go anywhere beyond the local SSD. Now, certainly if you’re going to a data center in Virginia or whatever, you’re totally hosed. So it was very important to incorporate this performance capability into Muse.
00:13:49 - Speaker 2: Yeah, that article was eye opening for me and that you connected the research around human factors, things that looked at what level of latency you needed for something to feel snappy and responsive, and then separately the speed of light, which is how sort of the maximum possible speed that information can travel, and if you add those together or do very simple arithmetic on that, you can instantly see it’s not about having a faster network connection. You literally cannot make something that will feel fast in the way that we’re talking about if you have to make a network round trip.
00:14:21 - Speaker 1: Yeah, and the one other thing that was really interesting to me about this space was the developer experience.
I alluded to this earlier with the Pixel Pusher demo, but in the before times there were two ways to develop apps.
You had the local model where you were typically programming against the SQL database, and everything was right there and it sort of made perfect sense. You would query for what you need and you write when you have new information and so on.
And then there was the remote model of you would make rest calls, for example, out to some service like admit this edit or add a new post or whatever.
But then these two worlds were colliding where we always wanted to be adding sync and collaborative capabilities to our apps, we would try to kind of jam one into the other, like you would try to patch some rest onto the database or you try to patch some database on yours and it just wasn’t working, and I realized we need to do a pretty fundamental rethink of this whole architecture, which is what we end up doing in the research lab and then now with Muse.
The last thing I’ll mention about my journey here was my background was in back in engineering and distributed systems engineering, and so I had encountered variants of the sync problem several times, for example, at Hiroku, Adam. We had this challenge of we had these routers that were directing HTTP requests to a back end that was constantly changing based on these dinos coming up and down, and the routers needed to maintain in memory router tables based on the control plan that was being adjusted by the API.
And so we had a similar problem if you need to propagate consistently in real time state to the in-memory databases of all these router nodes, and sure enough that work kind of came full circle and we were applying some of the same lessons here with Muse. So it’s a problem I’ve had the opportunity, for better or worse, to make a few passes at in my career.
00:15:57 - Speaker 3: Yeah, I think it’s an extremely hard problem that comes up so often across so many projects is eventually you need data over here in Box A to look the exact same as data over here in Box B. and it’s one of those problems that’s just surprisingly hard to get right, and there just aren’t that many libraries and existing solutions for it to drop in and implement. A lot of other libraries you can just go out and find it, and there’s code out there, or you can license it or open source, whatever, but for whatever reason, sync is one of those things that’s for every project, it needs to be custom baked to that project, just about every time.
00:16:38 - Speaker 2: And that’s part of what blew my mind back 8 years ago when I was looking for a sinking layer for clue and realizing that, yeah, I just had this feeling like surely everyone has this problem, everyone needs it, everyone needs the same thing. It’s really hard, you know, an individual company shouldn’t be distracting from their core competency of building their app to create the sinking layer, and yet to my surprise, there really wasn’t much, and that continues to basically be true today.
00:17:06 - Speaker 1: Yeah, and this gets into our collaboration with Martin Klutman on CRDTs.
So briefly you can think of there being two pieces to this problem. One is conveying the physical data around, and the other is, OK, you have all this data that synchronize, what do you do with it, because it’s all a bunch of conflicting edits and so on.
And that’s where the CRDT technology came in. I think one of the reasons why we haven’t seen widespread standard libraries for this stuff is the thinking problem is hard. We’ll talk more about that. But another is that we haven’t had the computer science technology to make sense of all of these edits. Well, we sort of did. There was like operational transforms, but you literally need to hire a. Team of PhD computer scientists have any shot at doing stuff like that. And so Google Docs basically had it and maybe a few others, but normal humans couldn’t do anything with it. But the CRDT technology and automerge, which we’ll talk more about, made it much more accessible and possible to make sense of all these conflicting edits and merge them into some useful application state. So that’s the kind of why now of why now is a good time I think to be pursuing this.
00:18:06 - Speaker 3: Yeah, and I think almost surprisingly to me, the solution we came up with at Muse, I think is actually really generic, and I think we solve it in a really elegant way that’s even more foundational to the technology than solving just for use. I think the solution we have. Can certainly solve from use in the future and is futureproof in that regard, but is broad enough to be applicable to a whole number of different uses and applications, which I think is really exciting too.
00:18:37 - Speaker 2: Maybe it’s worth taking a moment to also mention why we think local first in the style of sync is important for you specifically. I think certainly Mark and I have had a long time interest in it. Well, if you have an interest in it, so it’s just something that’s more like we’d like to see more software working in this way where the user has a lot more sort of control and literal ownership over the data because it’s on their device. In addition to being mirrored in the cloud, certainly the performance element is huge for me personally, and I think for all of us on the team. But I think Muse, as we go to this multi-device world, on one hand, we think that every device has its own kind of unique mood. The iPad is this relaxed space for reading and annotating, whereas the Mac or a desktop computer is for focus, productivity, you know, the phone is for quick capture, the web is good for sharing. OK, so really you need your work to be seamlessly across all of them.
But at the same time, you know, we want that sense of intimacy and certainly the performance and the feeling that it’s in your control and you own it, and it belongs to you.
I think that maybe matters less for some consumer products, or maybe it matters less for more kind of B2B, you know, enterprisey products, but for this tool, which is for thinking.
Which is very personal, which is very kind of needs to be at your fingertips and friction free. I think the local first approach would be a good fit for a lot of software, but I think Muse needs it even more than most. So that’s why I’m really excited to see how this works out in practice as people try it out, and we really don’t know yet, right? It may be that we’ve made this huge engineering investment and in the end customers just say, I’d be happy with the cloud, yeah, it’s fine. I have some spinners, I can’t access my work offline. I hope not. But that could happen. We could be like falsifying the business hypothesis, but I really believe that for our specific type of customer, you’ll go to use this product with the sinking layer, you know, once we shake out all the bugs and so on and say, you know, this feels really fundamentally different from the more cloud-based software that I’m used to like an ocean and also fundamentally different from the non syncing pure local apps that I might use.
00:20:51 - Speaker 3: Yeah, I really think that with as connected as this world is and is becoming, there’s always going to be places of low connectivity, there’s always going to be places of just dodgy internet, and having an application that you know just always works, no matter what’s going on, and figures itself out later once it has good internet, is just so freeing compared to Those times when, you know, your device is switching Wi Fi networks or the LTE is just not quite what it needs to be to make things happen.
I think it really does make a really huge difference, especially when you’re deep in thought, working on your content in use, the last thing you want is to be interrupted for even half of a second with a small spinner that says please connect to the internet. And so just being able to free the application and free the user from even worrying about the internet at all, even if it works 99% of the time, it’s that 1% of the time that breaks your train of thought that is just really frustrating. And I think that’s what’s exciting about being able to be purely offline is it fixes that really huge problem of that really small percentage of time that it happens.
00:22:10 - Speaker 2: Very well said. Now with that, I’ll do a little content warning. I think we’re about to get a lot more technical than we ever have on this podcast before, but I think this is a topic that deserves it. So I’d love to, and me especially as someone who’s not deep in the technology and just observing from the sidelines, I’d love to hear about what’s the high level architecture, what are all the pieces that fit together here that make this syncs when you’re on the internet and lets you keep working even when you’re not? What is it that makes all that work?
00:22:41 - Speaker 1: Yeah, I’ll give a really quick overview and then we can dive into some of the specific pieces.
So to start with the logical architecture, the basic model is a user has a bag of edits, so you might have 1000 edits or a million edits where each edit is something like I put this card here or I edit this picture, and over time the user is accumulating all these edits and the job of the sync system is to ensure that eventually all of the users’ devices have the same bag of edits.
And it passes those edits around as opaque blobs and different flavors of blobs we’ll talk about.
Basically there’s a bunch of bits of binary data that all devices need to have the same, and then it’s the device’s responsibility to make sense of those edits in a consistent way.
So given the same bag, each device needs to come up with the same view of the muse corpus of that individual user, what boards are alive and what cards are on them and so forth. And then briefly in terms of the physical architecture, there’s a small server that’s running on Hiokku, data is stored in post grass and S3 and it’s implemented in Go, and again the server is just shuffling binary blocks around basically. And then there’s different front ends, different clients that implement this synchronization protocol and present a use corpus model up to the application developers. So the most important of these is the SWF client. We also have a JavaScript client and both of these back to SOI databases locally.
00:24:09 - Speaker 3: Yeah, and I think what’s really interesting about this architecture is that we actually maintain the entire bag of edits.
Edits only get added into the bag, but they never really get removed. And so the current state of the application is whatever the most recent edit is.
So if I make a card bigger on my Mac, and then I go to my iPad and I make that same card smaller. And then we synchronize those two things. Well, at the end of the day, either the card is going to be smaller on both devices, or the card is gonna be bigger on both devices, and we just pick the most recent one. And that strategy of just picking the most recent edit actually makes conflicts essentially invisible or so small and so easy to fix that the user can just, oh, I want that big, let me make it big again. It’s really easy to solve. For the user side without showing up one of those really annoying, hello, there’s been an edit. There’s a conflict here. Would you like to choose copy A or copy B? Just being able to automatically resolve those is more than half of the magic, I think, of this architecture.
00:25:13 - Speaker 2: I also note this is a place where I think the muse domain, if you want to call it that, of the cards on a canvas model works pretty well with this sort of automated resolution, which is if you moved a card in one direction on one device and you moved it somewhere else on the other device, it’s not really a huge deal which one it picks as long as it’s all kind of like flows pretty logically.
By comparison, text editing, so what you have in a Google Docs or certainly I know auto merge team and the incode switch team has done a huge amount of work on this, is a much harder space where you can get into very illogical states if you can merge your edits together, strangely, but I think a card move, a card resize, add remove, even some amount of reparenting within the boards, those things just are pretty natural to merge together, I think.
00:26:02 - Speaker 3: Yeah, I think so, and I think even with the new text block feature in Muse, we end up slicing what would be a really long form text document into much smaller sentences or paragraphs. And so then text edits, even though we’re only picking the kind of the most recent one to win, we’re picking that most recent at the granularity of the sentence or of the the paragraph, and so. Conflicts between documents for us are larger than they would be for automerge or for Google Docs, but are small enough that it’s still ignorable for the user and easily solvable by the user.
00:26:42 - Speaker 2: Which incidentally I think is a trick we sort of borrowed from FIMA, at least on the tech side, which is in FIGA and also in Muse. If one person edits, you know, the red car and someone else edits the blue car, you don’t get the red blue car, you just get one or the other, and it turns out for this specific domain, that’s just fine.
00:27:03 - Speaker 3: Yeah, I think we kind of lucked out having such a visual model, and we don’t need to worry about intricacies of multi-user live document editing.
00:27:13 - Speaker 1: Yeah, I would point to both Sigma and actual budget as two very important inspirations for our work. I would say those are two of the products that were most at the forefront of this space, and thought about it most similarly to how we did.
And notably they, as well as us sort of independently arrived at this notion of basically having a bunch of last white wins registers. As the quote unquote CRDTs.
So these are very, very small, simple, almost degenerate CRDTs where the CRDT itself is just representing one attribute, for example, the X coordinate of a given card. But this is an important insight of the industrial application of this technology, if you will. That’s a good trade-off to make it. It basically covers all the practical cases, but it’s still very simple to implement, relatively speaking.
00:28:03 - Speaker 2: I also mentioned briefly actual budget, great in the basically made by one person app and recently open source, so you can actually go and read the custom CRDT work there and maybe learn a thing or two that you might want to borrow from.
00:28:17 - Speaker 3: I think one of the really interesting problems for me about the CRDT was Deciding which edit is the most recent because it just makes logical sense to say, oh well, it’s 3 o’clock, and when I make this edit at 3 o’clock and I make a different edit at 3:02, obviously the one at 3:02 wins.
But since computer clocks aren’t necessarily trustworthy, sometimes I have games on my iPad that reset every day and so I’ll set my clock forward or set my clock backward. Or if I’m on an airplane and there’s time zones, and there’s all kinds of reasons the clock might jump forward or jump backward or set to different problems, and so using A fancier clock that incorporates a wall clock, but also includes a counter and some other kind of bits of information, lets us still order edits one after the other, even if one of those clocks on the wall is a year ahead of schedule compared to the other clocks that are being synchronized. I don’t know how in depth we want to get on that, but it’s it’s called a hybrid logical clock.
00:29:23 - Speaker 1: Yeah, I think this is another great example along with choosing very simple CRDT structures of industrial style architecture where you could go for a full blown vector clock, and that gives you perfect logical ordering and a bunch of other nice properties, but it’s quite large and it’s expensive to compute and so on. Whereas if you choose a simpler fixed size clock, that can give you all the benefits that you need in practice, it can be easier to implement, it could be faster to run, and so on.
00:29:52 - Speaker 3: Like everything in life, it’s all about trade-offs, and you can get accuracy, but it costs more, or you can get a little bit less accuracy, and it costs a lot less, and for us that was the better trade-off to have a fixed size clock that gives us Enough of the ordering to make sense, but might not be exactly perfect ordering.
00:30:13 - Speaker 1: And we’ve been alluding to trade-offs and different options, so maybe it’s time to address it head on in terms of the other options that we considered and why they weren’t necessarily as good of a fit for us. So I would include in this list both iCloud and what you call like file storage.
00:30:27 - Speaker 2: It might be like cloud kit or something, but yeah, they have one that’s more of a blob, kind of, you know, save files, what people will think of with their sort of iCloud drive, almost kind of a Dropbox thing, and then they also have a cloud kit. I feel like it’s a key value store, but in theory, those two things together would give you the things you need for an application like ours.
00:30:47 - Speaker 1: Yeah, so there’s iCloud as an option, Firebase, automerge. CouchDB maybe, then there’s the role you’re on which we ended up doing.
00:30:57 - Speaker 2: Yeah, the general wisdom is, you know, you don’t write your own, if there’s a good off the shelf solution, you name some there that are commercial, some are built into the operating system we’re using, some are indeed research projects that we’ve been a part of, what ultimately caused us to follow our own path on that.
00:31:15 - Speaker 1: Yeah, so there was a set of issues that tended to come up with all of these, and it was more or less in different cases, but I think it’d be useful to go through the challenge that we ran into and talk about how they emerged in different ones of these other solutions.
So one simple one, it would seem it’s just like correctness slash it works. And the simple truth is, a lot of the singing systems out there just do not work reliably. Hate to pick on Apple and iCloud, but honestly, they were the toughest in this respect where sometimes you would, you know, admit data to be synchronized and just wouldn’t show up, and especially with opaque closed source solutions and third party solutions, stuff would not show up and you couldn’t do anything about it, like you couldn’t see what went wrong or when it might show up or if there was some error.
And then bizarrely, sometimes the stuff would pop up like 5 or 10 minutes later. It’s like, oh, it’s actually sort of worked, but it’s off by You know, several zeros in terms of performance. So that was a really basic one, like the syncing system has to be absolutely rock solid and it kind of goes back to the discussion Wulf had around being offline sometimes. If there’s any chance that the sync system is not reliable, then that becomes a loop in the user’s mind. Am I gonna lose this data? Is something not showing up because the sync system is broken. Our experience has been that if there’s any lack of reliability or lack of visibility into the synchronization layer. It really bubbles up into the user’s mind in a destructive way, so we want it to be absolutely rock solid. Another important thing for us was supporting the right programming model. So we’ve been working on news for several years now. We have a pretty good idea of what capabilities the system needed to have, and I think there were 4 key pillars. One is the obvious transactional data. It’s things like what are the cards and where are they on the board. This is data that you would traditionally put in a SQL database. Another thing that’s important to have is blob support, to a lot of these binary assets in use, and we wanted those to be in the same system and not have to have another separate thing that’s out of band, and they need to be able to relate to each other correctly.
00:33:09 - Speaker 2: This is something where a 10 megabyte PDF or a 50 megabyte video just has very different data storage needs than the tiny little record that says this card is at this X and Y position and belongs to this board.
00:33:23 - Speaker 1: Right, very different, and in fact you’re gonna want to manage the networking differently.
Basically you want to prioritize the transactional data and then load later, or even lazily, the binary data, which is much larger.
Yeah, so there was transactional data, blob data, and then real-time data slash ephemeral data.
So this is things like you’re in the middle of an ink stroke or you’re in the middle of moving a card around and this is very important to convey if you’re gonna have real time and especially multi-user collaboration, but again, you can’t treat this the same as certainly blob data, but even transactional data, because if you store every position a card ever was under your finger for all time, you’re gonna blow up the database.
So you need those 3 different types of data, and they all need to be integrated very closely.
So for example, when you’re moving a card around, that’s real time, but basically the last frame becomes a bit of transactional data, and those two systems need to be so lined up to each other that it’s as simple as changing a flag. If you’re going on a 2nd or a 3rd band for real-time data and need to totally change course for saving the transactional data, it’s not gonna be good.
It was quite rare. I don’t know if we found any systems that support all three of these coherently.
00:34:33 - Speaker 2: The ephemeral data element I found especially interesting because you do really want that real timey feeling of someone wiggles a card with their finger and you can see the wiggling on the other side. That just makes the thing feel live and Just responsive in a way that it doesn’t otherwise.
But yeah, at the same time, you also don’t want hundreds of thousands of records of the card moved 3 pixels right, and then 3 pixels left.
And one thing I thought was fascinating, correct me if I misunderstood this, but is that because the client even knows how many other devices are actively connected to the session, it can choose to not even send that ephemeral data at all. It doesn’t even need to tap the network. If no one else is listening, why bother sending ephemeral data? All you need is the transactions over time.
00:35:21 - Speaker 1: Right, this is actually a good example of how there’s a lot of cases where different parts of the system need to know or at least benefit from knowing about other parts.
So it becomes costly or or maybe just an outright bad idea to separate them, especially as we’re still figuring out as industry how they should work. I think there’s actually quite a bit of benefits to them being integrated.
Another. that we could talk about eventually is prioritizing which data you download and upload, you might choose to first download blobs that are closer to you in board space, like it’s in your current room or it’s in adjacent rooms, and then later you can download other blobs. So that’s something you could do if the application had no notion of the networking layer.
It actually brings us to Another big challenge we saw with existing systems, which is multiplexing. So I’ll use an example of automerge here, and this is something we’ve seen with a lot of research oriented CRDT work. It’s very focused on a single document, so you have a document that represents, you know, say a board or whatever, and a lot of the work is around how do you synchronize that document, how do you maintain correctness, even how do you maintain performance when you’re synchronizing that document across devices.
Well, the challenge with Muse with our model.
You might have, you know, easily 1000, but, you know, potentially tens of thousands up to millions of documents in the system corresponding to all your individual cards and so on. And so if you do anything that’s order and in the number of documents, it’s already game over. It needs to be the case that, here’s a specific challenge that I had in mind for the system. You have a corpus, let’s say it’s a million edits across 10,000 documents or something like that, and it’s 100 megabytes. I wanted the time to synchronize a new device that is to download and persist that entire corpus, to be roughly proportional to the time it would take to just physically download that data. So if you’re on a 10 megabyte connection, 100 megabyte connection, maybe that’s 10 seconds. But the only way to do that is to do a massive amount of like multiplexing, coalescing, batching, compression, so that you’re taking all these edits and you’re squeezing them into a small number of network messages and compressing them and so on. So you’re sort of pivoting the data, so it’s better suited to the network transfer and the persistence layer. And again, you need to be considering all these things at once, like how does the application model relate to the logical model, relate to the networking protocol, relate to the compression strategy, and we weren’t able to find systems that correctly handle that, especially for when you’re talking about thousands or millions of documents being synchronized in parallel. And the last thing I’ll mention is what I call industrial design trade-offs. We’ve been alluding to it in the podcast so far, but things like simplicity, understandability, control, these are incredibly important when you’re developing an industrial application, and you tend not to get these with early stage open source projects and third party solutions and third party services. You just don’t have a lot of control and it was too likely to my mind that we would just be stuck in the cold at some point where system didn’t work or it didn’t have some capability that we wanted, and then you’re up a dead end road, and so what do you do? Whereas this is a very small, simple system. You could print out the entirety of the whole system it would be probably a few pages, well it’s a few 1000 lines of code, it’s not a lot of code, and it’s across it’s a couple code bases, and so we can load the whole thing into our head and therefore understand it and make changes as needed to advance the business.
00:38:38 - Speaker 3: Yeah, I think that last point might honestly be the most important, at least for me. I think having a very simple mental model of what is happening in sync makes things so much easier to reason about. It makes fixing bugs so much easier. It makes preventing bugs so much easier. We’ve been talking about how sync is hard and how almost nobody gets it right, and that’s because it’s complicated. There’s a bajillion little bitty edge cases of if this happens, but then this happens after this happens, and then this happens. What do we do? And so making something really really simple conceptually, I think was really important for the muse sync stability and performance at the end of the day.
00:39:21 - Speaker 2: I’m an old school web developer, so when I think of clients and servers, I think of rest APIs, and you maybe make kind of a version API spec, and then the back end developer writes the endpoint to be called to and the front end developer figures out how to call that with the right parameters and what to do with the response. What’s the diff between a world that looks like that and how the new sync service is implemented?
00:39:50 - Speaker 1: Yeah, well, a couple things. At the network layer, it’s not wildly different. We do use protocol buffers and binary encoding, which by the way, I think would actually be the better thing for a lot of services to do, and I think services are increasingly moving in that direction, but that core model of you have, we call them endpoints. You construct messages that you send to the endpoint and the server responds with a response message. That basic model is pretty similar, even if it’s implemented in a way that’s designed to be more efficient, maintainable, and so on than a traditional rest server.
But a big difference between A traditional rest application and the muse sync layer is that there are two completely separate layers, what we call the network layer and the app layer. So the network layer is responsible for shuffling these binary blobs around the transactional data, the ephemeral data, and the big binary assets. And the server knows absolutely nothing about what’s inside of them by design, both because we don’t want to have to reimplement all of the muse logic about boards and cards or whatever in the server, and also because we anticipate eventually end to end encrypting this, and at that point, of course, the server can’t know anything about it, it’s not gonna be possible. So that’s the networking layer and then if you sort of unwrap that you get the application layer, and that’s the layer that knows about boards and cards and edits and so on. And so it is different, and I would say it’s a challenge to think about these two different layers. There’s actually some additional pivots that go on in between them, versus the traditional model of you would like post V1 slash boards directly and you’d put the parameters of the boards and then the surfer would write that to the boards table and the database. There’s a few different layers that happen with this system.
00:41:30 - Speaker 2: So if we want to add a new card type, for example, or add a new piece of data to an existing card, that’s purely in the application layer on the back end, or it doesn’t know anything about that or no changes are needed on the back end.
00:41:44 - Speaker 1: Yeah, no changes are needed.
In fact, one of the things I’m most proud about with this project is we basically haven’t changed the server since last year, December, and we’ve been, you know, rigorously iterating on the app, you know, adding features, changing features, improving a bunch of stuff, and the servers, it’s basically the same thing that was up 4 months ago, just chunking along, and that’s a benefit. It’s a huge benefit, I think, of this model of separating out the application model and the network model, because the network model is eventually gonna move very slowly. You basically figure that out once and I can run forever. And the application model has more churn, but then when you need to make those changes, you only need to make them in the client or the clients that maybe you update the application schema so that current and future clients can understand that, and then you just start including those data in the bag of edits.
00:42:26 - Speaker 3: Yeah, I think one thing that’s really nice is that those protocol buffers that you were talking about are type safe and kind of statically defined, so that way it’s when we’re sending that message over the wire, we know 100% we’re sending exactly the correct messages no matter what, and that guarantee is made at compile time, which I think is really nice because it means that a lot of bugs that could otherwise easily sneak in if we’re using kind of a generic JSON framework, we’re gonna find out about when we hit the please build muse button. Instead of the I’m running views and I randomly hit a bug button. And that kind of confidence early on in the build process has been really important for us as well to find and fix issues before they even arise.
00:43:11 - Speaker 1: Yeah, to my mind this is the correct way to build network clients. You have a schema and it generates typesa code in whatever language you want to use.
There’s just enormous benefits to that approach. I think we’re seeing it with this on use and again, I think more systems, even more traditional B2B type systems are moving in this direction.
By the way, everyone always made fun of Amazon’s API back in the day. I had this crazy XML thing where There’s a zillion endpoints. I actually think they were closer to the truth and the traditional, you know, nice rest crud stuff because their clients are all auto generate and sure enough they have like literally a zillion endpoints, but everything gets generated for free to a bunch of different languages.
Anyways, one challenge that we do have with this approa