An Interview with Alex Komyagin, Senior CE at MongoDB

An Interview with Alex Komyagin, Senior CE at MongoDB

18.05.2016
0
697

MongoDB, a cross-platform document-oriented database, is on the rise. Thousands of companies use MongoDB including Adobe, Amadeus, BNP Paribas, Cisco Systems, Craigslist, eBay, Foursquare, LinkedIn, McAfee, MetLife, SAP, Shutterfly and Yandex.. Today we are going to discuss the company with Alex Komyagin, Senior Consulting Engineer of MongoDB.

Hello, Alex. Tell us something about you.

Born and raised in Saint-Petersburg, I studied Computer Science at Saint-Petersburg Polytechnical University. In fact, I went there because I wanted to be a sysadmin, only later to discover a very lucky coincidence: that’s not what I actually wanted and that’s not what they were teaching. That said, our professors taught me a lot and I’m very grateful for all their hard work.

While studying, I worked at Motorola Solutions – the enterprise branch of Motorola – implementing their proprietary Bluetooth stack for Linux and WinCE (do you even remember that one?). After I graduated, I continued working for them and leading our small team towards successful integration with Android.

At some point I decided to move to the US (my parents have lived there for some time), and I got a Technical Service Engineer position at MongoDB. Initially I wanted to find a SW Engineer position, but I felt like I wanted to work more with people than with code, plus Ron (our VP) was very convincing. He found time to talk to me twice for over an hour during the interview process, and he was incredibly technical. That means a lot. Funny enough, I joined MongoDB on April 1st, only 3 weeks after I came to the US. The whole MongoDB team made a truly giant effort to help me adjust to the new environment. After 2 exciting years in the Technical Services team, I moved to our Professional Services department, which is where I’m working now. It’s somewhat similar to what I was doing before, in the sense that I’m mostly working with people, but allows me to work with them face-to-face and better understand how to help them more.

You work in MongoDB since 2013. Since then the company has grown and became a very strong player on the market. What’s the secret of success? Good management or a really good product?

When I joined MongoDB, we had a very small office on Prince St with around 40 people in there. Now our NYC office is huge and has moved to Times Sq. We also have offices in PA, TX, and around the world: Sydney, Dublin, Tel-Aviv, etc.

I think both management and the product are very important factors in our growth. Our engineering team is working hard on improving the database; actually, all products, since we have MongoDB drivers for different languages, Cloud Manager, OpsManager, Compass and BI Connector (more on these later). And our management is striving to build a great company.

However, there is much more beyond that. That is the people. Which I now believe is crucial for any business’s success. We have amazing people in all departments that communicate with each other and work as a unit, as a team. Every person in MongoDB brings something to the table, whether it’s their passion or in-depth knowledge or both.

Of course, just like any fast growing company with people all around the globe we are facing some challenges. But it’s the commitment on all levels of the company that allows us to overcome them.

What can you tell us about the history of MongoDB? Any important stages of company’s development?

Every year we are going through different changes in the company, both organizational and technological. If I had to pick a couple of important stages, I would say that moving the HQ office from CA to NY was definitely a big one. I joined after it already happened, and I think that having the HQ here allows us to be much closer to “the action”, especially from the logistical point of view. Many enterprises have offices just a train ride away.

Another big organizational change was the appointment of the new CEO, Dev. With him we started to focus heavily on our efficiency as organization, and we had some restructuring in our Professional Services and Sales departments to make it easier for our customers to work with us.

From the technological point of view, integration of WiredTiger storage engine in 2014 was definitely huge. It allowed us to resolve many outstanding requests from our users and to significantly expand the number of MongoDB use cases.

It is also notable that we are putting additional efforts on building an ecosystem with different tools for MongoDB, including tools and products that improve operational experience, such as OpsManager, CloudManager, Automation, managed backups and Compass.

Great. What you can tell us about your personal responsibilities in MongoDB? Remarkable facts, moments?

Our Professional Services department provides our clients help and expertise for their MongoDB projects. We offer help on all stages. It might be just a POC, or an already existing standalone project, or even a private cloud or DBaaS.

Not including pure training engagements, my job largely consists of two parts – working with the client to define and understand their goals and specific requirements, and then working with them to find possible specific solutions. It involves both solving global architecture design problems and helping with very specific MongoDB questions, like choosing a good shard key.

In my spare time I try to help my team by reviewing others’ reports, sharing my knowledge and filing bug reports. I guess sometimes I can be quite annoying for the engineering team, because my position requires good understanding of MongoDB internals and our future plans. I always ask a lot of questions and share the feedback I get from the field, whether it’s positive or prompts us to reassess our assumptions. But so far our engineering team still copes with me.

I really enjoy traveling for my job and visiting new cool places. A few months ago I went to Alabama and it looks much different from what I was used to in New York City. We went to a local BBQ place called “Landmark BBQ” and that was probably the best BBQ in the world. It was an old one-story building alongside the road, and the only thing that stood out was expensive cars parked next to it. I guess that’s the South thing.

What are the main benefits of MongoDB compared to other NoSQL solutions? Who is your main competitor?

I’m a big believer that you should use the most suitable tool for the problem. That is, right now you have a wide range of readily available tools for data storage and processing. MongoDB is one of those tools. And every solution has its own strong points.

MongoDB is somewhat unique in the sense that we are not a niche product, but we are building a general solution that can be used in different domains. Our goal is to give users the best of both worlds – an enterprise-grade product with rich functionality that is easy to use and that scales to support their data with minimal effort.

Among well-known benefits of using MongoDB, like flexible schema, scalability, high availability, great developer experience and easy setup, I would list the ecosystem and great support. In the last few years we undertook the effort to provide Ops teams with tools to operationalize MongoDB, like our Cloud- and OpsManager, that provide Monitoring, Automation and Backup. We put a lot of effort into integration with Enterprise ecosystem, we recently released a BI connector that provides MongoDB interface for SQL-based BI tools.

It’s also important to mention that we continuously working on all of our products, non-stop. For any issues that our customers face, we have the world-class support team that is always there to help.

What can be improved in MongoDB? Nothing in this world is perfect. For example, there are myths about MongoDB memory issues and file corruption. Also there is an opinion that MongoDB is only suitable for small projects. What would you say?

First of all, I’d like to be very clear that thousands of organizations rely on MongoDB for all kinds of production applications, and many of them are mission critical. So it’s a myth that MongoDB is only suitable for small projects. Regarding file corruption and different kinds of issues found in the Internet, most of them are just speculation. I’ve been involved in multiple corruption incidents in the past, and they all were traced to faulty disks or file system issues. It’s worth noting that our replication is statement-based as opposed to binary (e.g. SAN), hence replica sets always help to prevent data loss in situations like these.

That said, nothing is really perfect. With every release, including minor releases, Ramon – our Program Manager – publishes the Release Notes with the list of things addressed in that version. It’s very easy to see that we’re constantly working on improving the product. It has come a long way. Since version 3.0 we support the new WiredTiger storage engine and it’s the default option in 3.2 and newer releases. WiredTiger brings a lot of new features to the product and enables us to do even more cool things in the future. Specifically, in WiredTiger we changed the way we allocate memory and the way we manage disk space. WiredTiger supports compression by default and normally in the field I see 5x-10x disk space reduction without any noticeable CPU usage growth. The previous default storage engine (MMAP) was somewhat sensitive to FS corruption, which unfortunately happens in production systems sometimes, as MMAP had no way to detect if certain disk blocks were corrupted and to what extent. WiredTiger stores checksums for its pages and that makes MongoDB more robust.

Another important concern that WiredTiger addresses is locking. Prior to 3.0, we had database-level locking, meaning that for every database (you can have multiple databases in the same cluster) there could be either multiple concurrent readers or only one writer. WiredTiger employs lock-free algorithms and works similarly to MVCC systems. Essentially, we now have document-level locking, which is as granular as it gets. On systems with performant disks we can easily saturate all available CPUs, and thousand of write operations per second on average systems is not unusual anymore.

For version 3.2 our engineering team reworked the replication protocol to make elections in replica sets faster (from 10-30 seconds to 2-5) and to provide stronger durability guarantees for our customers.

I think many of our customers are now looking forward to improvements in sharding. Internally, sharding involves pretty sophisticated machinery to make sure that scaling doesn’t affect data consistency and durability guarantees. While MongoDB does a very good job in hiding this complexity from application developers, sharding introduces additional components to the system (query routers, config servers) and some additional operational overhead. It would be nice to remove that overhead or to minimize it. We are already making steps in that direction, in MongoDB 3.2 we support config servers as a replica set which eliminates some complexities in sharding administration.

And that’s just about the core product. We feel the need of our biggest customers to make it easier to operate many different clusters with lots of nodes and we invested a lot of efforts into CloudManager, which is our cloud-based Monitoring, Backup and Automation service, and OpsManager, its on-premise version. While you can easily perform a full cluster upgrade without downtime in MongoDB and there’s a simple procedure for that which we call “rolling” upgrade, it’s still a manual procedure that takes time. Imagine doing that not on a 3-node replica set, but on a 50-shard cluster with 5 replica set nodes in each shard (250 nodes total). With OpsManager Automation success is only a few button clicks and a couple of minutes away. Backup and Automation features are pretty new, but they hold a lot of potential and I already see them being used in many production deployments. However, just like with any new product on such a well-established market, they will need some time before they can fully meet many existing Enterprise requirements.

As you know, MongoDB supports flexible schema, and by that I mean that we don’t impose any restrictions on the structure of your documents, and you are free to store them in whatever way you like. Ironically, sometimes it results in users not having a very clear idea of what exactly they have in the database. For big deployments with millions or billions of documents it’s simply not feasible to process them all. Recently, we launched a new schema exploration tool called Compass. You should try it out if you haven’t heard of it. In one project I remember finding out that 95% of their field values were “-”, which they were not interested in. We significantly reduced their indexes size by using partial indexes.

MongoDB cannot do collation-based sorting and is limited to byte-wise comparison via memcmp, which will not provide correct ordering for many non-English languages when used with a Unicode encoding. Could we please get an update on this? For many non-English speaking users, this is a pretty important issue.

In short, we’re working on this. It’s not a trivial thing to do, as there are several different parts of the system where changes have to be made, including the query engine itself, indexing and sharding. It’s hard to say right now how it will look like from the user interface point of view and what exactly will be available in our next release, MongoDB 3.4. But rest assured, collation support is being actively worked on.

There are two well-known benchmarks by United Software Associates and End Point about the MongoDB performance, but these benchmarks suggest the opposite results. What can you tell about it?

Personally, I’m quite skeptical about usefulness of benchmarks in the modern database world, primarily exactly because they tend to produce different results and are sometimes used as a weapon of speculation.

MongoDB is very flexible, and its performance depends on many factors. You can optimize your schema for your application-specific needs, deploy MongoDB with HA as a replica set, or build a sharded cluster to scale your load, you can build secondary indexes to support your queries and so on. Proper schema design and architecture will give you optimal performance results. But your performance results will really not be applicable to a different use case or a different application, because they will have their own data model, queries, etc. As such, I find that benchmarks are not very useful and they tend to confuse people.

Further, the database is an ever-growing product, and performance is one of our priorities. Hence any published benchmark becomes outdated rather quickly, although sometimes it can be used to track particular performance improvements. We did our own set of benchmarks for this a year ago with MongoDB 3.0, where we saw 4x throughput improvements with WiredTiger over MMAP engine for read-mostly workloads, and about 6x for balanced workloads: Performance Testing MongoDB 3.0 Part 1: Throughput Improvements Measured with YCSB. Although now numbers are probably even higher.

In my opinion, building a POC is a much better time investment than studying benchmarks.

To implement simple ‘paged’ results in MongoDB you can use cursor.skip() method. Yet, it isn’t very effective when used with large collections (for example, the query is processed for too long). Are you going to fix that? I mean, the issue is described in documentation, but… What do you think about it?

In order to properly answer this question, it’s important to explain what cursor skip() method actually does and why sometimes it’s not the best way to implement pagination.

Cursor.skip() has to find a position in the index first, and then it has to traverse through a fixed number of index keys. Index is a B-Tree, and without going too much into technical details, B-Trees are very effective in finding a specific value or doing a range scan, but they don’t have a shortcut for skipping a certain number of values. Thus skip results in traversing the tree, which indeed can be a very IO-heavy operation if the skip count is big.

In general, range-based pagination is a much more efficient way to implement ‘paged’ results. For this you have to keep track of the first and the last unique sort key values on your current page, and when you switch to the next page, it renders results starting with the last sort key value from the current page (the $gt query operator). Similarly, when you switch to the previous page, it renders results starting from the first sort key value from the current page in the opposite direction (the $lt query operator). This solution always has great performance regardless of the page number.

What are your thoughts on MongoDB community? Do you listen to people’s suggestions or just continue to develop the product? For instance, you have added a new feature that nobody likes. What would you do?

Our Product Management team has regular meetings with different users to get their feedback, and we actively listen to our community. People’s suggestions are always taken into consideration. For example, in MongoDB 3.2 we introduced $lookup, which is a new pipeline stage for the aggregation framework that essentially does a left outer join. Originally, we intended to include that new functionality only in the Enterprise Edition of MongoDB, along with other features like Encryption at Rest. However, based on the feedback from our community, we changed our decision and, as you probably know, $lookup is now available as part of the Community Edition.

I think it’s very important for a company like ours to hear the feedback while maintaining the actual vision of the product. Another good example is SSL support, which for a long time was only available in MongoDB Enterprise or if you built MongoDB yourself from the source with SSL support. Now it’s included in the official Community Edition packages.

With all of the above, whom would you recommend to use MongoDB? Who is your main customer – big corporations or small businesses or startups? Which projects are better for MongoDB? Give us some examples. Which technologies can be easily ‘combined’ with MongoDB?

I don’t think it’s possible to draw a line by size of the company or by the nature of the business. I see totally different projects in totally different areas every day. Those are big banks, hedge funds, healthcare corporations, media companies, big and small startups and many many more. Some use cases are of course more interesting than the others. Not so long ago I was working with a biopharma company who stores genome metadata in MongoDB.

Deployment topologies and sizes vary as well. Some deployments span across cities, some – across continents. There are people who want to write 100’s of MB/s to their cluster, there are people who need 99.999% of their queries to complete within 5ms (that’s actually eBay), there are people who store 100’s TB’s of data, there are people with 100’s of shards…

In the last couple of years I see many enterprises moving away from SQL solutions to MongoDB. New features, scalability, performance – it all plays a big role. But money has its role, too. Sometimes Oracle can be just too expensive.

In the data lake terminology, MongoDB is usually used as operational data store, and Hadoop/Spark are used to perform heavy duty ad-hoc analytics. It makes sense to me, as MongoDB works best when you organize your data and indexes to support your application needs. In this way MongoDB is much better equipped than the NoSQL competition, thanks to the expressive query language and secondary indexes. However, it’s different from what modern analytics and BI is about, because more often than not database owners don’t know in advance what specific analytical queries will be run. MongoDB has a very powerful pipeline-based aggregation framework, and I can frequently see people using it to do different kind of groupings, but mostly as a secondary use case.

Overall, I think MongoDB is worth evaluating for any project, unless it is a very specific domain and there is a very specific niche solution for it. I always recommend my customers to spend time on proper MongoDB topology and schema design for their POCs, because there are a lot of factors that affect MongoDB performance. The most important part is to understand that MongoDB allows you to store your data in the form that your application needs, but to capitalize on that you need to know your application. This way you can render your main landing page in one database call, if we are talking about a website CMS.

What you can tell us about Cloud Services? I understand that MongoDB gained a lot of popularity thanks to integration with Cloud, am I right?

There is no doubt that a lot of things are moving into the Cloud now, including database services. We have strong partnerships with AWS, Google, and others.

While our database itself doesn’t have any special integration with any Cloud service, I think that the simplicity of installation and the distributed nature gave us some advantage there. For instance, with almost every Cloud service you can provision servers in different data centers or availability zones, and it perfectly aligns with our HA model – Replica Sets. It’s very easy to create or terminate cloud servers, and it’s also very easy to add or remove replica set members – it’s really just a single command. Normally you put replica set members into different zones to ensure that your database stays online if one or more zones is down. It doesn’t happen very often, but when it does – it’s never the right time.

Among the most popular Cloud services where people run MongoDB are AWS, Google and Azure. There are also full database-as-a-service solutions out there. My personal preference has always been AWS, mainly because of their flexible disk IO provisioning. When we’re talking about TB’s of active data on a single node, it’s the disk IO that is going to be the dominant factor in database performance.

Of course, Cloud makes it easier to run your servers, but you still have to manage them yourself. On the MongoDB side, Cloud Manager Automation makes this a lot easier for our users and we integrated it with Azure and AWS, so you can even automatically provision new servers for your cluster.

In my experience, Cloud introduces new dynamics into distributed systems. You can no longer make assumptions about availability and performance of individual components, including network, CPU, disks, etc. Potential network issues definitely present new challenges to geographically distributed databases, including MongoDB, because many of MongoDB cluster components have to talk to each other a lot during normal operations, especially in sharding. However, we addressed many issues in MongoDB over the last year to improve stability in case of communication problems in sharded clusters.

What can we expect from MongoDB World 2016 in June?

Many interesting, educational presentations, great people and of course unveiling some roadmap details.
This event is as much about our users as it is about MongoDB. Our headliners include Google, Capital One and IBM. Many of our customers from around the US and from other countries come to NY to learn, share their knowledge and to make new connections with like minded people.

MongoDB World is a very big thing in our company, and every time we try to make it better than before. I heard many positive opinions about the last one, so I expect this year to be a hit.

Do you have plans to present anything yourself?

It’s really hard to commit to anything with my schedule, as I just obey my calendar these days. If I will be in the city for the conference, I’ll try to participate in the “Enterprise Architecture for Scalable Solutions” workshop. This is a new pre-conference activity that we’re doing this year. There will be other workshops about microservices, OpsManager, and more. We are still ironing out the details and the actual format, but it will be lead by actual engineers and it will be very practical and very useful.

Also, for our users in Europe who are not able to come to NY this summer, we will have MongoDB Europe in London in November.

Very interesting. I’m sure that the conference will go well. Lastly, I wanted to ask about the future of MongoDB. New features? New partnerships?

So far our tradition has been to announce this kind of news on MongoDB World, and we will continue it this year. As the level of adoption of MongoDB by Enterprise customers increases, we are facing new requirements and discover new customer needs. Our Product Management team is constantly engaged in dialogue with our users and many changes that you see now are a direct consequence of that dynamic. Without spoiling things too much before the big conference, right now I can only say that we will have all of the things you mentioned, and even more. More products, more features and new partnerships.

I see. Thank you for your time, Alex. I hope we will have one more interview soon. Bye.