Ada URL Parser, a powerful tool for parsing URLs, has just been updated to version 2.0 after the release of version 1.0.4 just a month ago. This latest version brings some significant improvements over its predecessor, includinga doubling of execution speed in some cases, as well as reduced memory usage and allocations. These enhancements make the Ada URL Parser more efficient and capable of handling a broader range of URL parsing tasks with ease. In this blog post, we'll take a closer look at the new features and performance improvements of the Ada URL Parser v2.0 and explore how they can benefit developers in their everyday work.
In addition to the performance and memory improvements, the Ada URL Parser v2.0 also introduces a new feature that will be of particular interest to developers working with one-time URL parsing tasks. With this update, Ada now supports two different implementations of URL parsing out of the box.
Dropping ICU requirement
The International Components for Unicode (ICU) library was a vital component of the Ada URL Parser v1.x and was required for proper URL hostname parsing and manipulation. This is because URLs can contain a wide variety of international characters, including non-Latin alphabets, and the ICU library provides a comprehensive set of tools for handling these characters in a standardized manner.
Specifically, the ICU library provides support for Unicode normalization, which is a critical step in URL parsing. Normalization ensures that URLs containing international characters are properly encoded and can be processed correctly by other software components. Additionally, the ICU library provides support for character set conversion and detection, which are essential for handling URLs from different regions of the world.
Prior to v2.0, Ada was dependent on ICU for
to_unicode operations. The ICU library is available for all systems and we cannot guaranteed that it is up-to-date. Providing our own unicode functions allows us to fully support the standard across a broad range of systems. Furthermore, our benchmarks reveal that we sometimes achieve better performance with our own dedicated functions.
With Ada v2.0, we are dropping the system requirement of ICU and rolling out our implementation of the Unicode Specification, and releasing it publicly on GitHub (ada/idna).
The first implementation, ada::url, is the existing URL representation that was suitable and optimized for environments where the reference to the instance can persist and live on. It provides a comprehensive set of URL parsing functions and is suitable for tasks that involve ongoing URL manipulation.
The second implementation, ada::url_aggregator, is a new addition to the Ada URL Parser family. It is explicitly designed for parsing and deserializing URLs in one-time environments, where a new URL instance is created for each parsing operation. This implementation uses a similar URL representation inspired by Servo URL parser making it an ideal choice for performance-critical applications.
As you may have realized, the public API of ada::url and ada::url_aggregator is same. However, their internal implementations and the way in which each URL subcomponent is defined is the key differentiator. The ada::url structure stores the components of the parsed URL in different string instances, making updates fast. The ada::url_aggregator uses a single string buffer, thus minimizing memory usage, at the expense of more work during updates
Reducing string allocations in Ada
One of the key features of the Ada URL Parser v2.0 is its ability to provide a comprehensive representation of the various components of a URL string. This representation is based on the WHATWG URL specification and includes a range of offsets for different URL components. These offsets are used to identify the start and end indexes of different parts of the URL, such as the protocol, username, hostname, port, pathname, search, and hash. By using these offsets, developers can easily extract and manipulate different parts of a URL string without having to worry about the underlying parsing details.
For example, the ada::url_components feature includes the protocol_end offset, which represents the ending index of the protocol component in the URL string. It also includes the username_end offset, which is used for URLs that contain a username. Additionally, the feature provides host_start and host_end offsets, which represent the start and end indexes of the hostname component of the URL. The port, pathname_start, search_start, and hash_start offsets are also included, allowing developers to easily extract and manipulate the corresponding URL components.
We have been actively working on improving the library's benchmark infrastructure to provide a more realistic comparison of Ada's performance compared to other similarly-scoped libraries. As part of this effort, we developed a crawler that visited 100,000 URLs from the most visited 100 websites, with a limit of 100 URLs per unique domain. The crawler was designed to simulate real-world URL parsing and manipulation scenarios and provided valuable insights into the performance of the Ada URL Parser in comparison to other libraries.
All benchmarks are executed using the Apple M1 Max processor.
Comparing Ada with alternatives
The following benchmark code is available on GitHub as well as the dataset of 100,000 URLs. We used cURL 8.0.1, Servo 2.3.1 (using Rust 1.64.0), and Boost 1.81.0 for benchmarking.
Ada is currently 50% faster than Boost, 3x faster than Servo and 6x faster than cURL in the given dataset.
Disclaimer:cURL follows RFC 3986+, and Boost follows RFC 3986.
Comparing Node.js (with Ada) with alternatives
The following benchmark code is available on GitHub as well as the dataset of 100,000 URLs. We used Node.js main branch, Bun 0.5.8 and Deno 1.32.1 for benchmarking.
Node.js is currently 82% faster than Bun and 3x faster than Deno in the given dataset.
All this great work is a result of a collaboration between Daniel Lemire, Miguel Teixeira, and me (Yagiz Nizipli).
The full changelog can be found on Ada's GitHub releases page.