Postmortem Archives - mailsac blog

Mailsac is not affected by Log4J CVEs

December 22, 2021December 22, 2021 Jeff P

Tech news has recently been full of CVEs related to a popular JVM logging library named Log4J.

Mailsac services do rely on JVM languages, including Java. This extends through the entire stack, custom apps, self-hosted open source software, internal and external, infrastructure, proxies, and scripts.

There is one exception – an instance of the CI server Jenkins which is isolated behind a VPN, and is was never vulnerable according to troubleshooting steps from the Jenkins developers.

Mailsac and Security

The Mailsac Team is small yet mighty, with decades of experience taking security seriously. We follow best practices for infrastructure-as-code, patching, testing, network isolation, backups, restoration, and principle of least access access. Large enterprises including banks and government agencies trust Mailsac for disposable email testing. We provide exceptionally fast and predictable REST and Web Socket APIs with an excellent uptime record.

Mailsac has support for multiple users under the same account, so you can keep disposable email testing private within your company.

It’s free to test email immediately – no payment details required. You can send email to any address @mailsac.com and confirm delivery in seconds without even logging in. Start now at mailsac.com.

Inbound Mail 2021-11-25 Outage (Resolved) Postmortem

November 26, 2021December 3, 2021 Michael M

All times US Pacific Standard Time

Start of Outage

On the US holiday Thanksgiving, November 25th at approximately 17:20, an email address [email protected] began sending tens of thousands of simultaneous emails to Mailsac. By 17:28, various alerts were sent to the devops team. Primary inbound mail services were exhausted of memory and locked up or ready to fall over. Soon the failover services were overrun and inbound mail stopped working entirely.

Recovery Actions

The devops team sprang into action and took evasive maneuvers. Grafana dashboards, which show key indicators of service health, were slow to load or unresponsive. Logging infrastructure was still working and showed that the sender was using a Reply-To address of[email protected] yet the envelope and FROM header address were generated from unique subdomains per inbound email address which exploited a previously unknown workaround of Mailsac’s multi-tier throttling infrastructure. All of these messages came from sandbox Salesforce subdomains – at least 6 subdomains deep.

Once the root cause was discovered, the sender’s mail was blocked and additional resources were allocated to inbound mail services to allow more memory to build up while blocklists were propagating across the network of inbound mail services. By 17:40, inbound mail was coming back online, and by 17:44 most alerts had resolved.

Lessons Learned

We monitor and throttle inbound mail in several custom systems. The goal of these systems is to keep pressure off our primary datastore and API services, and provide insight into system load and identify bad actors. The monitoring systems looked mostly at the domain and/or subdomain. Unfortunately we did not anticipate a sender with unique subdomains per message. This caused tens of thousands of superfluous Prometheus metrics which led to three things to be overwhelmed:

the metrics exporter inside the inbound mail server,
the prometheus metrics server running out of memory, and
grafana UI dashboard being non-responsive due to too many apparently unique senders.

All of the described issues have been fixed.

Non-Impacted Services

During the outage all other services remained up. The REST API, web sockets, outbound SMTP, SMTP capture, and more were unaffected.

We wanted to apologize to all of our paying customers. Mailsac is often integrated with automated tests in CI/CD systems. If our downtime also caused alerts for you, we’re very sorry about this! The root cause has been fixed and we’re continuing to monitor the situation.

(Resolved) Investigating Slowness

March 31, 2021October 27, 2021 Jeff P

Investigating reports of slowness this morning beginning around 4:45am US Pacific time.

The issue has been mitigated resolved around 5:15am US pacific time.

A slow, locking query was identified to be in a critical path. It has been adjusted.

Service Update: Login Issues Resolved

September 25, 2020September 25, 2020 Jeff P

Some users experienced issues logging into the service, and signing up, yesterday and this morning (US Pacific). There was a configuration issue with cookies which has been resolved. Please continue to report any new issues with login or signup.

(Resolved) Service degradation due to apparent attack

June 19, 2020October 29, 2020 Jeff P

Beginning 2:36 AM US Pacific time, Mailsac internal monitoring indicated slowness due to an abnormally large amount of spam coming from China. By approximately 6:30 AM we identified all root causes and believe the issue is resolved.

Our service employs several methods of blocking, shaping, and throttling egregious traffic from unpaid users. This particular attack worked around these automatic mitigation efforts, in part because the attackers opened thousands of sockets and left them open a long time, exploiting a loophole in our SMTP inbound receiver code.

Here is a graph of our inbound message rate showing the attack compared to baseline.

(Fixed) Upstream API proxy service network connectivity – workaround

July 17, 2018October 25, 2022 Mailsac Engineering Leave a comment

Update:
The issue referenced below was resolved in under 15 minutes. The referenced servers URIs have since been deprecated and should not be used.

Pages: 1 2

(Resolved) Outage Report: Tue Feb 6 2018

February 6, 2018May 26, 2020 Mailsac Engineering Leave a comment

The VPS host which handles the Mailsac database servers is having a routing issue, and most of the microservices are unable to contact it. We are in direct communication with the our support rep regarding this issue and expect it to be resolved ASAP. This is a full outage.

Service status can be tracked here: status.mailsac.com

We apologize for the issue and will be working to minimize the likelihood of this happening again.

Timeline (US Pacific)

– 2018-02-06 09:36 Outage noticed by monitoring services
– 2018-02-06 09:37 Troubleshooting and evaluating logs on shared logging server
– 2018-02-06 09:38 Able to ssh into primary database node from office
– 2018-02-06 09:38 Ticket opened with upstream hosting company indicating many geographically distributed services cannot reach the network of the database servers
– 2018-02-06 09:43 Provided several traceroutes for help troubleshooting
– 2018-02-06 09:59 Monitoring indicates the service is back online
– 2018-02-06 10:03 All frontend UI/API servers were rebooted in series to clear a MongoDB error “Topology was destroyed”
– 2018-02-06 10:05 Error notifications seem to all be cleared
– 2018-02-06 10:10 Updated HAProxy error pages to include links to status page and community website

Edit: Concluding Remarks

Mailsac’s database, for caching and data storage, is MongoDB. Without the database, everything grinds to a halt. MongoDB supports configurations for high availability (Replication with Automatic Failover).

Having all nodes of the database hosted in one provider’s network has proven to not be sufficient to prevent outages. In this case, a router within the hosting company’s network failed, which caused none of the MongoDB nodes to be accessible to the networks of the other hosting companies. We will take some time to change that configuration.

Mailsac already has microservice instances across multiple providers and geographic regions, as seen in the system diagram:

basic diagram of the mailsac email microservices

In the event one or two instances went offline, or even an entire region of an upstream host, Mailsac should not go down as long as the database was still accessible to the API. Obviously that was not the case here.

The solution will be to add a Secondary Node and Arbiter in different networks.

15 minute partial API outage due apparent DOS

October 30, 2017October 30, 2017 Mailsac Engineering

For about 15 minutes (8:13 am – 8:27 am PDT), our API was flooded with traffic due to hundreds of thousands of email attempts from 4 IP addresses. Nearly all emails were received, but HTTP requests for the API and UI frequently timed out. We do not know the percentage of requests that timed out, but it was quite high. The API is load balanced and only one API was timing out frequently.

We blocked the bad IPs immediately upon seeing traffic logs (~8:19), but because our custom IP blocking service relies on the API to fetch the blacklist, and the API was not fully responsive to HTTP on one leg, it took a while for the changes to propagate to all five inbound SMTP servers.