Gradwell Blog

Service Update for w/c 15th October

There have been a number of faults experienced by customers using our email and phone services in the last week and we wanted to update customers as to the current status for resolving those. Many of these issues are related to our connection to both the internet and the traditional phone network (PSTN) and we are committed to fixing those supplier issues, and, if necessary, changing supplier where appropriate.

Call Drops
Customers are seeing phone calls dropping out and disconnecting mid-call. Unfortunately this problem is occurring randomly and despite pretty exhaustive investigations into the problem but have been unable to find a way to replicate the call drops so that we can start to narrow down the cause.

However, we have changed one of our internet transit suppliers and identified a number of ways to reduce the latency on our core network, and since that work was completed on Tuesday evening we believe the number of call drops has significantly reduced.

The only knock on effect was that we caused some capacity issues on part of our new internet transit supplier’s network on Friday morning, and they will have that fully resolved by Sunday.

Echo
Since Monday morning we started to get reports of echo on calls. This timing matches a change made to one of our PSTN carriers which has resolved problems with dtmf recognition and voicemail recording quality. We have reversed out the change made to fix the echo problem and are continuing to work on our carrier to resolve the other issues.

Phones De-Registering

We started on Monday to see reports of Phones randomly de-registering from our servers, which appear to be due to DNS lookups failing. We believe these are, in turn, related to the network latency problems. As well as fixing network latency, we have added additional DNS servers for our outbound proxy and this seems to have resolved all error messages on our proxy servers.

Switch Crash
We also experienced a crash of one of our main PSTN switches on Thursday afternoon. This was resolved within about 20 minutes and we are working with the manufacturer to understand the nature of these occasional crashes. They have also begun to commission our second PSTN switch, which we hope to enter into service in the next few weeks.

Mail delivery
Mail delivery has suffering large delays due to an increase in spam and particularly bounce notifications generated from other people sending spam and this has lead to overloading on our mail servers. We have, as previously announced, begun a refreshment of the hardware in our hosting platform, and this week migrated spam + virus scanning onto some of our new quad-core servers.

This did improve mail throughput, but has also just moved the bottle neck along. We are continuing our program of upgrades, although it will take several weeks to complete the entire refresh.

Specifically, this week we’ve added two more quad core servers to handle outgoing mail on relay.gradwell.net and incoming mail for customers. We are also adding extra servers to store and process bounce messages, speeding up the ability to remove those from the main mail queues. Finally, we’ve made a few further improvements to our queue management software, including reducing the levels of unnecessary disk access on our mail servers, which has speeded up the mail flow.

For the last couple of days our monitoring has shown that, under normal operation we are managing the load ok, and problems are only occurring when a queue builds up, so we are proactively monitoring to avoid that from occurring.

Customer Support
During this time our customer support staffs have been extremely busy and therefore we have experienced delays in dealing with enquiries. It would be very helpful if customers reporting faults could provide as much specific information as possible as this greatly reduces the delay in identifying and resolving the problem.

Conclusion
We’d like to thank customers for their forbearance during this challenging week. We have made good progress in getting on top of the various issues that have arisen.

We look forward to next week, where we will continue to work on our systems to ensure they are all operating correctly.

Comments are closed.