Just a digit mistake in a IP address of an Avaya phone did a crash of a VPN managed by a Cisco 1841 router.

septiembre 23, 2010


Today i had to deal with one of the most complex issues in my life as IT Manager because all our VPN was going down since yesterday at 10am without giving us any knowledge of the cause.

Briefly, the scenario, VPN serviced by ONO provider using Cisco 1841 routers at central office and Draytek Vigor 2200 at sites. At each site one VoIP AVAYA 5602SW+ phone and at main office one Avaya IP 406 Office call server.

When the VPN traffic was wrong all sites could not connect to our Oracle 10g Database server, even no ping could be done to external Smoothwall firewall, but at IT support team we could stablish remote control session using VNC and check that pings made at remote did not arrive.

We started manual monitoring because were thinking that Zenoss could not accomplish that. All we knew was that traffic was going well from main office to sites but not in reverse direction.
Making traces of traffic using traceroute command from my OpenSolaris desktop i could notice that packets went over different and balanced routers depending on the direction of the traffic. ONO support team confirmed that i started to think about a routing mistake on routers that managed traffic from sites.

I told ONO engineer if he could flush or clear ARP table cache of the main 1841 Cisco router. He gently did that and we noticed that traffic was recovered to normal state. That was taken by ONO provider as a explanation of what was going wrong telling us that we had a problem at our lan at ARP negotiation level of our Avaya call server. They closed the incident, reported by mail to me and gave us the responsability to solve the puzzle.

My boss, as allways and i could imagine yours aswell, went into her normal nervous but not exteriorized state. The CEO assistant was who asked for me a solution to duplicate our application servers at a critical site. How can you duplicate a fully VMware ESXi environment that hosts our Oracle Database and our reporting BI system to a site?

Are we fool? Can we, IT managers, reverse the policy of built centralizad systems to accomplish business needs of make a fast ROI, easy management, up to date database information because one day Internet is down? Most of us have received a “sorry but our lines are down” when we reached cashier at bank office. Even phone providers, the ones that must maintain their own lines up, tell us “now we have an issue and we can’t give you your incident number”.

Can a low size company have other datacenter in sync with the main one to solve a two or three downs per year? Certainly not. Thanks, the solution is just to wait.

Returning to our issue, we had to deal alone without help of ONO provider. Thanks to our internal issue management system we checked that at same time we substitute an Avaya IP phone with broken keyboard was reported a VPN down by other site where the boss was visitting. We rechecked ip configuration of the VoIP phone and got the surprise that there was a forgotten one digit. It was 10.0.0.1 in place of 10.0.0.11. The IP assigned to the IP phone was same of the Cisco 1841 main office VPN router. Oh! What a mistake! But that phone was running ok yesterday and when configured we had not received a “IP conflict” as we did other times.

What really happen and how an Avaya IP phone could get same address as router Cisco connected to the same DLink Gigabit switch? And even how both were running ok most of times and others Cisco router rejected packets from the VPN or crashed ARP protocol?

I knew that flushing Cisco ARP table cache solved anomaly and that give me the idea of perhaps bad or duplicate entries ir ARP table of Cisco 1841. Perhaps Cisco 1841 is a real IPv6 device and it implemented IPv4 by software. When Avaya phone was configured to 10.0.0.1 address, same as Cisco router, there was not an IP conflict issue, perhaps because IPv6 to IPv4 internal translation table of Cisco permit us that anomaly. But that is another story. The question is that just a digit mistake cause a malfunction of a VPN managed by a Cisco 1841 router.

Maybe i will report this to Avaya and Cisco support teams. Perhaps they enjoy!

Anuncios

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s

A %d blogueros les gusta esto: