maemo.org - Talk

maemo.org - Talk (https://talk.maemo.org/index.php)
-   General (https://talk.maemo.org/forumdisplay.php?f=7)
-   -   Infrastructure maintainance on 19.11. (https://talk.maemo.org/showthread.php?t=98329)

fstern 2016-11-18 12:33

Infrastructure maintainance on 19.11.
 
Hi everybody,

sorry for the short notice but we will do some heavy maintainance to the maemo.org infrastructure tomorrow, starting at 10:00 CET (09:00 UTC).

All systems will be affected.

We expect to be down for at least 6 hours as we do upgrades on the underlying hypervisors.

What we will do:
  • Do an image backup of all machines
  • Upgrade the underlying hypervisors
  • Upgrade individual machines

Sorry for any inconvenience this might cause.

Best,

Falk

peterleinchen 2016-11-18 14:14

Re: Infrastructure maintainance on 19.11.
 
Thanks for notificatiln.

@tmo admin
possibly to be made sticky on overall level?

fstern 2016-11-19 22:57

Re: Infrastructure maintainance on 19.11.
 
Hi everyone,

tl;dr: half of infrastucture broken, fix expected early next week, film at eleven.

This maintainance didn't go to plan, here's a short post-mortem:

Timeline:

10:00 - start updates and backups on blade-a
14:30 - backups and updates complete on blade-a, reboot confirmed successful
14:31 - uptime induced filesystem check after 1347 days
15:00 - start of backups on blade-b
17:12 - filesystem check complete, blade-a up and running
17:30 - first systems on blade-a confirmed up and working
18:30 - software upgrade on stage and mail complete
20:15 - backups of blade-b finished and copied onto blade-a backup space
20:16 - start of updates on blade-b
21:00 - updates on blade-b complete, reboot
21:01 - blade-b stuck in boot with corrupt bios image in flash
23:30 - all available remote recovery options tried, none working
23:40 - decision to go for Plan B, boot talk.maemo.org on blade-a, redirect everything else to talk.m.o
23:45 - blade-b turned off through IPMI
23:53 - talk.m.o available again

Fallbacks in place:

www.maemo.org, wiki.maemo.org, garage.maemo.org are redirected to talk.maemo.org

Next Action Items:

I'll visit the datacenter monday after work (around 18:00 CET) to try to recover the bios of the broken machine with a physical USB stick.

If this is successful we'll migrate talk.m.o back to it's original host and reenable www.m.o, wiki.m.o, garage.m.o through DNS after the VMs and the blade are confirmed working


Best,

xes & falk

pichlo 2016-11-20 07:14

Re: Infrastructure maintainance on 19.11.
 
My browser complaints about a wrong certificate; is this a side effect of the update? Is it temporary?
(Details: the name on the cert does not match the URL.)

peterleinchen 2016-11-20 07:56

Re: Infrastructure maintainance on 19.11.
 
1 Attachment(s)
Quote:

Originally Posted by joerg_rw (Post 1519058)
many thanks for this massive effort ...

+1

A hint for all remaining N9 user: we have again no automatic network (WLAN auto/manual) detection. A nice screenshot attached (maybe later, my N9 does not let me select it :))

--edit
Quote:

Originally Posted by pichlo (Post 1519060)
My browser complaints about a wrong certificate; is this a side effect of the update? ...

Guess so as these corrections/redirections were also made earlier this year.

xes 2016-11-20 16:16

Re: Infrastructure maintainance on 19.11.
 
2 Attachment(s)
Let me share the screen that our Supermicro server showed to reward us for a day of work...
http://www.supermicro.nl/products/sy...cfm?parts=SHOW

Then, we also discovered that Supermicro wants money to obtain a license to flash bios remotely using the IPMI.
(anyway, we are not sure this could work to recovery the bios)

Supermicro: really, thanks.

Win7Mac 2016-11-20 20:58

Re: Infrastructure maintainance on 19.11.
 
Possible to replace the chip?

xes 2016-11-20 23:40

Re: Infrastructure maintainance on 19.11.
 
@win7mac
at the moment i can't say which is the "weight" of the problem we are facing until tomorrow Falk will make some tests while trying to restore the blade.

Then, while with your personal pc / board / laptop you can try whatever you want and any hack, any trick is done because you have nothing to loose, with servers you have to enter in a different perspective where you have to consider risks, best options, time to fix, quality of result and possibility to make more damages.

So, my reply is: i think that no one tries to remove a chip from a server mainboard without a spare board or without a warranty of result.

Win7Mac 2016-11-21 00:28

Re: Infrastructure maintainance on 19.11.
 
I wasn't suggesting any tricks or hacks. Some BIOS are replaceable, but since it's not listed on that parts list, that's probably not an option. :(

fstern 2016-11-21 06:28

Re: Infrastructure maintainance on 19.11.
 
Quote:

Originally Posted by joerg_rw (Post 1519131)
plus we have two spare blades, incl BIOS chips (if the flash of the now-down blade is actually defect)
edit: I think it would actually be a great opportunity to swap the blades for wear leveling

No, we don't. All we have ist two empty slots in the Chassis.

Best,

Falk

juiceme 2016-11-21 07:25

Re: Infrastructure maintainance on 19.11.
 
What's the cost of a new blade if it turns out the bios chip in blade-B is beyond recovery-by-flashing?
Are there different options on what kind of blades can be used with our server chassis, and what are the costs?

pichlo 2016-11-21 08:25

Re: Infrastructure maintainance on 19.11.
 
Quote:

Originally Posted by joerg_rw (Post 1519110)
of course when your request to https://wiki.maemo.org gets redirected to https://talk.maemo.org (or whatever) then the cert has wrong name ;-)

No, that wasn't it. I was going to https://talk.maemo.org. The name in the certificate was something completely off, like boo.muahaha.de (sorry, I cannot remember the exact name, but it was definitely something from the dot de domain).

I am posting from work now and cannot reproduce it on my work PC (in Pale Moon on Windows 7). I could not reproduce it on my Jolla either. But that's how it showed on my daughter's Android tablet. I can try again later today when I come home.

jellyroll 2016-11-21 09:15

Re: Infrastructure maintainance on 19.11.
 
2 Attachment(s)
Webcat:




sfdroid browser:

mscion 2016-11-21 13:34

Re: Infrastructure maintainance on 19.11.
 
Hi. Sorry, I didn't quite follow the previous discussion but I just noticed that if I select Intro, Downloads, Development, Community, News I go directly Talk.

EDIT: Ok. I see this is mentioned in fstern's post #3. Sorry again.

fstern 2016-11-21 19:53

Re: Infrastructure maintainance on 19.11.
 
Short update from today's datacenter visit:

Quote:

Originally Posted by juiceme (Post 1519135)
What's the cost of a new blade if it turns out the bios chip in blade-B is beyond recovery-by-flashing?
Are there different options on what kind of blades can be used with our server chassis, and what are the costs?

We actually do have 2 empty motherboards in the chassis.

Blade-b is totally broken. If I attach USB devices it just signals a different post error.

I couldn't exchange blade-b with a spare blade because I couldn't remove blade-b from the chassis to exchange the CPU and memory.

To exchange the two blades I have to remove the box from the rack (or at least uncable it).

This is planned for this saturday, sadly I don't have time earlier.

This will include a full powerdown of all maemo servers for about 30 minutes to finish work.

Best,

Falk

pichlo 2016-11-22 08:31

Re: Infrastructure maintainance on 19.11.
 
Quote:

Originally Posted by pichlo (Post 1519136)
No, that wasn't it. I was going to https://talk.maemo.org. The name in the certificate was something completely off, like boo.muahaha.de (sorry, I cannot remember the exact name, but it was definitely something from the dot de domain).

OK, I think I got it. The problem is the Coding Competition logo in the top right. It links to https://wiki.maemo.org/Maemo.org_Cod...mpetition_2016, which has a certificate issued to kilbeggan.fourecks.de. Both my PC and my Jolla simply and quietly refuse to load it, but the stock browser in my kid's Android tablet tried and popped up the warning.

xes 2016-11-22 21:51

Re: Infrastructure maintainance on 19.11.
 
New maintenance notice.

i'm going to stop stage (repository) to create a backup copy.
Due to the size of the VM this operation could last many hours.
Thank you for your patience

chemist 2016-11-23 09:18

Re: Infrastructure maintainance on 19.11.
 
Quote:

Originally Posted by joerg_rw (Post 1519164)
Please consider sending in blade-b for repair, HiFo (or who ever is responsible for holding the account) should have enough funds for that and should be interested in keeping assets in a functional state.

[edit] actually (quote) used CPUs are a few bucks nowadays, from datacenter upgrade surplus (/quote). So - unlike a 4(?) years ago - now populating a third (and, after repair, 4th) blade might be way cheaper and basically only cost the HDD plus a few bucks for a CPU and RAM you just need to clean from dust before mounting them to our spare blades

BR
/j

If you'd care enough about who is in charge of our funds as much as you like to troll techstaff, you would remember that you are member of the entity behind Maemo and would show up to meetings and know about it. You might just consider to leave the e.V. when you do not plan to show up to 4 consecutive meetings (for whatever reason).

To answer the actual question, blade mainboards will be swapped for now, techstaff and board will work out the details to populate the then empty slots (2) with hardware that is not doomed to fail and actually being replaced in all datacenters of people I talked to the past 3 days.

We started to discuss ideas about how the setup (best fit) would look for us and so far we came up with the idea of having a 2(old):1(new) setup where 1 can take the load off both old blades, with the option to make it 2:1:1 if we got enough funds for that. We also need to increase storage capacity and it might be a good idea to replace the PSUs while we are at it.

xes 2016-11-23 20:15

Re: Infrastructure maintainance on 19.11.
 
Maintenance Notice:

stage (repository) is up and running.

bencoh 2016-11-25 11:55

Re: Infrastructure maintainance on 19.11.
 
By the way, could anyone with proper rights add a link to this topic as a TMO notice (just like the coding competition one)?

Letting non-daily TMO readers know that maemo (the infra) is not dead/dying but just under maintainance sounds like a reasonable idea to me :)

xes 2016-11-26 23:12

Re: Infrastructure maintainance on 19.11.
 
# Maintenance Notice: MAEMO IS UP AND RUNNING.

More details about current status and work done will follow.

Halftux 2016-11-28 15:56

Re: Infrastructure maintainance on 19.11.
 
What I recognized is that the maemo extra assistant page is working but is not moving the files to the autobuilder (no notice from extras-cauldron-list).
If this is already known ignore me.

Thanks for the hard work which were caused by the faulty hardware.

xes 2016-11-29 01:42

Re: Infrastructure maintainance on 19.11.
 
@Halftux

please check it now.

https://garage.maemo.org/pipermail/e...mber/date.html

Halftux 2016-11-29 09:22

Re: Infrastructure maintainance on 19.11.
 
Quote:

Originally Posted by xes (Post 1519587)

Thank you xes now the package appears in cauldron list and maemo/packages no further upload was needed.
So I guess it was just hanging somewhere.

Really appreciate the overall efforts well done.


All times are GMT. The time now is 14:23.

vBulletin® Version 3.8.8