Social.coop - Tech working group

Fri 17 Aug 2018 10:34AM

Heads up on social.coop server space

Nick S Public Seen by 25

Just to report on the last server outage whilst I'm thinking about it....

This last one was resolved by @fardog :raised_hands: who happened to be awake at the right time. He discovered that it was because it was running low on disk space on the main (root) partition.

He pruned some docker image cruft, but it's still currently at 93% full.

Now I can't explain exactly why it's so full, but obviously it's something we need to do something about or our server will start dying all the time. It's not the Mastodon database, that's on another (also 64% full) disk.

(Perhaps the growth was related to the recent Mastodon influx I've been hearing about, but either way we should expect more users and more tooting...)

Sorting this out may require taking the server down for a while, I suspect.

There's also backups, I notice the database backup file has quadrupled in size since about June (2G -> 8.4G), which probably needs investigation. I say 'backup' because we currently just have a manual backup of the database, and it's only run when someone remembers to. In order to protect ourselves from various trousers-falling-down scenarios we might encounter, we need an automated back-up, ideally generational, which also means more 10s - 100s of gigabytes of (off-server) disk space.

Does anyone here know much about analysing Mastodon instances, or know someone who does?

And this touches on the issue of spending funds, which is a different issue but I'll mention here: perhaps we should allocate a budget to working groups, which they can spend at their discretion without the need to go back to the main / finance group?

For those with git.coop accounts, you can see the tickets I created on the recent outage and the disk space question here. I suggest we keep the technical discussion there as much as possible to spare those here who have been overwhelmed by Loomio chat. :) Anyone who wants an account can sign up following the instructions here https://git.coop/social.coop/

Antoine-Frédéric Raquin Tue 21 Aug 2018 10:32AM

I was suggesting XMPP because :

1) I was not aware of the Matrix chat ;

2) there's no reason to discuss about snapcraft in a Matrix chatroom, as it would mainly flood the chatrooms already here, IMO. I prefer to have 5 private conversations on 5 different sync channels (that anyone is free to join) than to have 5 conversations in the same Matrix chatroom, for purposes of clarity ;

3) I'm not sure that Riot runs on my computer.

Nick S Tue 21 Aug 2018 10:49AM

The riot.im client (see my link above) should run in all major web browsers, but if not please let us know because we don't want to pick tech which isn't fairly universally accessible for our public channel.

However, if you want a private discussion with a specific person, you're right, of course you can pick any tech you agree on.

And a snapcraft discussion could just be on a new thread here in the Loomio Tech WG group, I didn't mean to imply you needed to discuss that in our public chat room.

Antoine-Frédéric Raquin Tue 21 Aug 2018 11:30AM

It doesn't run on a 2000s Intel CPU without overheating the CPU, whereas xmpp-client, Gajim, Dino… run without much problems on i3 (obviously not on Windows though). I'm not on my own computer (whose motherboard melted) so it's fine, but this exact trend of building software over Electron or in a browser just because it works over all platforms isn't really on point on ecology and accessibility.

It's also not that much accessible because the Riot client has a terrible UX, and I just think that making it the de-facto standard for free software, cooperative, or union work is a bad idea.

I'd simply recommend rocket.chat instead of Matrix (if not XMPP, but that's just because the clients and servers are objectively better regardless of the protocol itself); I don't know if officially using Matrix doesn't push people out of the decision process and I'd like to question that.

Victor Matekole Fri 24 Aug 2018 6:59AM

Never heard of Snap till now... However, here is current status of Snap package for Mastodon — https://github.com/tootsuite/mastodon/issues/1068... Appears non-existent at the moment.

Victor Matekole Fri 17 Aug 2018 2:27PM

Hi all,

Disk space has always been a problem, mainly due to limited budget.... However, there are a couple of things that you missed that will free up further space:

— docker system prune -a — this gets rid of everything that is superfluous, including old containers and volumes. Docker doesn't automatically remove previously ran containers or volumes, unless you say so. It is safe to do this as we mount volumes that need persisting to host, this allows you to not care about Docker preserving volumes or containers.

– NUM_DAYS=7 rake mastodon:media:remove_remote. All media is uploaded locally first, resized/optimised with Paperclip and then pushed to DreamObjects. I'm not sure if Mastodon runs the aforementioned rake task regularly, it is possible it does but from my experience there is always media available to delete when I run it, this frees up a lot of space.

Finally, looking at the disk /var/lib/docker takes up 27GB of space alone, which suggests it is this media that is taking up space. I am now running this rake task in a detached screen, please no reboots : ) until I give the all clear.

Hope this helps.

@wulee @fardog

Nick S Fri 17 Aug 2018 3:13PM

Thanks I've just added this to the ticket here to be tried and written up somewhere

Victor Matekole Fri 17 Aug 2018 2:31PM

Additionally, I just came across this —
https://github.com/tootsuite/documentation/blob/d9ecbee47d6c09afbff8cf1280e29018872936b3/Running-Mastodon/Production-guide.md#remote-media-attachment-cache-cleanup

It appears that the cache is also made up of images from other instances we are connected to, scary....

Nick S Fri 17 Aug 2018 4:46PM

@victormatekole , can we, or should we try to move the /var/lib/docker folder off the root partition? Is it eventually going to get too big?

Victor Matekole Fri 17 Aug 2018 4:59PM

@wulee good question and probably not a bad strategy as it does tend to bloat. Theoretically there should be no issues but I would never assume with Docker, let me do some digging.

That being said I am wondering if we should consider getting a root server. I run my services via Digital Ocean (mission critical) and Hetzner where I have a couple of root servers(less critical and resource heavy). Hetzner are super cheap and you get a lot for very little, service ain't too bad either and is based in Germany. With an extra $15 or so we can get a couple of terabytes and not too shabby CPU. Nursing a 100GB with a social network of a 1000+ users seems pretty tight.

Mayel de Borniol Sat 18 Aug 2018 8:39PM

Not sure what you mean? The servers we have are already root servers.
And there's pros/cons to 100GB of SSD storage vs 1000GB of SATA storage.
Of course it's probably time to add storage space and/or upgrade the server (with a bigger root partition). I have no objections if you all want to switch to another provider either, though it's worth putting together a comparison table.

Victor Matekole Sat 18 Aug 2018 9:08PM

Sorry, "root servers" implies dedicated hardware/servers (not virtual), as far as I understood Scaleway is a cloud service? You are correct regarding SATA vs SSD. However, Hetzner will allow you mixed setups, we can have SSD for Postgres and SATA for lesser demanding parts of the stack. Either way, I am sure we'd pay less per GB than on Scaleway. But as I suggested earlier there is always a trade-off — having a dedicated server means we look after the hardware, if a disk breaks we have to call Hetzner to replace, from experience they are reasonably fast, in this case.

Nonetheless, I've always felt a 100GB was never enough for our growth rate and requirements long-term. Hetzner was just an example, as I know them but I have no bias. I just wanted to start a conversation, where growth rate, performance and cost are carefully considered.

E.g. piece of hardware:

Intel Core i7-2600
2x HDD SATA 1,5 TB
HDD1x SSD 240 GB
RAM 32GB DDR3
€45.38 / mth

Gil Scott Fitzgerald Sat 18 Aug 2018 9:11PM

I wonder if we could just throw postgres in RAM?

Mayel de Borniol Sat 18 Aug 2018 9:12PM

As indicated in the docs 'trunk' is a dedicated server, and 'toot' is VPS:
https://git.coop/social.coop/tech/operations/wikis/infrastructure-overview

Victor Matekole Sat 18 Aug 2018 9:15PM

I see ... Do they support upgrades of the disk and perhaps memory?

Victor Matekole Sat 18 Aug 2018 9:19PM

BTW — how do I get an account to git.coop? Just tried to register under my email address but was denied.

Fabián Heredia Montiel Sat 18 Aug 2018 9:33PM

Hi @victormatekole, check out this guide on the steps to get your git.coop account: https://git.coop/social.coop/general/wikis/getting-an-account

Nick S Sat 18 Aug 2018 11:14PM

I think one of our milestones should be the capability (duplicated amongst several people) to rebuild the server in the event it dies or gets hacked.

In order to learn how to do this, we need a server (or servers) to practice on.

I'd call this a "staging server".

Gil Scott Fitzgerald Sat 18 Aug 2018 7:27PM

IMO spend the money for a good experience and fewer headaches later

Victor Matekole Sat 18 Aug 2018 9:10PM

Disk consumption is now 80% by the way but there is more that can be trimmed from the media cache, I think someone restarted the ruby app and thus the job I started got killed.

Nick S Sat 18 Aug 2018 11:21PM

Wasn't me, honest!

In general I aim to go to the riot.im channels to check if anything's going on on our servers, or to announce it on the public channel if I'm there doing something. I suggest this'd be a good policy for everyone to follow, to help avoid tripping each other up by mistake.

open channel: https://riot.im/app/#/room/#SocialCoop:matrix.org
encrypted private channel: https://riot.im/app/#/room/#tech.social.coop:matrix.org

Nick S Sun 19 Aug 2018 9:27AM

Also, I should add, if this was running in a docker container, I have been noticing a lot of 'dying and restarting' events when browsing the datadog account Mayel (I think) set up to monitor our servers. (Maybe it was you originally, however you did say it was unused and should be removed, and it seems to be a new free account).

If you have any experience interpreting these, I'd be interested what you think...

And anyone else on the tech team who's interested, go and have a look, it's quite impressive. I can either paste the credentials into the tech group's private channel, or maybe I'll get time to get keryringer set up.

Victor Matekole Sun 19 Aug 2018 3:23PM

Glad you are finding Datadog useful, it is pretty amazing tool! I thought it should be killed as I understood they were removing their free option or at least limiting it to 30 days... I may have got that wrong, last time I checked I could not gain access with my current credentials for social.coop. If you send me the credentials I'd be happy to give my 2 cents...

Ian Smith Mon 20 Aug 2018 11:49PM

Social.coop returning 502 bad gateway. @victormatekole @wulee

Nick S Tue 21 Aug 2018 9:55AM

Thanks. As I mentioned in the chat channel, it seems to have resolved itself...

There've been a bunch of outages like this, in which there's a 502 or similar, and a pingometer/pingdom notification, which mysteriously resolves itself. I'm a bit of a newbie with docker, but it looks like one of the containers will die and then restart. I'd like to know why this happens, I'm still researching that. Maybe @mayel or @victormatekole or one of the other admins will be able to shed some light on that, but at least it isn't currently a critical problem (and I don't think it's a disk related problem).

Nick S Tue 21 Aug 2018 9:58AM

Timezones: I think we have admins who can fix server issues in the EU and US timezones (assuming they're not indisposed for some reason). Do we have anyone in the Asian timezones in between who could do this?

Victor Matekole Fri 24 Aug 2018 7:10AM

When I have chance to look at Datadog I will check to see what maybe the root cause. When I look at mem. consumption there is only 200mb free, I wonder if we are hitting some memory limits, which is common with Rails apps as they tend to be resource heavy and leak memory especially from poorly written 3rd-party packages.

Nick S Fri 24 Aug 2018 7:42AM

I was trying to get the memory/CPU load overlayed with docker events, to see if they correlate. I think I managed it and concluded that the memory grows and then gets resets when there's an event, but this is across the whole system, and yet doesn't imply that memory causes the events rather than vice versa.

Chris Croome (Webarchitects Co-operative) Tue 21 Aug 2018 9:50AM

Hi, one option you could consider for hosting is buying your own hardware, if you can raise the capital, you could get a 1U server with a lot of RAM and SSDs and HDDs which could run everything (assuming you run a hypervisor on it and multiple virtual servers) and have space for development servers and backups (though you would probably also want backups elsewhere) and then colocate it with a hosting co-operative. Most new servers come with a three warranty — it would make sense to budget for renewing it after 3 years, however at that point the old machine could be used as a backup as, in my experience, servers can generally be run for about ten years.

Victor Matekole Fri 24 Aug 2018 7:02AM

I like the idea of owning bare metal! Some cost-benefit analysis would have to be performed but I suspect it would be cheaper in the long-run as the network grows in numbers.

Heads up on social.coop server space

Antoine-Frédéric Raquin · Tue 21 Aug 2018 10:32AM

Nick S · Tue 21 Aug 2018 10:49AM

Antoine-Frédéric Raquin · Tue 21 Aug 2018 11:30AM

Victor Matekole · Fri 24 Aug 2018 6:59AM

Victor Matekole · Fri 17 Aug 2018 2:27PM

Nick S · Fri 17 Aug 2018 3:13PM

Victor Matekole · Fri 17 Aug 2018 2:31PM

Nick S · Fri 17 Aug 2018 4:46PM

Victor Matekole · Fri 17 Aug 2018 4:59PM

Mayel de Borniol · Sat 18 Aug 2018 8:39PM

Victor Matekole · Sat 18 Aug 2018 9:08PM

Gil Scott Fitzgerald · Sat 18 Aug 2018 9:11PM

Mayel de Borniol · Sat 18 Aug 2018 9:12PM

Victor Matekole · Sat 18 Aug 2018 9:15PM

Victor Matekole · Sat 18 Aug 2018 9:19PM

Fabián Heredia Montiel · Sat 18 Aug 2018 9:33PM

Nick S · Sat 18 Aug 2018 11:14PM

Gil Scott Fitzgerald · Sat 18 Aug 2018 7:27PM

Victor Matekole · Sat 18 Aug 2018 9:10PM

Nick S · Sat 18 Aug 2018 11:21PM

Nick S · Sun 19 Aug 2018 9:27AM

Victor Matekole · Sun 19 Aug 2018 3:23PM

Ian Smith · Mon 20 Aug 2018 11:49PM

Nick S · Tue 21 Aug 2018 9:55AM

Nick S · Tue 21 Aug 2018 9:58AM

Victor Matekole · Fri 24 Aug 2018 7:10AM

Nick S · Fri 24 Aug 2018 7:42AM

Chris Croome (Webarchitects Co-operative) · Tue 21 Aug 2018 9:50AM

Victor Matekole · Fri 24 Aug 2018 7:02AM