2017 in China's open source contribution list
The original:Who contributed the most to open source in 2017? Let 's analyze GitHub'
Translation: unimpeded me flying
Abstract: Based on the data from GitHub 2017, the author analyzed the open source contribution data of nearly 100 companies, and explained their own analysis methods. The following is the translation.
In this analysis, we will see all the GitHub published in 2017.
Pushevents. For every GitHub user, we have to do our best to speculate on which organization they belong to. This analysis is concerned only with a warehouse that has grown more than 20 star in 2017.
Here is the result of my analysis.I modify these results in my interactive data warehouse report.
Comparison of top cloud providers
2017 GitHub data:
- About 1300 employees in Microsoft actively push the code to the 825 top Repositories on GitHub.
- About 900 employees in Google are active in GitHub, pushing the code to about 1100 top warehouses.
- About 134 employees in the Amazon are active in GitHub, pushing the code to only 158 top warehouses.
- Not all projects are fair: Google employees contribute 25% more code repositories than Microsoft, and the number of star acquired by warehouses is also more (530000 vs 260000). The total of star in 2017 was 27000 in the Amazon.
RedHat, IBM, Pivotal, Intel, and Facebook
Amazon is far behind Microsoft and Google. What companies are there between them? According to their contribution, RedHat, Pivotal, and Intel have also made a prominent contribution to GitHub.
Note that the following table merges all the IBM global domains (IBM domains in various countries will have domain name suffixes in various countries) -- though all regions are still appearing in the next table.
The number of GitHub users in Facebook and IBM (US) is similar to that of Amazon, but the projects they contribute to more star (especially Facebook):
Then it's Alibaba, Uber, and Wix:
GitHub itself, Apache and Tencent:
Baidu, Apple, Mozilla:
Oracle, Stanford, Mit, Shopify, MongoDb, Berkeley, VmWare, Netflix, Salesforce, Gsa.gov:
LinkedIn, Broad Institute, Palantir, Yahoo, MapBox, Unity3d, Automattic, Sandia, Travis-ci, Spotify:
Chromium, UMich, Zalando, Esri, IBM (UK), SAP, EPAM, Telerik, UK Cabinet Office.
Cern, Odoo, Kitware, Suse, Yandex, IBM (Canada), Adobe, AirBnB, Chef, The Guardian:
Arm, Macports, Docker, Nuxeo, NVidia, Yelp, Elastic, NYU, WSO2, Mesosphere, Inria
Puppet, Stanford (CS), DatadogHQ, Epfl, NTT Data, Lawrence Livermore Lab:
How to link GitHub users with enterprises
It is not easy to decide which GitHub users belong to. But we can confirm by email's domain name. The domain name information is contained in the submitted PushEvents.
- More than one user uses the same e-mail, so we can only consider that GitHub users can push the code to more than 20 star GitHub projects at the same time.
- I only account for more than 3 push GitHub users at the same time.
- The user push code to GitHub can display a lot of different e - mails in the push - partly explaining how Git works. To determine the organization that each user belongs to, look at the most frequent emails that they push to display.
- Not everyone will use their own mailbox address. There are a lot of gmail.com, users.noreply.github.com, or other mailboxes on Github. Sometimes users are anonymous because they need to protect their own mailboxes, so I can't see their domain names, so they can't take them in.
- Sometimes the employees change the organization, that is, the job hopping. I'll assign them to the company that gets more push.
#standardSQL WITH Period AS ( SELECT * FROM `githubarchive.month.2017*` a ), Repo_stars AS ( SELECT repo.id, COUNT (DISTINCT actor.login) stars, APPROX_TOP_COUNT (repo.name, 1) [OFFSET (0)].value repo_name FROM period WHERE type='WatchEvent' GROUP BY 1 HAVING stars>20 ), Pushers_guess_emails_and_top_projects AS ( SELECT * #, REGEXP_EXTRACT (email, r'@ 'domain (*)) (REGEXP_EXTRACT (email, REGEXP_REPLACE, r'@ (. *) '), r'.*.ibm.com','ibm.com') domain (FROM SELECT actor.id APPROX_TOP_COUNT (actor.login, 1) [OFFSET (0)].value login APPROX_TOP_COUNT (JSON_EXTRACT_SCALAR (payload,'$.commits.author.email'), 1) [OFFSET (0)].value email COUNT (*) C ARRAY_AGG (DISTINCT TO_JSON_STRING (STRUCT (b.repo_name, stars)) repos) FROM period a JOIN repo_stars B ON a.repo.id=b.id WHERE type='PushEvent' GROUP BY 1 HAVING c>3 ) ) SELECT * FROM ( SELECT domain Githubers (SELECT COUNT (DISTINCT repo) FROM UNNEST (repos) repo) repos_contributed_to ARRAY ( SELECT AS STRUCT JSON_EXTRACT_SCALAR (repo,'$.repo_name') repo_name CAST (JSON_EXTRACT_SCALAR (repo,'$.stars') AS INT64) stars COUNT (*) githubers_from_domain FROM UNNEST (repos) repo GROUP BY 1, 2 HAVING githubers_from_domain>1 ORDER BY stars DESC LIMIT 3 Top) , (SELECT SUM (CAST (JSON_EXTRACT_SCALAR (repo,'$.stars') AS INT64)) FROM (SELECT DISTINCT, SELECT, and the second). (FROM SELECT domain, COUNT (*) githubers, ARRAY_CONCAT_AGG (ARRAY (SELECT * FROM UNNEST (repos) repo)) repos FROM pushers_guess_emails_and_top_projects #WHERE domain IN UNNEST (SPLIT ('google.com|microsoft.com|amazon.com','|')) WHERE domain NOT IN UNNEST (SPLIT ('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru','|') #) email hosters GROUP BY 1 HAVING githubers > 30 ) WHERE (SELECT MAX (githubers_from_domain) FROM (SELECT repo, COUNT (*) githubers_from_domain FROM UNNEST (repos) repo GROUP BY repo) >4 second filter email hosters #) ) ORDER BY githubers DESC
frequently asked questions
If an organization has 1500 warehouses, why only 200? If a warehouse has 7000 star, why only 1500?
I did the association filtering, and only calculated the number of star in 2017. For example, on GitHub, Apache has more than 1500 warehouses, but in 2017, only 205 received more than 20 star numbers.
Is this the current situation of open source?
Please note that GitHub data do not include top-level organizations, such as Android, Chromium, GNU, Mozilla, nor Apache or Eclipse foundation. There are also other projects that choose to run most of their activities outside GitHub.
It's unfair to my organization
I can only calculate what I can see. You can question my assumptions and tell me how you can measure it in a better way. Job search is the best way to do it.
For example, when IBM's domain based domain name is merged into a top-level domain name with a SQL conversion statement, how are their rankings changed:
SELECT * REGEXP_REPLACE (REGEXP_EXTRACT (email, r'@ (. *) '), r'.*.ibm.com','ibm.com') domain
(when a domain e-mail domain name is merged with IBM, its relative ranking changes significantly)
The next step
I may have missed it before - the mistake may happen again. Look at all the raw data available to GitHub and question all my assumptions - it's cool to see what you're going to get.