2017 in China's open source contribution list

translate October 31, 2017 17:20:03

The original:Who contributed the most to open source in 2017? Let 's analyze GitHub'
Author:Felipe Hoffa
Translation: unimpeded me flying

Abstract: Based on the data from GitHub 2017, the author analyzed the open source contribution data of nearly 100 companies, and explained their own analysis methods. The following is the translation.

Here's a picture description

In this analysis, we will see all the GitHub published in 2017.Pushevents. For every GitHub user, we have to do our best to speculate on which organization they belong to. This analysis is concerned only with a warehouse that has grown more than 20 star in 2017.

Here is the result of my analysis.I modify these results in my interactive data warehouse report.

Comparison of top cloud providers

2017 GitHub data:

  • About 1300 employees in Microsoft actively push the code to the 825 top Repositories on GitHub.
  • About 900 employees in Google are active in GitHub, pushing the code to about 1100 top warehouses.
  • About 134 employees in the Amazon are active in GitHub, pushing the code to only 158 top warehouses.
  • Not all projects are fair: Google employees contribute 25% more code repositories than Microsoft, and the number of star acquired by warehouses is also more (530000 vs 260000). The total of star in 2017 was 27000 in the Amazon.
    Here's a picture description

RedHat, IBM, Pivotal, Intel, and Facebook

Amazon is far behind Microsoft and Google. What companies are there between them? According to their contribution, RedHat, Pivotal, and Intel have also made a prominent contribution to GitHub.

Note that the following table merges all the IBM global domains (IBM domains in various countries will have domain name suffixes in various countries) -- though all regions are still appearing in the next table.

Here's a picture description

Here's a picture description

The number of GitHub users in Facebook and IBM (US) is similar to that of Amazon, but the projects they contribute to more star (especially Facebook):

Here's a picture description

Then it's Alibaba, Uber, and Wix:

Here's a picture description

GitHub itself, Apache and Tencent:

Here's a picture description

Baidu, Apple, Mozilla:

Here's a picture description

Oracle, Stanford, Mit, Shopify, MongoDb, Berkeley, VmWare, Netflix, Salesforce, Gsa.gov:

Here's a picture description

LinkedIn, Broad Institute, Palantir, Yahoo, MapBox, Unity3d, Automattic, Sandia, Travis-ci, Spotify:

Here's a picture description

Chromium, UMich, Zalando, Esri, IBM (UK), SAP, EPAM, Telerik, UK Cabinet Office.

Here's a picture description

Cern, Odoo, Kitware, Suse, Yandex, IBM (Canada), Adobe, AirBnB, Chef, The Guardian:

Here's a picture description

Arm, Macports, Docker, Nuxeo, NVidia, Yelp, Elastic, NYU, WSO2, Mesosphere, Inria

Here's a picture description

Puppet, Stanford (CS), DatadogHQ, Epfl, NTT Data, Lawrence Livermore Lab:

Here's a picture description

My method

How to link GitHub users with enterprises

It is not easy to decide which GitHub users belong to. But we can confirm by email's domain name. The domain name information is contained in the submitted PushEvents.

  • More than one user uses the same e-mail, so we can only consider that GitHub users can push the code to more than 20 star GitHub projects at the same time.
  • I only account for more than 3 push GitHub users at the same time.
  • The user push code to GitHub can display a lot of different e - mails in the push - partly explaining how Git works. To determine the organization that each user belongs to, look at the most frequent emails that they push to display.
  • Not everyone will use their own mailbox address. There are a lot of gmail.com, users.noreply.github.com, or other mailboxes on Github. Sometimes users are anonymous because they need to protect their own mailboxes, so I can't see their domain names, so they can't take them in.
  • Sometimes the employees change the organization, that is, the job hopping. I'll assign them to the company that gets more push.

My inquiry

Period AS (
FROM `githubarchive.month.2017*` a
Repo_stars AS (
SELECT repo.id, COUNT (DISTINCT actor.login) stars, APPROX_TOP_COUNT (repo.name, 1) [OFFSET (0)].value repo_name
FROM period
WHERE type='WatchEvent'
HAVING stars>20
Pushers_guess_emails_and_top_projects AS (
#, REGEXP_EXTRACT (email, r'@ 'domain (*))
(REGEXP_EXTRACT (email, REGEXP_REPLACE, r'@ (. *) '), r'.*.ibm.com','ibm.com') domain
SELECT actor.id
APPROX_TOP_COUNT (actor.login, 1) [OFFSET (0)].value login
APPROX_TOP_COUNT (JSON_EXTRACT_SCALAR (payload,'$.commits[0].author.email'), 1) [OFFSET (0)].value email
ARRAY_AGG (DISTINCT TO_JSON_STRING (STRUCT (b.repo_name, stars)) repos)
FROM period a
JOIN repo_stars B
ON a.repo.id=b.id
WHERE type='PushEvent'
SELECT domain
(SELECT COUNT (DISTINCT repo) FROM UNNEST (repos) repo) repos_contributed_to
SELECT AS STRUCT JSON_EXTRACT_SCALAR (repo,'$.repo_name') repo_name
CAST (JSON_EXTRACT_SCALAR (repo,'$.stars') AS INT64) stars
COUNT (*) githubers_from_domain FROM UNNEST (repos) repo
HAVING githubers_from_domain>1
SELECT domain, COUNT (*) githubers, ARRAY_CONCAT_AGG (ARRAY (SELECT * FROM UNNEST (repos) repo)) repos
FROM pushers_guess_emails_and_top_projects
#WHERE domain IN UNNEST (SPLIT ('google.com|microsoft.com|amazon.com','|'))
WHERE domain NOT IN UNNEST (SPLIT ('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru','|') #) email hosters
HAVING githubers > 30
WHERE (SELECT MAX (githubers_from_domain) FROM (SELECT repo, COUNT (*) githubers_from_domain FROM UNNEST (repos) repo GROUP BY repo) >4 second filter email hosters #)
ORDER BY githubers DESC

frequently asked questions

If an organization has 1500 warehouses, why only 200? If a warehouse has 7000 star, why only 1500?

I did the association filtering, and only calculated the number of star in 2017. For example, on GitHub, Apache has more than 1500 warehouses, but in 2017, only 205 received more than 20 star numbers.
Here's a picture description

Here's a picture description

Is this the current situation of open source?

Please note that GitHub data do not include top-level organizations, such as Android, Chromium, GNU, Mozilla, nor Apache or Eclipse foundation. There are also other projects that choose to run most of their activities outside GitHub.

It's unfair to my organization

I can only calculate what I can see. You can question my assumptions and tell me how you can measure it in a better way. Job search is the best way to do it.

For example, when IBM's domain based domain name is merged into a top-level domain name with a SQL conversion statement, how are their rankings changed:

SELECT * REGEXP_REPLACE (REGEXP_EXTRACT (email, r'@ (. *) '), r'.*.ibm.com','ibm.com') domain

Here's a picture description

Here's a picture description
(when a domain e-mail domain name is merged with IBM, its relative ranking changes significantly)

The next step

I may have missed it before - the mistake may happen again. Look at all the raw data available to GitHub and question all my assumptions - it's cool to see what you're going to get.

Using an interactive data warehouse report

2017 double eleven Ali technical summary

In 2017, Tmall double 11 realized the largest machine intelligence application in the history of human science and technology. The machine intelligent recommendation system: generate personalized page, realize the cute "thousand faces"; Luban AI Designer: automatically generates a variety of advertising maps, making 8000 pairs per second...
  • Chichengjunma
  • Chichengjunma
  • 10:49 in November 12, 2017
  • Two thousand two hundred and seventy-seven

2017 Ali technical year selection (upper +)

  • 16:46 in December 19, 2017
  • 35.34MB
  • download

The 2017 annual selection of 2 Ali Technology

  • 13:26 in December 19, 2017
  • 15.61MB
  • download

The hanging mirror security laboratory Kr0iNg won the NetEase SRC July contribution list first, and the enterprise security service is given us more reassuring

Web site is not safe, you say not, to the professional people to see.
  • Anprou
  • Anprou
  • 2017 07 - 07 09:47
  • Five hundred and thirty-seven

2017 Ali technical annual collection

  • December 27, 2017 09:59
  • 34.24MB
  • download

Hacker on the list of Time 2016 annual figures, the top ranking is Trump

The original reprint E security, original address: https://www.easyaq.com/newsdetail/id/1014216573.shtml If the copyright issue is involved, you can delete it by contact with a small editor. Time today...
  • Anprou
  • Anprou
  • 16:39 in December 16, 2016
  • Five hundred and thirty-six

A complete collection of 2017 Ali technical years

  • 12:49 in December 20, 2017
  • 34.23MB
  • download

The 2017 annual collection of Ali Technology (5)

  • 14:09 in December 19, 2017
  • 35.35MB
  • download

The ten most popular Chinese open source software in 2017

In 2017, it has passed quickly. What open source projects have been used in the past year, and combined with a survey in open source China, the following ten open source software are popular. 1:JAVA speed WEB+ORM frame JFinal Online text...
Content Report
Back to the top
Collector assistant
Bad information report
You report the article:2017 in China's open source contribution list
Reporting reasons:
Reasons for the following:

(at most only 30 words are allowed)