⭐ Data analysis and visualization of [award winning information of computer design competition in recent years] ⭐ ️

Python data analysis The column contains this content
25 articles 6 subscription

Write in front

Based on the award-winning data of the "Chinese college student computer design competition" in recent three years (the 2021 result has not been announced), this paper analyzes and excavates some deep contents of the competition, mainly including the following points:

  • Proportion distribution of awards in each year
  • Top 10 schools with the most awards in each year
  • Which schools have won the top 10 awards for many times
  • Statistics of participation times of each school in three years
  • Division of school levels in awards at all levels
  • Relationship between the number of participants and Awards
  • Name of award-winning works: hot words

If there is a large amount of code in each part of the article, it will not be displayed for reading experience. If you need code and relevant documents, you can trust me privately.


Data reading and description

The data set is the official data provided by the competition. The data in 2018 and 2019 are. Xlsx files and the data in 2020 are. PDF files. First read the data of 2018 and 2019 and observe the data set information.

importpandas aspd

df_ two thousand and eighteen=pd.read_ excel('2018 final official result. Xlsx',sheet_ name='123 first prize')
df_ two thousand and nineteen=pd.read_ excel('publicity of the list of winning works in 2019 20190907. Xlsx')

Insert picture description here

df_ two thousand and nineteen.info()

Insert picture description here

It can be seen from the information in the above figure that there are great differences in the format of data sets between 2018 and 2019, which need to be unified in future consolidation.

Since the 2020 data is a. PDF file, we define a separate function to read. Some details of reading have been marked in the form of comments in the code.

importpdfplumber

def read_ pdf_ two thousand and twenty(read_ path):
pdf_ two thousand and twenty=pdfplumber.open(read_ path)
result_ df =pd.DataFrame()
    forpage inpdf_ two thousand and twenty.pages:
table =page.extract_ table()
df_ detail =pd.DataFrame(table[one:],columns=table[0])
        #Merge datasets per page
result_ df =pd.concat([df_ detail,result_ df],ignore_ index=True)
    #Delete columns whose values are all Nan
result_ df.dropna(axis=one,how='all',inplace=True)
    #Reset column name
result_ df.columns = ['Awards', 'work no.', 'work name', 'participating schools', 'author', 'instructor']
    returnresult_ df

df_ two thousand and twenty=read_ pdf_ two thousand and twenty('winners list of entries in 2020 China University student computer design competition. PDF')

Insert picture description here

Looking at the data of 2020, compared with the data of the previous two years, there is no missing value in each column, but the award information of 2020 does not include the column of work category, so we need to delete the category column of the previous two years when processing the data set. In this way, we can use the format of 2020 data set as a template to convert the data sets of the previous two years to the same format and then merge them.


Data preprocessing

Data set formatting for each year

According to the 2020 format, merge some columns in the 2018 and 2019 data sets, replace column names and delete redundant columns. And add the "year" column.

The following is the processed data for 2018 and 2019.

Insert picture description here
Insert picture description here

For the processing of 2020 data set, it should be noted that the data is read based on the data of each page. If there is more data in the last line of a page and a new line is required, the data in the first line of the next page will be missing, as shown below.
Insert picture description here

In this case, you need to filter out the lines with empty work number and add the data to the previous line.

#2020 dataset processing
clean_ df_ two thousand and twenty=df_ two thousand and twenty.copy()

#Part of the information is too long, causing it to be split at the page and appear on two pages respectively. Next, add the empty data to the information of the previous data.
clean_ df_ two thousand and twenty.iloc[six hundred and nine]['participating schools'] += 'medical University '
clean_ df_ two thousand and twenty.iloc[one thousand three hundred and thirty]['work name'] += 'danxia '
clean_ df_ two thousand and twenty.iloc[two thousand one hundred and twenty-one]['work name'] += 'present'
clean_ df_ two thousand and twenty.iloc[two thousand nine hundred and ninety-seven]['work name'] += 'cloud platform'

del_ index =clean_ df_ two thousand and twenty.loc[clean_ df_ two thousand and twenty['Awards'] == ''].index
clean_ df_ two thousand and twenty.drop(del_ index,inplace=True)
clean_ df_ two thousand and twenty.reset_ index(drop=True,inplace=True)
clean_ df_ two thousand and twenty['year'] = [two thousand and twenty for_ in range(len(clean_ df_ two thousand and twenty))]

Data merging

Now merge the data for three years. The merged dataset is as follows.
Insert picture description here

Data cleaning

Now we need to do some processing on the merged data set for better analysis and visualization. Since we need to use some basic information of national colleges and universities later, such as school level (985 211, etc.), we need to import itcollege_ info.csv, the data was crawled by bloggers on June 15, and some higher vocational colleges may not be included. For these colleges and universities, the corresponding label is assigned as "no data temporarily". Clear the line break "\ n" in the names of participating schools and works, and then addNumber of participantsColumn to recordNumber of authors of each workNumber of InstructorsColumn recordNumber of instructors of the work

college_ info =pd.read_ csv('college_ info.csv')
college_ name =college_ info['School name '].values.tolist()
college_ level = []
forcollege inall_ df['participating schools']:
    ifcollege not incollege_ name:
college_ level.append('no data temporarily')
    else:
college_ level.append(college_ info['School level '][college_ name.index(college)])
all_ df['School level '] =college_ level
all_ df['participating schools'] =all_ df['participating schools'].str.replace('\n|\r', '')
all_ df['work name'] =all_ df['work name'].str.replace('\n|\r', '')

#Delete columns with empty author
all_ df.dropna(subset=['author'],axis=0,inplace=True)
#Add a parameter number column to record the number of authors of each work
all_ df['number of participants'] =all_ df['author'].apply(lambdax: len(x.split('、')))
count_ list = []
forindex,row inall_ df.iterrows():
    try:
count_ list.append(len(row['instructor'].split('、')))
    except:
count_ list.append(0)
all_ df['number of instructors'] =count_ list

all_ df.to_ csv('all_ df.csv',index=False)
all_ df

The processed data set is as follows.
Insert picture description here


Data analysis and visualization

Distribution of awards by year

Count the proportion of first prize, second prize and third prize in three years, and draw a stacked bar chart.

'''
Data statistical omission
'''

frompyecharts importoptions asopts
frompyecharts.charts importBar
frompyecharts.commons.utils importJsCode
frompyecharts.globals importThemeType

c =Bar(init_ opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_ xaxis([two thousand and eighteen, two thousand and nineteen, two thousand and twenty])
c.add_ yaxis("Third prize",list1,stack="stack1",category_ gap="70%")
c.add_ yaxis("Second prize",list2,stack="stack1",category_ gap="70%")
c.add_ yaxis("First prize",list3,stack="stack1",category_ gap="70%")
c.set_ series_ opts(label_ opts=opts.LabelOpts(
position="right",
formatter=JsCode(
                        "function(x){return Number(x.data.percent * 100).toFixed() + '%';}"
                    ),
                )
            )
c.render(".. / images / stacked bar chart for the distribution of awards by year. HTML")
c.render_ notebook()

Insert picture description here

It is observed from the above figure that with the passage of time, the proportion of the first prize and the second prize begins to decrease, and the proportion of the third prize increases. In 2020, the proportion of the third prize will reach 68%. It is not difficult to see that the competition side wants to increase the gold content of the first prize.


Top 10 schools with the most awards in each year

Count the number of awards of the top 10 schools that won the most awards in each year and draw graphics.
Please add a picture description

It can be seen from the above figure that many universities appear in the top 10 more than once. Among these schools, some may be because they pay more attention to the competition.

Let's use the Wayne diagram to see in detail which schools have entered the top 10 with the most awards for many times.
Insert picture description here

Shenyang Normal University, Shenyang Institute of technology and Liaoning University of technology have all entered the top 10 in three years, and there are some other schools that have entered the top 10 twice, of which there are obviously many universities in the northeast.


Statistics of participation times of each school

Now count the number of entries of each school and calculate the number of schools.

fromcollections importCounter
all_ school = []
foryear in [two thousand and eighteen, two thousand and nineteen, two thousand and twenty]:
school_ set = set(all_ df.loc[all_ df['year'] ==year, 'participating schools'].values.tolist())
all_ school += list(school_ set)
value_ count =Counter(all_ school)
count_ list = ['entry' + str(n) + 'Times' forn invalue_ count.values()]
counter =Counter(count_ list)

frompyecharts.charts importPie

c =Pie(init_ opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add("", [list(z) forz in zip(counter.keys(),counter.values())])
c.set_ global_ opts(title_ opts=opts.TitleOpts(title="Pie basic example"))
c.set_ series_ opts(label_ opts=opts.LabelOpts(formatter="{b}: {c}"))
c.render(".. / images / statistical pie chart of attendance times of each school. HTML")
c.render_ notebook()

Insert picture description here

Among the participating schools in the past three years, about half of them participated in the competition three times, and about 25% of them participated in the competition once and twice respectively. In this way, the schools participating in the competition are still willing to continue to participate in the next competition, which shows that the competition is attractive to schools.


Level division of participating schools in each year

Count the levels of participating schools in each year and observe the level distribution of participating schools.
Insert picture description here

In the past three years, the vast majority of participants came from ordinary undergraduates, followed by 211, and the number of school participants at all levels is increasing year by year. Ordinary undergraduate is the most significant. It can be seen that with the publicity of the competition and the popularity of computers, more and more people pay attention to computer competitions( For the column without data, it is partly because the school information is not included, and it may be caused by the contestants' mistakes when filling in the school)


Number of participants and distribution of awards

According to the number of authors and instructors, count the number of awards, and draw the following figure.
Please add a picture description

Among the awards, the largest proportion of awards is the lineup of three authors and two teachers, followed by three authors and one teacher. The rest of the team won fewer awards. It doesn't seem that the more people there are, the greater the chance of winning.


Name of award-winning works: hot words

First, define a function to load stop words, which is used to load local stop words.

def load_ stopwords(read_ path):
    '''
Read each line of the file and save it to the list
:param read_ Path: the path of the file to be read
: Return: saves a list of each line of information in the file
'''
result = []
    with open(read_ path, "r",encoding='utf-8') asf:
        forline inf.readlines():
line =line.strip('\n')  #Remove the newline character of each element in the list
result.append(line)
    returnresult

#Load Chinese stop words
stopwords =load_ stopwords('wordcloud_ stopwords.txt')

Count the words in the names of all works after removing the stop words and save them in the list.

importjieba

#Add custom dictionary
jieba.load_ userdict(Custom dictionary.txt)

token_ list = []
#Segment the title content and save the word segmentation results in the list
forname inall_ df['work name']:
tokens =jieba.lcut(name,cut_ all=False)
token_ list += [token fortoken intokens iftoken not instopwords]
len(token_ list)

Count the frequency of each word in the list, take the top 100 as popular words, and draw a word cloud map.

frompyecharts.charts importWordCloud
fromcollections importCounter

token_ count_ list =Counter(token_ list).most_ common(one hundred)
new_ token_ list = []
fortoken,count intoken_ count_ list:
new_ token_ list.append((token, str(count)))

c =WordCloud()   
c.add(series_ name="Hot words",data_ pair=new_ token_ list,word_ size_ range=[twenty, two hundred])
c.set_ global_ opts(
title_ opts=opts.TitleOpts(
title="Hot words of award-winning works",title_ textstyle_ opts=opts.TextStyleOpts(font_ size=twenty-three)
    ),
tooltip_ opts=opts.TooltipOpts(is_ show=True),
)
c.render(".. / images / hot words of award-winning works. HTML")
c.render_ notebook()

Insert picture description here

By observing the above figure, we can clearly understand the current hot topics in computers, such as big data, artificial intelligence, algorithms, visualization, management systems, robots, etc. these directions have always been hot directions in the computer industry and can also be used as a road for our future development.


summary

  • In recent years, the first prize and the second prize of computer design competitionProportion reduction, third prizeProportion increase, which increases the chances of winning the first and second prizesdifficultyAndGold content
  • NortheastUniversities pay more attention to the event. In terms of the number of participants and the number of winners, Shenyang Normal University, Shenyang Institute of technology and Liaoning University of technology have entered the top 10 with the most awards for many times.
  • Most schoolsContinuous participationIn the last three years, the schools that participated in the competition for three times accounted for about half of the total award-winning schools.
  • In three years, most of the contestants came fromGeneral undergraduate, followed by 211, and the number of school participants at all levels is increasing year by year.
  • The largest proportion of award-winning contestants is3 authors and 2 teachersThe second is three authors and one teacher( 5 authors at most, 2 instructors (7 in total)
  • Popular words in worksBig data, artificial intelligence, algorithm, visualization, management system, robotWait.

This is all the content of this article, if it feels good.❤ Point a praise before you go!!! ❤

Insert picture description here
In the future, we will continue to share articles on data analysis. If you are interested, you can pay attention and don't get lost ~.

emoticon
Insert expression
Relevant recommendations More similar content
©️ 2020 CSDN Skin theme: rolling cat Designer: Ma Bangbang Return to home page
Paid inelement
Payment with balance
Click retrieve
Code scanning payment
Wallet balance 0

Deduction Description:

1. The balance is the virtual currency of wallet recharge, and the payment amount is deducted according to the ratio of 1:1.
2. The balance cannot be purchased and downloaded directly. You can buy VIP, c-coin package, paid column and courses.

Balance recharge