Python pandas super detailed introductory tutorial -- Chapter 6 pandas data discretization and merging [advanced]

Pandas The column contains this content
7 articles 2 subscription

1 data discretization

1.1 why discretization

The purpose of continuous attribute discretization is to simplify the data structure. Data discretization technology can be used to reduce the number of given continuous attribute values. Discretization method is often used as a tool of data mining.

1.2 what is data discretization

Discretization of continuous attributes is to divide the value domain into several discrete intervals on the value domain of continuous attributes, and finally use different symbols or integer values to represent the attribute values in each sub interval.

There are many methods of discretization, and the following is the simplest way to operate.

  • Original height data: 165174160180159163192184
  • It is assumed that there are several intervals according to height: 150-165165-180180 ~ 195

In this way, we divide the data into three interval segments. I can mark them as low, medium and high categories, and finally process them into a "dummy variable" matrix.

1.3 discretization of the rise and fall of stocks

The following is the discretization of the daily rise and fall range of stocks

1.3.1 reading stock data

First read the data of the stock and filter out the data of the rise and fall range

data =pd.read_ csv("./data/stock_day.csv")
p_ change=data['p_change']

1.3.2 grouping stock price range data

API used:

  • pd.qcut(data, q):
    Group the data. Group the data, usually with value_ Count is used together to count the number of each group
  • series.value_ Counts (): counts the grouping times

User defined interval grouping:

  • pd.cut(data, bins)
#Specify your own grouping interval
bins = [-twenty, -seven, -five, -three, 0, three, five, seven, twenty]
p_ counts =pd.cut(p_change,bins)

1.3.3 the grouping data of stock price range becomes one hot code

  • What is one hot coding
    Generate a Boolean column for each category. Only one of these columns can take 1 for this sample. It is also called hot coding.

  • pandas.get_ dummies(data, prefix=None)

    • data:array-like, Series, or DataFrame
    • Prefix: group name
bins = [-twenty, -seven, -five, -three, 0, three, five, seven, twenty]
p_ counts =pd.cut(p_change,bins)
#The one hot coding matrix is obtained
dummies =pd.get_ dummies(p_counts,prefix = "Rise and fall range")

Insert picture description here

2 data consolidation

If your data consists of multiple tables, you sometimes need to combine different contents for analysis.

2.1 pd.concat to realize data consolidation

  • pd.concat([data1, data2], axis=1)
    • Merge by row or column. Axis = 0 is the column index and axis = 1 is the row index

For example, we combine the one hot code just processed with the original data

#By row index
pd.concat([data,dummies],axis=one)

Insert picture description here

2.2 pd.merge

  • pd.merge(left, right, how=‘inner’, on=None)
    • You can specify to merge or separate the left and right according to the common key value pairs of the two sets of data
    • left: DataFrame
    • Right: another dataframe
    • On: specified common key
    • How: how to connect? The connection method is similar to the database. It is divided into internal connection, external connection, left connection and right connection

2.2.1 pd.merge merge

left =pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                        'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

right =pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                        'key2': ['K0', 'K0', 'K0', 'K0'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})

#Default internal connection
result =pd.merge(left,right,on=['key1', 'key2'])

Insert picture description here

  • Left connection
result =pd.merge(left,right,how='left',on=['key1', 'key2'])

Insert picture description here

  • Right connection
result =pd.merge(left,right,how='right',on=['key1', 'key2'])

Insert picture description here

  • External link
result =pd.merge(left,right,how='outer',on=['key1', 'key2'])

Insert picture description here
Creation is not easy, white whoring is not good. Your support and recognition is the biggest driving force for my creation. See you in the next article!

Dragon youth

If there are any mistakes in this blog, please comment and advice. Thank you very much!
Insert picture description here

  • five
    give the thumbs-up
  • one
    comment
  • four
    Collection
  • One key three links
    One key three links
  • Sweep and share posters

emoticon
Insert expression
©️ 2020 CSDN Skin theme: swimming - white Designer: Bai Songlin Return to home page
Paid inelement
Payment with balance
Click retrieve
Code scanning payment
Wallet balance 0

Deduction Description:

1. The balance is the virtual currency of wallet recharge, and the payment amount is deducted according to the ratio of 1:1.
2. The balance cannot be purchased and downloaded directly. You can buy VIP, c-coin package, paid column and courses.

Balance recharge