Pworkspace

week12 - pandas (11주차에 조금 당겨서 배움)

haerangssa 2024. 5. 17. 11:07

1. PANDAS: panel data system

- 데이터 정렬과 손실 데이터의 통합처리

- 데이터셋의 reshaping, pivoting, slicing, indexing, subsetting

- 데이터 구조 열 삽입 지우기

- 데이터 셋에 split-applu-combine연산 및 merging. joining

- 다양한시계열 가능 

 

2. Numpy와의 차이

pandas는 데이터 구조가 짜여있음

numpy는 X

 

3. pandas API

Series와 DataFrame, Indec, Scalars등

 

4. 데이터 셀렉

국외: UN통계부, OECD

국내: 공공데이터포털, 기상자료개방포털

 

5. 실습

# pandas!
# import NumPy and load pandas
import numpy as np
import pandas as pd

# scalar value data for no index
gdp_s1 = pd.Series([24288, 26084,26689,26338,28210])
print(gdp_s1)
# scalar value data with index
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210],
index=[2006,2007,2008,2009,2010])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9],
index=[2006,2007,2008,2009,2010])
print(gdp_s2)
print(poverty_s1)

# import NumPy and load pandas
import numpy as np
import pandas as pd

# Series from Python dictionary
poverty_s2 = pd.Series({2006:14.3,2007:14.8,2008:15.2,2009:15.3,
2010:14.9})
print(poverty_s2)
# Series from ndarray
s3 = pd.Series(np.random.randn(4), index=['Jan', 'Feb', 'Mar', 'Apr'], name='series name')
#인덱싱 문자로도 가능
print(s3)
print(poverty_s2.to_numpy())


# import NumPy and load pandas
import numpy as np
import pandas as pd

# DataFrame from dict of Lists
index=pd.Series([2006,2007,2008,2009,2010])
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9])
data1={'Year':index, 'GDP':gdp_s2, 'Poverty':poverty_s1}
data_f1=pd.DataFrame(data1)
print(data_f1)


# DataFrame from dict of Series
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210],
index=[2006,2007,2008,2009,2010])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9],
index=[2006,2007,2008,2009,2010])
data2={'GDP':gdp_s2, 'Poverty':poverty_s1}
data_f2=pd.DataFrame(data2)
print(data_f2)
print("----------------")
# import NumPy and load pandas
import numpy as np
import pandas as pd

# DataFrame from dict
data_f3 = pd.DataFrame({'Year':[2006,2007,2008,2009,2010],
'GDP':[24288, 26084,26689,26338,28210],
'Poverty':[14.3,14.8,15.2,15.3,14.9]})
print(data_f3)


# DataFrame from time series data of ndarray
index = pd.date_range('1/1/2000', periods=5)
data_f4 = pd.DataFrame(np.random.randn(5, 4), index=index,
columns=['Jan', 'Feb', 'Mar', 'Apr'])
print(data_f4)

data_f4.sort_index(axis=0, ascending=False)
data_f4.sort_values(by='Feb')
data_f4[0:3]
data_f2['Weight']=pd.Series([60,65,70,63,58],
index=[2006,2007,2008,2009,2010])
print(data_f2)
data_f2['Rate']=data_f2['Poverty']* data_f2['Weight']
print (data_f2)

# import NumPy and load pandas
import numpy as np
import pandas as pd

data_f2['Weight']=pd.Series([60,65,70,63,58],
index=[2006,2007,2008,2009,2010])
print(data_f2)
data_f2['Rate']=data_f2['Poverty']* data_f2['Weight']
print (data_f2)

del data_f2['Rate']
print (data_f2)

pd.DataFrame(data_f2, index=[2010,2009,2011],
columns=['Year','GDP','Poverty','GNP'])

data_f1.T

data_f1.describe()
data_f1.head(2)
data_f1.index
data_f1.GDP