1. PANDAS: panel data system
- 데이터 정렬과 손실 데이터의 통합처리
- 데이터셋의 reshaping, pivoting, slicing, indexing, subsetting
- 데이터 구조 열 삽입 지우기
- 데이터 셋에 split-applu-combine연산 및 merging. joining
- 다양한시계열 가능
2. Numpy와의 차이
pandas는 데이터 구조가 짜여있음
numpy는 X
3. pandas API
Series와 DataFrame, Indec, Scalars등
4. 데이터 셀렉
국외: UN통계부, OECD
국내: 공공데이터포털, 기상자료개방포털
5. 실습
# pandas!
# import NumPy and load pandas
import numpy as np
import pandas as pd
# scalar value data for no index
gdp_s1 = pd.Series([24288, 26084,26689,26338,28210])
print(gdp_s1)
# scalar value data with index
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210],
index=[2006,2007,2008,2009,2010])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9],
index=[2006,2007,2008,2009,2010])
print(gdp_s2)
print(poverty_s1)
# import NumPy and load pandas
import numpy as np
import pandas as pd
# Series from Python dictionary
poverty_s2 = pd.Series({2006:14.3,2007:14.8,2008:15.2,2009:15.3,
2010:14.9})
print(poverty_s2)
# Series from ndarray
s3 = pd.Series(np.random.randn(4), index=['Jan', 'Feb', 'Mar', 'Apr'], name='series name')
#인덱싱 문자로도 가능
print(s3)
print(poverty_s2.to_numpy())
# import NumPy and load pandas
import numpy as np
import pandas as pd
# DataFrame from dict of Lists
index=pd.Series([2006,2007,2008,2009,2010])
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9])
data1={'Year':index, 'GDP':gdp_s2, 'Poverty':poverty_s1}
data_f1=pd.DataFrame(data1)
print(data_f1)
# DataFrame from dict of Series
gdp_s2 = pd.Series([24288, 26084,26689,26338,28210],
index=[2006,2007,2008,2009,2010])
poverty_s1 = pd.Series([14.3,14.8,15.2,15.3,14.9],
index=[2006,2007,2008,2009,2010])
data2={'GDP':gdp_s2, 'Poverty':poverty_s1}
data_f2=pd.DataFrame(data2)
print(data_f2)
print("----------------")
# import NumPy and load pandas
import numpy as np
import pandas as pd
# DataFrame from dict
data_f3 = pd.DataFrame({'Year':[2006,2007,2008,2009,2010],
'GDP':[24288, 26084,26689,26338,28210],
'Poverty':[14.3,14.8,15.2,15.3,14.9]})
print(data_f3)
# DataFrame from time series data of ndarray
index = pd.date_range('1/1/2000', periods=5)
data_f4 = pd.DataFrame(np.random.randn(5, 4), index=index,
columns=['Jan', 'Feb', 'Mar', 'Apr'])
print(data_f4)
data_f4.sort_index(axis=0, ascending=False)
data_f4.sort_values(by='Feb')
data_f4[0:3]
data_f2['Weight']=pd.Series([60,65,70,63,58],
index=[2006,2007,2008,2009,2010])
print(data_f2)
data_f2['Rate']=data_f2['Poverty']* data_f2['Weight']
print (data_f2)
# import NumPy and load pandas
import numpy as np
import pandas as pd
data_f2['Weight']=pd.Series([60,65,70,63,58],
index=[2006,2007,2008,2009,2010])
print(data_f2)
data_f2['Rate']=data_f2['Poverty']* data_f2['Weight']
print (data_f2)
del data_f2['Rate']
print (data_f2)
pd.DataFrame(data_f2, index=[2010,2009,2011],
columns=['Year','GDP','Poverty','GNP'])
data_f1.T
data_f1.describe()
data_f1.head(2)
data_f1.index
data_f1.GDP
'Pworkspace' 카테고리의 다른 글
week10 - numpy(쉅안온날) (0) | 2024.05.28 |
---|---|
week13 - pandas(2), 과제5, 기말고사 공지(?) (0) | 2024.05.28 |
week11 - numpy(2)(histogram+ 너구리) (0) | 2024.05.14 |
week9 - matplotlib(시험범위 귀띔) (0) | 2024.04.30 |
week7 + 중간고사 공지 (0) | 2024.04.16 |