肿瘤康复网,内容丰富有趣,生活中的好帮手!
肿瘤康复网 > 数据挖掘 数据探索上手简易教程

数据挖掘 数据探索上手简易教程

时间:2022-01-27 11:42:56

相关推荐

Easy Data Hacking & Mining & Exploring

*Note: This essay uses python for data mining. *

独立博客还有其他博文 懒得扒了 移步 Crazydogen’s indie blog

Agenda

PandasPandas-ProfilingStatsmodelsMissingnoWordcloud

Pandas

Pandas is a Python library for exploring, processing, and model data.

Here we take a dataset named mimic-III as an example.

Basic stats

# First load df from a filedf.head()df.shapedf[a column].mean()df[a column].std()df[a column].max()df[a column].min()df[a column].quantile()df.describe() # brief descriptiondf.isna().any() # check every columns whether it has missing valuesdf.isna().sum() # count NAN values

Additional tips

Pandas dataframe methods

Working with missing data

Pandas dataframe Operations

Statistical functions

Charting a tabular dataset

Supported charts

DataFrame.plot([x, y], kind)- kind :- 'line': line plot (default)- 'bar': vertical bar plot- 'barh': horizontal bar plot- 'hist': histogram- 'box': boxplot- 'kde': Kernel Density Estimation plot- 'density': same as 'kde'- 'area': stacked area plot- 'pie': pie plot- 'scatter': scatter plot- 'hexbin': Hexagonal binning plot

import pandas as pda = pd.read_csv("/path/mimic_demo/admissions.csv")a.columns = map(str.lower, a.columns)a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')

a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')

p = pd.read_csv("/path/mimic_demo/patients.csv")p.columns = map(str.lower, p.columns)ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)

c = pd.read_csv("/path/mimic_demo/cptevents.csv")c.columns = map(str.lower, c.columns)ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)

Pandas-profiling

Pandas-Profiling is a Python library for exploratory data analysis.

A quick example

import pandas as pdimport pandas_profilinga = pd.read_csv("/path/mimic_demo/admissions.csv")a.columns = map(str.lower, a.columns)# ignore the times when profiling since they are uninterestingcols = [c for c in a.columns if not c.endswith('time')]pandas_profiling.ProfileReport(a[cols], explorative=True)

Save generated profile to a ".html".

profile.to_file("/path/data_profile.html")

Statsmodels

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Basic stats

For simplicity, we use statsmodels’describe(Note:Describehas been deprecated in favor ofDescriptionand it’s simplified functional version,describe.Describewill be removed after 0.13) for quick stats. The selectable statistics include:

“nobs” - Number of observations

“missing” - Number of missing observations

“mean” - Mean

“std_err” - Standard Error of the mean assuming no correlation

“ci” - Confidence interval with coverage (1 - alpha) using the normal or t. This option creates two entries in any tables: lower_ci and upper_ci.

“std” - Standard Deviation

“iqr” - Interquartile range

“iqr_normal” - Interquartile range relative to a Normal

“mad” - Mean absolute deviation

“mad_normal” - Mean absolute deviation relative to a Normal

“coef_var” - Coefficient of variation

“range” - Range between the maximum and the minimum

“max” - The maximum

“min” - The minimum

“skew” - The skewness defined as the standardized 3rd central moment

“kurtosis” - The kurtosis defined as the standardized 4th central moment

“jarque_bera” - The Jarque-Bera test statistic for normality based on the skewness and kurtosis. This option creates two entries, jarque_bera and jarque_beta_pval.

“mode” - The mode of the data. This option creates two entries in all tables, mode and mode_freq which is the empirical frequency of the modal value.

“median” - The median of the data.

“percentiles” - The percentiles. Values included depend on the input value of percentiles.

“distinct” - The number of distinct categories in a categorical.

“top” - The mode common categories. Labeled top_n for n in 1, 2, …, ntop.

“freq” - The frequency of the common categories. Labeled freq_n for n in 1, 2, …, ntop.

import pandas as pdimport statsmodels.stats.descriptivestats as dstpd.set_option('display.max_columns', None)pd.set_option('display.max_rows', None)a = pd.read_csv("/path/xx.csv") # load your file via pd.read_xx()de = dst.describe(a)# Save the description to excel (pandas also support other formats)df.describe().to_excel("./pd_des.xlsx")de.to_excel("./sm_des.xlsx")

Missingno

Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts aboutADMISSIONStable:

Not every patient is admitted to the emergency department as there are many missing values in edregtime and edouttime.Languagedata of patients is mendatory field, but it used to be not.

import missingno as msnoa = pd.read_csv("/path/mimic_demo/admissions.csv")msno.matrix(a)

Missingsno also supports bar charts, heatmaps and dendrograms, check it out at github.

Wordcloud

Wordcloud visualizes a given text in a word-cloud format

This example illustrates that majority of patients suffered from sepsis.

from wordcloud import WordCloudtext = str(a['diagnosis'].values) #Prepare an input text in stringwordcloud = WordCloud().generate(text) #Generate a word-cloud from the input text# Plot the word-cloud import matplotlib.pyplot as pltplt.figure(figsize = (10,10))plt.imshow(wordcloud, interpolation = 'bilinear')plt.axis("off")plt.show()

Reference

MIMIC 数据集数据可视化Data analysis and visualization tutorial at TMF summer school Statistics in PythonBilogur, (). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547.Allen B. Downey. Think Stats, 2nd Edition.Scipy’s statistical functions

如果觉得《数据挖掘 数据探索上手简易教程》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。