Easy Data Hacking & Mining & Exploring
*Note: This essay uses python for data mining. *
独立博客还有其他博文 懒得扒了 移步 Crazydogen’s indie blog
Agenda
PandasPandas-ProfilingStatsmodelsMissingnoWordcloud
Pandas
Pandas is a Python library for exploring, processing, and model data.
Here we take a dataset named mimic-III as an example.
Basic stats
# First load df from a filedf.head()df.shapedf[a column].mean()df[a column].std()df[a column].max()df[a column].min()df[a column].quantile()df.describe() # brief descriptiondf.isna().any() # check every columns whether it has missing valuesdf.isna().sum() # count NAN values
Additional tips
Pandas dataframe methods
Working with missing data
Pandas dataframe Operations
Statistical functions
Charting a tabular dataset
Supported charts
DataFrame.plot([x, y], kind)- kind :- 'line': line plot (default)- 'bar': vertical bar plot- 'barh': horizontal bar plot- 'hist': histogram- 'box': boxplot- 'kde': Kernel Density Estimation plot- 'density': same as 'kde'- 'area': stacked area plot- 'pie': pie plot- 'scatter': scatter plot- 'hexbin': Hexagonal binning plot
import pandas as pda = pd.read_csv("/path/mimic_demo/admissions.csv")a.columns = map(str.lower, a.columns)a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')
a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')
p = pd.read_csv("/path/mimic_demo/patients.csv")p.columns = map(str.lower, p.columns)ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)
c = pd.read_csv("/path/mimic_demo/cptevents.csv")c.columns = map(str.lower, c.columns)ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)
Pandas-profiling
Pandas-Profiling is a Python library for exploratory data analysis.
A quick example
import pandas as pdimport pandas_profilinga = pd.read_csv("/path/mimic_demo/admissions.csv")a.columns = map(str.lower, a.columns)# ignore the times when profiling since they are uninterestingcols = [c for c in a.columns if not c.endswith('time')]pandas_profiling.ProfileReport(a[cols], explorative=True)
Save generated profile to a ".html".
profile.to_file("/path/data_profile.html")
Statsmodels
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
Basic stats
For simplicity, we use statsmodels’describe
(Note:Describe
has been deprecated in favor ofDescription
and it’s simplified functional version,describe
.Describe
will be removed after 0.13) for quick stats. The selectable statistics include:
“nobs” - Number of observations
“missing” - Number of missing observations
“mean” - Mean
“std_err” - Standard Error of the mean assuming no correlation
“ci” - Confidence interval with coverage (1 - alpha) using the normal or t. This option creates two entries in any tables: lower_ci and upper_ci.
“std” - Standard Deviation
“iqr” - Interquartile range
“iqr_normal” - Interquartile range relative to a Normal
“mad” - Mean absolute deviation
“mad_normal” - Mean absolute deviation relative to a Normal
“coef_var” - Coefficient of variation
“range” - Range between the maximum and the minimum
“max” - The maximum
“min” - The minimum
“skew” - The skewness defined as the standardized 3rd central moment
“kurtosis” - The kurtosis defined as the standardized 4th central moment
“jarque_bera” - The Jarque-Bera test statistic for normality based on the skewness and kurtosis. This option creates two entries, jarque_bera and jarque_beta_pval.
“mode” - The mode of the data. This option creates two entries in all tables, mode and mode_freq which is the empirical frequency of the modal value.
“median” - The median of the data.
“percentiles” - The percentiles. Values included depend on the input value of percentiles.
“distinct” - The number of distinct categories in a categorical.
“top” - The mode common categories. Labeled top_n for n in 1, 2, …, ntop.
“freq” - The frequency of the common categories. Labeled freq_n for n in 1, 2, …, ntop.
import pandas as pdimport statsmodels.stats.descriptivestats as dstpd.set_option('display.max_columns', None)pd.set_option('display.max_rows', None)a = pd.read_csv("/path/xx.csv") # load your file via pd.read_xx()de = dst.describe(a)# Save the description to excel (pandas also support other formats)df.describe().to_excel("./pd_des.xlsx")de.to_excel("./sm_des.xlsx")
Missingno
Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts aboutADMISSIONS
table:
Not every patient is admitted to the emergency department as there are many missing values in edregtime and edouttime.Language
data of patients is mendatory field, but it used to be not.
import missingno as msnoa = pd.read_csv("/path/mimic_demo/admissions.csv")msno.matrix(a)
Missingsno also supports bar charts, heatmaps and dendrograms, check it out at github.
Wordcloud
Wordcloud visualizes a given text in a word-cloud format
This example illustrates that majority of patients suffered from sepsis.
from wordcloud import WordCloudtext = str(a['diagnosis'].values) #Prepare an input text in stringwordcloud = WordCloud().generate(text) #Generate a word-cloud from the input text# Plot the word-cloud import matplotlib.pyplot as pltplt.figure(figsize = (10,10))plt.imshow(wordcloud, interpolation = 'bilinear')plt.axis("off")plt.show()
Reference
MIMIC 数据集数据可视化Data analysis and visualization tutorial at TMF summer school Statistics in PythonBilogur, (). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547.Allen B. Downey. Think Stats, 2nd Edition.Scipy’s statistical functions如果觉得《数据挖掘 数据探索上手简易教程》对你有帮助,请点赞、收藏,并留下你的观点哦!