A way to access metadata on all the files due for processing.
%load_ext autoreload
%autoreload 2
%load_ext rich

get_page_count[source]

get_page_count(filepath:PathLike[Any])

Gets the page count of a PDF file.

get_file_metadata[source]

get_file_metadata(filepath:PathLike[Any])

Gets the metadata associated with a PDF file.

add_comma_separation[source]

add_comma_separation(input:int)

Adds comma-separation for thousands to an integer.

has_ocr_layer[source]

has_ocr_layer(filepath:PathLike[Any])

Checks whether a particular file has an OCR layer.

get_stats[source]

get_stats(source_path:PathLike[Any])

Gathers statistics on the PDF data contained in a particular directory.

convert_timestamp[source]

convert_timestamp(item_date_object)

Helper function to convert a datetime object to timestamp when needed for a JSON object.

get_json_stats[source]

get_json_stats(source_path:PathLike[Any])

Gathers statistics on the PDF data in a directory in JSON format.

get_dataframe_stats[source]

get_dataframe_stats(source_path:PathLike[Any])

Gathers statistics on the PDF data in a directory as a dataframe.

export_stats_as_csv[source]

export_stats_as_csv(source_path:PathLike[Any], destination_path:PathLike[Any]=Path('stats.csv'))

Exports statistics on the PDF data as a CSV file.

In this next section I manage to get some makeshift drift detection working with evidently:

# from evidently.dashboard import Dashboard
# from evidently.tabs import DataDriftTab
# from evidently.pipeline.column_mapping import ColumnMapping

# source1 = Path("/Users/strickvl/Desktop/NL")
# source2 = Path("/Users/strickvl/Desktop/machine-learning-flashcards")

# data_1 = get_dataframe_stats(source1)
# data_2 = get_dataframe_stats(source2)

# data_types_dict = {
#     'filename': str,
#     "pagecount": np.number,
#     'has_ocr_layer': np.number,
#     'pdf_file_size_bytes': np.number,
#     'author': str,
# }

# data_1['date_created'] = pd.to_datetime(data_1['date_created'])
# data_1['date_last_modified'] = pd.to_datetime(data_1['date_last_modified'])
# data_1 = data_1.astype(data_types_dict)

# data_2['date_created'] = pd.to_datetime(data_2['date_created'])
# data_2['date_last_modified'] = pd.to_datetime(data_2['date_last_modified'])
# data_2 = data_2.astype(data_types_dict)

# cols_to_drop = ['date_created', 'date_last_modified', "filename", "author"]
# data_1.drop(cols_to_drop, axis=1, inplace=True)
# data_2.drop(cols_to_drop, axis=1, inplace=True)

# # export_stats_as_csv(source1, Path("./tryout/stats1.csv"))
# # export_stats_as_csv(source2, Path("./tryout/stats2.csv"))

# # df1 = pd.read_csv("./tryout/stats1.csv")
# # df2 = pd.read_csv("./tryout/stats1.csv")

# data_drift_report = Dashboard(tabs=[DataDriftTab()])
# # these next two lines need fixing
# # data_drift_report.calculate(data_1, data_2)
# # data_drift_report.show()

display_stats[source]

display_stats(stats_list:list)

Displays statistics on the PDF data contained in a particular directory.

source = Path("/Users/strickvl/Desktop/machine-learning-flashcards")

display_stats(get_stats(source))
                                  Stats for your PDF Files                                   
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ PageCo…  Filename                                            ocr_la…  pdf_fi…  author ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│       1  Residual_Sum_Of_Squares.pdf                         False    1045078         │
│       1  Basic_Parts_Of_Deep_Learning.pdf                    False    871672          │
│       1  Conditioning.pdf                                    False    368812          │
│       1  DBSCAN.pdf                                          False    779406          │
│       1  Youdens_J_Statistic.pdf                             False    1181340         │
│       1  Handling_Imbalanced_Classes_In_Support_Vector_Mac…  False    423704          │
│       1  Word2Vec.pdf                                        False    518507          │
│       1  Bayes_Theorem.pdf                                   False    448065          │
│       1  Uniform_Distribution.pdf                            False    574346          │
│       1  Matthews_Correlation_Coefficient.pdf                False    501665          │
│       1  Bias-Variance_Tradeoff.pdf                          False    1172740         │
│       1  Hidden_Layer.pdf                                    False    1335063         │
│       1  K-Nearest_Neighbors_Tips_And_Tricks.pdf             False    558009          │
│       1  How_To_Choose_Hidden_Unit_Activation_Functions.pdf  False    475676          │
│       1  Bayes_Error.pdf                                     False    1518133         │
│       1  Softmax_Normalization.pdf                           False    1086385         │
│       1  Anscombes_Quartet.pdf                               False    1595618         │
│       1  Saddle_Point.pdf                                    False    708805          │
│       1  k-Nearest_Neighbors.pdf                             False    1245264         │
│       1  Early_Stopping_Advantages.pdf                       False    1053443         │
│       1  Fowlkes-Mallows.pdf                                 False    1622939         │
│       1  Gradient.pdf                                        False    1849334         │
│       1  Capacity.pdf                                        False    577765          │
│       1  Regularization.pdf                                  False    1018306         │
│       1  Supervised_Deep_Learning_Rule_Of_Thumb.pdf          False    1299819         │
│       1  Decision_Trees.pdf                                  False    1203288         │
│       1  Confusion_Matrix.pdf                                False    1442159         │
│       1  Minimum_Of_A_Loss_Function.pdf                      False    660131          │
│       1  Variance.pdf                                        False    665448          │
│       1  AIC.pdf                                             False    900340          │
│       1  Hinge_Loss.pdf                                      False    1052360         │
│       1  Stop_Words.pdf                                      False    861381          │
│       1  Sensitivity.pdf                                     False    957821          │
│       1  No_Free_Lunch_Theorem.pdf                           False    1065637         │
│       1  Bias.pdf                                            False    615440          │
│       1  Overfit_Vs_Underfit.pdf                             False    868425          │
│       1  Probability_Density_Function.pdf                    False    1142624         │
│       1  Noisy_ReLU.pdf                                      False    648507          │
│       1  Categorical_Feature.pdf                             False    1116135         │
│       1  Conditional_Probability.pdf                         False    838578          │
│       1  Logistic_Regression_Vs_Linear_Regression.pdf        False    467698          │
│       1  Ordinary_Least_Squares.pdf                          False    1161237         │
│       1  Cross-Entropy.pdf                                   False    1043346         │
│       1  Standard_Error_Of_The_Mean.pdf                      False    643085          │
│       1  Hessian_Matrix.pdf                                  False    423681          │
│     ...  ...                                                 ...      ...      ...    │
└─────────┴────────────────────────────────────────────────────┴─────────┴─────────┴────────┘
TOTAL PAGECOUNT: 900