Classifiers

Katrina_saba · May 19, 2022, 1:20pm

my data is about flight dataset and have to build a classifier that classifies the “AIRLINE-NAME” attribute. The classification goal is to predict whether it is Southwest Airlines Co. or not (target attribute: AIRLINE-NAME {(binary: 0, 1), 1–> Southwest Airlines Co. and 0–> Other Airlines (Delta Airlines Inc., American Airlines Inc., United Airlines Inc.,…)
How to classify a string dataset. What is the optimal solution for large datasets.

aworker · May 19, 2022, 1:31pm

Hi @Katrina_saba

This is the same question (more or les ) than in your today’s previous thread

It is preferable to stay in the same thread when the topic remains the same and one thinks it has not been fully solved.
What is wrong with the solution posted in your previous thread ? What is not working with that solution?
Maybe we could help you better if you give examples where the solution doesn’t work

Best
Ael

Katrina_saba · May 19, 2022, 1:37pm

That problem was resolved that’s why open a new topic. I am trying to use different classifiers for resolving this question like the k-nearest neighbour, SVM, decision tree, random forest, etc to come up with an optimal solution for doing supervised classification

Katrina_saba · May 19, 2022, 1:37pm

And calculating the accuracy thats why need help in it

aworker · May 19, 2022, 1:46pm

Let me see if I understood your project. Yo have this list of airline company names and I guess they are written let’s say some times with errors or in slightly different ways ?
For instance, let’s imagine that “Southwest Airlines Co.” maybe written as “Southwest Airlines” or “SouthWest Airlines Co.” which still it is the same company in your case.

Therefore, you would like to train a classifier which recognizes the right airline company even with slight differences ? Is this what you mean ?

Katrina_saba · May 19, 2022, 1:51pm

Exactly yes. The data set has missing values. Need to use different classifiers for predicting its southwest airline or not and choose the classifier that best suits the purpose

aworker · May 19, 2022, 1:56pm

Could you please share/upload here your data as a text file (CSV/Excel or other) ? It would be much easier for us to help you

Katrina_saba · May 19, 2022, 1:59pm

It doesnt allow to attach excel sheet

aworker · May 19, 2022, 2:01pm

I’m afraid, it does

Please check first that the file extension is the correct one. That may be the problem

Katrina_saba · May 19, 2022, 2:02pm

The file is a csv file data set

aworker · May 19, 2022, 2:04pm

Please rename your file extension as .txt

Katrina_saba · May 19, 2022, 2:11pm

The file is too big for getting attached

aworker · May 19, 2022, 2:17pm

Your file must be very redundant because it repeatedly contains the same company names and hence it can be easily zipped. Could you please compress it and try to upload it here again? Please rename it from .zip to .txt once it is zipped so that it remains compatible to be uploaded.

ScottF · May 19, 2022, 2:35pm

For this type of problem I would use an approach based on string similarity. Here’s an example workflow that may be useful:

aworker · May 19, 2022, 2:49pm

Hi @ScottF

Indeed this is a possible solution. Thanks for posting it.

It would nice to compare the Levenshtein approach to other machine learning approaches if @Katrina_saba manages to upload her data.

Best
Ael

Katrina_saba · May 20, 2022, 8:48am

I tried zipping the file and changing to .txt format still didn’t work, it says too long file

mlauber71 · May 20, 2022, 9:02am

@Katrina_saba is this by any chance a dataset from kaggle or a similar website? So you could link to that. Like:

Other possibility would be to take a sample and upload that so we could see in principle what is going on.

Randomly draw some lines that might be suitable to give an idea about the dataset.

And I think more details about the actual task are necessary.

Katrina_saba · May 20, 2022, 9:11am

Hey perfect, its in the kaggle website for airline delay in 2015.

mlauber71 · May 20, 2022, 9:55am

@Katrina_saba ok - next step would be to understand the task at hand. Concerning machine learning you might want to explore this collection:

If it is about text analysis and de duplication you might want to start with the suggestions by @ScottF.

Katrina_saba · May 20, 2022, 9:57am

To make it easy, the question here is to use supervised learning and task involves
"Below you will find 3 datasets: a training dataset for training and optimising your model (it contains the target values), an “unknown” dataset for the final model assessment (it does not have the target values - you need to predict them) and a submission sample which shows you what the file submitted to Kaggle should look like. In particular, you will need to set the column names in your submission file correctly - that is, “row ID” and “AIRLINE-NAME”. These datasets can also be found on the Kaggle competition page under the “Data” tab.

Build a classifier that classifies the “AIRLINE-NAME” attribute. The classification goal is to predict whether it is Southwest Airlines Co. or not (target attribute: AIRLINE-NAME {(binary: 0, 1), 1–> Southwest Airlines Co. and 0–> Other Airlines (Delta Airlines Inc., American Airlines Inc., United Airlines Inc.,…)}. You can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), providing explanations for why you have chosen to do that. You may need to split the provided training set further into training, validation and/or test sets to accurately set the parameters and evaluate the quality of the classifier."