1 | !pip install pandas |
Requirement already satisfied: pandas in c:\users\produttivita\anaconda3\lib\site-packages (1.5.3) Requirement already satisfied: pytz>=2020.1 in c:\users\produttivita\anaconda3\lib\site-packages (from pandas) (2022.7.1) Requirement already satisfied: numpy>=1.20.3 in c:\users\produttivita\anaconda3\lib\site-packages (from pandas) (1.24.1) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\produttivita\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: six>=1.5 in c:\users\produttivita\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Obiettivo di questo progetto:
-Sperimentare l'uso di python in particolare della libreria pandas
-apprendere le basi della libreria di data visualizzation matplot
-creazione di un notebook di funzioni stock per riciclo codice
About This Data
This is a list of over 18,000 restaurants in the US that serve vegetarian or vegan food provided by Datafiniti's Business Database. The dataset includes address, city, state, business name, business categories, menu data, phone numbers, and more.
Note that this is a sample of a large dataset. The full dataset is available through Datafiniti.
Domande analisi:
Quali sono i ristoranti che compaiono maggiormente nella lista?
Quali città compaiono maggiormente nella lista?
Quanti ristoranti con scelte vegan ci sono? e quanti per ogni città?
Quanti ristoranti con scelte vegetariane ci sono? e quanti per ogni città?
Quali sono le principali categorie dei ristoranti?
Quali sono le cucine piu diffuse nel dataset?
Quali sono i prezzi medi per ristoranti vegani e vegetariani? e confronta?
Quali sono le province maggiormente presenti nel dataset?
Quanti sono i ristoranti e città che hanno come tipo di cucina Italiano? giapponese? americano? indiano?
1 2 3 4 5 6 7 | #importo pandas, numpy (per pochi utilizzi) e il dataset import pandas as pd import numpy as np import matplotlib.pyplot as plt file_path = r'C:\Users\Produttivita\Desktop\Datafiniti_Vegetarian_and_Vegan_Restaurants.csv' df = pd.read_csv(file_path) |
1 2 3 | #visualizzo il data set per vederne le caratteristiche pd.set_option('display.max_columns', None) df |
| id | dateAdded | dateUpdated | address | categories | primaryCategories | city | claimed | country | cuisines | descriptions.dateSeen | descriptions.sourceURLs | descriptions.value | facebookPageURL | features.key | features.value | hours.day | hours.dept | hours.hour | imageURLs | isClosed | keys | languagesSpoken | latitude | longitude | menuPageURL | menus.amountMax | menus.amountMin | menus.category | menus.currency | menus.dateSeen | menus.description | menus.name | menus.sourceURLs | name | paymentTypes | phones | postalCode | priceRangeCurrency | priceRangeMin | priceRangeMax | province | sic | sourceURLs | websites | yearOpened | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | Squash, carrots, butternut squash and potato | Baked Kabocha | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| 1 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | With wild blend brown rice | Short Grain Brown Rice | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| 2 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | NaN | Green Bean Veggie Soup | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| 3 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 8.95 | 8.95 | NaN | USD | 2016-04-22T02:48:00.000Z | NaN | Steamed Broccoli, Carrots and Turnips | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| 4 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | With vinegar tahini beet dressing | Mixed Green Salad | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | AVwd88lK_7pvs4fz-toI | 2016-03-28T05:35:40Z | 2018-01-09T08:01:08Z | 1335 W Thomas Rd | Vegetarian / Vegan Restaurant,Restaurant,Medit... | Accommodation & Food Services | Phoenix | NaN | US | Mediterranean,Vegetarian,Persian,Middle Eastern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://cdn1.gbot.me/photos/6D/oK/1475183225/-... | NaN | us/az/phoenix/1335wthomasrd/-1848873742 | NaN | 33.480480 | -112.090640 | http://www.singlepage.com/persian-garden-cafe | 8.00 | 8.00 | NaN | USD | 2017-12-17T23:32:00.000Z,2016-03-28T05:36:00.000Z | NaN | Vegan Carob Chocolate Cheesecake | https://foursquare.com/v/persian-garden-cafe/4... | Persian Garden Cafe | NaN | (602) 263-1915 | 85013 | USD | 40.0 | 55.0 | AZ | NaN | https://foursquare.com/v/persian-garden-cafe/4... | NaN | http://persiangardencafe.com,http://www.persia... | NaN |
| 9996 | AVwd88lK_7pvs4fz-toI | 2016-03-28T05:35:40Z | 2018-01-09T08:01:08Z | 1335 W Thomas Rd | Vegetarian / Vegan Restaurant,Restaurant,Medit... | Accommodation & Food Services | Phoenix | NaN | US | Mediterranean,Vegetarian,Persian,Middle Eastern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://cdn1.gbot.me/photos/6D/oK/1475183225/-... | NaN | us/az/phoenix/1335wthomasrd/-1848873742 | NaN | 33.480480 | -112.090640 | http://www.singlepage.com/persian-garden-cafe | 8.00 | 8.00 | NaN | USD | 2017-12-17T23:32:00.000Z | NaN | Mediterranean Salad | https://foursquare.com/v/persian-garden-cafe/4... | Persian Garden Cafe | NaN | (602) 263-1915 | 85013 | USD | 40.0 | 55.0 | AZ | NaN | https://foursquare.com/v/persian-garden-cafe/4... | NaN | http://persiangardencafe.com,http://www.persia... | NaN |
| 9997 | AVwd88lK_7pvs4fz-toI | 2016-03-28T05:35:40Z | 2018-01-09T08:01:08Z | 1335 W Thomas Rd | Vegetarian / Vegan Restaurant,Restaurant,Medit... | Accommodation & Food Services | Phoenix | NaN | US | Mediterranean,Vegetarian,Persian,Middle Eastern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://cdn1.gbot.me/photos/6D/oK/1475183225/-... | NaN | us/az/phoenix/1335wthomasrd/-1848873742 | NaN | 33.480480 | -112.090640 | http://www.singlepage.com/persian-garden-cafe | 7.00 | 7.00 | NaN | USD | 2017-12-17T23:32:00.000Z,2016-03-28T05:36:00.000Z | NaN | House Garden Salad | https://foursquare.com/v/persian-garden-cafe/4... | Persian Garden Cafe | NaN | (602) 263-1915 | 85013 | USD | 40.0 | 55.0 | AZ | NaN | https://foursquare.com/v/persian-garden-cafe/4... | NaN | http://persiangardencafe.com,http://www.persia... | NaN |
| 9998 | AVwd88lK_7pvs4fz-toI | 2016-03-28T05:35:40Z | 2018-01-09T08:01:08Z | 1335 W Thomas Rd | Vegetarian / Vegan Restaurant,Restaurant,Medit... | Accommodation & Food Services | Phoenix | NaN | US | Mediterranean,Vegetarian,Persian,Middle Eastern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://cdn1.gbot.me/photos/6D/oK/1475183225/-... | NaN | us/az/phoenix/1335wthomasrd/-1848873742 | NaN | 33.480480 | -112.090640 | http://www.singlepage.com/persian-garden-cafe | 6.50 | 6.50 | NaN | USD | 2017-12-17T23:32:00.000Z | NaN | Mango Ice Cream | https://foursquare.com/v/persian-garden-cafe/4... | Persian Garden Cafe | NaN | (602) 263-1915 | 85013 | USD | 40.0 | 55.0 | AZ | NaN | https://foursquare.com/v/persian-garden-cafe/4... | NaN | http://persiangardencafe.com,http://www.persia... | NaN |
| 9999 | AVwd88lK_7pvs4fz-toI | 2016-03-28T05:35:40Z | 2018-01-09T08:01:08Z | 1335 W Thomas Rd | Vegetarian / Vegan Restaurant,Restaurant,Medit... | Accommodation & Food Services | Phoenix | NaN | US | Mediterranean,Vegetarian,Persian,Middle Eastern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://cdn1.gbot.me/photos/6D/oK/1475183225/-... | NaN | us/az/phoenix/1335wthomasrd/-1848873742 | NaN | 33.480480 | -112.090640 | http://www.singlepage.com/persian-garden-cafe | 4.50 | 4.50 | NaN | USD | 2017-12-17T23:32:00.000Z | NaN | Freash Carrot Apple Ginger | https://foursquare.com/v/persian-garden-cafe/4... | Persian Garden Cafe | NaN | (602) 263-1915 | 85013 | USD | 40.0 | 55.0 | AZ | NaN | https://foursquare.com/v/persian-garden-cafe/4... | NaN | http://persiangardencafe.com,http://www.persia... | NaN |
10000 rows × 47 columns
1 2 3 | #vedere forme df.shape |
(10000, 47)
1 2 3 | #vedo valori in generale del dataframe df.describe() |
| descriptions.dateSeen | descriptions.sourceURLs | descriptions.value | features.key | features.value | hours.day | hours.dept | hours.hour | languagesSpoken | latitude | longitude | menus.amountMax | menus.amountMin | priceRangeMin | priceRangeMax | sic | yearOpened | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 6327.000000 | 6327.000000 | 140.000000 | 91.0 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 38.904276 | -85.139809 | 12.641530 | 12.552005 | 21.414731 | 39.212739 | 5874.200000 | 2015.0 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.720011 | 17.508638 | 39.541238 | 39.532886 | 14.420910 | 10.539746 | 960.905203 | 0.0 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 20.734835 | -156.452451 | 0.000000 | 0.000000 | 0.000000 | 12.000000 | 4773.000000 | 2015.0 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 37.800320 | -87.975883 | 4.000000 | 4.000000 | 0.000000 | 25.000000 | 5610.000000 | 2015.0 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.726279 | -74.000703 | 7.000000 | 7.000000 | 25.000000 | 40.000000 | 5610.000000 | 2015.0 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.758343 | -73.979965 | 10.950000 | 10.950000 | 25.000000 | 40.000000 | 5610.000000 | 2015.0 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.682262 | -69.721340 | 2500.000000 | 2500.000000 | 40.000000 | 55.000000 | 8129.000000 | 2015.0 |
1 2 3 | #vedo valori piu frequenti df.mode().values[0] |
array(['AV0BJkuP-gnIPe8DUvYQ', '2017-10-18T16:27:40Z',
'2018-07-19T21:03:58Z', '311 W 43rd St',
'Chinese Restaurant,Restaurant,Cocktails,Vegetarian,Dim Sum,Chinese,Dim Sum Restaurant',
'Accommodation & Food Services', 'New York', 'yellowpages.com',
'US', 'Vegetarian,Indian', nan, nan, nan,
'https://www.facebook.com/ImperialRestaurantPDX', nan, nan, nan,
nan, nan,
'http://i2.ypcdn.com/blob/077b38d95103a7a69314f01aab0801fde8bc1cb0_74x74_crop.jpg,http://i2.ypcdn.com/blob/077b38d95103a7a69314f01aab0801fde8bc1cb0_70x70_crop.jpg',
False, 'us/ny/newyork/311w43rdst/-77668488', nan, 40.758343,
-73.990152, 'http://www.menupages.com/restaurants/hakkasan/menu',
4.0, 4.0, 'Beverages', 'USD',
'2016-03-22T03:41:04Z,2016-06-05T00:09:12Z', 'Vegetarian', 'Soda',
'https://foursquare.com/v/hakkasan/4ea832fe8b8154b19ffa31ad/menu',
'Hakkasan', 'AMEX', '(212) 776-1818', '10003', 'USD', 25.0, 40.0,
'NY', 5610.0,
'https://foursquare.com/v/hakkasan/4ea832fe8b8154b19ffa31ad,http://www.menupages.com/restaurants/hakkasan/menu,http://www.menupages.com/restaurants/hakkasan/,https://foursquare.com/v/hakkasan/4ea832fe8b8154b19ffa31ad/menu,https://www.allmenus.com/ny/new-york/303985-hakkasan/menu/',
'ImperialPDX',
'http://www.hakkasan.com/newyork/,http://www.hakkasan.com', 2015.0],
dtype=object)
PULIZIA¶
1 2 3 4 | #rendo tutte le colonne lowercase df.columns = df.columns.str.lower() df.head(2) |
| id | dateadded | dateupdated | address | categories | primarycategories | city | claimed | country | cuisines | descriptions.dateseen | descriptions.sourceurls | descriptions.value | facebookpageurl | features.key | features.value | hours.day | hours.dept | hours.hour | imageurls | isclosed | keys | languagesspoken | latitude | longitude | menupageurl | menus.amountmax | menus.amountmin | menus.category | menus.currency | menus.dateseen | menus.description | menus.name | menus.sourceurls | name | paymenttypes | phones | postalcode | pricerangecurrency | pricerangemin | pricerangemax | province | sic | sourceurls | websites | yearopened | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | Squash, carrots, butternut squash and potato | Baked Kabocha | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
| 1 | AVwd3yXEkufWRAb59-sH | 2016-04-22T02:47:48Z | 2018-09-10T21:00:49Z | 1045 San Pablo Ave | Restaurant,Asian/Pacific,Cafe,Vegetarian / Veg... | Accommodation & Food Services | Albany | NaN | US | Thai,Asian/Pacific,Vegetarian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | https://igx.4sqi.net/img/general/600x600/o_h1i... | NaN | us/ca/albany/1045sanpabloave/-1106202297 | NaN | 37.885131 | -122.297013 | http://www.singlepage.com/ruen-pair-authentic-... | 13.95 | 3.75 | NaN | USD | 2016-04-22T02:48:00.000Z | With wild blend brown rice | Short Grain Brown Rice | https://foursquare.com/v/potala-organic-cafe/4... | Potala Organic Cafe | NaN | (510) 528-2375 | 94706 | NaN | NaN | NaN | CA | NaN | https://foursquare.com/v/potala-organic-cafe/4... | NaN | http://www.potala.us/,http://potala.us | NaN |
1 2 3 4 5 | #elimino righe doppioni print(df.shape[0]) df = df.drop_duplicates() print(df.shape[0]) |
10000 10000
1 2 3 | # conta i valori nulli delle colonne df.isnull().sum() |
id 0 dateadded 0 dateupdated 0 address 0 categories 0 primarycategories 0 city 0 claimed 9311 country 0 cuisines 0 descriptions.dateseen 10000 descriptions.sourceurls 10000 descriptions.value 10000 facebookpageurl 9063 features.key 10000 features.value 10000 hours.day 10000 hours.dept 10000 hours.hour 10000 imageurls 4866 isclosed 9963 keys 0 languagesspoken 10000 latitude 0 longitude 0 menupageurl 0 menus.amountmax 0 menus.amountmin 0 menus.category 4070 menus.currency 53 menus.dateseen 0 menus.description 6013 menus.name 0 menus.sourceurls 0 name 0 paymenttypes 6127 phones 0 postalcode 0 pricerangecurrency 3673 pricerangemin 3673 pricerangemax 3673 province 0 sic 9860 sourceurls 0 twitter 8042 websites 1817 yearopened 9909 dtype: int64
1 2 3 4 5 6 7 8 | #elimino tutte le colonne con troppi valori nulli df = df.drop(["sic","yearopened","languagesspoken"], axis=1) #elimino solo quelli cui Nan non puo corrispondere ad una risposta negativa # o inutili all'analisi df = df.drop(["dateadded","dateupdated"], axis=1) |
1 2 3 4 5 6 7 8 9 10 11 | #elimino le colone a mio parere inutili al fine dell'analisi n_col1= df.shape[1] print(n_col1) df = df.drop(["descriptions.dateseen","descriptions.sourceurls","descriptions.value", "facebookpageurl","features.key","features.value","hours.day","hours.dept", "hours.hour","imageurls","twitter","websites","menupageurl","sourceurls","phones", "menus.sourceurls" ], axis=1) n_col2 = df.shape[1] print(n_col2) print("col drop: "+ str(n_col1-n_col2)) # risultato sarà le colonne eliminate |
42 26 col drop: 16
1 2 3 4 5 6 7 8 9 10 | #esploro colonna per possibile eliminazione #df["paymenttypes"].head(10) #elimino e elimino altre colonne inutili df=df.drop(["paymenttypes"],axis=1) df=df.drop(["menus.description"],axis=1) df=df.drop(["menus.dateseen"],axis=1) df=df.drop(["menus.currency"],axis=1) df=df.drop(["keys"],axis=1) df=df.drop(["pricerangecurrency"],axis=1) |
1 2 3 4 5 6 7 8 9 10 11 12 | #controllo colonne presenti in df def controllo_colonna(x,y,z,w,q): columns = [x, y, z, w,q] if any(col in df.columns for col in columns): print(True) else: print(False) controllo_colonna("paymenttypes","menus.description", "menus.dateSeen","menus.currency","keys") #nella funzione puoi inseire altre per verificare altre 5 colonne |
False
1 2 3 4 5 6 7 8 9 | #elimino le ultime colonne/righe non necessarie e rinomino le restanti per agevolarmi x=set(df["isclosed"]) print(x) y=set(df["claimed"]) print(y) df=df.drop(["isclosed"],axis=1) #per troppi valori non possibile da definire df=df.drop(["claimed"],axis=1) #pensavo che i Nan fosse non chiuso ma esiste sia True/False df=df.drop(["menus.name"],axis=1) |
{nan, True, False}
{nan, 'yellowpages.com', 'foursquare.com'}
1 2 3 | # conta i valori nulli delle colonne df.isnull().sum() |
id 0 address 0 categories 0 primarycategories 0 city 0 country 0 cuisines 0 latitude 0 longitude 0 menus.amountmax 0 menus.amountmin 0 menus.category 4070 name 0 postalcode 0 pricerangemin 3673 pricerangemax 3673 province 0 dtype: int64
1 2 3 4 | # verifico le tipoligie di dati nelle colonne df.dtypes #df.head(2) |
id object address object categories object primarycategories object city object country object cuisines object latitude float64 longitude float64 menus.amountmax float64 menus.amountmin float64 menus.category object name object postalcode object pricerangemin float64 pricerangemax float64 province object dtype: object
1 2 3 4 5 6 7 8 9 10 11 12 | #modico la tipologia delle colonne #problema conversione postalcode da object per valori con - df["name"].loc[9830] df.loc[df.index=="Kacha Bistro","postalcode"]= 94596 seleziona_kacha = df.loc[df["name"] == "Kacha Bistro"] #vedo tutte le righe di Kacha df.loc[9830, "postalcode"] = 94596 df.loc[9831, "postalcode"] = 94596 df.loc[9832, "postalcode"] = 94596 seleziona_kacha.head(10) df["postalcode"]=pd.to_numeric(df['postalcode']) # run dopo precedenti #conversione avvenuta da object in int |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #rinomino le colonne in italiano per agevolare analisi def cambia_nome(coln,ncol): df.rename(columns={coln: ncol}, inplace=True) cambia_nome("city","città") cambia_nome("categories","categorie") cambia_nome("address","indirizzo") cambia_nome("postalcode","cap") cambia_nome("name","ristorante") cambia_nome("primarycategories","categoria_principale") cambia_nome("cuisines","tipo_cucina") cambia_nome("menus.amountmax","costo_max_menu") cambia_nome("menus.amountmin","costo_min_menu") cambia_nome("menus.category","categoria_menu") cambia_nome("pricerangemin","prezzo_range_min") cambia_nome("pricerangemax","prezzo_range_max") cambia_nome("province","provincia") |
ESPLORAZIONI - ANALISI¶
1 2 3 | #creo copia dataframe df_dup= df.copy() |
1 2 3 | #Visualizzo valori unici df.nunique() |
id 211 indirizzo 204 categorie 166 categoria_principale 6 città 81 country 1 tipo_cucina 149 latitude 204 longitude 204 costo_max_menu 431 costo_min_menu 432 categoria_menu 818 ristorante 209 cap 146 prezzo_range_min 6 prezzo_range_max 7 provincia 25 dtype: int64
1 2 3 4 5 | #conto quante volte un ristorante compare nella lista e la città df_count = df["ristorante"].value_counts().reset_index() df_count.columns = ["ristorante","count"] df_count.head(209) |
| ristorante | count | |
|---|---|---|
| 0 | Hakkasan | 375 |
| 1 | Liquiteria | 310 |
| 2 | Imperial | 308 |
| 3 | Vegetarian Dim Sum House | 265 |
| 4 | Six Penn Kitchen | 217 |
| ... | ... | ... |
| 204 | Stinky's Fish Camp | 3 |
| 205 | Vegetarian Restaurant by Hakin | 2 |
| 206 | Nice China Town | 1 |
| 207 | Tuscan Grill | 1 |
| 208 | David Magen Pizza | 1 |
209 rows × 2 columns
1 2 3 4 5 | #conta quante volte un citta compare nella lista df_count = df["città"].value_counts().reset_index() df_count.columns = ["città","count"] df_count.head(81) |
| città | count | |
|---|---|---|
| 0 | New York | 3395 |
| 1 | Brooklyn | 1188 |
| 2 | Portland | 316 |
| 3 | Chicago | 265 |
| 4 | Charlotte | 232 |
| ... | ... | ... |
| 76 | Escondido | 5 |
| 77 | Redmond | 4 |
| 78 | Walnut Creek | 3 |
| 79 | Santa Rosa Beach | 3 |
| 80 | Waltham | 1 |
81 rows × 2 columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Conto quanti ristoranti con scelte vegetariane per ogni città # Creo un dataframe con solo le categorie vegetariane df_vegetarian = df[df['tipo_cucina'].str.contains("Vegetarian")] # Raggruppo le città contando i ristoranti df_group = df_vegetarian.groupby("città")["ristorante"].nunique().reset_index(name='num_ristoranti') # Filtro per ordine decrescente df_desc= df_group.sort_values(by="num_ristoranti",ascending=False) # Stampa le prime 25 righe print(df_desc.head(25)) # Calcola la somma totale dei ristoranti somma_ristoranti = df_desc["num_ristoranti"].sum() # Stampa il risultato print("******") print("Somma ristoranti: ",somma_ristoranti) # somma totale finale |
città num_ristoranti 45 New York 65 10 Brooklyn 21 36 Los Angeles 5 60 San Diego 5 18 Chicago 5 29 Houston 4 25 Flushing 3 61 San Francisco 3 17 Charlotte 3 30 Jackson Heights 3 51 Phoenix 3 7 Berkeley 2 59 Salt Lake City 2 3 Astoria 2 62 San Jose 2 6 Bellerose 2 38 Miami 2 65 Seattle 2 53 Portland 2 69 Tempe 2 70 Torrance 2 52 Pittsburgh 2 78 Woodside 2 50 Philadelphia 1 54 Raleigh 1 ****** Somma ristoranti: 200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #conto quanti ristoranti con scelte vegan per ogni città # Filtra il dataframe in base alla categoria "Vegan" df_vegan = df[df['tipo_cucina'].str.contains("Vegan")] # Raggruppa i dati in base alla città e conta il numero unico di ristoranti df_group2 = df_vegan.groupby("città")["ristorante"].nunique().reset_index(name='num_ristoranti') # Ordina i dati in base al numero di ristoranti in modo decrescente df_desc2= df_group2.sort_values(by="num_ristoranti",ascending=False) # Stampa le prime 25 righe print(df_desc2.head(25)) # Calcola la somma totale dei ristoranti somma_ristoranti2= df_desc["num_ristoranti"].sum() # Stampa il risultato print("******") print("Somma ristoranti: ",somma_ristoranti2) |
città num_ristoranti 19 New York 7 5 Brooklyn 6 31 Seattle 2 10 Chicago 2 0 Arlington 1 26 Salt Lake City 1 21 Philadelphia 1 22 Phoenix 1 23 Pittsburgh 1 24 Portland 1 25 Richmond 1 28 San Francisco 1 27 San Diego 1 29 Santa Clara 1 30 Santa Rosa Beach 1 32 Skowhegan 1 33 Tahoe City 1 34 Torrance 1 20 Newport Beach 1 18 Neenah 1 1 Asbury Park 1 17 Miami 1 16 Los Angeles 1 15 Long Island City 1 14 Houston 1 ****** Somma ristoranti: 200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # Aggiungo manualmente la popolazione delle 10 città piu popolose nel dataset df_desc_copy = df_desc.copy() df_desc_copy.loc[df_desc_copy["città"] == "New York", "popolazione"] = 8398748 df_desc_copy.loc[df_desc_copy["città"] == "Los Angeles", "popolazione"] = 3990456 df_desc_copy.loc[df_desc_copy["città"] == "Chicago", "popolazione"] = 2705994 df_desc_copy.loc[df_desc_copy["città"] == "San Jose", "popolazione"] = 1030119 df_desc_copy.loc[df_desc_copy["città"] == "San Francisco", "popolazione"] = 883305 df_desc_copy.loc[df_desc_copy["città"] == "Houston", "popolazione"] = 2325502 df_desc_copy.loc[df_desc_copy["città"] == "Philadelphia", "popolazione"] = 1584138 df_desc_copy.loc[df_desc_copy["città"] == "San Diego", "popolazione"] = 1425976 df_desc_copy.loc[df_desc_copy["città"] == "Phoenix", "popolazione"] = 1660272 df_desc_copy.loc[df_desc_copy["città"] == "Charlotte", "popolazione"] =872498 # Calcola il numero di ristoranti ogni 100k abitanti df_desc_copy["rist/100k abit"] = df_desc_copy["num_ristoranti"] / df_desc_copy["popolazione"] * 10000 # Sostituisce i valori mancanti con 0 df_desc_copy.fillna(0, inplace=True) # Converte la colonna "popolazione" in intero df_desc_copy["popolazione"] = df_desc_copy["popolazione"].astype(int) # Ordina il dataframe in base alla colonna "rist/100k abit" in ordine decrescente df_desc_copy.sort_values(by="rist/100k abit", ascending=False) # Crea una serie di città top 10 serie10 = {"New York","San Diego","Charlotte", "San Francisco","Chicago","San Jose", "Philadelphia","Phoenix","Los Angeles"} # Filtra solo le città top 10 citytop10 = df_desc_copy["città"].isin(serie10) df_desc_copy = df_desc_copy[citytop10] # Stampa il dataframe filtrato print(df_desc_copy) |
città num_ristoranti popolazione rist/100k abit 31 New York 61 8398748 0.072630 37 San Diego 5 1425976 0.035064 12 Chicago 3 2705994 0.011086 26 Los Angeles 2 3990456 0.005012 34 Philadelphia 1 1584138 0.006313 35 Phoenix 1 1660272 0.006023 38 San Francisco 1 883305 0.011321 39 San Jose 1 1030119 0.009708 11 Charlotte 1 872498 0.011461
1 2 3 4 5 6 | #conto quali sono le categorie di ristorante piu diffuse top 10 #raggruppo per categorie principale e conto le città df_group3 = df.groupby("categoria_principale")["ristorante"].nunique().reset_index(name='conta categorie') df_cat= df_group3.sort_values(by="conta categorie",ascending=False) print(df_cat.head(10)) |
categoria_principale conta categorie 0 Accommodation & Food Services 200 3 Retail 4 4 Wholesale Trade,Accommodation & Food Services 2 1 Accommodation & Food Services,Information 1 2 Arts Entertainment & Recreation 1 5 Wholesale Trade,Accommodation & Food Services,... 1
1 2 3 4 5 | #conto quali sono i tipi di cucina piu diffuse df_group4 = df.groupby("tipo_cucina")["ristorante"].nunique().reset_index(name='conta tipo cucina') df_cuc= df_group4.sort_values(by="conta tipo cucina",ascending=False) print(df_cuc.head(10)) |
tipo_cucina conta tipo cucina 129 Vegetarian,Indian 16 47 Indian,Vegetarian 12 140 Vegetarian,Salads 9 89 Salads,Vegetarian 5 111 Vegetarian 5 70 Mexican,Vegetarian 5 114 Vegetarian,Breakfast and Brunch 3 40 Health Food,Vegetarian 2 78 Middle Eastern,Vegetarian 2 101 Thai,Vegetarian 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # creo dei data frame sulle città presenti cucina ita-jap-usa-ind con n_rist def raggruppa_str(df,colstr,string,colindx,colval): df_str = df[df[colstr].str.contains(string)] df_group = df_str.groupby(colindx)[colval].nunique().reset_index(name='num_values') return df_group #utilizzo la funzione per raggruppare per città contanto i ristoranti con tipo ristoranti df_italian =raggruppa_str(df,"categorie","Italian","città","ristorante") df_japanese=raggruppa_str(df,"categorie","Japanese","città","ristorante") df_american=raggruppa_str(df,"categorie","American","città","ristorante") df_indian=raggruppa_str(df,"categorie","Indian","città","ristorante") #rinomino la colonna num_values df_japanese.rename(columns={"num_values": "n_val(Jap)"}, inplace=True) df_italian.rename(columns={"num_values": "n_val(Ita)"}, inplace=True) df_american.rename(columns={"num_values": "n_val(Usa)"}, inplace=True) df_indian.rename(columns={"num_values": "n_val(Ind)"}, inplace=True) '''creo 4 tabelle per tipo di cucina contando città, i ristoranti e visualizzando le città e il num di ristoranti''' print("ITALIANO") print("n_città: ", df_italian["città"].nunique()) print("n_rist: ",df_italian["n_val(Ita)"].sum()) print(df_italian) print("---------------------------------") print("AMERICANO") print("n_città: ", df_american["città"].nunique()) print("n_rist: ",df_american["n_val(Usa)"].sum()) print(df_american) print("---------------------------------") print("GIAPPONESE") print("n_città", df_japanese["città"].nunique()) print("n_rist: ",df_japanese["n_val(Jap)"].sum()) print(df_japanese) print("---------------------------------") print("INDIANO") print("n_città", df_indian["città"].nunique()) print("n_rist: ",df_indian["n_val(Ind)"].sum()) print(df_indian) |
ITALIANO
n_città: 5
n_rist: 6
città n_val(Ita)
0 Berkeley 1
1 Chicago 1
2 Houston 1
3 New York 2
4 Raleigh 1
---------------------------------
AMERICANO
n_città: 23
n_rist: 27
città n_val(Usa)
0 Arlington 1
1 Asbury Park 1
2 Atlanta 1
3 Austin 1
4 Brooklyn 2
5 Canoga Park 1
6 Champaign 1
7 Charlotte 2
8 Concord 1
9 Encino 1
10 Flushing 1
11 Los Angeles 1
12 Miami 1
13 Minneapolis 1
14 New York 3
15 Pittsburgh 1
16 Portland 1
17 Rancho Santa Fe 1
18 Salt Lake City 1
19 San Francisco 1
20 Seattle 1
21 Torrance 1
22 Woodside 1
---------------------------------
GIAPPONESE
n_città 5
n_rist: 8
città n_val(Jap)
0 Brooklyn 2
1 College Point 1
2 New York 3
3 Reno 1
4 San Francisco 1
---------------------------------
INDIANO
n_città 19
n_rist: 35
città n_val(Ind)
0 Bellerose 2
1 Berkeley 1
2 Boston 1
3 Brooklyn 3
4 Canoga Park 1
5 Diamond Bar 1
6 Flushing 1
7 Glen Oaks 1
8 Glendale 1
9 Jackson Heights 3
10 Jamaica 1
11 Kew Gardens 1
12 New York 11
13 Overland Park 1
14 Richmond 1
15 San Diego 2
16 Santa Fe 1
17 South Richmond Hill 1
18 Westmont 1
1 2 3 4 5 6 7 8 9 10 | #cambio valori prezzi mettendo nan ai valori inferiori a 1 usd e vedo in dettaglio #modifico valori zero sosituendoli con 1 usd colonne_interessate = ["costo_max_menu","costo_min_menu", "prezzo_range_min","prezzo_range_max"] df[colonne_interessate] = df[colonne_interessate].applymap(lambda x: np.nan if x<1 else x) #visualizzo dati statistici df[["costo_max_menu","costo_min_menu", "prezzo_range_min","prezzo_range_max"]].describe() |
| costo_max_menu | costo_min_menu | prezzo_range_min | prezzo_range_max | |
|---|---|---|---|---|
| count | 9867.000000 | 9866.000000 | 4609.000000 | 6327.000000 |
| mean | 12.806658 | 12.717135 | 29.397049 | 39.212739 |
| std | 39.781082 | 39.774878 | 7.126696 | 10.539746 |
| min | 1.000000 | 1.000000 | 7.000000 | 12.000000 |
| 25% | 4.000000 | 4.000000 | 25.000000 | 25.000000 |
| 50% | 7.000000 | 7.000000 | 25.000000 | 40.000000 |
| 75% | 10.950000 | 10.950000 | 40.000000 | 40.000000 |
| max | 2500.000000 | 2500.000000 | 40.000000 | 55.000000 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #calcolo i valori describe per entrambi i stili alimentari #creo copie per evitare errore df_vegan_copy = df_vegan.copy() df_vegetarian_copy = df_vegetarian.copy() #modifico valori zeor in 1 usd df_vegan_copy.loc[:, colonne_interessate] = df_vegan_copy[colonne_interessate].applymap(lambda x: np.nan if x<1 else x) df_vegetarian_copy.loc[:, colonne_interessate] = df_vegetarian_copy[colonne_interessate].applymap(lambda x: np.nan if x<1 else x) #visualizzo media min e max delle colonne interessate (create prec codice) df_vegan_stat = df_vegan_copy[colonne_interessate].agg(["mean", "min", "max"]) df_vegetarian_stat = df_vegetarian_copy[colonne_interessate].agg(["mean", "min", "max"]) #concateno i due df statistici con keys reciproche (veg/veget) df_v_stat= pd.concat([df_vegan_stat,df_vegetarian_stat],keys=["vegan","vegetarian"]) df_v_stat |
| costo_max_menu | costo_min_menu | prezzo_range_min | prezzo_range_max | ||
|---|---|---|---|---|---|
| vegan | mean | 7.668826 | 7.473961 | 27.275168 | 36.269702 |
| min | 1.000000 | 1.000000 | 25.000000 | 25.000000 | |
| max | 117.500000 | 117.500000 | 40.000000 | 55.000000 | |
| vegetarian | mean | 12.336781 | 12.302281 | 30.028122 | 39.654969 |
| min | 1.000000 | 1.000000 | 7.000000 | 12.000000 | |
| max | 2500.000000 | 2500.000000 | 40.000000 | 55.000000 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | #confronto i valori medi dei prezzi range e prezzi menu con un grafico tra vegan e vegetarian # Calcolo della media dei valori della colonna "prezzo_range_min" e "prezzo_range_max" per i dati vegan df_vegan_m = df_vegan[["prezzo_range_min","prezzo_range_max"]].mean() # Calcolo della media dei valori della colonna "prezzo_range_min" e "prezzo_range_max" per i dati vegetarian df_vegetarian_m = df_vegetarian[["prezzo_range_min","prezzo_range_max"]].mean() # Impostazione della larghezza della barra bar_width = 0.35 # Impostazione della posizione delle x x_pos = [0, 1] # Creazione di un grafico a barre con le medie calcolate plt.figure(figsize=(8,8)) #inserisco le due colonne a confronto plt.bar(x_pos, df_vegan_m, bar_width, color='blue') plt.bar([i + bar_width for i in x_pos], df_vegetarian_m, bar_width, color='red') #aggiungo sotto le colonne le etichette plt.xticks([i + bar_width / 2 for i in x_pos], ["prezzo_range_min","prezzo_range_max"]) #etichetta e titoli grafico plt.ylabel('Media') plt.title('Confronto tra le medie delle colonne vegan e vegetarian') # Aggiunta del valore della media come etichetta su ogni barra for i, v in enumerate(df_vegan_m.values): plt.text(i, v + 3, str(round(v,2)), color='black', fontweight='normal',fontsize=10) for i, v in enumerate(df_vegetarian_m.values): plt.text(i, v + 1, str(round(v,2)), color='black', fontweight='normal',fontsize=10) # Mostra il grafico plt.show() # Calcola la media delle colonne "costo_max_menu" e "costo_min_menu" per i dati vegan e vegetarian df_vegan_m = df_vegan[["costo_max_menu","costo_min_menu"]].mean() df_vegetarian_m = df_vegetarian[["costo_max_menu","costo_min_menu"]].mean() # Imposta la larghezza della barra e la posizione y bar_width = 0.35 y_pos = [1, 0] # Crea una figura di dimensioni 8x8 plt.figure(figsize=(8,8)) # Disegna la barra per i dati vegan plt.bar(y_pos, df_vegan_m, bar_width, color='blue') # Disegna la barra per i dati vegetariani plt.bar([i + bar_width for i in y_pos], df_vegetarian_m, bar_width, color='red') # Imposta le etichette sull'asse x plt.xticks([i + bar_width / 2 for i in y_pos], ["costo_max_menu","costo_min_menu"]) # Imposta l'etichetta sull'asse y plt.ylabel('Media') # Imposta il titolo del grafico plt.title('Confronto tra le medie delle colonne vegan e vegetarian') # Aggiunge il valore della media sopra ogni barra per i dati vegan for i, v in enumerate(df_vegan_m.values): plt.text(i, v + -1, str(round(v,2)), color='black', fontweight='normal',fontsize=10) # Aggiunge il valore della media sopra ogni barra per i dati vegetariani for i, v in enumerate(df_vegetarian_m.values): plt.text(i, v + -1, str(round(v,2)), color='black', fontweight='normal',fontsize=10) # Mostra il grafico plt.show() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #conto presenza nella lista di ongni provincia piu grafico #conto la frequenza di ogni città df_values= df["provincia"].value_counts() #setto dimensione grafico plt.figure(figsize=(12,5)) #scelgo indice e valore e personalizzo plt.bar(df_values.index,df_values.values,color="red",label="n. pres") #metto titolo e cambio font plt.title("Province piu diffuse nel dataset", fontdict={"fontname": "Arial","fontsize":20}) #metto eticchette sopra ogni colonna for i, v in enumerate(df_values.values): plt.text(i, v + 3, str(v), color='black', fontweight='normal') #etichette y e x plt.xlabel("provincie") plt.ylabel("presenza") #mostra legenda e mostra grafico plt.legend() plt.show() |
1 | df.to_excel('Dataframe1.xlsx', index=False, header=True) |
1 2 3 4 5 6 7 | df_desc.to_excel('Data-veget.xlsx', index=False, header=True) df_desc2.to_excel('Data-veg.xlsx', index=False, header=True) df_desc_copy.to_excel('Data-top10(2).xlsx', index=False, header=True) df_indian.to_excel('Data-ind.xlsx', index=False, header=True) df_italian.to_excel('Data-ita.xlsx', index=False, header=True) df_japanese.to_excel('Data-jap.xlsx', index=False, header=True) df_american.to_excel('Data-usa.xlsx', index=False, header=True) |
1 |