We would like to thank Dr Stefan Peters for his assessment of our work and the useful suggestions. As discussed below, we have considered all suggestions and made relevant edits and improvements to the manuscript.
1. The authors missed to provide details on the imagery source and did not discuss resolution details on the available and used satellite imagery extracted that was from Google Earth Pro. For the applied eye altitude (200m) Google Earth Pro usually uses mosaiced true color composites derived from Digital Globe's WorldView-1/2/3 series, GeoEye-1, and Airbus' Pleiades, all of which provide data at around 0.5m spatial resolution. How does this refer to the resolution you extracted at 200m eye altitude?
Response: (1) We will answer the question of imagery source in the following comment (i.e. comment 2). (2) Resolution: As mentioned in the third paragraph of page 5, the resolution of images extracted from Google Earth Pro were 4800 pixels x 2908 pixels. (3) Eye altitude: the eye altitude at Google Earth Pro has no revenant with the resolution of the satellite image. A 50 cm spatial resolution means you won’t be able to recognize any object smaller than 50 cm. However, for objects such as small boats the 50 cm resolution is adequate. In this work, we fixed all eye altitudes to 200 meters because we wanted to fix the scale of the map, whose benefits had been discussed detailed on “D. Object Measurement and Classification” on page 6.
2. Google Earth Pro does display the satellite imagery source, as for instance: “Image @ 2022 CNES / Airbus”, which refers to Airbus' Pleiades imagery. However, the authors did not mention the satellite imagery source(s) of the 694 high-resolution imagery being used.
Response: Thank you very much for your reminder. We appreciate the importance of disclose of imagery sources. Thus, we have supplemented imagery sources to the preprint version 2 manual, for instance, Image © 2022 CNES/Airbus. Another minor error, the counting number of images of the Gulf of California from 2018 to 2021 is 690 instead of 694, which is now corrected.
3. The paper also leaves a few further open questions: What would be the benefit of additional spectral bands in the IR part of the EMS, if available? Worldview3 for example comes with 29 bands. PlanetScope with 5 bands. What would be the minimum required spatial resolution? Which multispectral (high-res) satellite imagery is available for free, and which one is not? A discussion on satellite imagery access, availability, costs and in particular resolution (spatial, spectral, temporal) would be beneficial for this research.
Response: Thank you very much for the recommendation. We recognise the value of bands, availability and costs in satellite imagery, however, this is a little bit out of the scope of the paper because we did all the image extraction work directly at Google Earth Pro. But we will consider your suggestion for a future comparative paper on satellite imagery access, availability, and costs!
4. Did you account for the count of duplicates?
Response: Yes, we checked for duplication under the methodology developed for this work. However, due to fixed timestamps and coordinates being used in this research and avoiding duplicated areas when extracting images from Google Earth Pro, avoided the duplication or double-accounting of the small boats.
5. Details on how satellite images were extracted from Google Earth Pro are missing.
Response: We have now added the details on how satellite images were extracted from Google Earth Pro on page 5.
6. The issue of boats located on 2 adjacent images could have been addressed by applying tiling with spatial overlap (for instance of 50%, depending on the set tile size) …page 7 second paragraph: “…some large vessels …do not appear fully in an image”
Response: We are grateful you mentioned the spatial overlap. We did not have problems with this spatial issue since we manually extracted the images, allowing us to filter them where small boats did not appear fully.
7. The authors could also consider masking out land areas.
Response: We appreciate your suggestion on masking out land areas for better detection of small boats. However, due to the transfer learning, all images used for the training data sets are based on the boats on the sea. The algorithm will not detect the ships on the land areas in high confidence, for instance, Figure 13 on page 9.
8. Page 6: 2 paragraph: The authors mentioned shadows and clouds but did discuss in the next sentence the removal of haze. This leaves the reader with open questions about shadows and clouds (although Google Earth Pro imagery is as good as everywhere cloud-free mosaics).
Response: (1) This has been addressed in the text where it is now clear that what we are talking about haze clouds. (2) According to the paper “Single Image Haze Removal Using Dark Channel Prior” , shadow is one of the three factors that cause low intensity in the dark channel. Thus, while the cited paper focuses on removing haze, shadows can be removed in the same way.
1. Abstract: 1st sentence: improve wording.
Response: We believe this sentence is clear.
2. Abstract: I suggest to replace the term “Techno-activity” with “Technology…”
Response: We have replaced the term “Techno-activity” with “technological and operational assumptions”.
3. Abstract: Replace GPS with “Global navigation satellite system (GNSS)”
4. Abstract: “…The work produced a methodology named BoatNet that can detect, measure and classify small boats…” – I suggest to also inform about target classes (shipping/leisure)
Response: Text modified to incorporate the boat classes.
5. Page 1: Unit ‘Mt’ should be written in full when using the first time: “Megaton (Mt)”
6. Same accounts for CO2e: Carbon dioxide equivalent (page 2 last line)
7. Page 2: first 3 paragraphs: I suggest adding the respective literature references to back up your statements.
Response: Added 20 more citations: 14 to 35.
8. Page 2 – section C.: First sentence: “Bringing deep….is essential.” Why is it essential – what for?
Response: Thank you for your comment. We have added the argument on why deep learning is essential for satellite image recognition on page 2.
9. Page 2 – section C.: Third sentence: I suggest using the term ‘resolution’ instead of ‘quality’
Response: Thank you for your comment. In fact, it is not “resolution”. However, we have deleted this sentence to better explain what you suggested.
10. Page 2 – section C.: Forth sentence: I suggest rewording into something like: “Machine learning is widely used for satellite imagery analysis.
Response: Thank you for your comment. We have deleted this sentence to better explain what you suggested.
11. Whole text: You may consider replacing “Satellite image” with “Satellite imagery”
Response: Thank you for your comment. It has been fixed.
12. Page 3 – line 8: “…and fuel used data” Did you mean fuel-used data? The sentence wasn’t clear to me.
Response: Thank you for your comment. It is fuel-used data. Fixed.
13. Page 3 – line 10: “CO2e” Does “e” refer to estimate? Please write full form when using an abbreviation for the first time in the text.
Response: It is equivalent and has been fixed in Section A of Part II of page 3.
14. Page 3 – section B – paragraph 2: “…each number is neither zero nor new, but…” Are you sure that is correct? What is a “new” number? Please adjust to improve clarity.
Response: Thank you for your comment. It should be “one”. Fixed.
15. Page 5 – end of chapter II: I recommend adding a summary of the literature review including argumentation for why Yolo CNN was selected for this work (just before the last paragraph)
Response: Thank you for your comment. I have added one paragraph on page 5 for arguing that.
16. Page 5 – end of chapter II- last paragraph: “…aims at detecting small boats…” I recommend adding the fact that the proposed model intends to detect specific boat types (fishing, recreational)
Response: Thank you for your comment. It is been corrected.
17. Page 6 - Fig. 5 caption: reference should be replaced by ref 51 (Dwivedi …Yolov5), or at least ref 51 should be added.
Response: Thank you for your comment. You are right. Ref 51 (now Ref 86) is one of the key references for Fig.5. I have added Ref 51 (now Ref 86) to the caption of Fig. 5.
18. Page 6 - the last paragraph: “ To validate… was done beteen BoatNet…” à correct between
19. Page 7: what is the loss of prediction (detection) accuracy due to image resizing? Depending on research goals (and other factors), it is sometimes worth running a multiple day training.
Response: Thank you for your comment. Firstly, the detected imagery is not resized. Secondly, the reason for resizing the training dataset in this work is that we wanted to remove the “blank information” of an image. For instance, in Fig. 4, the two small boats are extremely concentrated. Since we labelled small boats in the training data sets, most resized images are not labelled and, thus, are full of “blank information”. Therefore, considering the prediction loss due to image resizing then it might not be suitable. For the question of multiple-day training, Colab Pro limits RAM to 32 GB while Pro+ limits RAM to 52 GB. Colab Pro and Pro+ limit sessions to 24 hours. Besides, one of our main aim in the future is to find an efficient computing way to apply object detection to edge computing, and a multiple-day training idea might just be an option. We take your suggestion as future work on this topic.
20. Page 8: false counts caused by nearby located boats: couldn’t the boat type classification allow to at least distinguish between different (adjacent) boat types?
Response: Thank you for your comment. Most false counts or the loss in precision are caused by misdetection on the same boat classification, for instance, Fig. 11. However, the key is with the image quality (our detected imagery’s pixels are 4800 pixels x 2908 pixels), not the algorithm.
21. Page 8: last paragraph: “Nevertheless, as Figures 11, 12, 13 demonstrate, the model still detects most small boats in poorly detailed satellite images,…” What exactly did you refer to with “poorly detailed” ?
Response: Thank you for your comment. Poorly detailed images mean that although the pixels are 4800 pixels x 2908 pixels in our work, the imagery details are still unclear. For instance, when taking tourist photos of your friends with your mobile phone, you may notice that the details on your friends' faces are more visible than the snowy mountains in the background.
22. “precision of training can be up to 93.9%,” … why isn’t this result not explained in the result section? How did you derive 93
Response: Thank you for your comment. However, in the results section, we hope to focus on (1) fundamental issues behind satellite image detection in the real-world, for instance, imagery quality; (2) statistics results relevant to the maritime energy-emission problem for further research; (3) The precision can be derived with the formula: Precision = TP / (TP + FP), where TP is true positive, FP is false positive. Precision reflects the sensitivity of the classifier to positive categories, and a high Precision indicates that the classifier rarely misclassifies negative categories as positive categories.
23. Page 10, fourth paragraph: “Due to the low data quality of the selected regions, the images are less suitable as training datasets.” What exactly did you mean with “low data quality”
Response: Thank you for your comment. Hopefully, Answer 21 has already resolved this question!
 He, K., Sun, J. and Tang, X., 2010. Single image haze removal using dark channel prior. IEEE transactions on pattern analysis and machine intelligence, 33(12), pp.2341-2353.
We would like to thank Dr Irwan Priyanto for his assessment of our work and the useful suggestions. As discussed below, we have considered all suggestions and made relevant edits and improvements to the manuscript.
1. Authors should present the YOLOv5l architecture and cite previous research using YOLO5vl.
Response: Thank you for your comment. We have added the argument on why we use YOLO and citations of previous research on page 5.
2. In addition, it is necessary to include GPU resources and the employed framework along with computational cost analysis.
Response: Thank you for your comment. We added sentences on the GPU resources and the employed framework we used in Section C of page 6. Regarding the computational cost analysis, Fig 5 shows the relationship between Average Precision (AP) and GPU Speed.
3. To enrich the analysis, the author should add a comparison of research results with other methods.
Response: Thank you very much for the recommendation. We recognise the value of doing the same image recognition with different models. While we considered this step in the initial stages of the research as we progressed with the relevant discrimination processes to narrow down the potential object detection method, it was clear that this task became out of the scope of the objective of this paper. But we considered your suggestion for a future comparative paper on different algorithms for the same problem on page 10.
Tracking and measuring national carbon footprints is one of the keys to achieving the ambitious goals set by countries. According to statistics, more than 10% of global transportation carbon emissions result from shipping. However, accurate tracking of the emissions of the small boat segment is not well established. Past research has begun to look into the role played by small boat fleets in terms of Greenhouse Gases (GHG), but this either relies on high-level technological and operational assumptions or the installation of Global navigation satellite system (GNSS) sensors to understand how this vessel class behaves. This research is undertaken mainly in relation to fishing and recreational boats. With the advent of open-access satellite imagery and its ever-increasing resolution, it can support innovative methodologies that could eventually lead to the quantification of GHG emissions. This work used deep learning algorithms to detect small boats in three cities in the Gulf of California in Mexico. The work produced a methodology named BoatNet that can detect, measure and classify small boats with leisure boats and fishing boats even under low-resolution and blurry satellite images, achieving an accuracy of 93.9% with a precision of 74.0%. Future work should focus on attributing a boat activity to fuel consumption and operational profile to estimate small boat GHG emissions in any given region. The data curated and produced in this study is freely available at https://github.com/theiresearch/BoatNet.